Lesson 11: Applied Problems for Survey Sampling

Lesson 11: Applied Problems for Survey Sampling

In Section 11.1, we discuss non-sampling error, which tends to persist even if the sample size gets larger and larger. We then discuss one common non-sampling error: nonresponse. To address this problem, we discuss how to use double sampling to adjust for non-response in the form of callbacks. Then an example is given to illustrate how to compute the estimate and the estimated variance of the estimate. In addition, we provide the formula for selecting the optimal number for a callback. In the last part of section 11.1, we discuss how to design surveys to reduce non-response.

In Section 11.2, the technique of interpenetrating subsample is discussed. An example is used to show this technique which takes into consideration the interviewer effect.

In Section 11.3, we discuss the case when one does not know whether an element belongs to the subpopulation until after it has been sampled and how to estimate the mean and total of this subpopulation. An example is then given to illustrate the method.

Lesson 11: Ch. 14.5 of Sampling by Steven Thompson, 3rd edition

Objectives

Upon completion of this lesson you should be able to:

  1. Distinguish between sampling error and non-sampling error,
  2. Apply double sampling to adjust for non-response by callbacks,
  3. Compute the optimal allocation for the number of callbacks,
  4. Apply interpenetrating subsample technique to take care of interviewer effect, and
  5. Estimate the mean and total over subpopulation.

11.1 - Double Sampling for Nonresponse

11.1 - Double Sampling for Nonresponse

Non-Sampling Error

Non-sampling error
The differences between estimates and population quantities do not arise solely from the fact that only a sample, instead of the whole population, is observed.

Here is an example of when the sampling frame does not match up perfectly with the target population.

Example: For telephone surveys, if we are interested in sampling the entire population in a given city, telephone directories are inadequate because of telephone numbers that are unlisted or the homeless that do not have telephones.

What would be another example where an important part of the target population would be missed? A serious non-sampling error might have occurred for non-response.

Non-response
the self-selection of respondents may produce bias. For example, only people with certain opinions will respond to some questions.

In a well-designed research study, handling non-responses is important. How do we handle this common phenomenon?

An important use of double sampling...

Double sampling can be used to adjust for non-response in the form of callbacks.

Non-response is an important problem to consider in any survey. We can consider the two groups: response and non-response in two strata.

The two steps of double sampling for non-response:

  1. Step 1:

    n' initial simple random samples are selected from a population of N units. These units are classified into two strata: response and non-response.This is the thing you need to do.

    \(n'_1\) of these respond - stratum 1
    \(n'_2\) of these do not respond - stratum 2

  2. Step 2:

    Call back \(n_2\) samples by simple random sampling from the \(n'_2\) non-respondents (by giving more incentives, etc.)

    Thus, we are in a double sampling setting where \(n_1=n'_1,n_2\) is the number of callbacks.

Example 11-1: Time spent studying

In a college with 1000 students, a questionnaire is mailed to a simple random sample of 106 students asking them about the amount of time they spend per week studying. Out of these students, 46 respond. From the 60 non-respondents, a simple random sample of 20 is selected and intensive efforts are made by telephone and personal visits to obtain responses. The data obtained are as follows:

  Students responding to questionnaire Students contacted and responded to telephone and visit
Sample mean 20.5 hours 10.9 hours
Sample st. dev. 6.2 hours 5.1 hours
Sample size 46 20

Now, let's estimate the mean and also variance of the estimate.

Solution

To estimate the average hours students spend per week studying, they use a double sampling:

  1. Step 1:

    106 students are randomly sampled; 46 responded, and 60 were non-respondents.

    \(n'=106,\ n'_1=46,\ n'_2=60\)

    \(w_1=\dfrac{n'_1}{n'}=\dfrac{46}{106}=0.434\)

    \(w_2=\dfrac{n'_2}{n'}=\dfrac{60}{106}=0.566\)

  2. Step 2:

    From the 60 non-respondents, a simple random sample of 20 students was run and the following responses were obtained:

    \(n_1=46,\ n_2=20,\ \bar{y}_1=20.5,\ \bar{y}_2=10.9\)

Try it!

Provide an estimate for the mean. Also, estimate the variance of this estimate.

The estimate for the mean is:

\begin{align}
\bar{y}_d &= w_1\bar{y}_1+w_2\bar{y}_2\\
&= 0.434 \times 20.5+ 0.566 \times 10.9\\
&= 15.0664\\
\end{align}

And, the estimated variance of this estimate is:

\begin{align}
\hat{V}ar(\bar{y}_d)&= \dfrac{N-n'}{N(n'-1)}\sum\limits_{h=1}^L w_h(\bar{y}_h-\bar{y}_d)^2+\dfrac{N-1}{N}\sum\limits_{h=1}^L \left(\dfrac{n'_h-1}{n'-1}- \dfrac{n_h-1}{N-1}\right)\dfrac{w_h s^2_h}{n_h}\\
&= \dfrac{1000-106}{1000(106-1)}\cdot [0.434(20.5-15.066)^2+0.566(10.9-15.066)^2]\\
&+ \dfrac{1000-1}{1000}\left[\left(\dfrac{46-1}{106-1}-\dfrac{46-1}{1000-1}\right)\dfrac{0.434(6.2)^2}{46}+\left(\dfrac{60-1}{106-1}-\dfrac{20-1}{1000-1}\right)\dfrac{0.566(5.1)^2}{20}\right]\\
&= 0.1928+0.5382\\
&= 0.731\\
\end{align}

Selecting the Number of Call Backs

  • \(c_0\): the initial cost of sampling each respondent (the set-up cost for each respondent)
  • \(c_1\): the cost of a standard response (cost of producing the response)
  • \(c_2\): the cost of a call-back response

Total cost =\(\left(n^{\prime} \times c_{0}\right)+\left(n_{1}^{\prime} \times c_{1}\right)+\left(n_{2} \times c_{2}\right)\)

We want to determine the value k(k > 1) where: \(n_2=\dfrac{n'_2}{k}\).

As \(\bar{y}_d= \sum\limits_{h=1}^2 w_h \bar{y}_h\), its variance can be derived and one can find the value of k and n' that minimizes the expected cost of sampling for a desired fixed value of \(\hat{V}ar(\bar{y}_d)\), which we denote as V0.

When N is large, the optimal value of k and n' are:

\(k=\sqrt{\dfrac{c_2(\sigma^2-w_2\sigma^2_2)}{\sigma^2_2(c_0+c_1w_1)}}\)

\(n'=\dfrac{N(\sigma^2+(k-1)w_2\sigma^2_2)}{NV_0+\sigma^2}\)

where \(\sigma^2\) is the variance of the entire population and \(\sigma_2^2\)is the variance of the non-response group. [Sometimes, we are only given the variance of the response group and non-response group. In such a case, we can use the formula for the variance of a mixture distribution provided after the Example to compute the variance of the entire population.] 

Example 11-2: Weekly living expenditures

In a college of 1000 students, we want to find out students' average weekly living expenditure. The response rate is anticipated to be about 60%. It is thought that the response group has a higher variance than the non-response group. The overall variance \(\sigma^2~120\) and the variance of the non-response group \(\sigma_2^2\sim 80, c_0=0,c_1=1, c_2=4.\)

Try it!

Find k, n' and also \(n_2\) so that the variance of the resulting estimator is approximately 5 units.

\(w_1=0.6\)
\(w_2=1-w_1=0.4\)

The optimal value for k is:

\(k=\sqrt{\dfrac{c_2(\sigma^2-w_2\sigma^2_2)}{\sigma^2_2(c_0+c_1w_1)}}=\sqrt{\dfrac{4\times (120-(0.4\times 80))}{80(0+(1\times 0.6))}}=2.71\)

Note: \(V_0=5\)

\(n'=\dfrac{N(\sigma^2+(k-1)w_2\sigma^2_2)}{NV_0+\sigma^2}=\dfrac{1000(120+(2.71-1)0.4\times 80)}{1000\times 5+120}=34.1\)

Round up to 35.

\(n'_2=0.4\times 35=14\)

We should sample 35 in step 1 and call back

\(n_2=\dfrac{0.4\times 35}{2.71}=5.2 \text{ or }6\) in step 2.

Variance of a Mixture Distribution

\(X \sim\left(\mu_{X}, (\sigma_{X})^{2}\right)\)
\(Y \sim\left(\mu_{Y}, (\sigma_{Y})^{2}\right)\)

W is a mixture of pX and (1 - p)Y

\(W=pX\oplus (1-p)Y\)

where \(0\leq p \leq 1\)

Then:

\(\mu_{W}=p \mu_{X}+(1-p) \mu_{Y}\)

What is the variance of the overall population?

\((\sigma_{W})^{2}=p (\sigma_{X})^{2}+(1-p) (\sigma_{Y})^{2}+p(1-p)\left(\mu_{X}-\mu_{Y}\right)^{2}\)

You can think of these p's as weights of the two variables. When one wants to apply this formula to find the mean and variance of a population that consists of the response and the nonresponse group, then p is the proportion of the response group and (1-p) the proportion of the nonresponse group.

Designing surveys to reduce non-response

Good survey practice is to discover why the non-response occurs and resolve as many of the problems as possible before commencing the survey.

Example: To design an experiment to find out how to best improve the response rate.

A factorial experiment employed in the 1992 Census Implementation Test to explore the individual effect and the interactions for 3 factors on the response rate (in %):

  1. pre-notice letter
  2. stamped return envelope
  3. reminder postcard

cube of responses

  • Letter, Postcard, Envelope 64.3%
  • Letter, Postcard, no envelope 62.7% (not bad!)

Some factors that may influence the response rate and data accuracy:

  • Survey content: sensitive questions will have a high non-response rate. Try the randomizing response technique
  • Time of survey - select wisely
  • Data collection method such as Computer Assisted Telephone Interviewing (CATI) has been shown to improve data accuracy. CATI: interview questions are stored in a computer, recalled in programmable sequences, and displayed for each interviewer on a video display terminal. And, interviewers enter answers received via telephone directly into the computer right away.
  • Incentives and penalties
Note! The quality of the survey data is largely determined at the design stage.

11.2 - Interpenetrating Subsample

11.2 - Interpenetrating Subsample

There are k interviewers and they are each different in their manner of interviewing and hence may obtain slightly different responses. To make the notation simple, we assume that each interviewer conducts the same number of interviews. Let n denote the total sample size and n = k * m. There are k subsamples and each interviewer will be assigned m subjects.

Objective: to use simple random sampling to estimate \(\mu\)

  • Interviewer \(1-y_{11}, y_{12}, y_{13},...,y_{1m}\)
  • Interviewer \(2-y_{21}, y_{22}, y_{23},...,y_{2m}\)
  • Interviewer \(3-y_{31}, y_{32}, y_{33},...,y_{3m}\)
  • Interviewer \(k-y_{k1}, y_{k2}, y_{k3},...,y_{km}\)

The average for the ith interviewer is denoted as:

\(\bar{y}_i=\dfrac{1}{m}\sum\limits_{j=1}^m y_{ij}\)

The grand average is denoted as:

\(\bar{y}=\dfrac{1}{k}\sum\limits_{i=1}^k \bar{y}_i\)

The grand average \(\bar{y}\) is unbiased for μ and the estimated variance of \(\bar{y}\) is:

\(\hat{V}ar(\bar{y})=\dfrac{N-n}{N}\cdot \dfrac{s^2_k}{k}\)

\(\text{where } s^2_k=\dfrac{\sum\limits_{i=1}^k (\bar{y}_i-\bar{y})^2}{k-1}\)

The technique of interpenetrating the subsample gives an estimate of the variance of ybar that accounts for interviewer biases. In practice, the estimated variance given in the above formula is usually larger than the estimate of the variance by using simple random sampling.

Example 11-3: Interpenetrating subsample

A researcher has 10 research assistants, each with his/her own equipment that they use to measure the time (in seconds) it takes for people to respond to a command. A simple random sample of 80 people is taken. Since the researcher believes the assistants will produce slightly biased measurements, he decides to randomly divide the 80 people into 10 subsamples of 8 persons each. Each assistant is then assigned to one subsample. The measurements are given in the following table.

assistant time it takes to respond
1

52

73 62 75 71 68 55 65
2 62 65 73 67 78 71 67 59
3 43 54 52 48 56 51 62 57
4 73 64 63 59 71 78 67 76
5 88 76 69 83 85 66 74 73
6 55 71 63 75 68 72 69 60
7 72 65 77 69 74 82 73 67
8 55 43 58 62 42 61 53 61
9 62 52 59 63 69 72 64 58
10 77 65 79 69 72 68 71 67

Minitab output:

  mean
Subsample 1 65.125
Subsample 2 67.750
Subsample 3 52.875
Subsample 4 68.875
Subsample 5 76.750
Subsample 6 66.625
Subsample 7 72.375
Subsample 8 54.375
Subsample 9 62.375
Subsample 10 71.000

Try it!

Estimate the mean and the variance of the estimate.

We estimate the mean by:

\(\bar{y}=\dfrac{1}{10}(\bar{y}_1+\bar{y}_2+\ldots+\bar{y}_{10})=\dfrac{1}{10}(65.125+\ldots+71.000)=65.81\)

Its variance is estimated to be:

\(\hat{V}ar(\bar{y})=\dfrac{\sum\limits_{i=1}^k (\bar{y}_i-65.81)^2}{(10-1)\times 10}=5.72\)

\(\hat{S}D(\bar{y})=2.39\)

If one neglects the interviewer effect, then \(\hat{S}D(\bar{y})\approx 1\), thus it is important to take into consideration the interviewer effect. Otherwise, one underestimates \(\hat{S}D(\bar{y})\).


11.3 - Estimation of means and totals over subpopulation

11.3 - Estimation of means and totals over subpopulation

Quite often, obtaining a frame that lists only those elements of the population that one is interested in is impossible. For example, perhaps you want to sample households with children, however, the best frame available is a list of all households. Therefore, we wish to estimate the parameters of a subpopulation of the population represented in the frame.

Main Issue: You do not know the size of the subpopulation.

Notation:

  • N - the number of elements in the population
  • \(N_1\)- the number of elements in the subpopulation
  • n - sample size from the population
  • \(n_1\) - the number of sampled elements from the subpopulation
  • \(y_{1j}\) - the jth sampled observation that falls in the subpopulation

An unbiased estimator of \(\mu_1\), the subpopulation mean is:

\(\bar{y}_1=\dfrac{1}{n_1}\sum\limits_{j=1}^{n_1} y_{ij}\)

Its variance is estimated by:

\(\hat{V}ar(\bar{y}_1)=\left(\dfrac{N_1-n_1}{N_1}\right)\dfrac{s^2_1}{n_1}\)

\(\text{where } s^2_1=\dfrac{\sum\limits_{j=1}^{n_1} (y_{ij}-\bar{y}_1)^2}{n_1-1}\)

Usually, we do not know \(N_1\), so we will estimate the finite population correction factor as :

\(\dfrac{N_1-n_1}{N_1} \text{ by } \dfrac{N-n}{N}\)

Example 11-4: Amount spent on food

Let's say we want to estimate the average weekly amount spent on food by married graduate students in a certain college at Penn State. There are 80 graduate students in the college. 15 are sampled and 10 are married. A summary of the data follows:

Variable marital status N Mean SE Mean StDev
food cost m 10 135.3 14.1 44.4
  s 5 87.60 9.73 21.76

Try it!

What is the average food cost for married students in that college at Penn State? Provide an estimate for the standard deviation for the estimate.

The average food cost for married students is:

\(\bar{y}_m=135.3\)

An estimate for the standard deviation for the estimate is:

\(\hat{V}ar(\bar{y}_m)=\dfrac{80-15}{80}\cdot \dfrac{44.4^2}{10}=160.173\)

\(\hat{S}D(\bar{y}_m)=12.656\)


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility