11.1 - Double Sampling for Nonresponse

Non-Sampling Error Section

Non-sampling error
The differences between estimates and population quantities do not arise solely from the fact that only a sample, instead of the whole population, is observed.

Here is an example of when the sampling frame does not match up perfectly with the target population.

Example: For telephone surveys, if we are interested in sampling the entire population in a given city, telephone directories are inadequate because of telephone numbers that are unlisted or the homeless that do not have telephones.

What would be another example where an important part of the target population would be missed? A serious non-sampling error might have occurred for non-response.

Non-response
the self-selection of respondents may produce bias. For example, only people with certain opinions will respond to some questions.

In a well-designed research study, handling non-responses is important. How do we handle this common phenomenon?

An important use of double sampling...

Double sampling can be used to adjust for non-response in the form of callbacks.

Non-response is an important problem to consider in any survey. We can consider the two groups: response and non-response in two strata.

The two steps of double sampling for non-response:

  1. Step 1:

    n' initial simple random samples are selected from a population of N units. These units are classified into two strata: response and non-response.This is the thing you need to do.

    \(n'_1\) of these respond - stratum 1
    \(n'_2\) of these do not respond - stratum 2

  2. Step 2:

    Call back \(n_2\) samples by simple random sampling from the \(n'_2\) non-respondents (by giving more incentives, etc.)

    Thus, we are in a double sampling setting where \(n_1=n'_1,n_2\) is the number of callbacks.

Example 11-1: Time spent studying Section

In a college with 1000 students, a questionnaire is mailed to a simple random sample of 106 students asking them about the amount of time they spend per week studying. Out of these students, 46 respond. From the 60 non-respondents, a simple random sample of 20 is selected and intensive efforts are made by telephone and personal visits to obtain responses. The data obtained are as follows:

  Students responding to questionnaire Students contacted and responded to telephone and visit
Sample mean 20.5 hours 10.9 hours
Sample st. dev. 6.2 hours 5.1 hours
Sample size 46 20

Now, let's estimate the mean and also variance of the estimate.

Solution

To estimate the average hours students spend per week studying, they use a double sampling:

  1. Step 1:

    106 students are randomly sampled; 46 responded, and 60 were non-respondents.

    \(n'=106,\ n'_1=46,\ n'_2=60\)

    \(w_1=\dfrac{n'_1}{n'}=\dfrac{46}{106}=0.434\)

    \(w_2=\dfrac{n'_2}{n'}=\dfrac{60}{106}=0.566\)

  2. Step 2:

    From the 60 non-respondents, a simple random sample of 20 students was run and the following responses were obtained:

    \(n_1=46,\ n_2=20,\ \bar{y}_1=20.5,\ \bar{y}_2=10.9\)

Try it!

Provide an estimate for the mean. Also, estimate the variance of this estimate.

The estimate for the mean is:

\begin{align}
\bar{y}_d &= w_1\bar{y}_1+w_2\bar{y}_2\\
&= 0.434 \times 20.5+ 0.566 \times 10.9\\
&= 15.0664\\
\end{align}

And, the estimated variance of this estimate is:

\begin{align}
\hat{V}ar(\bar{y}_d)&= \dfrac{N-n'}{N(n'-1)}\sum\limits_{h=1}^L w_h(\bar{y}_h-\bar{y}_d)^2+\dfrac{N-1}{N}\sum\limits_{h=1}^L \left(\dfrac{n'_h-1}{n'-1}- \dfrac{n_h-1}{N-1}\right)\dfrac{w_h s^2_h}{n_h}\\
&= \dfrac{1000-106}{1000(106-1)}\cdot [0.434(20.5-15.066)^2+0.566(10.9-15.066)^2]\\
&+ \dfrac{1000-1}{1000}\left[\left(\dfrac{46-1}{106-1}-\dfrac{46-1}{1000-1}\right)\dfrac{0.434(6.2)^2}{46}+\left(\dfrac{60-1}{106-1}-\dfrac{20-1}{1000-1}\right)\dfrac{0.566(5.1)^2}{20}\right]\\
&= 0.1928+0.5382\\
&= 0.731\\
\end{align}

Selecting the Number of Call Backs Section

  • \(c_0\): the initial cost of sampling each respondent (the set-up cost for each respondent)
  • \(c_1\): the cost of a standard response (cost of producing the response)
  • \(c_2\): the cost of a call-back response

Total cost =\(\left(n^{\prime} \times c_{0}\right)+\left(n_{1}^{\prime} \times c_{1}\right)+\left(n_{2} \times c_{2}\right)\)

We want to determine the value k(k > 1) where: \(n_2=\dfrac{n'_2}{k}\).

As \(\bar{y}_d= \sum\limits_{h=1}^2 w_h \bar{y}_h\), its variance can be derived and one can find the value of k and n' that minimizes the expected cost of sampling for a desired fixed value of \(\hat{V}ar(\bar{y}_d)\), which we denote as V0.

When N is large, the optimal value of k and n' are:

\(k=\sqrt{\dfrac{c_2(\sigma^2-w_2\sigma^2_2)}{\sigma^2_2(c_0+c_1w_1)}}\)

\(n'=\dfrac{N(\sigma^2+(k-1)w_2\sigma^2_2)}{NV_0+\sigma^2}\)

where \(\sigma^2\) is the variance of the entire population and \(\sigma_2^2\)is the variance of the non-response group. [Sometimes, we are only given the variance of the response group and non-response group. In such a case, we can use the formula for the variance of a mixture distribution provided after the Example to compute the variance of the entire population.] 

Example 11-2: Weekly living expenditures Section

In a college of 1000 students, we want to find out students' average weekly living expenditure. The response rate is anticipated to be about 60%. It is thought that the response group has a higher variance than the non-response group. The overall variance \(\sigma^2~120\) and the variance of the non-response group \(\sigma_2^2\sim 80, c_0=0,c_1=1, c_2=4.\)

Try it!

Find k, n' and also \(n_2\) so that the variance of the resulting estimator is approximately 5 units.

\(w_1=0.6\)
\(w_2=1-w_1=0.4\)

The optimal value for k is:

\(k=\sqrt{\dfrac{c_2(\sigma^2-w_2\sigma^2_2)}{\sigma^2_2(c_0+c_1w_1)}}=\sqrt{\dfrac{4\times (120-(0.4\times 80))}{80(0+(1\times 0.6))}}=2.71\)

Note: \(V_0=5\)

\(n'=\dfrac{N(\sigma^2+(k-1)w_2\sigma^2_2)}{NV_0+\sigma^2}=\dfrac{1000(120+(2.71-1)0.4\times 80)}{1000\times 5+120}=34.1\)

Round up to 35.

\(n'_2=0.4\times 35=14\)

We should sample 35 in step 1 and call back

\(n_2=\dfrac{0.4\times 35}{2.71}=5.2 \text{ or }6\) in step 2.

Variance of a Mixture Distribution Section

\(X \sim\left(\mu_{X}, (\sigma_{X})^{2}\right)\)
\(Y \sim\left(\mu_{Y}, (\sigma_{Y})^{2}\right)\)

W is a mixture of pX and (1 - p)Y

\(W=pX\oplus (1-p)Y\)

where \(0\leq p \leq 1\)

Then:

\(\mu_{W}=p \mu_{X}+(1-p) \mu_{Y}\)

What is the variance of the overall population?

\((\sigma_{W})^{2}=p (\sigma_{X})^{2}+(1-p) (\sigma_{Y})^{2}+p(1-p)\left(\mu_{X}-\mu_{Y}\right)^{2}\)

You can think of these p's as weights of the two variables. When one wants to apply this formula to find the mean and variance of a population that consists of the response and the nonresponse group, then p is the proportion of the response group and (1-p) the proportion of the nonresponse group.

Designing surveys to reduce non-response Section

Good survey practice is to discover why the non-response occurs and resolve as many of the problems as possible before commencing the survey.

Example: To design an experiment to find out how to best improve the response rate.

A factorial experiment employed in the 1992 Census Implementation Test to explore the individual effect and the interactions for 3 factors on the response rate (in %):

  1. pre-notice letter
  2. stamped return envelope
  3. reminder postcard

cube of responses

  • Letter, Postcard, Envelope 64.3%
  • Letter, Postcard, no envelope 62.7% (not bad!)

Some factors that may influence the response rate and data accuracy:

  • Survey content: sensitive questions will have a high non-response rate. Try the randomizing response technique
  • Time of survey - select wisely
  • Data collection method such as Computer Assisted Telephone Interviewing (CATI) has been shown to improve data accuracy. CATI: interview questions are stored in a computer, recalled in programmable sequences, and displayed for each interviewer on a video display terminal. And, interviewers enter answers received via telephone directly into the computer right away.
  • Incentives and penalties
Note! The quality of the survey data is largely determined at the design stage.