##
Non-Sampling Error
Section* *

- Non-sampling error
- The differences between estimates and population quantities that do not arise solely from the fact that only a sample, instead of the whole population, is observed.

Here is an example of when the sampling frame does not match up perfectly with the target population.

**Example**: For telephone surveys, if we are interested in sampling the entire population in a given city, telephone directories are inadequate because of telephone numbers that are unlisted or the homeless that do not have telephones.

What would be another example where an important part of the target population would be missed? A serious non-sampling error might have occurred for non-response.

- Non-response
- the self-selection of respondents may produce bias. For example, only people with certain opinions will respond to some questions.

In a well designed research study, handling non-responses is important. How do we handle this common phenomena?

**An important use of double sampling...**

Double sampling can be used to adjust for non-response in the form of call backs.

Non-response is an important problem to consider in any survey. We can consider the two groups: response and non-response in two strata.

The two steps of double sampling for non-response:

**Step 1:***n*' initial simple random samples are selected from a population of*N*units. These units are classified into two strata: response and non-response.This is the thing you need to do.\(n'_1\) of these respond - stratum 1

\(n'_2\) of these do not respond - stratum 2**Step 2:**Call back \(n_2\) samples by simple random sampling from the \(n'_2\) non-respondents (by giving more incentives, etc.)

Thus, we are in a double sampling setting where \(n_1=n'_1,n_2\) is the number of call backs.

##
Example 11-1: Time spent studying
Section* *

In a college with 1000 students, a questionnaire is mailed to a simple random sample of 106 students asking them about the amount of time they spend per week studying. Out of these students, 46 respond. From the 60 non-respondents, a simple random sample of 20 is selected and intensive efforts are made by telephone and personal visit to obtain responses. The data obtained are as follows:

Students responding to questionnaire | Students contacted and responded to telephone and visit | |

Sample mean | 20.5 hours | 10.9 hours |

Sample st. dev. | 6.2 hours | 5.1 hours |

Sample size | 46 | 20 |

Now, let's estimate the mean and also the variance of the estimate.

#### Solution

To estimate the average hours students spend per week studying, they use a double sampling:

**Step 1:**106 students are randomly sampled; 46 respond, 60 non-respondents.

\(n'=106,\ n'_1=46,\ n'_2=60\)

\(w_1=\dfrac{n'_1}{n'}=\dfrac{46}{106}=0.434\)

\(w_2=\dfrac{n'_2}{n'}=\dfrac{60}{106}=0.566\)

**Step 2:**From the 60 non-respondents, a simple random sample of 20 students were sampled and the following responses were obtained:

\(n_1=46,\ n_2=20,\ \bar{y}_1=20.5,\ \bar{y}_2=10.9\)

#### Try it!

The estimate for the mean is:

\begin{align}

\bar{y}_d &= w_1\bar{y}_1+w_2\bar{y}_2\\

&= 0.434 \times 20.5+ 0.566 \times 10.9\\

&= 15.0664\\

\end{align}

And, the estimated variance of this estimate is:

\begin{align}

\hat{V}ar(\bar{y}_d)&= \dfrac{N-n'}{N(n'-1)}\sum\limits_{h=1}^L w_h(\bar{y}_h-\bar{y}_d)^2+\dfrac{N-1}{N}\sum\limits_{h=1}^L \left(\dfrac{n'_h-1}{n'-1}- \dfrac{n_h-1}{N-1}\right)\dfrac{w_h s^2_h}{n_h}\\

&= \dfrac{1000-106}{1000(106-1)}\cdot [0.434(20.5-15.066)^2+0.566(10.9-15.066)^2]\\

&+ \dfrac{1000-1}{1000}\left[\left(\dfrac{46-1}{106-1}-\dfrac{46-1}{1000-1}\right)\dfrac{0.434(6.2)^2}{46}+\left(\dfrac{60-1}{106-1}-\dfrac{20-1}{1000-1}\right)\dfrac{0.566(5.1)^2}{20}\right]\\

&= 0.1928+0.5382\\

&= 0.731\\

\end{align}

##
Selecting the Number of Call Backs
Section* *

- \(c_0\): the initial cost of sampling each respondent (the set-up cost for each respondent)
- \(c_1\): the cost of a standard response (cost of producing the response)
- \(c_2\): the cost of a call back response

Total cost =\(\left(n^{\prime} \times c_{0}\right)+\left(n_{1}^{\prime} \times c_{1}\right)+\left(n_{2} \times c_{2}\right)\)

We want to determine the value *k*(*k* > 1) where: \(n_2=\dfrac{n'_2}{k}\).

As \(\bar{y}_d= \sum\limits_{h=1}^2 w_h \bar{y}_h\), its variance can be derived and one can find the value of *k* and *n*' that minimize the expected cost of sampling for a desired fixed value of \(\hat{V}ar(\bar{y}_d)\), which we denote as *V*_{0} .

When *N* is large, the optimal value of *k* and *n*' are:

\(k=\sqrt{\dfrac{c_2(\sigma^2-w_2\sigma^2_2)}{\sigma^2_2(c_0+c_1w_1)}}\)

\(n'=\dfrac{N(\sigma^2+(k-1)w_2\sigma^2_2)}{NV_0+\sigma^2}\)

where \(\sigma^2\) is the variance of the entire population and \(\sigma_2^2\)is the variance of the non-response group. *[Sometimes, we are only given the variance of the response group and non-response group. In such case, we can use the formula for variance of a mixture distribution provided after the Example to compute the variance of the entire population.] *

##
Example 11-2: Weekly living expenditures
Section* *

In a college of 1000 students, we want to find out students' average weekly living expenditure. The response rate is anticipated to be about 60%. It is thought that the response group has a higher variance than the non-response group. The overall variance \(\sigma^2~120\) and the variance of the non-response group \(\sigma_2^2\sim 80, c_0=0,c_1=1, c_2=4.\)

#### Try it!

*k,*

*n*' and also \(n_2\) so that the variance of the resulting estimator is approximately 5 units.

\(w_1=0.6\)

\(w_2=1-w_1=0.4\)

The optimal value for *k* is:

\(k=\sqrt{\dfrac{c_2(\sigma^2-w_2\sigma^2_2)}{\sigma^2_2(c_0+c_1w_1)}}=\sqrt{\dfrac{4\times (120-(0.4\times 80))}{80(0+(1\times 0.6))}}=2.71\)

Note: \(V_0=5\)

\(n'=\dfrac{N(\sigma^2+(k-1)w_2\sigma^2_2)}{NV_0+\sigma^2}=\dfrac{1000(120+(2.71-1)0.4\times 80)}{1000\times 5+120}=34.1\)

Round up to 35.

\(n'_2=0.4\times 35=14\)

We should sample 35 in step 1 and call back

\(n_2=\dfrac{0.4\times 35}{2.71}=5.2 \text{ or }6\) in step 2.

##
Variance of a Mixture Distribution
Section* *

\(X \sim\left(\mu_{X}, (\sigma_{X})^{2}\right)\)

\(Y \sim\left(\mu_{Y}, (\sigma_{Y})^{2}\right)\)

*W* is a mixture of *pX* and (1 - *p*)*Y *

\(W=pX\oplus (1-p)Y\)

where \(0\leq p \leq 1\)

Then:

\(\mu_{W}=p \mu_{X}+(1-p) \mu_{Y}\)

What is the variance of the overall population?

\((\sigma_{W})^{2}=p (\sigma_{X})^{2}+(1-p) (\sigma_{Y})^{2}+p(1-p)\left(\mu_{X}-\mu_{Y}\right)^{2}\)

You can think of these *p*'s as weights of the two variables. When one wants to apply this formula to find the mean and variance of a population which consists of the response and the nonresponse group, then *p* is the proportion of response group and (1-*p*) the proportion of nonresponse group.

##
Designing surveys to reduce non-response
Section* *

Good survey practice is to discover why the non-response occurs and resolve as many of the problems as possible before commencing the survey.

**Example**: To design an experiment to find out how to best improve the response rate.

A factorial experiment employed in the 1992 Census Implementation Test to explore the individual effect and the interactions for 3 factors on the response rate (in %):

- pre-notice letter
- stamped return envelope
- reminder postcard

- Letter, Postcard, Envelope 64.3%
- Letter, Postcard, no envelope 62.7% (not bad!)

Some factors that may influence the response rate and data accuracy:

- Survey content: sensitive questions will have high non-response rate. Try the randomizing response technique
- Time of survey - select wisely
- Data collection method such as Computer Assisted Telephone Interviewing (CATI) has been shown to improve data accuracy. CATI: interview questions are stored in a computer, and recalled in programmable sequences and displayed for each interviewer on a video display terminal. And, interviewers enter answers received via telephone directly into computer right away.
- Incentives and penalties

**Note!**The quality of the survey data is largely determined at the design stage.