NonSampling Error Section
 Nonsampling error
 The differences between estimates and population quantities do not arise solely from the fact that only a sample, instead of the whole population, is observed.
Here is an example of when the sampling frame does not match up perfectly with the target population.
Example: For telephone surveys, if we are interested in sampling the entire population in a given city, telephone directories are inadequate because of telephone numbers that are unlisted or the homeless that do not have telephones.
What would be another example where an important part of the target population would be missed? A serious nonsampling error might have occurred for nonresponse.
 Nonresponse
 the selfselection of respondents may produce bias. For example, only people with certain opinions will respond to some questions.
In a welldesigned research study, handling nonresponses is important. How do we handle this common phenomenon?
An important use of double sampling...
Double sampling can be used to adjust for nonresponse in the form of callbacks.
Nonresponse is an important problem to consider in any survey. We can consider the two groups: response and nonresponse in two strata.
The two steps of double sampling for nonresponse:

Step 1:
n' initial simple random samples are selected from a population of N units. These units are classified into two strata: response and nonresponse.This is the thing you need to do.
\(n'_1\) of these respond  stratum 1
\(n'_2\) of these do not respond  stratum 2 
Step 2:
Call back \(n_2\) samples by simple random sampling from the \(n'_2\) nonrespondents (by giving more incentives, etc.)
Thus, we are in a double sampling setting where \(n_1=n'_1,n_2\) is the number of callbacks.
Example 111: Time spent studying Section
In a college with 1000 students, a questionnaire is mailed to a simple random sample of 106 students asking them about the amount of time they spend per week studying. Out of these students, 46 respond. From the 60 nonrespondents, a simple random sample of 20 is selected and intensive efforts are made by telephone and personal visits to obtain responses. The data obtained are as follows:
Students responding to questionnaire  Students contacted and responded to telephone and visit  

Sample mean  20.5 hours  10.9 hours 
Sample st. dev.  6.2 hours  5.1 hours 
Sample size  46  20 
Now, let's estimate the mean and also variance of the estimate.
Solution
To estimate the average hours students spend per week studying, they use a double sampling:

Step 1:
106 students are randomly sampled; 46 responded, and 60 were nonrespondents.
\(n'=106,\ n'_1=46,\ n'_2=60\)
\(w_1=\dfrac{n'_1}{n'}=\dfrac{46}{106}=0.434\)
\(w_2=\dfrac{n'_2}{n'}=\dfrac{60}{106}=0.566\)

Step 2:
From the 60 nonrespondents, a simple random sample of 20 students was run and the following responses were obtained:
\(n_1=46,\ n_2=20,\ \bar{y}_1=20.5,\ \bar{y}_2=10.9\)
Try it!
The estimate for the mean is:
\begin{align}
\bar{y}_d &= w_1\bar{y}_1+w_2\bar{y}_2\\
&= 0.434 \times 20.5+ 0.566 \times 10.9\\
&= 15.0664\\
\end{align}
And, the estimated variance of this estimate is:
\begin{align}
\hat{V}ar(\bar{y}_d)&= \dfrac{Nn'}{N(n'1)}\sum\limits_{h=1}^L w_h(\bar{y}_h\bar{y}_d)^2+\dfrac{N1}{N}\sum\limits_{h=1}^L \left(\dfrac{n'_h1}{n'1} \dfrac{n_h1}{N1}\right)\dfrac{w_h s^2_h}{n_h}\\
&= \dfrac{1000106}{1000(1061)}\cdot [0.434(20.515.066)^2+0.566(10.915.066)^2]\\
&+ \dfrac{10001}{1000}\left[\left(\dfrac{461}{1061}\dfrac{461}{10001}\right)\dfrac{0.434(6.2)^2}{46}+\left(\dfrac{601}{1061}\dfrac{201}{10001}\right)\dfrac{0.566(5.1)^2}{20}\right]\\
&= 0.1928+0.5382\\
&= 0.731\\
\end{align}
Selecting the Number of Call Backs Section
 \(c_0\): the initial cost of sampling each respondent (the setup cost for each respondent)
 \(c_1\): the cost of a standard response (cost of producing the response)
 \(c_2\): the cost of a callback response
Total cost =\(\left(n^{\prime} \times c_{0}\right)+\left(n_{1}^{\prime} \times c_{1}\right)+\left(n_{2} \times c_{2}\right)\)
We want to determine the value k(k > 1) where: \(n_2=\dfrac{n'_2}{k}\).
As \(\bar{y}_d= \sum\limits_{h=1}^2 w_h \bar{y}_h\), its variance can be derived and one can find the value of k and n' that minimizes the expected cost of sampling for a desired fixed value of \(\hat{V}ar(\bar{y}_d)\), which we denote as V_{0}.
When N is large, the optimal value of k and n' are:
\(k=\sqrt{\dfrac{c_2(\sigma^2w_2\sigma^2_2)}{\sigma^2_2(c_0+c_1w_1)}}\)
\(n'=\dfrac{N(\sigma^2+(k1)w_2\sigma^2_2)}{NV_0+\sigma^2}\)
where \(\sigma^2\) is the variance of the entire population and \(\sigma_2^2\)is the variance of the nonresponse group. [Sometimes, we are only given the variance of the response group and nonresponse group. In such a case, we can use the formula for the variance of a mixture distribution provided after the Example to compute the variance of the entire population.]
Example 112: Weekly living expenditures Section
In a college of 1000 students, we want to find out students' average weekly living expenditure. The response rate is anticipated to be about 60%. It is thought that the response group has a higher variance than the nonresponse group. The overall variance \(\sigma^2~120\) and the variance of the nonresponse group \(\sigma_2^2\sim 80, c_0=0,c_1=1, c_2=4.\)
Try it!
\(w_1=0.6\)
\(w_2=1w_1=0.4\)
The optimal value for k is:
\(k=\sqrt{\dfrac{c_2(\sigma^2w_2\sigma^2_2)}{\sigma^2_2(c_0+c_1w_1)}}=\sqrt{\dfrac{4\times (120(0.4\times 80))}{80(0+(1\times 0.6))}}=2.71\)
Note: \(V_0=5\)
\(n'=\dfrac{N(\sigma^2+(k1)w_2\sigma^2_2)}{NV_0+\sigma^2}=\dfrac{1000(120+(2.711)0.4\times 80)}{1000\times 5+120}=34.1\)
Round up to 35.
\(n'_2=0.4\times 35=14\)
We should sample 35 in step 1 and call back
\(n_2=\dfrac{0.4\times 35}{2.71}=5.2 \text{ or }6\) in step 2.
Variance of a Mixture Distribution Section
\(X \sim\left(\mu_{X}, (\sigma_{X})^{2}\right)\)
\(Y \sim\left(\mu_{Y}, (\sigma_{Y})^{2}\right)\)
W is a mixture of pX and (1  p)Y
\(W=pX\oplus (1p)Y\)
where \(0\leq p \leq 1\)
Then:
\(\mu_{W}=p \mu_{X}+(1p) \mu_{Y}\)
What is the variance of the overall population?
\((\sigma_{W})^{2}=p (\sigma_{X})^{2}+(1p) (\sigma_{Y})^{2}+p(1p)\left(\mu_{X}\mu_{Y}\right)^{2}\)
You can think of these p's as weights of the two variables. When one wants to apply this formula to find the mean and variance of a population that consists of the response and the nonresponse group, then p is the proportion of the response group and (1p) the proportion of the nonresponse group.
Designing surveys to reduce nonresponse Section
Good survey practice is to discover why the nonresponse occurs and resolve as many of the problems as possible before commencing the survey.
Example: To design an experiment to find out how to best improve the response rate.
A factorial experiment employed in the 1992 Census Implementation Test to explore the individual effect and the interactions for 3 factors on the response rate (in %):
 prenotice letter
 stamped return envelope
 reminder postcard
 Letter, Postcard, Envelope 64.3%
 Letter, Postcard, no envelope 62.7% (not bad!)
Some factors that may influence the response rate and data accuracy:
 Survey content: sensitive questions will have a high nonresponse rate. Try the randomizing response technique
 Time of survey  select wisely
 Data collection method such as Computer Assisted Telephone Interviewing (CATI) has been shown to improve data accuracy. CATI: interview questions are stored in a computer, recalled in programmable sequences, and displayed for each interviewer on a video display terminal. And, interviewers enter answers received via telephone directly into the computer right away.
 Incentives and penalties