# Lesson 11: Applied Problems for Survey Sampling

Lesson 11: Applied Problems for Survey SamplingIn Section 11.1, we discuss non-sampling error, which tend to persist even if the sample size gets larger and larger. We then discuss one common non-sampling error: nonresponse. To address this problem, we discuss how to use double sampling to adjust for non-response in the form of callbacks. Then an example is given to illustrate how to compute the estimate and the estimated variance of the estimate. In addition, we provide the formula for selecting the optimal number for call back. In the last part of section 11.1, we discuss how to design surveys to reduce non-response.

In Sections 11.2, the technique of interpenetrating subsample is discussed. An example is used to show this technique which takes into consideration the interviewer effect.

In Section 11.3, we discuss the case when one does not know whether an element belongs to the subpopulation until after it has been sampled and how to estimate the mean and total of this subpopulation. An example is then given to illustrate the method.

*Sampling* by Steven Thompson, 3rd edition

## Objectives

- understand the difference between sampling error and non-sampling error,
- use double sampling to adjust for non-response by callbacks,
- find the optimal allocation for the number of callbacks,
- use interpenetrating subsample technique to take care of interviewer effect, and
- estimate the mean and total over subpopulation

# 11.1 - Double Sampling for Nonresponse

11.1 - Double Sampling for Nonresponse## Non-Sampling Error

- Non-sampling error
- The differences between estimates and population quantities that do not arise solely from the fact that only a sample, instead of the whole population, is observed.

Here is an example of when the sampling frame does not match up perfectly with the target population.

**Example**: For telephone surveys, if we are interested in sampling the entire population in a given city, telephone directories are inadequate because of telephone numbers that are unlisted or the homeless that do not have telephones.

What would be another example where an important part of the target population would be missed? A serious non-sampling error might have occurred for non-response.

- Non-response
- the self-selection of respondents may produce bias. For example, only people with certain opinions will respond to some questions.

In a well designed research study, handling non-responses is important. How do we handle this common phenomena?

**An important use of double sampling...**

Double sampling can be used to adjust for non-response in the form of call backs.

Non-response is an important problem to consider in any survey. We can consider the two groups: response and non-response in two strata.

The two steps of double sampling for non-response:

**Step 1:***n*' initial simple random samples are selected from a population of*N*units. These units are classified into two strata: response and non-response.This is the thing you need to do.\(n'_1\) of these respond - stratum 1

\(n'_2\) of these do not respond - stratum 2**Step 2:**Call back \(n_2\) samples by simple random sampling from the \(n'_2\) non-respondents (by giving more incentives, etc.)

Thus, we are in a double sampling setting where \(n_1=n'_1,n_2\) is the number of call backs.

## Example 11-1: Time spent studying

In a college with 1000 students, a questionnaire is mailed to a simple random sample of 106 students asking them about the amount of time they spend per week studying. Out of these students, 46 respond. From the 60 non-respondents, a simple random sample of 20 is selected and intensive efforts are made by telephone and personal visit to obtain responses. The data obtained are as follows:

Students responding to questionnaire | Students contacted and responded to telephone and visit | |

Sample mean | 20.5 hours | 10.9 hours |

Sample st. dev. | 6.2 hours | 5.1 hours |

Sample size | 46 | 20 |

Now, let's estimate the mean and also the variance of the estimate.

#### Solution

To estimate the average hours students spend per week studying, they use a double sampling:

**Step 1:**106 students are randomly sampled; 46 respond, 60 non-respondents.

\(n'=106,\ n'_1=46,\ n'_2=60\)

\(w_1=\dfrac{n'_1}{n'}=\dfrac{46}{106}=0.434\)

\(w_2=\dfrac{n'_2}{n'}=\dfrac{60}{106}=0.566\)

**Step 2:**From the 60 non-respondents, a simple random sample of 20 students were sampled and the following responses were obtained:

\(n_1=46,\ n_2=20,\ \bar{y}_1=20.5,\ \bar{y}_2=10.9\)

#### Try it!

The estimate for the mean is:

\begin{align}

\bar{y}_d &= w_1\bar{y}_1+w_2\bar{y}_2\\

&= 0.434 \times 20.5+ 0.566 \times 10.9\\

&= 15.0664\\

\end{align}

And, the estimated variance of this estimate is:

\begin{align}

\hat{V}ar(\bar{y}_d)&= \dfrac{N-n'}{N(n'-1)}\sum\limits_{h=1}^L w_h(\bar{y}_h-\bar{y}_d)^2+\dfrac{N-1}{N}\sum\limits_{h=1}^L \left(\dfrac{n'_h-1}{n'-1}- \dfrac{n_h-1}{N-1}\right)\dfrac{w_h s^2_h}{n_h}\\

&= \dfrac{1000-106}{1000(106-1)}\cdot [0.434(20.5-15.066)^2+0.566(10.9-15.066)^2]\\

&+ \dfrac{1000-1}{1000}\left[\left(\dfrac{46-1}{106-1}-\dfrac{46-1}{1000-1}\right)\dfrac{0.434(6.2)^2}{46}+\left(\dfrac{60-1}{106-1}-\dfrac{20-1}{1000-1}\right)\dfrac{0.566(5.1)^2}{20}\right]\\

&= 0.1928+0.5382\\

&= 0.731\\

\end{align}

## Selecting the Number of Call Backs

- \(c_0\): the initial cost of sampling each respondent (the set-up cost for each respondent)
- \(c_1\): the cost of a standard response (cost of producing the response)
- \(c_2\): the cost of a call back response

Total cost =\(\left(n^{\prime} \times c_{0}\right)+\left(n_{1}^{\prime} \times c_{1}\right)+\left(n_{2} \times c_{2}\right)\)

We want to determine the value *k*(*k* > 1) where: \(n_2=\dfrac{n'_2}{k}\).

As \(\bar{y}_d= \sum\limits_{h=1}^2 w_h \bar{y}_h\), its variance can be derived and one can find the value of *k* and *n*' that minimize the expected cost of sampling for a desired fixed value of \(\hat{V}ar(\bar{y}_d)\), which we denote as *V*_{0} .

When *N* is large, the optimal value of *k* and *n*' are:

\(k=\sqrt{\dfrac{c_2(\sigma^2-w_2\sigma^2_2)}{\sigma^2_2(c_0+c_1w_1)}}\)

\(n'=\dfrac{N(\sigma^2+(k-1)w_2\sigma^2_2)}{NV_0+\sigma^2}\)

where \(\sigma^2\) is the variance of the entire population and \(\sigma_2^2\)is the variance of the non-response group. *[Sometimes, we are only given the variance of the response group and non-response group. In such case, we can use the formula for variance of a mixture distribution provided after the Example to compute the variance of the entire population.] *

## Example 11-2: Weekly living expenditures

In a college of 1000 students, we want to find out students' average weekly living expenditure. The response rate is anticipated to be about 60%. It is thought that the response group has a higher variance than the non-response group. The overall variance \(\sigma^2~120\) and the variance of the non-response group \(\sigma_2^2\sim 80, c_0=0,c_1=1, c_2=4.\)

#### Try it!

*k,*

*n*' and also \(n_2\) so that the variance of the resulting estimator is approximately 5 units.

\(w_1=0.6\)

\(w_2=1-w_1=0.4\)

The optimal value for *k* is:

\(k=\sqrt{\dfrac{c_2(\sigma^2-w_2\sigma^2_2)}{\sigma^2_2(c_0+c_1w_1)}}=\sqrt{\dfrac{4\times (120-(0.4\times 80))}{80(0+(1\times 0.6))}}=2.71\)

Note: \(V_0=5\)

\(n'=\dfrac{N(\sigma^2+(k-1)w_2\sigma^2_2)}{NV_0+\sigma^2}=\dfrac{1000(120+(2.71-1)0.4\times 80)}{1000\times 5+120}=34.1\)

Round up to 35.

\(n'_2=0.4\times 35=14\)

We should sample 35 in step 1 and call back

\(n_2=\dfrac{0.4\times 35}{2.71}=5.2 \text{ or }6\) in step 2.

## Variance of a Mixture Distribution

\(X \sim\left(\mu_{X}, (\sigma_{X})^{2}\right)\)

\(Y \sim\left(\mu_{Y}, (\sigma_{Y})^{2}\right)\)

*W* is a mixture of *pX* and (1 - *p*)*Y *

\(W=pX\oplus (1-p)Y\)

where \(0\leq p \leq 1\)

Then:

\(\mu_{W}=p \mu_{X}+(1-p) \mu_{Y}\)

What is the variance of the overall population?

\((\sigma_{W})^{2}=p (\sigma_{X})^{2}+(1-p) (\sigma_{Y})^{2}+p(1-p)\left(\mu_{X}-\mu_{Y}\right)^{2}\)

You can think of these *p*'s as weights of the two variables. When one wants to apply this formula to find the mean and variance of a population which consists of the response and the nonresponse group, then *p* is the proportion of response group and (1-*p*) the proportion of nonresponse group.

## Designing surveys to reduce non-response

Good survey practice is to discover why the non-response occurs and resolve as many of the problems as possible before commencing the survey.

**Example**: To design an experiment to find out how to best improve the response rate.

A factorial experiment employed in the 1992 Census Implementation Test to explore the individual effect and the interactions for 3 factors on the response rate (in %):

- pre-notice letter
- stamped return envelope
- reminder postcard

- Letter, Postcard, Envelope 64.3%
- Letter, Postcard, no envelope 62.7% (not bad!)

Some factors that may influence the response rate and data accuracy:

- Survey content: sensitive questions will have high non-response rate. Try the randomizing response technique
- Time of survey - select wisely
- Data collection method such as Computer Assisted Telephone Interviewing (CATI) has been shown to improve data accuracy. CATI: interview questions are stored in a computer, and recalled in programmable sequences and displayed for each interviewer on a video display terminal. And, interviewers enter answers received via telephone directly into computer right away.
- Incentives and penalties

**Note!**The quality of the survey data is largely determined at the design stage.

# 11.2 - Interpenetrating Subsample

11.2 - Interpenetrating SubsampleThere are *k* interviewers and they are each different in their manner of interviewing and hence may obtain slightly different responses. To make notation simple, we assume that each interviewer conducts the same number of interviews. Let *n* denote the total sample size and *n* = *k ** *m*. There are *k* subsamples and each interviewer will be assigned *m* subjects.

Objective: to use simple random sampling to estimate \(\mu\)

- Interviewer \(1-y_{11}, y_{12}, y_{13},...,y_{1m}\)
- Interviewer \(2-y_{21}, y_{22}, y_{23},...,y_{2m}\)
- Interviewer \(3-y_{31}, y_{32}, y_{33},...,y_{3m}\)

- Interviewer \(k-y_{k1}, y_{k2}, y_{k3},...,y_{km}\)

The average for the *i*th interviewer is denoted as:

\(\bar{y}_i=\dfrac{1}{m}\sum\limits_{j=1}^m y_{ij}\)

The grand average is denoted as:

\(\bar{y}=\dfrac{1}{k}\sum\limits_{i=1}^k \bar{y}_i\)

The grand average \(\bar{y}\) is unbiased for μ and the estimated variance of \(\bar{y}\) is:

\(\hat{V}ar(\bar{y})=\dfrac{N-n}{N}\cdot \dfrac{s^2_k}{k}\)

\(\text{where } s^2_k=\dfrac{\sum\limits_{i=1}^k (\bar{y}_i-\bar{y})^2}{k-1}\)

The technique of interpenetreting the subsample gives an estimate of the variance of ybar that accounts for interviewer biases. In practice, the estimated variance given in the above formula is usually larger than the estimate of the variance by using simple random sampling.

## Example 11-3: Interpenetrating subsample

A researcher has 10 research assistants, each with his/her own equipment that they use to measure the time (in seconds) it takes for people to respond to a command. A simple random sample of 80 people are taken. Since the researcher believes the assistants will produce slightly biased measurements, he decides to randomly divide the 80 people into 10 subsamples of 8 persons each. Each assistant is then assigned to one subsample. The measurements are given in the following table.

assistant |
time it takes to respond |
|||||||

1 |
52 |
73 | 62 | 75 | 71 | 68 | 55 | 65 |

2 |
62 | 65 | 73 | 67 | 78 | 71 | 67 | 59 |

3 |
43 | 54 | 52 | 48 | 56 | 51 | 62 | 57 |

4 |
73 | 64 | 63 | 59 | 71 | 78 | 67 | 76 |

5 |
88 | 76 | 69 | 83 | 85 | 66 | 74 | 73 |

6 |
55 | 71 | 63 | 75 | 68 | 72 | 69 | 60 |

7 |
72 | 65 | 77 | 69 | 74 | 82 | 73 | 67 |

8 |
55 | 43 | 58 | 62 | 42 | 61 | 53 | 61 |

9 |
62 | 52 | 59 | 63 | 69 | 72 | 64 | 58 |

10 |
77 | 65 | 79 | 69 | 72 | 68 | 71 | 67 |

##### Minitab output:

mean | |
---|---|

Subsample 1 | 65.125 |

Subsample 2 | 67.750 |

Subsample 3 | 52.875 |

Subsample 4 | 68.875 |

Subsample 5 | 76.750 |

Subsample 6 | 66.625 |

Subsample 7 | 72.375 |

Subsample 8 | 54.375 |

Subsample 9 | 62.375 |

Subsample 10 | 71.000 |

#### Try it!

We estimate the mean by:

\(\bar{y}=\dfrac{1}{10}(\bar{y}_1+\bar{y}_2+\ldots+\bar{y}_{10})=\dfrac{1}{10}(65.125+\ldots+71.000)=65.81\)

Its variance is estimated to be:

\(\hat{V}ar(\bar{y})=\dfrac{\sum\limits_{i=1}^k (\bar{y}_i-65.81)^2}{(10-1)\times 10}=5.72\)

\(\hat{S}D(\bar{y})=2.39\)

If one neglects the interviewer effect, then \(\hat{S}D(\bar{y})\approx 1\), thus it is important to take into consideration the interviewer effect. Otherwise, one underestimates \(\hat{S}D(\bar{y})\).

# 11.3 - Estimation of means and totals over subpopulation

11.3 - Estimation of means and totals over subpopulationQuite often, obtaining a frame that lists only those elements of the population that one is interested in is impossible. For example, perhaps you want to sample households with children, however, the best frame available is a list of all households. Therefore, we wish to estimate the parameters of a subpopulation of the population represented in the frame.

**Main Issue**: You do not know the size of the subpopulation.

#### Notation:

*N*- the number of elements in the population- \(N_1\)- the number of elements in the subpopulation
*n*- sample size from the population- \(n_1\) - the number of sampled elements from the subpopulation
- \(y_{1j}\) - the
*j*th sampled observation that falls in the subpopulation

An unbiased estimator of \(\mu_1\), the subpopulation mean is:

\(\bar{y}_1=\dfrac{1}{n_1}\sum\limits_{j=1}^{n_1} y_{ij}\)

Its variance is estimated by:

\(\hat{V}ar(\bar{y}_1)=\left(\dfrac{N_1-n_1}{N_1}\right)\dfrac{s^2_1}{n_1}\)

\(\text{where } s^2_1=\dfrac{\sum\limits_{j=1}^{n_1} (y_{ij}-\bar{y}_1)^2}{n_1-1}\)

Usually we do not know \(N_1\), so we will estimate the finite population correction factor as :

\(\dfrac{N_1-n_1}{N_1} \text{ by } \dfrac{N-n}{N}\)

## Example 11-4: Amount spent on food

Let's say we want to estimate the average weekly amount spent on food by married graduate students in a certain college at Penn State. There are 80 graduate students in the college. 15 are sampled and 10 are married. A summary of the data follows:

Variable | marital status | N | Mean | SE Mean | StDev |
---|---|---|---|---|---|

food cost | m | 10 | 135.3 | 14.1 | 44.4 |

s | 5 | 87.60 | 9.73 | 21.76 |

#### Try it!

The average food cost for married students is:

\(\bar{y}_m=135.3\)

An estimate for the standard deviation for the estimate is:

\(\hat{V}ar(\bar{y}_m)=\dfrac{80-15}{80}\cdot \dfrac{44.4^2}{10}=160.173\)

\(\hat{S}D(\bar{y}_m)=12.656\)