10  Double or Two-Phase Sampling

Overview

In Section 10.1, we introduce double sampling and discuss the application of double sampling for ratio estimation. We then provide the formula for the variance of the ratio estimator while double sampling is used. An example is given to illustrate how to conduct the double sampling and how to compute the ratio estimator as well as the estimated variance of the estimator. The allocation in double sampling is then discussed.

In Section 10.2, double sampling for stratification is discussed. An example is then used to illustrate the use of double sampling for stratification and the computation of the estimate as well as the estimated variance of the estimate.

Lesson 10: Ch. 14.1-14.3 of Sampling by Steven Thompson, 3rd Edition.

Objectives

By the end of this lesson you should be able to:

  1. Use double sampling to collect information for ratio estimation,
  2. Compute the optimal allocation in double sampling for ratio estimation,
  3. Use double sampling to collect information for stratification,
  4. Compute the estimate when double sampling is used to collect information for stratification, and
  5. Compute the estimated variance of the estimate when double sampling is used to collect information for stratification

10.1 Double Sampling for Ratio Estimation

What is double sampling?

Designs in which initially a sample of units is selected for obtaining auxiliary information only, and then a second sample is selected in which the variable of interest is observed in addition to the auxiliary information.

Double sampling is also called two-phase sampling. It is useful in obtaining auxiliary variables for ratio and regression estimation. Double sampling is also useful for finding information for stratified sampling.

Ratio estimation with double sampling

  • \(y_i\) - variable of interest
  • \(x_i\) - auxiliary variable
  • \(n'\) - number of units in the first sample (which includes the second sample)
  • \(n\) - number of units in the second sample

Only in the second sample, both \(x_i\) and \(y_i\) values are observed. In the remaining units, (in the first but not the second sample), \(x_i\) but not \(y_i\) are observed. Note that observing \(y_i\)’s are expensive whereas observing \(x_i\)’s are not.

If \(x_i\) and \(y_i\) are highly linearly correlated and approximately passing through the origin, then the ratio estimate with double sampling may lead to improved estimates. While using the ratio estimate for double sampling, the ratio will be estimated using samples where both (\(x\), \(y\)) are observed; i.e., the second sample, whereas \(\tau_x\) will be estimated by the larger first sample.

The ratio estimator is:

\[\hat{\tau}=r\hat{\tau}_x\]

where \(r=\dfrac{\sum\limits_{i=1}^n y_i}{\sum\limits_{i=1}^n x_i}\), and \(\hat{\tau}_x=\dfrac{N}{n'}\sum\limits_{i=1}^{n'} x_i\)

Let \(s^2\) be the sample variance of the \(y\)-value, then the estimated variance of the ratio estimator is:

\[\hat{\operatorname{Var}}(\hat{\tau}_r)=\underbrace{N(N-n')\dfrac{s^2}{n'}}_{\textstyle \hat{\operatorname{Var}}[E(\hat{\tau}_r|s_1)]}+\underbrace{N^2\dfrac{n'-n}{n'n(n-1)}\displaystyle\sum_{i=1}^{n}(y_i-rx_i)^2}_{\textstyle E[\hat{\operatorname{Var}}(\hat{\tau}_r|s_1)]}\]

Note that \(s_1\) stands for the first sample.

Example 10.1 (Double Sampling) A forest resource manager is interested in estimating the total number of dead trees in a 400-acre area of heavy infestation. She subdivides the area into 200 plots of equal sizes and uses photo counts to find the number of dead trees in 18 randomly sampled plots. She then randomly samples 8 plots out of these 18 plots and conducts a ground count on these 8 plots.

Estimate the total number of dead trees in the 400-acre area.

Let \(x\) denote the number of dead trees in the plot by photo count and \(y\) the number of dead trees by ground count. The data are given as:

Plot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
x' 5 7 10 6 7 9 3 6 8 11 5 9 12 13 3 20 15 4

Out of these 18 plots, 8 are randomly selected and a ground count is conducted.

Plot 2 3 5 6 12 15 16 17
x 7 10 7 9 9 3 20 15
y 9 13 10 11 10 4 25 17
y-rx 0.3375 0.6250 1.3375 -0.1375 -1.1375 0.2875 0.2500 -1.5625

Minitab output:

Variable N Mean StDev
x' 18 8.50 4.46
x 8 10.00 5.26
y 8 12.37 6.28

Sum of x’ = 153.00, Sum of x = 80.00, Sum of y = 99.00.

Sum of squares (uncorrected) of y-rx = 6.192

For this example:

  • \(N = 200\)
  • \(n' = 18\)
  • \(n = 8\)

Try It!

Compute the ratio estimate for the population total.

The ratio estimator for this population total is:

\(r=\dfrac{99}{80}=1.2375\)

\(\hat{\tau}_x=\dfrac{200}{18}\sum\limits_{i=1}^{18} x_i =\dfrac{200}{18} \cdot 153=1700\)

\(\hat{\tau}_r=r\hat{\tau}_x=1.2375 \times 1700=2103.75\)

Try It!

Now, compute the estimated variance of the ratio estimator.

The estimated variance of the ratio estimator is:

\[\hat{\operatorname{Var}}(\hat{\tau}_r)=N(N-n')\dfrac{s^2}{n'}+N^2 \dfrac{n'-n}{n'n(n-1)}\sum\limits_{i=1}^n (y_i-rx_i)^2\]

\[s^2=(\text{SD}y)^2=6.28^2\]

\[\sum\limits_{i=1}^8 (y_i-rx_i)^2=6.1928\]

\[\begin{align} \hat{\operatorname{Var}}(\hat{\tau}_r)&= 200(200-18)\times \dfrac{6.28^2}{18}+200^2 \dfrac{18-8}{18\times 8(8-1)}\times 6.1928 \\ &= 79753.2+ 2457.46\\ &= 82210.66\\ \end{align}\]

\[\hat{\text{SD}}(\hat{\tau}_r)=286.724\]

Allocation in double sampling for ratio estimation

  • \(c'\) - the cost of observing an \(x\)-variable on one unit
  • \(c\) - the cost of observing a \(y\)-variable on one unit

The total cost \(= c' n' + cn\)

For a fixed total cost, the lowest variance of \(\hat{\tau}_r\) is obtained by:

\[\dfrac{n}{n'}=\sqrt{\dfrac{c'}{c}\times \dfrac{\sigma^2_r}{\sigma^2-\sigma^2_r}}\]

where \(\sigma_r^2\) is the variance of \(Y\) about the ratio line. \(\sigma^2\) is the variance of \(Y\).

\(\sigma_r^2\) can be estimated by \(s^2_r=\dfrac{\sum\limits_{i=1}^n (y_i-rx_i)^2}{n-1}\) and \(\sigma^2\) can be estimated by \(s^2=\dfrac{\sum\limits_{i=1}^n(y_i-\bar{y})^2}{n-1}\).

Try It!

If the cost of counting dead trees on a plot by photo count is 1/4 of the cost of a ground count, how are you going to decide upon the optimal subsampling fraction of \(n\) and \(n'\)?

\[s^2_r=\dfrac{\sum\limits_{i=1}^n (y_i-rx_i)^2}{n-1}=\dfrac{6.1928}{8-1}=0.885\]

\[\dfrac{n}{n'}=\sqrt{\dfrac{1}{4}\times \dfrac{0.885}{(6.28)^2-0.885}}=0.075\]

Note that here we use \(s_r^2\) to estimate \(\sigma_r^2\) and \(s^2\) to estimate \(\sigma^2\). For these to be reasonably good estimates, the sample size should not be too small in practical use.

To understand the above result, we can see that if the study is a very large scale, for example, if \(n' = 1000\), then we will select \(n\) as 75. The proportion is small since 0.885 is small compared to \((6.28)^2\).

10.2 Double Sampling for Stratification

In some sampling situations, units can be assigned to strata only after the sample is selected.

Try It!

Provide two examples where the units can be assigned to strata only after the sample is selected…

A couple of examples might include things like:

  • Phone survey where you know the gender after the person is sampled.
  • The person sampled can only be stratified by political affiliation after the survey.

The method of post-stratification is useful only if the relative proportion of each stratum in the population \(W_h=\dfrac{N_h}{N}\) is known for each stratum \(h\). If these proportions are not known, double sampling may be used, with an initial (large) sample used to classify the units into strata and then a stratified sample selected from the initial sample.

The two steps of double sampling for stratification:

Step 1

\(n'\) initial simple random samples are selected from a population of \(N\) units. These units are classified into strata, with \(n'_h\) observed to be in stratum \(h\). The population proportion \(W_h=\dfrac{N_h}{N}\) is estimated by the sample proportion: \(w_h=\dfrac{n'_h}{n'}, h = 1, \dots, L\).

Step 2

A second sample is then selected by stratified random sampling from the first sample. These units are classified into strata, with \(n_h\) units selected from the \(n'_h\) sample units in stratum \(h\). Measurement of \(y_{hi}\) is recorded for each unit in the second sample.

We denote the sample mean in stratum \(h\) in the second sample as \(\bar{y}_h=\sum\limits_{i=1}^{n_h} y_{hi}/n_h\)

An estimate for population mean is thus:

\[\bar{y}_d=\sum\limits_{h=1}^L w_h \bar{y}_h\]

Note that \(\bar{y}_d\) is unbiased.

The decomposition of variances of two-phase sampling is:

Let \(s_1\) denote the first-phase sample, then

\[\operatorname{Var}(\bar{y}_d)=\operatorname{Var}[E(\bar{y}_d|s_1)]+E[\operatorname{Var}(\bar{y}_d|s_1)]\]

Thus, the variance of the estimate for the population mean is:

\[\operatorname{Var}(\bar{y}_d)=\dfrac{N-n'}{N} \times \dfrac{\sigma^2}{n'}+E\sum\limits_{h=1}^L \left[\left(\dfrac{n'_h}{n'}\right)^2 \left(\dfrac{n'_h-n_h}{n'_h}\right) \dfrac{\sigma^2_{h(s_1)}}{n_h} \right]\]

where \(\sigma^2\) is the overall population variance and \(\sigma^2_{h(s_1)}\) is the population variance within stratum \(h\) for the particular first–phase sample \(s_1\).

An unbiased estimate for the variance of the estimate is:

\[\hat{\operatorname{Var}}(\bar{y}_d)=\dfrac{N-n'}{N} \times \dfrac{1}{n'-1}\sum\limits_{h=1}^L w_h(\bar{y}_h-\bar{y}_d)^2+ \dfrac{N-1}{N}\sum\limits_{h=1}^L \left[\left(\dfrac{n'_h-1}{n'-1}-\dfrac{n_h-1}{N-1}\right) \dfrac{w_h s^2_h}{n_h}\right]\]

where \(s_h^2\) is the stratum sample variance from the second sample.

Example 10.2 (Double Sampling for Stratification) A shoe store wants to estimate the average number of pairs of shoes owned by the students who live in a certain college town neighborhood. They think that a stratified sample based on gender is a good approach to take but do not know the makeup of the gender in that neighborhood. They also do not know the gender of the respondent until after contacting them. So, they use double sampling by first contacting 160 randomly selected students in that neighborhood and asking them about their gender. It turns out that 64 are males and 96 are females. They then randomly sample 8 males and 12 females, and provide them a $10.00 incentive for going home to count the number of pairs of shoes, and report them.

The data are given in the table below:

Male 5 6 9 5 9 7 5 8 - - - -
Female 17 19 13 16 8 11 15 19 12 13 33 20
Variable N Mean StDev
male 8 6.750 1.753
female 12 16.33 6.37

Answer

To estimate the average pairs of shoes, they use a double sampling:

Step 1:

160 students are randomly sampled to find out their gender. Result: 64 male, 96 female.

Step 2:

stratify by gender and randomly sample 8 males and 12 females.

male: \(n'_1 = 64\), female: \(n'_2 = 96\)

\[n'=n'_1+n'_2=160\]

Try It!

Compute \(\bar{y}_d\) and its estimated standard deviation.

\(w_1=\dfrac{64}{160}=0.4\) , \(w_2=\dfrac{96}{160}=0.6\)

\(n_1=8\), \(n_2=12\)

\(\bar{y}_1=6.75\), \(\bar{y}_2=16.33\)

\[\bar{y}_d=\sum\limits_{h=1}^L w_h \bar{y}_h=(0.4 \times 6.75)+(0.6\times 16.33)=12.498\]

In this example, \(N\) is not known but we know it is large. Thus, approximately,

\[\hat{\operatorname{Var}}(\bar{y}_d)=\dfrac{1}{n'-1}\sum\limits_{h=1}^L w_h(\bar{y}_h-\bar{y}_d)^2+\sum\limits_{h=1}^L \dfrac{n'_h-1}{n'-1}\dfrac{w_h s^2_h}{n_h}\]

where \(s^2_1=(1.753)^2=3.073\) and \(s^2_2=(6.37)^2=40.5769\)

\[\begin{align} \hat{\operatorname{Var}}(\bar{y}_d) &= \dfrac{1}{160-1}[0.4(6.75-12.498)^2+0.6(16.33-12.498)^2]\\ & + \dfrac{64-1}{160-1}\left[ 0.4 \times \dfrac{3.073}{8}\right]+\dfrac{96-1}{160-1}\left[0.6\times \dfrac{40.5769}{12}\right]\\ &= 0.1385+1.2731\\ &= 1.4116\\ \end{align}\]

\[\hat{s.}d.(\bar{y}_d)=1.188\]