10.1 - Double Sampling for Ratio Estimation

What is double sampling? Section

Designs in which initially a sample of units is selected for obtaining auxiliary information only, and then a second sample is selected in which the variable of interest is observed in addition to the auxiliary information.

Double sampling is also called two-phase sampling. It is useful in obtaining auxiliary variables for ratio and regression estimation. Double sampling is also useful for finding information for stratified sampling.

Ratio estimation with double sampling

  • \(y_i\) - variable of interest
  • \(x_i\) - auxiliary variable
  • n' - number of units in the first sample (which includes the second sample)
  • n - number of units in the second sample

Only in the second sample, both \(x_i\) and \(y_i\) values are observed. In the remaining units, (in the first but not the second sample), \(x_i\) but not \(y_i\) are observed. Note that observing \(y_i\)'s are expensive whereas observing \(x_i\)'s are not.

If \(x_i\) and \(y_i\) are highly linearly correlated and approximately passing through the origin, then the ratio estimate with double sampling may lead to improved estimates. While using the ratio estimate for double sampling, the ratio will be estimated using samples where both (x, y) are observed, i.e., the second sample, whereas \(\tau_x\) will be estimated by the larger first sample.

The ratio estimator is:

\(\hat{\tau}=r\hat{\tau}_x\)

where \(r=\dfrac{\sum\limits_{i=1}^n y_i}{\sum\limits_{i=1}^n x_i}\),

and \(\hat{\tau}_x=\dfrac{N}{n'}\sum\limits_{i=1}^{n'} x_i\)

Let \(s^2\) be the sample variance of the y-value, then the estimated variance of the ratio estimator is:

\(\hat{V}ar(\hat{\tau}_r)=\underbrace{N(N-n')\dfrac{s^2}{n'}}_{\textstyle  \hat{V}ar[E(\hat{\tau}_r|s_1)]}+\underbrace{N^2\dfrac{n'-n}{n'n(n-1)}\displaystyle\sum_{i=1}^{n}(y_i-rx_i)^2}_{\textstyle  E[\hat{V}ar(\hat{\tau}_r|s_1)]}\)

 

Note that s1 stands for the first sample.

Example 10-1: Double Sampling Section

A forest resource manager is interested in estimating the total number of dead trees in a 400-acre area of heavy infestation. She subdivides the area into 200 plots of equal sizes and uses photo counts to find the number of dead trees in 18 randomly sampled plots. She then randomly samples 8 plots out of these 18 plots and conducts a ground count on these 8 plots.

Estimate the total number of dead trees in the 400-acre area.

Let x denote the number of dead trees in the plot by photo count and y the number of dead trees by ground count. The data are given as:

Plot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
x' 5 7 10 6 7 9 3 6 8 11 5 9 12 13 3 20 15 4

Out of these 18 plots, 8 are randomly selected and a ground count is conducted.

Plot 2 3 5 6 12 15 16 17
x 7 10 7 9 9 3 20 15
y 9 13 10 11 10 4 25 17
y-rx 0.3375 0.6250 1.3375 -0.1375 -1.1375 0.2875 0.2500 -1.5625

 

 Minitab output:

Variable N Mean StDev
x' 18 8.50 4.46
x 8 10.00 5.26
y 8 12.37 6.28

Sum of x' = 153.00, Sum of x = 80.00, Sum of y = 99.00.

Sum of squares (uncorrected) of y-rx = 6.192

For this example:

  • N = 200
  • n' = 18
  • n = 8

Try it!

Compute the ratio estimate for the population total.

The ratio estimator for this population total is:

\(r=\dfrac{99}{80}=1.2375\)

\(\hat{\tau}_x=\dfrac{200}{18}\sum\limits_{i=1}^{18} x_i =\dfrac{200}{18} \cdot 153=1700\)

\(\hat{\tau}_r=r\hat{\tau}_x=1.2375 \times 1700=2103.75\)

Try it!

Now, compute the estimated variance of the ratio estimator.

The estimated variance of the ratio estimator is:

\(\hat{V}ar(\hat{\tau}_r)=N(N-n')\dfrac{s^2}{n'}+N^2 \dfrac{n'-n}{n'n(n-1)}\sum\limits_{i=1}^n (y_i-rx_i)^2\)

\(s^2=(\text{st.dev.}y)^2=6.28^2\)
\(\sum\limits_{i=1}^8 (y_i-rx_i)^2=6.1928\)

\begin{align}
\hat{V}ar(\hat{\tau}_r)&= 200(200-18)\times \dfrac{6.28^2}{18}+200^2 \dfrac{18-8}{18\times 8(8-1)}\times 6.1928 \\
&= 79753.2+ 2457.46\\
&= 82210.66\\
\end{align}

\(\hat{s}t.dev.(\hat{\tau}_r)=286.724\)

Allocation in double sampling for ratio estimation Section

  • c' - the cost of observing an x-variable on one unit
  • c - the cost of observing a y -variable on one unit

The total cost = c'n' + cn

For a fixed total cost, the lowest variance of \(\hat{\tau}_r\) is obtained by:

\(\dfrac{n}{n'}=\sqrt{\dfrac{c'}{c}\times \dfrac{\sigma^2_r}{\sigma^2-\sigma^2_r}}\)

where \(\sigma_r^2\) is the variance of Y about the ratio line. \(\sigma^2\) is the variance of Y.

\(\sigma_r^2\) can be estimated by \(s^2_r=\dfrac{\sum\limits_{i=1}^n (y_i-rx_i)^2}{n-1}\) and \(\sigma^2\) can be estimated by \(s^2=\dfrac{\sum\limits_{i=1}^n(y_i-\bar{y})^2}{n-1}\).

Try it!

If the cost of counting dead trees on a plot by photo count is 1/4 of the cost of a ground count, how are you going to decide upon the optimal subsampling fraction of n and n'?

\(s^2_r=\dfrac{\sum\limits_{i=1}^n (y_i-rx_i)^2}{n-1}=\dfrac{6.1928}{8-1}=0.885\)

\(\dfrac{n}{n'}=\sqrt{\dfrac{1}{4}\times \dfrac{0.885}{(6.28)^2-0.885}}=0.075\)

Note that here we use \(s_r^2\) to estimate \(\sigma_r^2\) and \(s^2\) to estimate \(\sigma^2\). For these to be reasonably good estimates, the sample size should not be too small in practical use.

To understand the above result, we can see that if the study is a very large scale, for example, if n' = 1000, then we will select n as 75. The proportion is small since 0.885 is small compared to (6.28)2.