4 Auxiliary Data and Ratio Estimation
Overview
This lesson starts with the rationale for using auxiliary information about the population to estimate the unknown population parameter of interest. It then motivates the ratio estimates and the condition appropriate for ratio estimates to be used. An example of using a ratio estimate when the population size is unknown is given in section 4.1. In section 4.2, an example of using ratio estimate when the population size is given. We then discuss the sample size needed for a specified margin of error and confidence level when the ratio estimate is used. At the end of section 4.2, a small population example is given to illustrate that the ratio estimate is biased and also to demonstrate that the ratio estimate is indeed better than the expansion estimate when the condition for using the ratio estimate is satisfied.
Lesson 4: Ch. 7.1, 7.2 of Sampling by Steven Thompson, 3rd Edition.
Objectives
Upon completion of this lesson, you should be able to:
- Identify the appropriate reasons and situations for utilizing ratio estimates,
- Evaluate the conditions for determining the feasibility of using a ratio estimate,
- Compute the ratio estimate and its estimated variance,
- Compute confidence interval based on ratio estimates,
- Compute the sample size needed when the ratio estimate is used,
- Recognize the biasedness of the ratio estimate that is demonstrated by a small population example, and
- Recognize the improved performance of the ratio estimate over the expansion estimate when the appropriate conditions for using the ratio estimate are met.
4.1 Auxiliary Data, Ratio Estimator and Its Computation
Using Auxiliary Information
The auxiliary information about the population may include a known variable to which the variable of interest is approximately related. The auxiliary information typically is easy to measure, whereas the variable of interest may be expensive to measure.
- Population units: \(1, 2, \dots, N\)
- variable of interest: \(y_1, y_2, \dots,y_N\) (expensive or costly to measure)
- auxiliary variable : \(x_1, x_2, \dots,x_N\) (known)
For example, consider:
A national park is partitioned into \(N\) units.
- \(y_i=\) the number of animals in unit \(i\)
- \(x_i=\) the size of unit \(i\)
Another example might be where a certain city has \(N\) bookstores.
- \(y_i =\) the sales of a given book title at bookstore \(i\)
- \(x_i=\) the size of the bookstore \(i\)
A third example would be a forest that has \(N\) trees.
- \(y_i =\) the volume of the tree
- \(x_i =\) the diameter of the tree
Ratio Estimators
If \(\tau_y=\sum\limits_{i=1}^N y_i\) and \(\tau_x=\sum\limits_{i=1}^N x_i\) then, \(\dfrac{\tau_y}{\tau_x}=\dfrac{\mu_y}{\mu_x}\) and \(\tau_y=\dfrac{\mu_y}{\mu_x}\cdot \tau_x\)
The ratio estimator, denoted as \(\hat{\tau}_r\), is \(\hat{\tau}_r=\dfrac{\bar{y}}{\bar{x}}\cdot \tau_x\)
The estimator is useful in the following situation:
When \(X\) and \(Y\) are highly linearly correlated through the origin, then: \(\operatorname{Var}(\hat{\tau}_r)\) is less than \(\operatorname{Var}(N\bar{y})\)
In the case where \(N\) is unknown, then it provides a way to estimate \(\tau_y\) since when \(N\) is unknown, one cannot use \(N\bar{y}\).
Historical Use
When was this type of estimator used historically? Probably the first instance of its use occurred in France in 1802. At this time there was no population census and Laplace wanted to estimate the total population of France. He did not have the resources to count every individual so he sampled 30 communities in France. In this case for Laplace, \(n = 30\), and the total number of inhabitants in these communities = 2,037,615. What type of information did the government already have?
Laplace found auxiliary information to help him and found good records of the number of registered births. It turns out that the total number of registered births for the 30 communities that he had selected = 71,866.33.
Dividing 2,037,615 by 71,866.33, he estimated that there is one registered birth for every 28.35 persons. Therefore, he estimated the total population by the total number of annual births × 28.35.
Rationale: Communities with larger populations are likely to have a larger number of registered births.
This is an example of the early use of ratio estimation.
Example 4.1 (Total Weight of Apple Juice) For a juice company, the price they paid for apples in large shipments is based on the amount of apple juice from the load. Therefore, we need to determine the amount of apple juice in the whole load prior to extraction. We can sample \(n\) apples and find \(y_1,\dots,y_n\) the amount of apple juice in those apples. \(N\bar{y}\) is hard to get in this case because \(N\) is hard to count. How could we measure this?
The total weight would be a good idea and easy to get. We will use the relationship between the weight of the load and the weight of the apple juice one obtains.
\(Y\) is related to \(x\), the weight of each apple in the sample and the total weight is easy to get for the entire shipment. We can thus estimate the total apple juice by:
\[\hat{\tau}_r=\dfrac{\bar{y}}{\bar{x}}\cdot \tau_x\]
For this example, \(N\) is unknown and we cannot use \(N\bar{y}\). One can see that if the condition for using the ratio estimator is satisfied and N is known, this ratio estimator may actually work better than \(N\bar{y}\).
Similarly, to estimate \(\mu_y\), we can use
\[\hat{\mu}_r=\dfrac{\bar{y}}{\bar{x}}\cdot \mu_x\]
It turns out that this estimate is not unbiased. Note that \(\hat{\tau}_r\) is not unbiased for \(\tau_y\) and \(\hat{\mu}_r\) is not unbiased for \(\mu_y\) but they are approximately unbiased for large samples when the sampling is a simple random sampling. The approximate MSE of \(\hat{\mu}_r\) is \(\operatorname{Var}(\hat{\mu}_r)\) and given by.
\[\operatorname{Var}(\hat{\mu}_r) \approx \left( \dfrac{N-n}{N} \right)\cdot \dfrac{\sigma^2_r}{n}\]
How can we compute the \(\sigma^2_r\)?
\[\text{where }\sigma^2_r=\dfrac{1}{N-1} \sum\limits_{i=1}^N \left( y_i-\dfrac{\tau_y}{\tau_x}\cdot x_i \right)^2\]
When we want to estimate \(\sigma^2_r\) we will estimate using this formula:
\[s^2_r=\dfrac{1}{n-1} \sum\limits_{i=1}^n \left( y_i-\dfrac{\bar{y}}{\bar{x}}\cdot x_i \right)^2\]
Given all of this, when do we know that the estimate \(\hat{\mu}_r\) is good? We can compare it to:
\[\operatorname{Var}(\bar{y})=\left(\dfrac{N-n}{N}\right)\cdot \dfrac{\sigma^2}{n}\]
\(\hat{\mu}_r\) will perform better if \(\sigma^2_r < \sigma^2\). That is the case for populations for which \(y\)’s and \(x\)’s are highly correlated and with roughly a linear relationship through the origin.
An approximate \(100(1 - \alpha)\)% CI for \(\mu_y\) is:
\[\hat{\mu}_r \pm t_{n-1,\alpha/2}\sqrt{\hat{\operatorname{Var}}(\hat{\mu}_r)}\]
for \(\tau_y\),
\[\hat{\tau}_r=N \hat{\mu}_r=\dfrac{\bar{y}}{\bar{x}}\cdot \tau_x\]
\[\hat{\operatorname{Var}}(\hat{\tau}_r)=N \cdot (N-n) \dfrac{s^2_r}{n}\]
Back to the context for this example…
Using Minitab
As it turns out in this example, 15 apples selected by simple random samples were weighed and also juiced. The total weight of the apple shipment was found to be 2000 pounds. What we need to do, given the table of results below, is to get a point estimate of the total weight of the juice for the shipment of apples and provide a 95% confidence interval. Here is the data in a Minitab worksheet (Apple_Juice.txt) which contains the following columns of data:
- C1 is the sampled Apple
- C2 is \(Y\), the weight of the Apple juice in lbs.
- C3 is \(X\), the weight of the Apple in lbs.
- C4 is \(Y-rX\), the observed \(y\) value - estimated \(y\) value, and
- C5 is \((Y-rX)^2\), the (observed \(y\) value - estimated \(y\) value) squared.
Total Apple juice weight is 2.85 lbs. (mean = 0.19 lbs.)
Total Apple weight is 4.32 lbs. (mean = 0.288 lbs.)
We can use Minitab to generate the scatterplot, the regression analysis, and the descriptive statistics:
Graph > Scatterplot
Stat > Regression > Regression
Choose \(Y\) for the Response and \(X\) for the continuous predictor.
Stat > Basic Statistics
Question
Is it appropriate to use the ratio estimate?
Use the graph and analysis below to justify your answer.
Minitab output:
Regression Analysis Apple juice \(Y\) versus Weight \(X\)
Coefficients
Predictor | Coef | SE Coef | T-Value | P-Value |
---|---|---|---|---|
Constant | -0.00904 | 0.02003 | -0.45 | 0.659 |
Weight X | 0.69112 | 0.06754 | 10.23 | 0.000 |
Regression Equation
\(\text{Apple juice } Y = - 0.0090 + 0.691 \text{Weight } X\)
\(S = 0.0185258\); R-Sq \(= 89.0%\); R-Sq(adj) \(= 88.1%\)
The scatterplot of the data shows a linear relationship between the \(y\) and \(x\) variables. Moreover, the regression analysis suggests that the regression line goes through the origin (p-value of constant \(= 0.659 > 0.05\)). Therefore, it appears appropriate to use the ratio estimate.
Use ratio estimate to estimate the total weight and provide a 95% confidence interval.
Minitab output:
Descriptive Statistics: Apple juice Y, Weight X
Variable | N | N* | Mean | SE Mean | StDev | Minimum | Q1 | Median | Q3 | Maximum |
---|---|---|---|---|---|---|---|---|---|---|
Apple juice Y | 15 | 0 | 0.1900 | 0.0139 | 0.0537 | 0.1100 | 0.1600 | 0.1700 | 0.2400 | 0.2800 |
Weight X | 15 | 0 | 0.2800 | 0.0189 | 0.0733 | 0.1600 | 0.2200 | 0.2900 | 0.3500 | 0.4000 |
The ratio estimate of the total weight is:
\[\hat{\tau}_r=r\tau_x=\dfrac{0.190}{0.288}\times 2000=1319.44\]
How accurate is this result? Let’s compute a confidence interval and for this, we need the variance.
Then an approximate 95% CI for \(\tau\) is then:
\[\begin{align} &= 1319.44 \pm t_{14} \hat{\text{SD}}(\hat{\tau}_r) \\ &= 1319.44 \pm 2.145 \times 32.24 \\ &= 1319.44 \pm 69.15 \end{align}\]
Let \(s^2\) denote the sample variance of the \(y_i\)’s, the \(s^2\) can be calculated to be \(s^2= (0.0537)^2=0.004536\), whereas \(s^2_r = 0.004536 / 14 = 0.000324.\)
We see that for this example, the estimate does reduce the variance by using information contained in \(x\) about \(y\).
Using R
Here is the code for R for this example:
Datafile: Apple.txt
R code: Chapter4_apple.R.txt
Estimation for Ratio
In some cases we are interested in estimating:
\[R=\dfrac{\tau_y}{\tau_x}\left(\text{also, } \dfrac{\mu_y}{\mu_x}\right)\]
For example, sociologists are interested in ratios such as the monthly food budget compared to the monthly income per family. The sample ratio is the estimate for \(R\) and:
4.2 Selecting Sample Size and Small Population Example for Ratio Estimate
Example 4.2 (Estimating average number of trees per acre)
See p. 196 of Scheaffer, Mendenhall and Ott
Part 1: Estimate the average number of trees per acre
Suppose we want to estimate the average number of trees per acre on a 1000-acre plantation. The investigator samples 10 one-acre plots by simple random sampling and counts the number of trees (\(y\)) on each plot. She also has aerial photographs of the plantation from which she can estimate the number of trees (\(x\)) on each plot of the entire plantation. Hence, she knows \(\mu_x = 19.7\), and since the two counts are approximately proportional through the origin, she uses a ratio estimate to estimate \(\mu_y\).
Plot | Actual number per acre \(Y\) | Aerial estimate \(X\) | \(y_i-rx_i\) |
---|---|---|---|
1 | 25 | 23 | 0.5625 |
2 | 15 | 14 | 0.1250 |
3 | 22 | 20 | 0.7500 |
4 | 24 | 25 | -2.5625 |
5 | 13 | 12 | 0.2500 |
6 | 18 | 18 | -1.1250 |
7 | 35 | 30 | 3.1250 |
8 | 30 | 27 | 1.3125 |
9 | 10 | 8 | 1.5000 |
10 | 29 | 31 | -3.9375 |
Mean | 22.10 | 20.80 | - |
Here is a scatterplot of this data:
The Minitab output for regression is given below.
Coefficients
Predictor | Coef | SE Coef | T-Value | P-Value |
---|---|---|---|---|
Constant | 1.239 | 2.007 | 0.62 | 0.554 |
X | 1.002 | 0.06754 | 10.23 | 0.000 |
Regression Equation
\(Y = 1.24 + 1.00 X\)
The scatter plot of the data shows a linear relationship between \(y\) and \(x\). Moreover, the regression analysis suggests that the regression line goes through the origin (\(p\)-value of constant \(= 0.554 > 0.05\)). Therefore, it may be appropriate to use the ratio estimate.
Part 2: Estimate the average number of trees per acre (Computation)
- \(N = 1000\) (plantation size)
- \(n = 10\) (taken by SRS)
- \(y_i =\) the actual count of trees in the 1-acre plots, \(i = 1, 2, \dots, 10.\)
- \(x_i =\) the aerial estimate for each plot
- \(\bar{y}=22.10\)
- \(\bar{x}=20.80\)
- \(\mu_x\) is given to be 19.70
Answer
The approximate 95% confidence interval for \(\mu_y\) is:
\[\begin{align} \hat{\mu}_r \pm t_9 \cdot \hat{\text{SD}}(\hat{\mu}_r)&=20.93 \pm 2.262\cdot0.6448\\ &=20.93 \pm 1.46\\ \end{align}\]
Part 3: Sample size needed to estimate \(\mu_y\)
To find the sample size needed to estimate \(\mu_y\) when the ratio estimator is used.
Derivation
Let \(d\) denote the margin of error of the \(100(1 - \alpha)\)% confidence interval for \(\mu_y\). Then we know that:
\[t\cdot\sqrt{\dfrac{N-n}{N}\cdot \dfrac{s^2_r}{n}}=d\]
For the tree count with the aerial estimate example, if we want to estimate \(\mu_y\) to within 1 tree with 95% confidence, how many plots should we sample if the ratio estimate is to be used?
round up to 16 acres.
(one can also use the refined method instead of estimating \(t\) by 1.96)
Answer
n = 19
When is using the ratio estimate advantageous?
For estimating \(\mu_y\), we want to compare \(\bar{y}\) versus \(\hat{\mu}_r\) under simple random sampling. For large samples, \(\hat{\mu}_r\) is roughly unbiased and thus we can compare their variances. Recall that \(\hat{\operatorname{Var}}(\bar{y})=(1-\dfrac{n}{N})\dfrac{s^2_y}{n}\) where \(s^2_y\) is the estimated variance of the sample, and \(\hat{\operatorname{Var}}(\hat{\mu}_r)=\dfrac{N-n}{N}\cdot \dfrac{s^2_r}{n}\) where \(s^2_r=s^2_y+r^2s^2_x-2r\hat{\rho} s_x s_y\). Then the ratio becomes:
where \(r=\dfrac{\bar{y}}{\bar{x}}\) and \(\hat{\rho}\) is the sample correlation between \(X\) and \(Y\).
Since, \(\hat{\operatorname{Var}}(\bar{y})>\hat{\operatorname{Var}}(\hat{\mu}_r)\) if \(\hat{\rho}>\dfrac{1}{2} \dfrac{s_x/\bar{x}}{s_y/\bar{y}}\), it is then advantageous to use \(\hat{\mu}_r\).
Example 4.3 (Illustrating bias with a small population)
(See Section 7.2 of the textbook)
This is an artificial small population example that we will use to demonstrate how to compute the bias and MSE of the ratio estimator.
site i | 1 | 2 | 3 | 4 |
---|---|---|---|---|
Nets, xi | 4 | 5 | 8 | 5 |
Fishes, yi | 200 | 300 | 500 | 400 |
\(\tau_x = 22\), \(\tau_y = 1400\)
Samples (SRS) \(n = 2.\)
Samples | \(\hat{\tau}\_r=\dfrac{\bar{y}}{\bar{x}}\cdot \tau_x\) |
---|---|
(1, 2) | \(\hat{\tau}\_r=\dfrac{(200+300)/2}{(4+5)/2}\cdot 22=1222\) |
(1, 3) | \(\hat{\tau}\_r=\dfrac{(200+500)/2}{(4+8)/2}\cdot 22=1283\) |
(1, 4) | 1467 |
(2, 3) | 1354 |
(2, 4) | 1540 |
(3, 4) | 1523 |
Thus, there is a very slight bias.
When there is a slight bias, \(\operatorname{MSE} \ne \operatorname{Var}\).
site i | 1 | 2 | 3 | 4 |
---|---|---|---|---|
Nets, xi | 4 | 5 | 8 | 5 |
Fishes, yi | 200 | 300 | 500 | 400 |
\(\tau_x= 22\) , \(\tau_y = 1400\)
On the other hand, if one uses \(\hat{\tau}=N \cdot \bar{y}\)
Samples | \(\hat{\tau}=N \cdot \bar{y}\) |
---|---|
(1, 2) | \(4 × (200 + 300) / 2 = 1000\) |
(1, 3) | \(4 × (200 + 500) / 2 = 1400\) |
(1, 4) | \(4 × (200 + 400) / 2 = 1200\) |
(2, 3) | \(4 × (300 + 500) / 2 = 1600\) |
(2, 4) | \(4 × (300 + 400) / 2 = 1400\) |
(3, 4) | \(4 × (500 + 400) / 2 = 1800\) |
66667 is much larger than the MSE of \(\hat{\tau}_r\).