5  Auxillary Data and Regression Estimation

Overview

This lesson discusses when and how to use regression estimation. An example of using regression is given. Then we compare the regression estimate by simply using the sample mean, not taking advantage of the auxiliary information. To illustrate that one has to choose the right model, we use the ratio estimate for the example even though the condition for using the ratio estimate was not satisfied. And, not surprisingly, the ratio estimate performs poorly since it is not the appropriate model for that data set.

Lesson 5: Ch. 8.1 of Sampling by Steven Thompson, 3rd Edition.

Objectives

Upon completion of this lesson you should be able to:

  1. Identify the appropriate reasons and situations for using regression estimates,
  2. Assess the conditions to determine whether it is appropriate to use the regression estimate,
  3. Compute the regression estimate and its estimated variance,
  4. Compute confidence interval based on regression estimate,
  5. Compare the performance of the regression estimate and the expansion estimate, and recognize that the regression estimate outperforms the expansion estimate when the auxillary data is useful, and
  6. Compare the performance of the regression estimate and ratio estimate, and recognize that the regression estimate outperforms the ratio estimate when the condition for using the ratio estimate is not satisfied.

5.1 Linear Regression Estimator

Looking at the data, how will we find things that will work, or which model should we use? These are key questions. The variance for the estimators will be an important indicator.

The Idea Behind Regression Estimation

When the auxiliary variable \(x\) is linearly related to \(y\) but does not pass through the origin, a linear regression estimator would be appropriate. This does not mean that the regression estimate cannot be used when the intercept is close to zero. The two estimates, regression, and ratio may be quite close in such cases and you can choose the one you want to use.

In addition, if multiple auxiliary variables have a linear relationship with \(y\), multiple regression estimates may be appropriate.

To estimate the mean and total of \(y\)-values, denoted as \(\mu\) and \(\tau\), one can use the linear relationship between \(y\) and known \(x\)-values.

Let’s start with a simple example:

\(\hat{y}=a+bx\), which is our basic regression equation. Then,

\[b=\dfrac{\sum\limits_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum\limits_{i=1}^n(x_i-\bar{x})^2} \text{ and}\]

\[a=\bar{y}-b\bar{x}\]

Then to estimate the mean for \(y\), substitute as follows:

\[x=\mu_x,\quad a=\bar{y}-b\bar{x} \text{ then }\]

\[\hat{\mu}_L=(\bar{y}-b\bar{x})+b\mu_x\]

\[\hat{\mu}_L=\bar{y}+b(\mu_x-\bar{x}),\quad \hat{\mu}_L=a+b\mu_x\]

Note that even though \(\hat{\mu}_L\) is not unbiased under simple random sampling, it is roughly so (asymptotically unbiased) for large samples.

Thus, the Mean Square Error of the estimate, denoted as \(\operatorname{MSE}(\hat{\mu}_L)\) is not the same as \(\operatorname{Var}(\hat{\mu}_L)\) due to the bias but can be roughly estimated by the following when the sample size is large:

\[\begin{align} \hat{\operatorname{Var}}(\hat{\mu}_L) &=\dfrac{N-n}{N \times n}\cdot \dfrac{\sum\limits_{i=1}^n(y_i-a-bx_i)^2}{n-2}\\ &= \dfrac{N-n}{N \times n}\cdot \operatorname{MSE}\\ \end{align}\]

where MSE is the MSE of the linear regression model of \(y\) on \(x\).

Therefore, an approximate \((1-\alpha)100\)% CI for \(\mu\) is:

\[\hat{\mu}_L \pm t_{n-2,\alpha/2}\sqrt{\hat{\operatorname{Var}}(\hat{\mu}_L)}\]

It follows that:

\[\hat{\tau}_L=N\cdot \hat{\mu}_L=N\bar{y}+b(\tau_x-N\bar{x})\]

\[\begin{align} \hat{\operatorname{Var}}(\hat{\tau}_L) &= N^2 \hat{\operatorname{Var}}(\hat{\mu}_L) \\ &= \dfrac{N \times (N-n)}{n} \cdot \operatorname{MSE}\\ \end{align}\]

And, an approximate \((1-\alpha)100\)% CI for \(\tau\) is:

\[\hat{\tau}_L \pm t_{n-2,\alpha/2}\sqrt{\hat{\operatorname{Var}}(\hat{\tau}_L)}\]

Example 5.1 (Average First Year Calculus Scores)  

Reference: p. 205 of Scheaffer, Mendenhall, and Ott

The institutional researcher of a college wants to estimate the average first-year Calculus score of first-year students. Since the students take the Calculus class from different instructors, it is expensive to find out their Calculus scores. The researcher only takes a simple random sample of 10 students and finds out their first-year Calculus scores. The researcher has a record of the college mathematics achievement test that the 486 first-year students took prior to entering college. And the average achievement test score for the 486 students was 52. The scatterplot of the 10 samples with both scores is given below. The researcher would like to use this information to help estimate the average first-year calculus score of these 486 students.

Plot of Calculus scores vs Achievement scores of 10 students

The scatter plot shows that there is a strong positive linear relationship.

\[\hat{\mu}_L=\bar{y}+b(\mu_x-\bar{x})=a+b\mu_x\]

Student Achievement test score X Calculus score Y
1 39 65
2 43 78
3 21 52
4 64 82
5 57 92
6 47 89
7 28 73
8 75 98
9 34 56
10 52 75

Minitab output:

Regression Analysis

The regression equation is:

\(Y = 40.8 + 0.766 X\)

Analysis of Variance

Source DF SS MS F P
Regression 1 1450.0 1450.0 19.14 0.002
Resid. Error 8 606.0 75.8 - -
Total 9 2056.0 - - -

Coefficients

Predictor Coef StDev T P
Constant 40.784 8.507 4.79 0.001
X 0.7656 0.1750 4.38 0.002

\(S = 8.704\); \(R-Sq = 70.5%\); \(R-Sq(adj) = 66.8%\)

Try It!

Using the results from the Minitab output here, what do you get for the regression estimate?

\[\begin{align} \hat{\mu}_L &= 40.8+0.766 \times 52\\ &= 80.63 \end{align}\]

The Minitab output provides us with \(p\)-values for the constant and the coefficient of \(X\). We can see that both terms are significant. (ratio estimate is not appropriate since the constant term is non-zero).

Try It!

What is the variance of the regression estimate?

\[\begin{align} \hat{\operatorname{Var}}(\hat{\mu}_L) &=\dfrac{N-n}{N \times n}\cdot \operatorname{MSE}\\ &= \dfrac{486-10}{486 \times 10} \times 75.8\\ &= 7.42\\ \end{align}\]

Try It!

What is then, an approximate 95% CI for \(\mu\)?

\[\hat{\mu}_L \pm t_{n-2}\sqrt{\hat{\operatorname{Var}}(\hat{\mu}_L)}, df=8\]

\[\begin{align} &= 80.63 \pm 2.306 \times \sqrt{7.42} \\ &= 80.63 \pm 6.28 \end{align}\]

5.2 Comparison of Estimators

Compare the regression estimate to the estimate \(\bar{y}\)

To compare the regression estimate to the estimate \(\bar{y}\), (which does not use auxiliary result of \(x\)), we see that:

\[\hat{\operatorname{Var}}(\bar{y})=\dfrac{N-n}{N}\cdot \dfrac{s^2}{n}\]

\(s^2\) for \(y\)-values is: \((15.11)^2\)

Try It!

  1. What is the \(\operatorname{Var}(\bar{y})\)?

    \[\begin{align} \hat{\operatorname{Var}}(\bar{y}) &= \dfrac{486-10}{486 \times 10} \cdot 228.31 \\ &= 22.36\\ \end{align}\]

  2. Next, what is an approximate 95% CI for \(\mu\)?

    Answer

    \[\bar{y} \pm t_{n-1}\sqrt{\hat{\operatorname{Var}}(\bar{y})}\] \[\begin{align} &= 76 \pm 2.262 \times \sqrt{22.36} \\ &= 76 \pm 10.70 \end{align}\]

Recall that the 95% confidence interval using regression estimate is \(80.63 \pm 6.28\); a much shorter confidence interval.

This regression estimate is more precise than \(\bar{y}\).

Additionally, we have another estimator that we can look at.

Compare \(\hat{\mu}_L\) to the ratio estimator \(\hat{\mu}_r\)

Next, Minitab was used to find out the mean and standard deviation for \(X\) and \(Y\).

Variable N Mean StDev SE Mean
X 10 46.00 16.58 5.24
Y 10 76.00 15.11 4.78

The ratio estimate is inappropriate for this example. However, just to show a counter-example, we can compute the variance of the ratio estimate using the following Minitab printout and compare this to the regression estimate.

X Y Y -rX
39 65 0.572
43 78 6.964
21 52 17.308
64 82 -23.728
57 92 -2.164
47 89 11.356
28 73 26.744
75 98 -25.900
34 56 -0.168
52 75 -10.904

The sum of squares (uncorrected) of \(Y -rX = 2550.03\)

Note! For the Calculus Scores example, we should not use the ratio estimator \(\hat{\mu}_r\) because the \(p\)-value for the constant term is 0.001. This implies that it does not go through the origin and for this reason the ratio estimate is not appropriate. But for the purposes of a counter-example, we will work it out here anyway:

\[\hat{\mu}_r=r\mu_x=\dfrac{\bar{y}}{\bar{x}}\cdot \mu_x=\dfrac{76}{46}\cdot 52=85.91\]

Next, we need to figure out the variance and for this, we need the MSE while using a ratio estimate. From the Minitab output, we have the \(SS / n-1\), therefore, the

\[s^2_r=\dfrac{1}{10-1} \sum\limits_{i=1}^{10} (y_i-rx_i)^2=283.33 \text{(this is huge!)}\]

Now we can compute the variance:

Try It!

What is the variance of \(\hat{\mu}_r\)?

\[\begin{align} \hat{\operatorname{Var}}(\hat{\mu}_r) &=\dfrac{N-n}{N}\cdot \dfrac{s^2_r}{n}\\ &= \dfrac{486-10}{486}\cdot \dfrac{283.33}{10}=27.75\\ \end{align}\]

Now we can compute a 95% confidence interval for \(\mu\).

Try It!

What is an approximate 95% confidence interval for \(\hat{\mu}_r\) using a ratio estimate?

\[\hat{\mu}_r \pm t_{n-1}\sqrt{\hat{\operatorname{Var}}(\hat{\mu}_r)}\]

\[\begin{align} &= 85.91 \pm 2.262 \times \sqrt{27.75} \\ &= 85.91 \pm 11.92 \end{align}\]

We can see that the ratio estimate is even worse than \(\bar{y}\) when it is used in an inappropriate situation.

The width of the interval is larger than the one for the regression estimate.

The moral of this story here is, “Use the right model!”