Looking at the data, how will we find things that will work, or which model should we use? These are key questions. The variance for the estimators will be an important indicator.
The Idea Behind Regression Estimation
When the auxiliary variable x is linearly related to y but does not pass through the origin, a linear regression estimator would be appropriate. This does not mean that the regression estimate cannot be used when the intercept is close to zero. The two estimates, regression, and ratio may be quite close in such cases and you can choose the one you want to use.
In addition, if multiple auxiliary variables have a linear relationship with y, multiple regression estimates may be appropriate.
To estimate the mean and total of y-values, denoted as \(\mu\) and \(\tau\), one can use the linear relationship between y and known x-values.
Let's start with a simple example:
\(\hat{y}=a+bx\) , which is our basic regression equation.
Then,
\(b=\dfrac{\sum\limits_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum\limits_{i=1}^n(x_i-\bar{x})^2}\) and
\(a=\bar{y}-b\bar{x}\)
Then to estimate the mean for y, substitute as follows:
\(x=\mu_x,\quad a=\bar{y}-b\bar{x},\text{then}\)
\(\hat{\mu}_L=(\bar{y}-b\bar{x})+b\mu_x\)
\(\hat{\mu}_L=\bar{y}+b(\mu_x-\bar{x}),\quad \hat{\mu}_L=a+b\mu_x\)
Note that even though \(\hat{\mu}_L\) is not unbiased under simple random sampling, it is roughly so (asymptotically unbiased) for large samples.
Thus, the Mean Square Error of the estimate, denoted as MSE (\(\hat{\mu}_L\)) is not the same as Var(\(\hat{\mu}_L\))due to the bias but can be roughly estimated by the following when the sample size is large:
\begin{align}
\hat{V}ar(\hat{\mu}_L) &=\dfrac{N-n}{N \times n}\cdot \dfrac{\sum\limits_{i=1}^n(y_i-a-bx_i)^2}{n-2}\\
&= \dfrac{N-n}{N \times n}\cdot MSE\\
\end{align}
where MSE is the MSE of the linear regression model of y on x.
Therefore, an approximate (1-\(\alpha\))100% CI for \(\mu\) is:
\(\hat{\mu}_L \pm t_{n-2,\alpha/2}\sqrt{\hat{V}ar(\hat{\mu}_L)}\)
It follows that:
\(\hat{\tau}_L=N\cdot \hat{\mu}_L=N\bar{y}+b(\tau_x-N\bar{x})\)
\begin{align}
\hat{V}ar(\hat{\tau}_L) &= N^2 \hat{V}ar(\hat{\mu}_L) \\
&= \dfrac{N \times (N-n)}{n} \cdot MSE\\
\end{align}
And, an approximate (1-\(\alpha\))100% CI for \(\tau\) is:
\(\hat{\tau}_L \pm t_{n-2,\alpha/2}\sqrt{\hat{V}ar(\hat{\tau}_L)}\)
Example 5-1: Average First Year Calculus Scores Section
Reference: p. 205 of Scheaffer, Mendenhall, and Ott
The institutional researcher of a college wants to estimate the average first-year Calculus score of first-year students. Since the students take the Calculus class from different instructors, it is expensive to find out their Calculus scores. The researcher only takes a simple random sample of 10 students and finds out their first-year Calculus scores. The researcher has a record of the college mathematics achievement test that the 486 first-year students took prior to entering college. And the average achievement test score for the 486 students was 52. The scatterplot of the 10 samples with both scores is given below. The researcher would like to use this information to help estimate the average first-year calculus score of these 486 students.
The scatter plot shows that there is a strong positive linear relationship.
\(\hat{\mu}_L=\bar{y}+b(\mu_x-\bar{x})=a+b\mu_x\)
Student | Achievement test score X | Calculus score Y |
---|---|---|
1 | 39 | 65 |
2 | 43 | 78 |
3 | 21 | 52 |
4 | 64 | 82 |
5 | 57 | 92 |
6 | 47 | 89 |
7 | 28 | 73 |
8 | 75 | 98 |
9 | 34 | 56 |
10 | 52 | 75 |
Minitab output
Regression Analysis
The regression equation is
Y = 40.8 + 0.766 X
Analysis of Variance
Source | DF | SS | MS | F | P |
---|---|---|---|---|---|
Regression | 1 | 1450.0 | 1450.0 | 19.14 | 0.002 |
Resid. Error | 8 | 606.0 | 75.8 | ||
Total | 9 | 2056.0 |
Coefficients
Predictor | Coef | StDev | T | P |
---|---|---|---|---|
Constant | 40.784 | 8.507 | 4.79 | 0.001 |
X | 0.7656 | 0.1750 | 4.38 | 0.002 |
S = 8.704 R-Sq = 70.5% R-Sq(adj) = 66.8%
Try it!
\begin{align}
\hat{\mu}_L &= 40.8+0.766 \times 52\\
&= 80.63\\
\end{align}
The Minitab output provides us with p-values for the constant and the coefficient of X. We can see that both terms are significant. (ratio estimate is not appropriate since the constant term is non-zero).
Now we can compute the variance.
Try it!
\begin{align}
\hat{V}ar(\hat{\mu}_L) &=\dfrac{N-n}{N \times n}\cdot MSE\\
&= \dfrac{486-10}{486 \times 10} \times 75.8\\
&= 7.42\\
\end{align}
Try it!
\(\hat{\mu}_L \pm t_{n-2}\sqrt{\hat{V}ar(\hat{\mu}_L)}, \quad df=8\)
\begin{array}{lcl}
& = & 80.63 \pm 2.306 \times \sqrt{7.42} \\
& = & 80.63 \pm 6.28
\end{array}