Lesson 8: More Regression

Lesson 8: More Regression

Overview

In the previous lesson, we learned that one of the primary uses of an estimated regression line:

\(\hat{y}=\hat{\alpha}+\hat{\beta}(x-\bar{x})\)

is to determine whether or not a linear relationship exists between the predictor \(x\) and the response \(y\). In that lesson, we learned how to calculate a confidence interval for the slope parameter \(\beta\) as a way of determining whether a linear relationship does exist. In this lesson, we'll learn two other primary uses of an estimated regression line:

  1. If we are interested in knowing the value of the mean response \(E(Y)=\mu_Y\) for a given value \(x\) of the predictor, we'll learn how to calculate a confidence interval for the mean \(E(Y)=\mu_Y\).

  2. If we are interested in knowing the value of a new observation \(Y_{n+1}\) for a given value \(x\) of the predictor, we'll learn how to calculate a prediction interval for the new observation \(Y_{n+1}\).


8.1 - A Confidence Interval for the Mean of Y

8.1 - A Confidence Interval for the Mean of Y

We have gotten so good with deriving confidence intervals for various parameters, let's just jump right in and state (and prove) the result.

Theorem

A \((1-\alpha)100\%\) confidence interval for the mean \(\mu_Y\) is:

\(\hat{y} \pm t_{\alpha/2,n-2}\sqrt{MSE} \sqrt{\dfrac{1}{n}+\dfrac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}}\)

Proof

We know from our work in the previous lesson that a point estimate of the mean \(\mu_Y\) is:

\(\hat{y}=\hat{\alpha}+\hat{\beta}(x-\bar{x})\)

Now, recall that:

\(\hat{\alpha} \sim N\left(\alpha,\dfrac{\sigma^2}{n}\right)\) and \(\hat{\beta}\sim N\left(\beta,\dfrac{\sigma^2}{\sum_{i=1}^n (x_i-\bar{x})^2}\right)\) and \(\dfrac{n\hat{\sigma}^2}{\sigma^2}\sim \chi^2_{(n-2)}\)

are independent. Therefore, \(\hat{Y}\) is a linear combination of independent normal random variables with mean:

\(E(\hat{y})=E[\hat{\alpha}+\hat{\beta}(x-\bar{x})]=E(\hat{\alpha})+(x-\bar{x})E(\hat{\beta})=\alpha+\beta(x-\bar{x})=\mu_Y\)

and variance:

\(Var(\hat{y})=Var[\hat{\alpha}+\hat{\beta}(x-\bar{x})]=Var(\hat{\alpha})+(x-\bar{x})^2 Var(\hat{\beta})=\dfrac{\sigma^2}{n}+\dfrac{(x-\bar{x})^2\sigma^2}{\sum(x_i-\bar{x})^2}=\sigma^2\left[\dfrac{1}{n}+\dfrac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}\right]\)

The first equality holds by the definition of \(\hat{Y}\). The second equality holds because \(\hat{\alpha}\) and \(\hat{\beta}\) are independent. The third equality comes from the distributions of \(\hat{\alpha}\) and \(\hat{\beta}\) that are recalled above. And, the last equality comes from simple algebra. Putting it all together, we have:

\(\hat{Y} \sim N\left(\mu_Y, \sigma^2\left[\dfrac{1}{n}+\dfrac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}\right]\right)\)

Now, the definition of a \(T\) random variable tells us that:

\(T=\dfrac{\dfrac{\hat{Y}-\mu_Y}{\sigma \sqrt{\dfrac{1}{n}+\dfrac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}}}}{\sqrt{\dfrac{n\hat{\sigma}^2}{\sigma^2}/(n-2)}}=\dfrac{\hat{Y}-\mu_Y}{\sqrt{MSE} \sqrt{\dfrac{1}{n}+\dfrac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}}} \sim t_{n-2}\)

So, finding the confidence interval for \(\mu_Y\) again reduces to manipulating the quantity inside the parentheses of a probability statement:

\(P\left(-t_{\alpha/2,n-2} \leq \dfrac{\hat{Y}-\mu_Y}{\sqrt{MSE} \sqrt{\dfrac{1}{n}+\dfrac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}}} \leq +t_{\alpha/2,n-2}\right)=1-\alpha\)

Upon doing the manipulation, we get that a \((1-\alpha)100\%\) confidence interval for \(\mu_Y\) is:

\(\hat{y} \pm t_{\alpha/2,n-2}\sqrt{MSE} \sqrt{\dfrac{1}{n}+\dfrac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}}\)

as was to be proved.

Example 8-1

Old Faithful geyser

The eruptions of Old Faithful Geyser in Yellowstone National Park, Wyoming are quite regular (and hence its name). Rangers post the predicted time until the next eruption (\(y\), in minutes) based on the duration of the previous eruption (\(x\), in minutes). Using the data collected on 107 eruptions from a park geologist, R. A. Hutchinson, what is the mean time until the next eruption if the previous eruption lasted 4.8 minutes? lasted 3.5 minutes? (Photo credit: Tony Lehrman)

Answer

The easiest (and most practical!) way of calculating the confidence interval for the mean is to let Minitab do the work for us. Here's what the resulting analysis looks like:

The regression equation is NEXT = 33.828 + 10.741 DURATION

Analysis of Variance

Source DF SS MS F P
Regression 1 13133 13133 294.08    0.000
Residual Error 105 4689 45 0.000  

Total

106 17822

 

   
New
Obs
Fit SE Fit 95% CI 95% PI
4.8 85.385 1.059 (83.286, 87.484) (71.969, 98.801)
3.5 77.422 0.646 (70.140, 72.703) (58.109, 84.734)

That is, we can be 95% confident that, if the previous eruption lasted 4.8 minutes, then the mean time until the next eruption is between 83.286 and 87.484 minutes. And, we can be 95% confident that, if the previous eruption lasted 3.5 minutes, then the mean time until the next eruption is between 70.140 and 72.703 minutes.

Let's do one of the calculations by hand, though. When the previous eruption lasted \(x=4.8\) minutes, then the predicted time until the next eruption is:

\(\hat{y}=33.828 + 10.741(4.8)=85.385\)

Now, we can use Minitab or a probability calculator to determine that \(t_{0.025, 105}=1.9828\). We can also use Minitab to determine that MSE equals 44.6 (it is rounded to 45 in the above output), the mean duration is 3.46075 minutes, and:

\(\sum\limits_{i=1}^n (x_i-\bar{x})^2=113.835\)

Putting it all together, we get:

\(85.385 \pm 1.9828 \sqrt{44.66} \sqrt{\dfrac{1}{107}+\dfrac{(4.8-3.46075)^2}{113.835}}\)

which simplifies to this:

\(85.385 \pm 2.099\)

and finally this:

\((83.286,87.484)\)

as we (thankfully) obtained previously using Minitab. Incidentally, you might note that the length of the confidence interval for \(\mu_Y\) when \(x=4.8\) is:

\(87.484-83.286=4.198\)

and the length of the confidence interval when \(x=3.5\) is:

\(72.703-70.140=2.563\)

Hmmm. That suggests that the confidence interval is narrower when the \(x\) value is close to the mean of all of the \(x\) values. That is, in fact, one generalization, among others, that we can make about the length of the confidence interval for \(\mu_Y\).

Ways of Getting a Narrow(er) Confidence Interval for \(\mu_Y\)

If we take a look at the formula for the confidence interval for \(\mu_Y\):

\(\hat{y} \pm t_{\alpha/2,n-2}\sqrt{MSE} \sqrt{\dfrac{1}{n}+\dfrac{(x-\bar{x})^2}{\sum(x_i-\bar{x})^2}}\)

we can determine four ways in which we can get a narrow confidence interval for \(\mu_Y\). We can:

  1. Estimate the mean \(\mu_Y\) at the mean of the predictor values. That's because when \(x=\bar{x}\), the term circled in blue:

    \(\hat{y} \pm t_{\alpha / 2, n-2} \sqrt{M S E} \sqrt{\frac{1}{n}+\frac{\color{blue}\boxed{\color{black}(x-\bar{x})^{2}}}{\sum\left(x_{i}-\bar{x}\right)^{2}}}\)

  2. contributes nothing to the length of the interval. That is, the shortest confidence interval for \(\mu_Y\) occurs when \(x=\bar{x}\).

  3. Decrease the confidence level. That's because, the smaller the confidence level, the smaller the term circled in blue:

    \(\hat{y} \pm \color{blue}\boxed{\color{black}t_{\alpha / 2, n-2}}\color{black} \sqrt{M S E} \sqrt{\frac{1}{n}+\frac{(x-\bar{x})^{2}}{\sum\left(x_{i}-\bar{x}\right)^{2}}}\)

  4. Increase the sample size. That's because, the larger the sample size, the larger the term circled in blue:

    \(\hat{y} \pm t_{\alpha / 2, n-2} \sqrt{M S E} \sqrt{\frac{1}{\color{blue}\boxed{\color{black}n}}+\frac{(x-\bar{x})^{2}}{\sum\left(x_{i}-\bar{x}\right)^{2}}}\)

    and therefore the shorter the length of the interval.

  5. Choose predictor values \(x_i\) so that they are quite spread out. That's because the more spread out the predictor values, the larger the term circled in blue:

    \(\hat{y} \pm t_{\alpha / 2, n-2} \sqrt{M S E} \sqrt{\frac{1}{n}+\frac{(x-\bar{x})^{2}}{\color{blue}\boxed{\color{black}\sum\left(x_{i}-\bar{x}\right)^{2}}}}\)

and therefore the shorter the length of the interval.


8.2 - A Prediction Interval for a New Y

8.2 - A Prediction Interval for a New Y

On the previous page, we focused our attention on deriving a confidence interval for the mean \(\mu_Y\) at \(x\), a particular value of the predictor variable. Now, we'll turn our attention to deriving a prediction interval, not for a mean, but rather for predicting a (that's one!) new observation of the response, which we'll denote \(Y_{n+1}\), at \(x\), a particular value of the predictor variable. Let's again just jump right in and state (and prove) the result.

Theorem

A \((1-\alpha)100\%\) prediction interval for a new observation \(Y_{n+1}\) when the predictor \(x=x_{n+1}\) is:

\(\hat{y}_{n+1} \pm t_{\alpha/2,n-2}\sqrt{MSE} \sqrt{1+\dfrac{1}{n}+\dfrac{(x_{n+1}-\bar{x})^2}{\sum(x_i-\bar{x})^2}}\)

Proof

First, recall that:

\(Y_{n+1} \sim N(\alpha+\beta(x_{n+1}-\bar{x}),\sigma^2)\) and \(\hat{\alpha} \sim N\left(\alpha,\dfrac{\sigma^2}{n}\right)\) and \(\hat{\beta}\sim N\left(\beta,\dfrac{\sigma^2}{\sum_{i=1}^n (x_i-\bar{x})^2}\right)\) and \(\dfrac{n\hat{\sigma}^2}{\sigma^2}=\dfrac{(n-2)MSE}{\sigma^2}\sim \chi^2_{(n-2)}\)

are independent. Therefore:

\(W=Y_{n+1}-\hat{Y}_{n+1}=Y_{n+1}-\hat{\alpha}-\hat{\beta}(x_{n+1}-\bar{x})\)

is a linear combination of independent normal random variables with mean:

\(\begin{aligned}
E(w)=E\left[Y_{n+1}-\hat{\alpha}-\hat{\beta}\left(x_{n+1}-\bar{x}\right)\right] &=\color{blue}E\left(Y_{n+1}\right)\color{black}-\color{red}E(\hat{\alpha})\color{black}-\color{green}\left(x_{n+1}-\bar{x}\right) E(\hat{\beta}) \\
&=\color{blue}\alpha+\beta\left(x_{n+1}-\bar{x}\right)\color{black}-\color{red}\alpha\color{black}-\color{green}\left(x_{n+1}-\bar{x}\right) \beta \\
&=0
\end{aligned}\)

and variance:

\(\begin{aligned}
\operatorname{Var}(w)=\operatorname{Var}\left[Y_{n+1}-\hat{\alpha}-\hat{\beta}\left(x_{n+1}-\bar{x}\right)\right] \stackrel{\text { IND }}{=}
&\color{blue}\operatorname{Var}\left(Y_{n+1}\right)\color{black}+\color{red}\operatorname{Var}(\hat{\alpha})\color{black}+\color{green}\left(x_{n+1}-\bar{x}\right)^{2} \operatorname{Var}(\hat{\beta})\\
&=\color{blue}\sigma^{2}\color{black}+\color{red}\frac{\sigma^{2}}{n}\color{black}+\color{green}\frac{\left(x_{n+1}-\bar{x}\right)^{2} \sigma^{2}}{\sum\left(x_{i}-\bar{x}\right)^{2}} \\
&=\sigma^{2}\left[1+\frac{1}{n}+\frac{\left(x_{n+1}-\bar{x}\right)^{2}}{\sum\left(x_{i}-\bar{x}\right)^{2}}\right]
\end{aligned}\)

The first equality holds by the definition of \(W\). The second equality holds because \(Y_{n+1}\), \(\hat{\alpha}\) and \(\hat{\beta}\) are independent. The third equality comes from the distributions of \(Y_{n+1}\), \(\hat{\alpha}\) and \(\hat{\beta}\) that are recalled above. And, the last equality comes from simple algebra. Putting it all together, we have:

\(W=(Y_{n+1}-\hat{Y}_{n+1})\sim N\left(0,\sigma^2\left[1+\dfrac{1}{n}+\dfrac{(x_{n+1}-\bar{x})^2}{\sum(x_i-\bar{x})^2}\right]\right)\)

Now, the definition of a \(T\) random variable tells us that:

\(T=\dfrac{\dfrac{(Y_{n+1}-\hat{Y}_{n+1})-0}{\sqrt{\sigma^2\left(1+\dfrac{1}{n}+\dfrac{(x_{n+1}-\bar{x})^2}{\sum(x_i-\bar{x})^2}\right)}}}{\sqrt{\dfrac{n\hat{\sigma}^2}{\sigma^2}/(n-2)}}=\dfrac{(Y_{n+1}-\hat{Y}_{n+1})}{\sqrt{MSE}\sqrt{1+\dfrac{1}{n}+\dfrac{(x_{n+1}-\bar{x})^2}{\sum(x_i-\bar{x})^2}}} \sim t_{n-2}\)

So, finding the prediction interval for \(Y_{n+1}\) again reduces to manipulating the quantity inside the parentheses of a probability statement:

\(P\left(-t_{\alpha/2,n-2} \leq \dfrac{(Y_{n+1}-\hat{Y}_{n+1})}{\sqrt{MSE}\sqrt{1+\dfrac{1}{n}+\dfrac{(x_{n+1}-\bar{x})^2}{\sum(x_i-\bar{x})^2}}} \leq +t_{\alpha/2,n-2}\right)=1-\alpha\)

Upon doing the manipulation, we get that a \((1-\alpha)100\%\) prediction interval for \(Y_{n+1}\) is:

\(\hat{y}_{n+1} \pm t_{\alpha/2,n-2}\sqrt{MSE} \sqrt{1+\dfrac{1}{n}+\dfrac{(x_{n+1}-\bar{x})^2}{\sum(x_i-\bar{x})^2}}\)

as was to be proved.

Example 8-1 (continued)

old faithful

The eruptions of Old Faithful Geyser in Yellowstone National Park, Wyoming are quite regular (and hence its name). Rangers post the predicted time until the next eruption (\(y\), in minutes) based on the duration of the previous eruption (\(x\), in minutes). Using the data collected on 107 eruptions from a park geologist, R. A. Hutchinson, what is the predicted time until the next eruption if the previous eruption lasted 4.8 minutes? lasted 3.5 minutes?

Answer

Again, the easiest (and most practical!) way of calculating the prediction interval for the new observation is to let Minitab do the work for us. Here's what the resulting analysis looks like:

The regression equation is NEXT = 33.828 + 10.741 DURATION

Analysis of Variance

Source DF SS MS F P
Regression 1 13133 13133 294.08    0.000
Residual Error 105 4689 45 0.000  

Total

106 17822

 

   

New

New
Obs
Fit SE Fit 95% CI 95% PI
4.8 85.385 1.059 (83.286, 87.484) (71.969, 98.801)
3.5 77.422 0.646 (70.140, 72.703) (58.109, 84.734)

That is, we can be 95% confident that, if the previous eruption lasted 4.8 minutes, then the time until the next eruption is between 71.969 and 98.801minutes. And, we can be 95% confident that, if the previous eruption lasted 3.5 minutes, then the time until the next eruption is between 58.109 and 84.734 minutes.

Let's do one of the calculations by hand, though. When the previous eruption lasted \(x=4.8\) minutes, then the predicted time until the next eruption is:

\(\hat{y}=33.828 + 10.741(4.8)=85.385\)

Now, we can use Minitab or a probability calculator to determine that \(t_{0.25, 105}=1.9828\). We can also use Minitab to determine that MSE equals 44.6 (it is rounded to 45 in the above output), the mean duration is 3.46075 minutes, and:

\(\sum\limits_{i=1}^n (x_i-\bar{x})^2=113.835\)

Putting it all together, we get:

\(85.385 \pm 1.9828 \sqrt{44.66} \sqrt{1+\dfrac{1}{107}+\dfrac{(4.8-3.46075)^2}{113.835}}\)

which simplifies to this:

\(85.385 \pm 13.416\)

and finally this:

\((71.969,98.801)\)

as we (thankfully) obtained previously using Minitab. Incidentally, you might note that the length of the confidence interval for \(\mu_Y\) when \(x=4.8\) is:

\(87.484-83.286=4.198\)

and the length of the prediction interval when \(x=4.8\) is:

\(98.801-71.969=26.832\)

Hmmm. I wonder if that means that the confidence interval will always be narrower than the prediction interval? That is indeed the case. Let's take note of that, as well as a few other things.

Note!

taking notes
  1. For a given value \(x\) of the predictor variable, and confidence level \((1-\alpha)\), the prediction interval for a new observation \(Y_{n+1}\) is always longer than the corresponding confidence interval for the mean \(\mu_Y\). That's because the prediction interval has an extra term (MSE, the estimate of the population variance) in its standard error:

    \(\displaystyle{\hat{y}_{n+1} \pm t_{\alpha / 2, n-2} \sqrt{M S E} \sqrt{\color{blue}\boxed{\color{black}1}\color{black}+\frac{1}{n}+\frac{\left(x_{n+1}-\bar{x}\right)^{2}}{\sum\left(x_{i}-\bar{x}\right)^{2}}}}\)

  2. The prediction interval for a new observation \(Y_{n+1}\) can be made to be narrower in the same ways that we can make the confidence interval for the mean \(\mu_Y\) narrower. That is, we can make a prediction interval for a new observation \(Y_{n+1}\) narrower by:

    1. decreasing the confidence level

    2. increasing the sample size

    3. choosing predictor values \(x_i\) so that they are quite spread out

    4. predicting \(Y_{n+1}\) at the mean of the predictor values.

  3. We cannot make the standard error of the prediction for \(Y_{n+1}\) approach 0, as we can for the standard error of the estimate for \(\mu_Y\). That's again because the prediction interval has an extra term (MSE, the estimate of the population variance) in its standard error:

    \(\displaystyle{\hat{y}_{n+1} \pm t_{\alpha / 2, n-2} \sqrt{M S E} \sqrt{\color{blue}\boxed{\color{black}1}\color{black}+\frac{1}{n}+\frac{\left(x_{n+1}-\bar{x}\right)^{2}}{\sum\left(x_{i}-\bar{x}\right)^{2}}}}\)


8.3 - Using Minitab to Lighten the Workload

8.3 - Using Minitab to Lighten the Workload

Minitab®

Use Minitab to calculate the confidence and/or prediction intervals

For any sizeable data set, and even for the small ones, you'll definitely want to use Minitab to calculate the confidence and/or prediction intervals for you. To do so:

  1. Under the Stat menu, select Stat, and then select Regression:

    minitab

  2. In the pop-up window that appears, in the box labeled Response, specify the response and, in the box labeled Predictors, specify the predictor variable:

    minitab

    Then, click on the Options... button.

  3. In the pop-up window that appears, in the box labeled Prediction intervals for new observations, type the value of the predictor variable for which you'd like a confidence interval and/or prediction interval:

    minitab

    In the box labeled Confidence level, type your desired confidence level. (The default is 95.) Then, select OK.

  4. And, select OK on the main pop-up window. The output should appear in the session window. The first part of the output should look something like this:

    The regression equation is NEXT = 33.8 + 10.7 DURATION

    Predictor Coef SE Coef T P
    Constant 33.828 2.262 14.96 0.000
    DURATION 10.7410 0.6263 17.15 0.000

    Analysis of Variance

    Source DF SS MS F P
    Regression 1 13133 13133 294.08    0.000
    Residual Error 105 4689 45 0.000  

    Total

    106 17822

     

       

    while the second part of the output, which contains the requested intervals, should look something like this:

    Predicted Values for New Observations

    New Obs Fit SE Fit 95% CI 95 % PI
    1 85.385 1.059 (83.286, 87.484) (71.969, 98.801)

    Values of Predictors for New Observations

    New Obs    DURATION
    1 4.80

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility