2.12 - Further Examples

Example 2-5: Highway Sign Reading Distance and Driver Age Section

The data are n = 30 observations on driver age and the maximum distance (feet) at which individuals can read a highway sign (Sign Distance data).

(Data source: Mind On Statistics, 3rd edition, Utts and Heckard)

The plot below gives a scatterplot of the highway sign data along with the least squares regression line.

scatterplot of highway sign data

Here is the accompanying Minitab output, which is found by performing Stat >> Regression >> Regression on the highway sign data.

Regression Analysis: Distance, Age

Coefficients

Predictor Coef SE Coef T-Value P-Value
Constant 576.68 23.47 24.57 0.000
Age -3.0068 0.4243 -7.09 0.000

Regression Equation

Distance = 577 - 3.01 Age

Hypothesis Test for the Intercept (\(\beta_{0}\))

This test is rarely a test of interest, but does show up when one is interested in performing a regression through the origin (which we touched on earlier in this lesson). In the Minitab output above, the row labeled Constant gives the information used to make inferences about the intercept. The null and alternative hypotheses for a hypotheses test about the intercept are written as:

\(H_{0} \colon \beta_{0} = 0\)
\(H_{A} \colon \beta_{0} \ne 0\)

In other words, the null hypothesis is testing if the population intercept is equal to 0 versus the alternative hypothesis that the population intercept is not equal to 0. In most problems, we are not particularly interested in hypotheses about the intercept. For instance, in our example, the intercept is the mean distance when the age is 0, a meaningless age. Also, the intercept does not give information about how the value of y changes when the value of x changes. Nevertheless, to test whether the population intercept is 0, the information from the Minitab output is used as follows:

  1. The sample intercept is \(b_{0}\) = 576.68, the value under Coef.
  2. The standard error (SE) of the sample intercept, written as se(\(b_{0}\)), is se(\(b_{0}\)) = 23.47, the value under SE Coef. The SE of any statistic is a measure of its accuracy. In this case, the SE of \(b_{0}\) gives, very roughly, the average difference between the sample \(b_{0}\) and the true population intercept \(\beta_{0}\), for random samples of this size (and with these x-values).
  3. The test statistic is t = \(b_{0}\)/se(\(b_{0}\)) = 576.68/23.47 = 24.57, the value under T-Value.
  4. The p-value for the test is p = 0.000 and is given under P-Value. The p-value is actually very small and not exactly 0.
  5. The decision rule at the 0.05 significance level is to reject the null hypothesis since our p < 0.05. Thus, we conclude that there is statistically significant evidence that the population intercept is not equal to 0.

So how exactly is the p-value found? For simple regression, the p-value is determined using a t distribution with n − 2 degrees of freedom (df), which is written as \(t_{n−2}\), and is calculated as 2 × area past |t| under a \(t_{n−2}\) curve. In this example, df = 30 − 2 = 28. The p-value region is the type of region shown in the figure below. The negative and positive versions of the calculated t provide the interior boundaries of the two shaded regions. As the value of t increases, the p-value (area in the shaded regions) decreases.

t - t
2 x the area to the right of \(\mid t \mid\)
 

Hypothesis Test for the Slope (\(\beta_{1}\))

This test can be used to test whether or not x and y are linearly related. The row pertaining to the variable Age in the Minitab output from earlier gives information used to make inferences about the slope. The slope directly tells us about the link between the mean y and x. When the true population slope does not equal 0, the variables y and x are linearly related. When the slope is 0, there is not a linear relationship because the mean y does not change when the value of x is  changed. The null and alternative hypotheses for a hypotheses test about the slope are written as:

\(H_{0} \colon \beta_{1}\) = 0
\(H_{A} \colon \beta_{1}\) ≠ 0

In other words, the null hypothesis is testing if the population slope is equal to 0 versus the alternative hypothesis that the population slope is not equal to 0. To test whether the population slope is 0, the information from the Minitab output is used as follows:

  1. The sample slope is \(b_{1}\) = −3.0068, the value under Coef in the Age row of the output.
  2. The SE of the sample slope, written as se(\(b_{1}\)), is se(\(b_{1}\)) = 0.4243, the value under SE Coef. Again, the SE of any statistic is a measure of its accuracy. In this case, the SE of b1 gives, very roughly, the average difference between the sample \(b_{1 }\)and the true population slope \(\beta_{1}\), for random samples of this size (and with these x-values).
  3. The test statistic is t = \(b_{1}\)/se(\(b_{1}\)) = −3.0068/0.4243 = −7.09, the value under T-Value.
  4. The p-value for the test is p = 0.000 and is given under P-Value.
  5. The decision rule at the 0.05 significance level is to reject the null hypothesis since our p < 0.05. Thus, we conclude that there is statistically significant evidence that the variables of Distance and Age are linearly related.

As before, the p-value is the region illustrated in the figure above.

Confidence Interval for the Slope (\(\beta_{1}\))

A confidence interval for the unknown value of the population slope \(\beta_{1}\) can be computed as

sample statistic ± multiplier × standard error of statistic

→ \(b_{1 }\)± t* × se(\(b_{1}\))

To find the t* multiplier, you can do one of the following:

  1. In simple regression, the t* multiplier is determined using a \(t_{n−2}\) distribution. The value of t* is such that the confidence level is the area (probability) between −t* and +t* under the t-curve.
  2. A table such as the one in the textbook can be used to look up the multiplier.
  3. Alternatively, software like Minitab can be used.

95% Confidence Interval

In our example, n = 30 and df = n − 2 = 28. For 95% confidence, t* = 2.05. A 95% confidence interval for \(\beta_{1}\), the true population slope, is:

3.0068 ± (2.05 × 0.4243)
3.0068 ± 0.870
or about − 3.88 to − 2.14.

Interpretation: With 95% confidence, we can say the mean sign reading distance decreases somewhere between 2.14 and 3.88 feet per each one-year increase in age. It is incorrect to say that with 95% probability the mean sign reading distance decreases somewhere between 2.14 and 3.88 feet per each one-year increase in age. Make sure you understand why!!!

99% Confidence Interval

For 99% confidence, t* = 2.76. A 99% confidence interval for \(\beta_{1}\) , the true population slope is:

3.0068 ± (2.76 × 0.4243)
3.0068 ± 1.1711
or about − 4.18 to − 1.84.

Interpretation: With 99% confidence, we can say the mean sign reading distance decreases somewhere between 1.84 and 4.18 feet per each one-year increase in age. Notice that as we increase our confidence, the interval becomes wider. So as we approach 100% confidence, our interval grows to become the whole real line.

As a final note, the above procedures can be used to calculate a confidence interval for the population intercept. Just use \(b_{0}\) (and its standard error) rather than \(b_{1}\).

Example 2-6: Handspans Data Section

Stretched handspans and heights are measured in inches for n = 167 college students (Hand Height data). We’ll use y = height and x = stretched handspan. A scatterplot with a regression line superimposed is given below, together with results of a simple linear regression model fit to the data.

scatterplot with a regression line superimposed

Regression Analysis: Height versus HandSpan

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Regression 1 1500.06 1500.06 199.17 0.000
HandSpan 1 1500.06 1500.06 199.17 0.000
Error 165 1242.70 7.53    
Lack-of-Fit 17 96.24 5.66 0.73 0.767
Pure Error 148 1146.46 7.75    
Total 166 2742.76      

Model Summary

S R-sq R-sq(adj) R-sq(pred)
2.74436 54.69% 54.42% 53.76%

Coefficients

Predictor Coef SE Coef T-Value P-Value VIF
Constant 35.53 2.32 15.34 0.000  
HandSpan 1.560 0.111 14.11 0.000 1.00

Regression Equation

Height = 35.53 + 1.560 HandSpan

Note! Some things to note are:

  • The residual standard deviation S is 2.744 and this estimates the standard deviation of the errors.
  • \(r^2\) = (SSTO-SSE) / SSTO = SSR / (SSR+SSE) = 1500.1 / (1500.1+1242.7) = 1500.1 / 2742.8 = 0.547 or 54.7%. The interpretation is that handspan differences explain 54.7% of the variation in heights.
  • The value of the F statistic is F = 199.2 with 1 and 165 degrees of freedom, and the p-value for this F statistic is 0.000. Thus we reject the null hypothesis \(H_{0} \colon \beta_{1}\) = 0 in favor of \(H_A\colon\beta_1\neq 0\). In other words, the observed relationship is statistically significant.

Example 2-7: Quality Data Section

You are a manufacturer who wants to obtain a quality measure on a product, but the procedure to obtain the measure is expensive. There is an indirect approach, which uses a different product score (Score 1) in place of the actual quality measure (Score 2). This approach is less costly but also is less precise. You can use regression to see if Score 1 explains a significant amount of the variance in Score 2 to determine if Score 1 is an acceptable substitute for Score 2. The results from a simple linear regression analysis are given below:

Regression Analysis: Score2 versus Score1

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Regression 1 2.5419 2.5419 156.56 0.000
Residual Error 7 0.1136 0.0162    
Total 8 2.6556      

Model Summary

S R-sq R-sq(adj)
0.127419 95.7% 95.1%

Coefficients

Predictor Coef SE Coef T-Value P-Value
Constant 1.1177 0.1093 10.23 0.000
WAge 0.21767 0.01740 12.51 0.000

Regression Equation

Score2 = 1.12 + 0.218 Score1

We are concerned in testing the null hypothesis that Score 1 is not a significant predictor of Score 2 versus the alternative that Score 1 is a significant predictor of Score 2. More formally, we are testing:

\(H_{0} \colon\beta_{1}\) = 0
\(H_{A} \colon \beta_{1}\) ≠ 0

The p-value in the ANOVA table (0.000), indicates that the relationship between Score 1 and Score 2 is statistically significant at an α-level of 0.05. This is also shown by the p-value for the estimated coefficient of Score 1, which is 0.000.