Example 2-5: Highway Sign Reading Distance and Driver Age Section
The data are n = 30 observations on driver age and the maximum distance (feet) at which individuals can read a highway sign (Sign Distance data).
(Data source: Mind On Statistics, 3rd edition, Utts and Heckard)
The plot below gives a scatterplot of the highway sign data along with the least squares regression line.
Here is the accompanying Minitab output, which is found by performing Stat >> Regression >> Regression on the highway sign data.
Regression Analysis: Distance, Age
Coefficients
Predictor | Coef | SE Coef | T-Value | P-Value |
---|---|---|---|---|
Constant | 576.68 | 23.47 | 24.57 | 0.000 |
Age | -3.0068 | 0.4243 | -7.09 | 0.000 |
Regression Equation
Distance = 577 - 3.01 Age
Hypothesis Test for the Intercept (\(\beta_{0}\))
This test is rarely a test of interest, but does show up when one is interested in performing a regression through the origin (which we touched on earlier in this lesson). In the Minitab output above, the row labeled Constant gives the information used to make inferences about the intercept. The null and alternative hypotheses for a hypotheses test about the intercept are written as:
\(H_{0} \colon \beta_{0} = 0\)
\(H_{A} \colon \beta_{0} \ne 0\)
In other words, the null hypothesis is testing if the population intercept is equal to 0 versus the alternative hypothesis that the population intercept is not equal to 0. In most problems, we are not particularly interested in hypotheses about the intercept. For instance, in our example, the intercept is the mean distance when the age is 0, a meaningless age. Also, the intercept does not give information about how the value of y changes when the value of x changes. Nevertheless, to test whether the population intercept is 0, the information from the Minitab output is used as follows:
- The sample intercept is \(b_{0}\) = 576.68, the value under Coef.
- The standard error (SE) of the sample intercept, written as se(\(b_{0}\)), is se(\(b_{0}\)) = 23.47, the value under SE Coef. The SE of any statistic is a measure of its accuracy. In this case, the SE of \(b_{0}\) gives, very roughly, the average difference between the sample \(b_{0}\) and the true population intercept \(\beta_{0}\), for random samples of this size (and with these x-values).
- The test statistic is t = \(b_{0}\)/se(\(b_{0}\)) = 576.68/23.47 = 24.57, the value under T-Value.
- The p-value for the test is p = 0.000 and is given under P-Value. The p-value is actually very small and not exactly 0.
- The decision rule at the 0.05 significance level is to reject the null hypothesis since our p < 0.05. Thus, we conclude that there is statistically significant evidence that the population intercept is not equal to 0.
So how exactly is the p-value found? For simple regression, the p-value is determined using a t distribution with n − 2 degrees of freedom (df), which is written as \(t_{n−2}\), and is calculated as 2 × area past |t| under a \(t_{n−2}\) curve. In this example, df = 30 − 2 = 28. The p-value region is the type of region shown in the figure below. The negative and positive versions of the calculated t provide the interior boundaries of the two shaded regions. As the value of t increases, the p-value (area in the shaded regions) decreases.
Hypothesis Test for the Slope (\(\beta_{1}\))
This test can be used to test whether or not x and y are linearly related. The row pertaining to the variable Age in the Minitab output from earlier gives information used to make inferences about the slope. The slope directly tells us about the link between the mean y and x. When the true population slope does not equal 0, the variables y and x are linearly related. When the slope is 0, there is not a linear relationship because the mean y does not change when the value of x is changed. The null and alternative hypotheses for a hypotheses test about the slope are written as:
\(H_{0} \colon \beta_{1}\) = 0
\(H_{A} \colon \beta_{1}\) ≠ 0
In other words, the null hypothesis is testing if the population slope is equal to 0 versus the alternative hypothesis that the population slope is not equal to 0. To test whether the population slope is 0, the information from the Minitab output is used as follows:
- The sample slope is \(b_{1}\) = −3.0068, the value under Coef in the Age row of the output.
- The SE of the sample slope, written as se(\(b_{1}\)), is se(\(b_{1}\)) = 0.4243, the value under SE Coef. Again, the SE of any statistic is a measure of its accuracy. In this case, the SE of b1 gives, very roughly, the average difference between the sample \(b_{1 }\)and the true population slope \(\beta_{1}\), for random samples of this size (and with these x-values).
- The test statistic is t = \(b_{1}\)/se(\(b_{1}\)) = −3.0068/0.4243 = −7.09, the value under T-Value.
- The p-value for the test is p = 0.000 and is given under P-Value.
- The decision rule at the 0.05 significance level is to reject the null hypothesis since our p < 0.05. Thus, we conclude that there is statistically significant evidence that the variables of Distance and Age are linearly related.
As before, the p-value is the region illustrated in the figure above.
Confidence Interval for the Slope (\(\beta_{1}\))
A confidence interval for the unknown value of the population slope \(\beta_{1}\) can be computed as
sample statistic ± multiplier × standard error of statistic
→ \(b_{1 }\)± t* × se(\(b_{1}\))
To find the t* multiplier, you can do one of the following:
- In simple regression, the t* multiplier is determined using a \(t_{n−2}\) distribution. The value of t* is such that the confidence level is the area (probability) between −t* and +t* under the t-curve.
- A table such as the one in the textbook can be used to look up the multiplier.
- Alternatively, software like Minitab can be used.
95% Confidence Interval
In our example, n = 30 and df = n − 2 = 28. For 95% confidence, t* = 2.05. A 95% confidence interval for \(\beta_{1}\), the true population slope, is:
3.0068 ± (2.05 × 0.4243)
3.0068 ± 0.870
or about − 3.88 to − 2.14.
Interpretation: With 95% confidence, we can say the mean sign reading distance decreases somewhere between 2.14 and 3.88 feet per each one-year increase in age. It is incorrect to say that with 95% probability the mean sign reading distance decreases somewhere between 2.14 and 3.88 feet per each one-year increase in age. Make sure you understand why!!!
99% Confidence Interval
For 99% confidence, t* = 2.76. A 99% confidence interval for \(\beta_{1}\) , the true population slope is:
3.0068 ± (2.76 × 0.4243)
3.0068 ± 1.1711
or about − 4.18 to − 1.84.
Interpretation: With 99% confidence, we can say the mean sign reading distance decreases somewhere between 1.84 and 4.18 feet per each one-year increase in age. Notice that as we increase our confidence, the interval becomes wider. So as we approach 100% confidence, our interval grows to become the whole real line.
As a final note, the above procedures can be used to calculate a confidence interval for the population intercept. Just use \(b_{0}\) (and its standard error) rather than \(b_{1}\).
Example 2-6: Handspans Data Section
Stretched handspans and heights are measured in inches for n = 167 college students (Hand Height data). We’ll use y = height and x = stretched handspan. A scatterplot with a regression line superimposed is given below, together with results of a simple linear regression model fit to the data.
Regression Analysis: Height versus HandSpan
Analysis of Variance
Source | DF | Adj SS | Adj MS | F-Value | P-Value |
---|---|---|---|---|---|
Regression | 1 | 1500.06 | 1500.06 | 199.17 | 0.000 |
HandSpan | 1 | 1500.06 | 1500.06 | 199.17 | 0.000 |
Error | 165 | 1242.70 | 7.53 | ||
Lack-of-Fit | 17 | 96.24 | 5.66 | 0.73 | 0.767 |
Pure Error | 148 | 1146.46 | 7.75 | ||
Total | 166 | 2742.76 |
Model Summary
S | R-sq | R-sq(adj) | R-sq(pred) |
---|---|---|---|
2.74436 | 54.69% | 54.42% | 53.76% |
Coefficients
Predictor | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | 35.53 | 2.32 | 15.34 | 0.000 | |
HandSpan | 1.560 | 0.111 | 14.11 | 0.000 | 1.00 |
Regression Equation
Height = 35.53 + 1.560 HandSpan
Note! Some things to note are:
- The residual standard deviation S is 2.744 and this estimates the standard deviation of the errors.
- \(r^2\) = (SSTO-SSE) / SSTO = SSR / (SSR+SSE) = 1500.1 / (1500.1+1242.7) = 1500.1 / 2742.8 = 0.547 or 54.7%. The interpretation is that handspan differences explain 54.7% of the variation in heights.
- The value of the F statistic is F = 199.2 with 1 and 165 degrees of freedom, and the p-value for this F statistic is 0.000. Thus we reject the null hypothesis \(H_{0} \colon \beta_{1}\) = 0 in favor of \(H_A\colon\beta_1\neq 0\). In other words, the observed relationship is statistically significant.
Example 2-7: Quality Data Section
You are a manufacturer who wants to obtain a quality measure on a product, but the procedure to obtain the measure is expensive. There is an indirect approach, which uses a different product score (Score 1) in place of the actual quality measure (Score 2). This approach is less costly but also is less precise. You can use regression to see if Score 1 explains a significant amount of the variance in Score 2 to determine if Score 1 is an acceptable substitute for Score 2. The results from a simple linear regression analysis are given below:
Regression Analysis: Score2 versus Score1
Analysis of Variance
Source | DF | Adj SS | Adj MS | F-Value | P-Value |
---|---|---|---|---|---|
Regression | 1 | 2.5419 | 2.5419 | 156.56 | 0.000 |
Residual Error | 7 | 0.1136 | 0.0162 | ||
Total | 8 | 2.6556 |
Model Summary
S | R-sq | R-sq(adj) |
---|---|---|
0.127419 | 95.7% | 95.1% |
Coefficients
Predictor | Coef | SE Coef | T-Value | P-Value |
---|---|---|---|---|
Constant | 1.1177 | 0.1093 | 10.23 | 0.000 |
WAge | 0.21767 | 0.01740 | 12.51 | 0.000 |
Regression Equation
Score2 = 1.12 + 0.218 Score1
We are concerned in testing the null hypothesis that Score 1 is not a significant predictor of Score 2 versus the alternative that Score 1 is a significant predictor of Score 2. More formally, we are testing:
\(H_{0} \colon\beta_{1}\) = 0
\(H_{A} \colon \beta_{1}\) ≠ 0
The p-value in the ANOVA table (0.000), indicates that the relationship between Score 1 and Score 2 is statistically significant at an α-level of 0.05. This is also shown by the p-value for the estimated coefficient of Score 1, which is 0.000.