3.6 - Further SLR Evaluation Examples

Example 1: Are Sprinters Getting Faster?

The following data set (mens200m.txt) contains the winning times (in seconds) of the 22 men's 200 meter olympic sprints held between 1900 and 1996. (Notice that the Olympics were not held during the World War I and II years.) Is there a linear relationship between year and the winning times? The plot of the estimated regression line sure makes it look so!

sprinters scatterplot

To answer the research question, let's conduct the formal F-test of the null hypothesis H0: β1 = 0 against the alternative hypothesis HA: β1 ≠ 0.

The analysis of variance table above has been animated to allow you to interact with the table. As you roll your mouse over the blue numbers, you are reminded of how those numbers are determined.

From a scientific point of view, what we ultimately care about is the P-value, which is 0.000 (to three decimal places). That is, the P-value is less than 0.001. The P-value is very small. It is unlikely that we would have obtained such a large F* statistic if the null hypothesis were true. Therefore, we reject the null hypothesis H0: β1 = 0 in favor of the alternative hypothesis HA: β1 ≠ 0. There is sufficient evidence at the α = 0.05 level to conclude that there is a linear relationship between year and winning time.

Equivalence of the analysis of variance F-test and the t-test

As we noted in the first two examples, the P-value associated with the t-test is the same as the P-value associated with the analysis of variance F-test. This will always be true for the simple linear regression model. It is illustrated in the year and winning time example also. Both P-values are 0.000 (to three decimal places):

minitab output

The P-values are the same because of a well-known relationship between a t random variable and an F random variable that has 1 numerator degree of freedom. Namely:

\[(t^{*}_{(n-2)})^2=F^{*}_{(1,n-2)}\]

This will always hold for the simple linear regression model. This relationship is demonstrated in this example as:

(-13.33)2 = 177.7

In short:

  • For a given significance level α, the F-test of β1 = 0 versus β1 ≠ 0 is algebraically equivalent to the two-tailed t-test.
  • We will get exactly the same P-values, so…
    • If one test rejects H0, then so will the other.
    • If one test does not reject H0, then so will the other.

The natural question then is ... when should we use the F-test and when should we use the t-test?

  • The F-test is only appropriate for testing that the slope differs from 0 (β1 ≠ 0).
  • Use the t-test to test that the slope is positive (β1 > 0) or negative (β1 < 0). Remember, though, that you will have to divide the reported two-tail P-value by 2 to get the appropriate one-tailed P-value.

The F-test is more useful for the multiple regression model when we want to test that more than one slope parameter is 0. We'll learn more about this later in the course!

image of a highway signExample 2: Highway Sign Reading Distance and Driver Age

The data are n = 30 observations on driver age and the maximum distance (feet) at which individuals can read a highway sign (signdist.txt). (Data source: Mind On Statistics, 3rd edition, Utts and Heckard).

The plot below gives a scatterplot of the highway sign data along with the least squares regression line. 

scatterplot

Here is the accompanying regression output:

Minitab output

Hypothesis Test for the Intercept (β0)

This test is rarely a test of interest, but does show up when one is interested in performing a regression through the origin (which we touched on earlier in this lesson). In the software output above, the row labeled Constant gives the information used to make inferences about the intercept. The null and alternative hypotheses for a hypotheses test about the intercept are written as:

H0 : β0 = 0
HA : β0 ≠ 0.

In other words, the null hypothesis is testing if the population intercept is equal to 0 versus the alternative hypothesis that the population intercept is not equal to 0. In most problems, we are not particularly interested in hypotheses about the intercept. For instance, in our example, the intercept is the mean distance when the age is 0, a meaningless age. Also, the intercept does not give information about how the value of y changes when the value of x changes. Nevertheless, to test whether the population intercept is 0, the information from the software output would be used as follows:

  1. The sample intercept is b0 = 576.68, the value under Coef.
  2. The standard error (SE) of the sample intercept, written as se(b0), is se(b0) = 23.47, the value under SE Coef. The SE of any statistic is a measure of its accuracy. In this case, the SE of b0 gives, very roughly, the average difference between the sample b0 and the true population intercept β0, for random samples of this size (and with these x-values).
  3. The test statistic is t = b0/se(b0) = 576.68/23.47 = 24.57, the value under T.
  4. The p-value for the test is p = 0.000 and is given under P. The p-value is actually very small and not exactly 0.
  5. The decision rule at the 0.05 significance level is to reject the null hypothesis since our p < 0.05. Thus, we conclude that there is statistically significant evidence that the population intercept is not equal to 0.

So how exactly is the p-value found? For simple regression, the p-value is determined using a t distribution with n − 2 degrees of freedom (df), which is written as tn−2, and is calculated as 2 × area past |t| under a tn−2 curve. In this example, df = 30 − 2 = 28. The p-value region is the type of region shown in the figure below. The negative and positive versions of the calculated t provide the interior boundaries of the two shaded regions. As the value of t increases, the p-value (area in the shaded regions) decreases.

2 x the area to the right of |t|

Hypothesis Test for the Slope ( β1)

This test can be used to test whether or not x and y are linearly related. The row pertaining to the variable Age in the software output from earlier gives information used to make inferences about the slope. The slope directly tells us about the link between the mean y and x. When the true population slope does not equal 0, the variables y and x are linearly related. When the slope is 0, there is not a linear relationship because the mean y does not change when the value of x is  changed. The null and alternative hypotheses for a hypotheses test about the slope are written as:

H0 : β1 = 0
HA : β1 ≠ 0.

In other words, the null hypothesis is testing if the population slope is equal to 0 versus the alternative hypothesis that the population slope is not equal to 0. To test whether the population slope is 0, the information from the software output is used as follows:

  1. The sample slope is b1 = −3.0068, the value under Coef in the Age row of the output.
  2. The SE of the sample slope, written as se(b1), is se(b1) = 0.4243, the value under SE Coef. Again, the SE of any statistic is a measure of its accuracy. In this case, the SE of b1 gives, very roughly, the average difference between the sample b1 and the true population slope β1, for random samples of this size (and with these x-values).
  3. The test statistic is t = b1/se(b1) = −3.0068/0.4243 = −7.09, the value under T.
  4. The p-value for the test is p = 0.000 and is given under P.
  5. The decision rule at the 0.05 significance level is to reject the null hypothesis since our p < 0.05. Thus, we conclude that there is statistically significant evidence that the variables of Distance and Age are linearly related.

As before, the p-value is the region illustrated in the figure above.

Confidence Interval for the Slope (β1)

A confidence interval for the unknown value of the population slope β1 can be computed as

sample statistic ± multiplier × standard error of statistic

b1 ± t* × se(b1).

In simple regression, the t* multiplier is determined using a tn−2 distribution. The value of t* is such that the confidence level is the area (probability) between −t* and +t* under the t-curve. To find the t* multiplier, you can do one of the following:

  1. A table such as the one in the textbook can be used to look up the multiplier.
  2. Alternatively, software like Minitab can be used.

95% Confidence Interval

In our example, n = 30 and df = n − 2 = 28. For 95% confidence, t* = 2.05. A 95% confidence interval for β1, the true population slope, is:

−3.0068 ± (2.05 × 0.4243)
−3.0068 ± 0.870
or about − 3.88 to − 2.14.

Interpretation: With 95% confidence, we can say the mean sign reading distance decreases somewhere between 2.14 and 3.88 feet per each one-year increase in age. It is incorrect to say that with 95% probability the mean sign reading distance decreases somewhere between 2.14 and 3.88 feet per each one-year increase in age. Make sure you understand why!!!

99% Confidence Interval

For 99% confidence, t* = 2.76. A 99% confidence interval for β1 , the true population slope is:

−3.0068 ± (2.76 × 0.4243)
−3.0068 ± 1.1711
or about − 4.18 to − 1.84.

Interpretation: With 99% confidence, we can say the mean sign reading distance decreases somewhere between 1.84 and 4.18 feet per each one-year increase in age. Notice that as we increase our confidence, the interval becomes wider. So as we approach 100% confidence, our interval grows to become the whole real line.

As a final note, the above procedures can be used to calculate a confidence interval for the population intercept. Just use b0 (and its standard error) rather than b1.

Example 3: Handspans Data

Stretched handspans and heights are measured in centimeters for n = 167 college students (handheight.txt). We’ll use y = height and x = stretched handspan. A scatterplot with a regression line superimposed is given below, together with results of a simple linear regression model fit to the data.

fitted line plot

lm(formula = Height ~ HandSpan)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.5250 2.3160 15.34 <2e-16 ***
HandSpan 1.5601 0.1105 14.11 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.744 on 165 degrees of freedom
Multiple R-squared: 0.5469, Adjusted R-squared: 0.5442
F-statistic: 199.2 on 1 and 165 DF, p-value: < 2.2e-16

Analysis of Variance Table
Response: Height
Df Sum Sq Mean Sq F value Pr(>F)
HandSpan 1 1500.1 1500.06 199.17 < 2.2e-16 ***
Residuals 165 1242.7 7.53
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Some things to note are:

  • The residual standard deviation S is 2.744 and this estimates the standard deviation of the errors.
  • r2 = (SSTO-SSE) / SSTO = SSR / (SSR+SSE) = 1500.1 / (1500.1+1242.7) = 1500.1 / 2742.8 = 0.547 or 54.7%. The interpretation is that handspan differences explain 54.7% of the variation in heights.
  • The value of the F statistic is F = 199.2 with 1 and 165 degrees of freedom, and the p-value for this F statistic is 0.000. Thus we reject the null hypothesis H0 : β1 = 0 because the p-value is so small. In other words, the observed relationship is statistically significant.