Lesson 2: SLR Model Evaluation

Lesson 2: SLR Model Evaluation

Overview

This lesson presents two alternative methods for testing whether a linear association exists between the predictor x and the response y in a simple linear regression model:

\(H_{0}\): \(\beta_{1}\) = 0 versus \(H_{A}\): \(\beta_{1}\) ≠ 0.

One is the t-test for the slope while the other is an analysis of variance (ANOVA) F-test.

As you know, one of the primary goals of this course is to be able to translate a research question into a statistical procedure. Here are two examples of research questions and the alternative statistical procedures that could be used to answer them:

  1. Is there a (linear) relationship between skin cancer mortality and latitude?
    • What statistical procedure answers this research question? We could estimate the regression line and then use the t-test to determine if the slope, \(\beta_{1}\), of the population regression line, is 0.
    • Alternatively, we could perform an (analysis of variance) F-test.
  2. Is there a (linear) relationship between height and grade point average?
    • What statistical procedure answers this research question? We could estimate the regression line and then use the t-test to see if the slope, \(\beta_{1}\), of the population regression line, is 0.
    • Again, we could alternatively perform an (analysis of variance) F-test.

We also learn a way to check for linearity — the "L" in the "LINE" conditions — using the linear lack of fit test. This test requires replicates, that is multiple observations of y for at least one (preferably more) values of x, and concerns the following hypotheses:

  • \(H_{0}\): There is no lack of linear fit.
  • \(H_{A}\): There is a lack of linear fit.

Objectives

Upon completion of this lesson, you should be able to:

  • Calculate confidence intervals and conduct hypothesis tests for the population intercept \(\beta_{0}\) and population slope \(\beta_{1}\) using Minitab's regression analysis output.
  • Draw research conclusions about the population intercept \(\beta_{0}\) and population slope \(\beta_{1}\) using the above confidence intervals and hypothesis tests.
  • Know the six possible outcomes about the slope \(\beta_{1}\) whenever we test whether there is a linear relationship between a predictor x and a response y.
  • Understand the "derivation" of the analysis of variance F-test for testing \(H_{0}\): \(\beta_{1} = 0\). That is, understand how the total variation in a response y is broken down into two parts — a component that is due to the predictor x and a component that is just due to random error. And, understand how the expected mean squares tell us to use the ratio MSR/MSE to conduct the test.
  • Know how each element of the analysis of variance table is calculated.
  • Know what scientific questions can be answered with the analysis of variance F-test.
  • Conduct the analysis of variance F-test to test \(H_{0}\): \(\beta_{1} = 0\) versus \(H_{A}\): \(\beta_{1} ≠ 0\).
  • Know the similarities and distinctions of the t-test and F-test for testing \(H_{0}\):\(\beta_{1} = 0\).
  • Know the t-test for testing that \(\beta_{1}\) = 0, the F-test for testing that \(\beta_{1}\) = 0, and the t-test for testing that \(\rho = 0\) yield similar results, but understand when it makes sense to report the results of each one.
  • Calculate all of the values in the lack of fit analysis of variance table.
  • Conduct the F-test for lack of fit.
  • Know that the (linear) lack of fit test only gives you evidence against linearity. If you reject the null and conclude a lack of linear fit, it doesn't tell you what (non-linear) regression function would work.
  • Understand the "derivation" of the linear lack of fit test. That is, understand the decomposition of the error sum of squares, and how the expected mean squares tell us to use the ratio MSLF/MSPE to test for lack of linear fit.

Lesson 2 Code Files

Below is a zip file that contains all the data sets used in this lesson:

STAT501_Lesson02.zip

  • couplesheight.txt
  • handheight.txt
  • heightgpa.txt
  • husbandwife.txt
  • leadcord.txt
  • mens200m.txt
  • newaccounts.txt
  • signdist.txt
  • skincancer.txt
  • solutions_conc.txt
  • whitespruce.txt

2.1 - Inference for the Population Intercept and Slope

2.1 - Inference for the Population Intercept and Slope

Recall that we are ultimately always interested in drawing conclusions about the population, not the particular sample we observed. In the simple regression setting, we are often interested in learning about the population intercept \(\beta_{0}\) and the population slope \(\beta_{1}\). As you know, confidence intervals and hypothesis tests are two related, but different, ways of learning about the values of population parameters. Here, we will learn how to calculate confidence intervals and conduct hypothesis tests for both \(\beta_{0}\) and \(\beta_{1}\).

Let's revisit the example concerning the relationship between skin cancer mortality and state latitude (Skin Cancer data). The response variable y is the mortality rate (number of deaths per 10 million people) of white males due to malignant skin melanoma from 1950-1959. The predictor variable x is the latitude (degrees North) at the center of each of the 49 states in the United States. A subset of the data looks like this:

Mortality Rate of White Males Due to Malignant Skin Melanoma
#
State
Latitude
Mortality
1
Alabama
33.0
219
2
Arizona
34.5
160
3
Arkansas
35.0
170
4
California
37.5
182
5
Colorado
39.0
149
\(\vdots\)

 \(\vdots\)

\(\vdots\)
\(\vdots\)
49
Wyoming
43.0
134

and a plot of the data with the estimated regression equation looks like this:

mortality vs latitude plot

Is there a relationship between state latitude and skin cancer mortality? Certainly, since the estimated slope of the line, b1, is -5.98, not 0, there is a relationship between state latitude and skin cancer mortality in the sample of 49 data points. But, we want to know if there is a relationship between the population of all the latitudes and skin cancer mortality rates. That is, we want to know if the population slope \(\beta_{1}\)is unlikely to be 0.

(1-\(\alpha\))100% t-interval for the slope parameter \(\beta_{1}\)

Confidence Interval for \(\beta_{1}\)

The formula for the confidence interval for \(\beta_{1}\), in words, is:

Sample estimate ± (t-multiplier × standard error)

and, in notation, is:

\(b_1 \pm t_{(\alpha/2, n-2)}\times \left( \dfrac{\sqrt{MSE}}{\sqrt{\sum(x_i-\bar{x})^2}} \right)\)

The resulting confidence interval not only gives us a range of values that is likely to contain the true unknown value \(\beta_{1}\). It also allows us to answer the research question "is the predictor x linearly related to the response y?" If the confidence interval for \(\beta_{1}\) contains 0, then we conclude that there is no evidence of a linear relationship between the predictor x and the response y in the population. On the other hand, if the confidence interval for \(\beta_{1}\)does not contain 0, then we conclude that there is evidence of a linear relationship between the predictor x and the response y in the population.

An \(\alpha\)-level hypothesis test for the slope parameter \(\beta_{1}\) 

We follow standard hypothesis test procedures in conducting a hypothesis test for the slope \(\beta_{1}\). First, we specify the null and alternative hypotheses:

  • Null hypothesis \(H_{0} \colon \beta_{1}\)= some number \(\beta\)
  • Alternative hypothesis \(H_{A} \colon \beta_{1}\)≠ some number \(\beta\)

The phrase "some number \(\beta\)" means that you can test whether or not the population slope takes on any value. Most often, however, we are interested in testing whether \(\beta_{1}\) is 0. By default, Minitab conducts the hypothesis test with the null hypothesis, \(\beta_{1}\) is equal to 0, and the alternative hypothesis, \(\beta_{1}\)is not equal to 0. However, we can test values other than 0 and the alternative hypothesis can also state that \(\beta_{1}\) is less than (<) some number \(\beta\) or greater than (>) some number \(\beta\).

Second, we calculate the value of the test statistic using the following formula:

\(t^*=\dfrac{b_1-\beta}{\left(\dfrac{\sqrt{MSE}}{\sqrt{\sum(x_i-\bar{x})^2}} \right)}=\dfrac{b_1-\beta}{se(b_1)}\)

Third, we use the resulting test statistic to calculate the P-value. As always, the P-value is the answer to the question "how likely is it that we’d get a test statistic t* as extreme as we did if the null hypothesis were true?" The P-value is determined by referring to a t-distribution with n-2 degrees of freedom.

Finally, we make a decision:

  • If the P-value is smaller than the significance level \(\alpha\), we reject the null hypothesis in favor of the alternative. We conclude that "there is sufficient evidence at the \(\alpha\) level to conclude that there is a linear relationship in the population between the predictor x and response y."
  • If the P-value is larger than the significance level \(\alpha\), we fail to reject the null hypothesis. We conclude "there is not enough evidence at the \(\alpha\) level to conclude that there is a linear relationship in the population between the predictor x and response y."
Note! that as with any statistical hypothesis test, there are assumptions underlying the test that need to be satisfied for it to be valid. We'll cover those assumptions in more detail in Lesson 4, but for now, keep in mind that this test only tells us whether or not the slope is significantly different from 0 assuming there is a linear relationship (and not a nonlinear relationship).

Minitab®

Drawing conclusions about the slope parameter \(\beta_{1}\) using Minitab

Let's see how we can use Minitab to calculate confidence intervals and conduct hypothesis tests for the slope \(\beta_{1}\). Minitab's regression analysis output for our skin cancer mortality and latitude example appears below.

The line pertaining to the latitude predictor, Lat, in the summary table of predictors has been bolded. It tells us that the estimated slope coefficient \(b_{1}\), under the column labeled Coef, is -5.9776. The estimated standard error of \(b_{1}\), denoted se(\(b_{1}\)), in the column labeled SE Coef for "standard error of the coefficient," is 0.5984.

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Constant 1 36464 36464 98.80 0.000
Residual Error 47 17173 365    
Total 48 53637      

Coefficients

Predictor Coef SE Coef T-Value P-Value
Constant 389.19 23.81 16.34 0.000
Lat -5.9776 0.5984 -9.99 0.000

Model Summary

S R-sq R-sq(adj)
19.12 68.0% 67.3%

The Regression equation

Mort = 389 - 5.98 Lat

By default, the test statistic is calculated assuming the user wants to test that the slope is 0. Dividing the estimated coefficient of -5.9776 by the estimated standard error of 0.5984, Minitab reports that the test statistic T is -9.99.

By default, the P-value is calculated assuming the alternative hypothesis is a "two-tailed, not-equal-to" hypothesis. Upon calculating the probability that a t-random variable with n-2 = 47 degrees of freedom would be larger than 9.99, and multiplying the probability by 2, Minitab reports that P is 0.000 (to three decimal places). That is, the P-value is less than 0.001. (Note we multiply the probability by 2 since this is a two-tailed test.)

Minitab Note! The P-value in Minitab's regression analysis output is always calculated assuming the alternative hypothesis is testing the two-tailed \(\beta_{1}≠ 0\). If your alternative hypothesis is the one-tailed \(\beta_{1}\)< 0 or \(\beta_{1}\)> 0, you have to divide the P-value that Minitab reports in the summary table of predictors by 2. (However, be careful if the test statistic is negative for an upper-tailed test or positive for a lower-tail test, in which case you have to divide by 2 and then subtract from 1. Draw a picture of an appropriately shaded density curve if you're not sure why.)

Because the P-value is so small (less than 0.001), we can reject the null hypothesis and conclude that \(\beta_{1}\) does not equal 0. There is sufficient evidence, at the \(\alpha\) = 0.05 level, to conclude that there is a linear relationship in the population between skin cancer mortality and latitude.

It's easy to calculate a 95% confidence interval for \(\beta_{1}\) using the information in the Minitab output. You just need to use Minitab to find the t-multiplier for you. It is \(t_{\left(0.025, 47\right)} = 2.0117\). Then, the 95% confidence interval for \(\beta_{1}\)is \(-5.9776 ± 2.0117(0.5984) \) or (-7.2, -4.8). (Alternatively, Minitab can display the interval directly if you click the "Results" tab in the Regression dialog box, select "Expanded Table" and check "Coefficients.")

We can be 95% confident that the population slope is between -7.2 and -4.8. That is, we can be 95% confident that for every additional one-degree increase in latitude, the mean skin cancer mortality rate decreases between 4.8 and 7.2 deaths per 10 million people.

Video: Using Minitab for the Slope Test

Factors affecting the width of a confidence interval for \(\beta_{1}\)

Recall that, in general, we want our confidence intervals to be as narrow as possible. If we know what factors affect the length of a confidence interval for the slope \(\beta_{1}\), we can control them to ensure that we obtain a narrow interval. The factors can be easily determined by studying the formula for the confidence interval:

\(b_1 \pm t_{\alpha/2, n-2}\times \left(\frac{\sqrt{MSE}}{\sqrt{\sum(x_i-\bar{x})^2}} \right) \)

First, subtracting the lower endpoint of the interval from the upper endpoint of the interval, we determine that the width of the interval is:

\(\text{Width}=2 \times t_{\alpha/2, n-2}\times \left(\frac{\sqrt{MSE}}{\sqrt{\sum(x_i-\bar{x})^2}} \right)\)

So, how can we affect the width of our resulting interval for \(\beta_{1}\)?

  • As the confidence level decreases, the width of the interval decreases. Therefore, if we decrease our confidence level, we decrease the width of our interval. Clearly, we don't want to decrease the confidence level too much. Typically, confidence levels are never set below 90%.
  • As MSE decreases, the width of the interval decreases. The value of MSE depends on only two factors — how much the responses vary naturally around the estimated regression line, and how well your regression function (line) fits the data. Clearly, you can't control the first factor all that much other than to ensure that you are not adding any unnecessary error in your measurement process. Throughout this course, we'll learn ways to make sure that the regression function fits the data as well as it can.
  • The more spread out the predictor x values, the narrower the interval. The quantity \(\sum(x_i-\bar{x})^2\) in the denominator summarizes the spread of the predictor x values. The more spread out the predictor values, the larger the denominator, and hence the narrower the interval. Therefore, we can decrease the width of our interval by ensuring that our predictor values are sufficiently spread out.
  • As the sample size increases, the width of the interval decreases. The sample size plays a role in two ways. First, recall that the t-multiplier depends on the sample size through n-2. Therefore, as the sample size increases, the t-multiplier decreases, and the length of the interval decreases. Second, the denominator \(\sum(x_i-\bar{x})^2\) also depends on n. The larger the sample size, the more terms you add to this sum, the larger the denominator, and the narrower the interval. Therefore, in general, you can ensure that your interval is narrow by having a large enough sample.

Six possible outcomes concerning slope \(\beta_{1}\)

There are six possible outcomes whenever we test whether there is a linear relationship between the predictor x and the response y, that is, whenever we test the null hypothesis \(H_{0} \colon \beta_{1}\) = 0 against the alternative hypothesis \(H_{A} \colon \beta_{1} ≠ 0\).

When we don't reject the null hypothesis, \(H_{0} \colon \beta_{1} = 0\), any of the following three realities are possible:

  1. We committed a Type II error. That is, in reality \(\beta_{1} ≠ 0\) and our sample data just didn't provide enough evidence to conclude that \(\beta_{1}\)≠ 0.
  2. There really is not much of a linear relationship between x and y.
  3. There is a relationship between x and y — it is just not linear.

When we do reject the null hypothesis, \(H_{0} \colon \beta_{1}\)= 0 in favor of the alternative hypothesis \(H_{A} \colon \beta_{1}\)≠ 0, any of the following three realities are possible:

  1. We committed a Type I error. That is, in reality \(\beta_{1} = 0\), but we have an unusual sample that suggests that \(\beta_{1} ≠ 0\).
  2. The relationship between x and y is indeed linear.
  3. A linear function fits the data, okay, but a curved ("curvilinear") function would fit the data even better.

(1-\(\alpha\))100% t-interval for intercept parameter \(\beta_{0}\)

Calculating confidence intervals and conducting hypothesis tests for the intercept parameter \(\beta_{0}\) is not done as often as it is for the slope parameter \(\beta_{1}\). The reason for this becomes clear upon reviewing the meaning of \(\beta_{0}\). The intercept parameter \(\beta_{0}\) is the mean of the responses at x = 0. If x = 0 is meaningless, as it would be, for example, if your predictor variable was height, then \(\beta_{0}\) is not meaningful. For the sake of completeness, we present the methods here for those situations in which \(\beta_{0}\) is meaningful.

Confidence Interval for \(\beta_{0}\)

The formula for the confidence interval for \(\beta_{0}\), in words, is:

Sample estimate ± (t-multiplier × standard error)

and, in notation, is:

\(b_0 \pm t_{\alpha/2, n-2} \times \sqrt{MSE} \sqrt{\dfrac{1}{n}+\dfrac{\bar{x}^2}{\sum(x_i-\bar{x})^2}}\)

The resulting confidence interval gives us a range of values that is likely to contain the true unknown value \(\beta_{0}\). The factors affecting the length of a confidence interval for \(\beta_{0}\) are identical to the factors affecting the length of a confidence interval for \(\beta_{1}\).

An \(\alpha\)-level hypothesis test for intercept parameter \(\beta_{0}\)

Again, we follow standard hypothesis test procedures. First, we specify the null and alternative hypotheses:

  • Null hypothesis \(H_{0}\): \(\beta_{0}\) = some number \(\beta\)
  • Alternative hypothesis \(H_{A}\): \(\beta_{0}\) ≠ some number \(\beta\)

The phrase "some number \(\beta\)" means that you can test whether or not the population intercept takes on any value. By default, Minitab conducts the hypothesis test for testing whether or not \(\beta_{0}\) is 0. But, the alternative hypothesis can also state that \(\beta_{0}\) is less than (<) some number \(\beta\) or greater than (>) some number \(\beta\).

Second, we calculate the value of the test statistic using the following formula:

\(t^*=\dfrac{b_0-\beta}{\sqrt{MSE} \sqrt{\dfrac{1}{n}+\dfrac{\bar{x}^2}{\sum(x_i-\bar{x})^2}}}=\dfrac{b_0-\beta}{se(b_0)}\)

Third, we use the resulting test statistic to calculate the P-value. Again, the P-value is the answer to the question "how likely is it that we’d get a test statistic t* as extreme as we did if the null hypothesis were true?" The P-value is determined by referring to a t-distribution with n-2 degrees of freedom.

Finally, we make a decision. If the P-value is smaller than the significance level \(\alpha\), we reject the null hypothesis in favor of the alternative. If we conduct a "two-tailed, not-equal-to-0" test, we conclude "there is sufficient evidence at the \(\alpha\) level to conclude that the mean of the responses is not 0 when x = 0." If the P-value is larger than the significance level \(\alpha\), we fail to reject the null hypothesis.

Minitab®

Drawing conclusions about intercept parameter \(\beta_{0}\) using Minitab

Let's see how we can use Minitab to calculate confidence intervals and conduct hypothesis tests for the intercept \(\beta_{0}\). Minitab's regression analysis output for our skin cancer mortality and latitude example appears below. The work involved is very similar to that for the slope \(\beta_{1}\).

The line pertaining to the intercept, which Minitab always refers to as Constant, in the summary table of predictors has been bolded. It tells us that the estimated intercept coefficient \(b_{0}\), under the column labeled Coef, is 389.19. The estimated standard error of \(b_{0}\), denoted se(\(b_{0}\)), in the column labeled SE Coef is 23.81.

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Constant 1 36464 36464 98.80 0.000
Residual Error 47 17173 365    
Total 48 53637      

Model Summary

S R-sq R-sq(adj)
19.12 68.0% 67.3%

Coefficients

Predictor Coef SE Coef T-Value P-Value
Constant 389.19 23.81 16.34 0.000
Lat -5.9776 0.5984 -9.99 0.000

Regression Equation 

Mort = 389 - 5.98 Lat

By default, the test statistic is calculated assuming the user wants to test that the mean response is 0 when x = 0. Note that this is an ill-advised test here because the predictor values in the sample do not include a latitude of 0. That is, such a test involves extrapolating outside the scope of the model. Nonetheless, for the sake of illustration, let's proceed to assume that it is an okay thing to do.

Dividing the estimated coefficient of 389.19 by the estimated standard error of 23.81, Minitab reports that the test statistic T is 16.34. By default, the P-value is calculated assuming the alternative hypothesis is a "two-tailed, not-equal-to-0" hypothesis. Upon calculating the probability that a t random variable with n-2 = 47 degrees of freedom would be larger than 16.34, and multiplying the probability by 2, Minitab reports that P is 0.000 (to three decimal places). That is, the P-value is less than 0.001.

Because the P-value is so small (less than 0.001), we can reject the null hypothesis and conclude that \(\beta_{0}\) does not equal 0 when x = 0. There is sufficient evidence, at the \(\alpha\) = 0.05 level, to conclude that the mean mortality rate at a latitude of 0 degrees North is not 0. (Again, note that we have to extrapolate in order to arrive at this conclusion, which in general is not advisable.)

Proceed as previously described to calculate a 95% confidence interval for \(\beta_{0}\). Use Minitab to find the t-multiplier for you. Again, it is \(t_{\left(0.025, 47\right)} = 2.0117 \). Then, the 95% confidence interval for \(\beta_{0}\) is \(389.19 ± 2.0117\left(23.81\right) = \left(341.3, 437.1\right) \). (Alternatively, Minitab can display the interval directly if you click the "Results" tab in the Regression dialog box, select "Expanded Table" and check "Coefficients.") We can be 95% confident that the population intercept is between 341.3 and 437.1. That is, we can be 95% confident that the mean mortality rate at a latitude of 0 degrees North is between 341.3 and 437.1 deaths per 10 million people. (Again, it is probably not a good idea to make this claim because of the severe extrapolation involved.)

Statistical inference conditions

We've made no mention yet of the conditions that must be true in order for it to be okay to use the above confidence interval formulas and hypothesis testing procedures for \(\beta_{0}\) and \(\beta_{1}\). In short, the "LINE" assumptions we discussed earlier — linearity, independence, normality, and equal variance — must hold. It is not a big deal if the error terms (and thus responses) are only approximately normal. If you have a large sample, then the error terms can even deviate somewhat far from normality.

Regression Through the Origin (RTO)

In rare circumstances, it may make sense to consider a simple linear regression model in which the intercept, \(\beta_{0}\), is assumed to be exactly 0. For example, suppose we have data on the number of items produced per hour along with the number of rejects in each of those time spans. If we have a period where no items were produced, then there are obviously 0 rejects. Such a situation may indicate deleting \(\beta_{0}\) from the model since \(\beta_{0}\) reflects the amount of the response (in this case, the number of rejects) when the predictor is assumed to be 0 (in this case, the number of items produced). Thus, the model to estimate becomes

\(\begin{equation*} y_{i}=\beta_{1}x_{i}+\epsilon_{i},\end{equation*}\)

which is called a Regression Through the Origin (or RTO) model. The estimate for \(\beta_{1}\)when using the regression through the origin model is:

\(b_{\textrm{RTO}}=\dfrac{\sum_{i=1}^{n}x_{i}y_{i}}{\sum_{i=1}^{n}x_{i}^{2}}.\)

Thus, the estimated regression equation is

\(\begin{equation*} \hat{y}_{i}=b_{\textrm{RTO}}x_{i}\end{equation*}.\)

Note that we no longer have to center (or "adjust") the \(x_{i}\)'s and \(y_{i}\)'s by their sample means (compare this estimate for \(b_{1}\) to that of the estimate found for the simple linear regression model). Since there is no intercept, there is no correction factor and no adjustment for the mean (i.e., the regression line can only pivot about the point (0,0)).

Generally, regression through the origin is not recommended due to the following:

  1. Removal of \(\beta_{0}\) is a strong assumption that forces the line to go through the point (0,0). Imposing this restriction does not give ordinary least squares as much flexibility in finding the line of best fit for the data.
  2. In a simple linear regression model, \(\sum_{i=1}^{n}(y_{i}-\hat{y}_i)=\sum_{i=1}^{n}e_{i}=0\). However, in regression through the origin, generally \(\sum_{i=1}^{n}e_{i}\neq 0\). Because of this, the SSE could actually be larger than the SSTO, thus resulting in \(r^{2}<0\).
  3. Since \(r^{2}\) can be negative, the usual interpretation of this value as a measure of the strength of the linear component in the simple linear regression model cannot be used here.

If you strongly believe that a regression through the origin model is appropriate for your situation, then statistical testing can help justify your decision. Moreover, if data has not been collected near \(x=0\), then forcing the regression line through the origin is likely to make for a worse-fitting model. So again, this model is not usually recommended unless there is a strong belief that it is appropriate.

To fit a "regression through the origin model in Minitab click "Model" in the regular regression window and then uncheck the "Include the constant term in the model."


2.2 - Another Example of Slope Inference

2.2 - Another Example of Slope Inference

Exampe 2-1

Is there a positive relationship between sales of leaded gasoline and the lead burden in the bodies of newborn infants? Researchers (Rabinowitz, et al, 1984) who were interested in answering this research question compiled data (Lead Cord data) on the monthly gasoline lead sales (in metric tons) in Massachusetts and mean lead concentrations (µl/dl) in umbilical-cord blood of babies born at a major Boston hospital over 14 months in 1980-1981.

Analyzing their data, the researchers obtained the following Minitab fitted line plot:

Fitted line plot

and standard regression analysis output:

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Regression 1 3.7783 3.7783 9.95 0.008
Residual Error 12 4.5560 0.3797    
Total 13 8.3343      

Model Summary

S R-sq R-sq(adj)
0.616170 45.3% 40.8%

Coefficients

Predictor Coef SE Coef T-Value P-Value
Constant 4.1082 0.6088 6.75 0.000
Sold 0.014885 0.004719 3.15 0.008

Regression Equation

Cord = 4.11 + 0.0149 Sold

Minitab reports that the P-value for testing \(H_{0} \colon \beta_{1} = 0\) against the alternative hypothesis \(H_{A} \colon \beta_{1} ≠ 0\) is 0.008. Therefore, since the test statistic is positive, the P-value for testing \(H_{0} \colon \beta_{1}= 0\) against the alternative hypothesis \(H_{A} \colon \beta_{1} > 0\) is 0.008 ÷ 2 = 0.004. The P-value is less than 0.05. There is sufficient statistical evidence, at the 0.05 level, to conclude that \(\beta_{1} > 0\).

Furthermore, since the 95% t-multiplier is \(t_{\left(0.025, 12 \right)} = 2.1788\), a 95% confidence interval for \(\beta_{1}\) is:

0.014885 ± 2.1788(0.004719) or (0.0046, 0.0252).

The researchers can be 95% confident that the mean lead concentrations in the umbilical-cord blood of Massachusetts babies increase between 0.0046 and 0.0252 µl/dl for every one-metric ton increase in monthly gasoline lead sales in Massachusetts. It is up to the researchers to debate whether or not this is a meaningful increase.


2.3 - Sums of Squares

2.3 - Sums of Squares

Let's return to the skin cancer mortality example (Skin Cancer data) and investigate the research question, "Is there a (linear) relationship between skin cancer mortality and latitude?"

Review the following scatter plot and estimated regression line. What does the plot suggest for answering the above research question? The linear relationship looks fairly strong. The estimated slope is negative, not equal to 0.

mortality vs latitude plot

We can answer the research question using the P-value of the t-test for testing:

  • the null hypothesis \(H_{0} \colon \beta_{1} = 0\)
  • against the alternative hypothesis \(H_{A} \colon \beta_{1} ≠ 0\).

As the Minitab output below suggests, the P-value of the t-test for "Lat" is less than 0.001. There is enough statistical evidence to conclude that the slope is not 0, that is, there is a linear relationship between skin cancer mortality and latitude.

There is an alternative method for answering the research question, which uses the analysis of variance F-test. Let's first look at what we are working towards understanding. The (standard) "analysis of variance" table for this data set is highlighted in the Minitab output below. There is a column labeled F, which contains the F-test statistic, and there is a column labeled P, which contains the P-value associated with the F-test. Notice that the P-value, 0.000, appears to be the same as the P-value, 0.000, for the t-test for the slope. The F-test similarly tells us that there is enough statistical evidence to conclude that there is a linear relationship between skin cancer mortality and latitude.

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Constant 1 36464 36464 99.80 0.000
Residual Error 47 17173 365    
Total 48 53637      

Model Summary

S R-sq R-sq(adj)
19.12 68.0% 67.3%

Coefficients

Predictor Coef SE Coef T-Value P-Value
Constant 389.19 23.81 16.34 0.000
Lat -5.9776 0.5984 -9.99 0.000

Regression Equation

Mort = 389 - 5.98 Lat

Now, let's investigate what all the numbers in the table represent. Let's start with the column labeled SS for "sums of squares." We considered sums of squares in Lesson 1 when we defined the coefficient of determination, \(r^2\), but now we consider them again in the context of the analysis of variance table.

The scatter plot of mortality and latitude appears again below, but now it is adorned with three labels:

  •  \(y_{i}\) denotes the observed mortality for the state i
  • \(\hat{y}_i\) is the estimated regression line (solid line) and therefore denotes the estimated (or "fitted") mortality for the latitude of the state i
  • \(\bar{y}\) represents what the line would look like if there were no relationship between mortality and latitude. That is, it denotes the "no relationship" line (dashed line). It is simply the average mortality of the sample.

If there is a linear relationship between mortality and latitude, then the estimated regression line should be "far" from the no relationship line. We just need a way of quantifying "far." The above three elements are useful in quantifying how far the estimated regression line is from the no relationship line. As illustrated by the plot, the two lines are quite far apart.

mortality vs latitude plot

\(\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2 =36464\)

\(\sum_{i=1}^{n}(y_i-\hat{y}_i)^2 =17173\)

\(\sum_{i=1}^{n}(y_i-\bar{y})^2 =53637\)

Total Sum of Squares

The distance of each observed value \(y_{i}\) from the no regression line \(\bar{y}\) is \(y_i - \bar{y}\). If you determine this distance for each data point, square each distance, and add up all of the squared distances, you get:

\(\sum_{i=1}^{n}(y_i-\bar{y})^2 =53637\)

Called the "total sum of squares," it quantifies how much the observed responses vary if you don't take into account their latitude.

Regression Sum of Squares

The distance of each fitted value \(\hat{y}_i\) from the no regression line \(\bar{y}\) is \(\hat{y}_i - \bar{y}\). If you determine this distance for each data point, square each distance, and add up all of the squared distances, you get:

\(\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2 =36464\)

Called the "regression sum of squares," it quantifies how far the estimated regression line is from the no relationship line.

Error Sum of Squares

The distance of each observed value \(y_{i}\) from the estimated regression line \(\hat{y}_i\) is \(y_i-\hat{y}_i\). If you determine this distance for each data point, square each distance, and add up all of the squared distances, you get:

\(\sum_{i=1}^{n}(y_i-\hat{y}_i)^2 =17173\)

Called the "error sum of squares," as you know, it quantifies how much the data points vary around the estimated regression line.

In short, we have illustrated that the total variation in observed mortality y (53637) is the sum of two parts — variation "due to" latitude (36464) and variation just due to random error (17173). (We are careful to put "due to" in quotes in order to emphasize that a change in latitude does not necessarily cause a change in mortality. All we could conclude is that latitude is "associated with" mortality.)


2.4 - Sums of Squares (continued)

2.4 - Sums of Squares (continued)

Investigating Height and GPA Data

Now, let's do a similar analysis to investigate the research question, "Is there a (linear) relationship between height and grade point average?"(Height and GPA data)

Review the following scatterplot and estimated regression line. What does the plot suggest for answering the above research question? In this case, it appears as if there is almost no relationship whatsoever. The estimated slope is almost 0.

gpa vs height plot

Again, we can answer the research question using the P-value of the t-test for:

  • testing the null hypothesis \(H_{0} \colon \beta_{1} = 0\)
  • against the alternative hypothesis \(H_{A} \colon \beta_{1} ≠ 0\).

As the Minitab output below suggests, the P-value of the t-test for "height" is 0.761. There is not enough statistical evidence to conclude that the slope is not 0. We conclude that there is no linear relationship between height and grade point average.

The Minitab output also shows the analysis of variable table for this data set. Again, the P-value associated with the analysis of variance F-test, 0.761, appears to be the same as the P-value, 0.761, for the t-test for the slope. The F-test similarly tells us that there is insufficient statistical evidence to conclude that there is a linear relationship between height and grade point average.

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Constant 1 0.0276 0.0276 0.09 0.761
Residual Error 33 9.7055 0.2941    
Total 34 9.7331      

Model Summary

S = 0.5423    R-Sq = 0.3%    R-Sq (adj) = 0.0%

Coefficients

Predictor Coef SE Coef T-Value P-Value
Constant 3.410 1.435 2.38 0.023
height -0.00656 0.02143 -0.31 0.761

Regression Equation

gpa = 3.14 -0.0066 height

The scatter plot of grade point average and height appear below, now adorned with the three labels:

  • \(y_{i}\) denotes the observed grade point average for student i
  • \(\hat{y}_i\) is the estimated regression line (solid line) and therefore denotes the estimated grade point average for the height of student i
  • \(\bar{y}\) represents the "no relationship" line (dashed line) between height and grade point average. It is simply the average grade point average of the sample.

For this data set, note that the estimated regression line and the "no relationship" line are very close together. Let's see how the sums of squares summarize this point.

gpa vs height plot

\(\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2 =0.0276\)

\(\sum_{i=1}^{n}(y_i-\hat{y}_i)^2 =9.7055\)

\(\sum_{i=1}^{n}(y_i-\bar{y})^2 =9.7331\)

  • The "total sum of squares," which again quantifies how much the observed grade point averages vary if you don't take into account height, is \(\sum_{i=1}^{n}(y_i-\bar{y})^2 =9.7331\).
  • The "regression sum of squares," which again quantifies how far the estimated regression line is from the no relationship line, is \(\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2 =0.0276\).
  • The "error sum of squares," which again quantifies how much the data points vary around the estimated regression line, is \(\sum_{i=1}^{n}(y_i-\hat{y}_i)^2 =9.7055\).

In short, we have illustrated that the total variation in the observed grade point averages y (9.7331) is the sum of two parts — variation "due to" height (0.0276) and variation due to random error (9.7055). Unlike the last example, most of the variation in the observed grade point averages is just due to random error. It appears as if very little of the variation can be attributed to the predictor height.

Try It!

Sums of Squares

Some researchers at UCLA conducted a study on cyanotic heart disease in children. They measured the age at which the child spoke his or her first word (x, in months) and the Gesell adaptive score (y) on a sample of 21 children. Upon analyzing the resulting data, they obtained the following analysis of variance table:

Analysis of Variance

Source DF  Adj SS Adj MS F-Value P-Value
Constant 1 1604.08 1604.08 13.20 0.002
Residual Error 19 2308.59 121.50    
Total 20 3912.67      
Which number quantifies how much the observed scores vary if you don't take into account the age at which the child first spoke?

Analysis of Variance

Source DF  Adj SS Adj MS F-Value P-Value
Constant 1 1604.08 1604.08 13.20 0.002
Residual Error 19 2308.59 121.50    
Total 20 3912.67      
Which number quantifies how far the estimated regression line is from the "no trend" line?

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Constant 1 1604.08 1604.08 13.20 0.002
Residual Error 19 2308.59 121.50    
Total 20 3912.67      
Which number quantifies how much the scores vary around the estimated regression line?

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Constant 1 1604.08 1604.08 13.20 0.002
Residual Error 19 2308.59 121.50    
Total 20 3912.67      

2.5 - Analysis of Variance: The Basic Idea

2.5 - Analysis of Variance: The Basic Idea

Break down the total variation in y (the "total sum of squares (SSTO)") into two components:

  • a component that is "due to" the change in x ("regression sum of squares (SSR)")
  • a component that is just due to random error ("error sum of squares (SSE)")

If the regression sum of squares is a "large" component of the total sum of squares, it suggests that there is a linear association between the predictor x and the response y.

Here is a simple picture illustrating how the distance \(y_i-\bar{y}\) is decomposed into the sum of two distances, \(\hat{y}_i-\bar{y}\) and \(y_i-\hat{y}_i\). Drag the bar at the bottom of the image to see each of the three components of the equation represented geometrically.

Although the derivation isn't as simple as it seems, the decomposition holds for the sum of the squared distances, too:

\(\underbrace{\left(\sum\limits_{i=1}^{n}(y_i-\bar{y})^2\right)}_{\underset{\text{Total Sum of Squares}}{\text{SSTO}}} = \underbrace{\sum\limits_{i=1} ^{n} \left( \hat{y}_{i} - \overline{y} \right)^{2}}_{\underset{\text{Regression of Sums}}{\text{SSR}}} + \underbrace{\sum\limits_{i=1} ^{n} \left( y_{i} - \hat{y} \right)^{2}}_{\underset{\text{Error Sum of Squares}}{\text{SSE}}}\)

\(\text{SSTO} = \text{SSR} + \text{SSE}\)

The degrees of freedom associated with each of these sums of squares follow a similar decomposition.

  • You might recognize SSTO as being the numerator of the sample variance. Recall that the denominator of the sample variance is n-1. Therefore, n-1 is the degree of freedom associated with SSTO.
  • Recall that the mean square error MSE is obtained by dividing SSE by n-2. Therefore, n-2 is the degree of freedom associated with SSE.

Then, we obtain the following breakdown of the degrees of freedom:

\(\underset{\substack{\text{degrees of freedom}\\ \text{associated with SSTO}}}{\left(n-1\right)} = \underset{\substack{\text{degrees of freedom}\\ \text{associated with SSR}}}{\left(1\right)} + \underset{\substack{\text{degrees of freedom}\\ \text{associated with SSE}}}{\left(n-2\right)}\)


2.6 - The Analysis of Variance (ANOVA) table and the F-test

2.6 - The Analysis of Variance (ANOVA) table and the F-test

Analysis of Variance for Skin Cancer Data

We've covered quite a bit of ground. Let's review the analysis of variance table for the example concerning skin cancer mortality and latitude (Skin Cancer data).

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Constant 1 36464 36464 99.80 0.000
Residual Error 47 17173 365    
Total 48 53637      

Model Summary

S R-sq R-sq(adj)
19.12 68.0% 67.3%

Coefficients

Predictor Coef SE Coef T-Value P-Value
Constant 389.19 23.81 16.34 0.000
Lat -5.9776 0.5984 -9.99 0.000

Regression Equation

Mort = 389 - 5.98 Lat

Recall that there were 49 states in the data set.

  • The degrees of freedom associated with SSR will always be 1 for the simple linear regression model. The degrees of freedom associated with SSTO is n-1 = 49-1 = 48. The degrees of freedom associated with SSE is n-2 = 49-2 = 47. And the degrees of freedom add up: 1 + 47 = 48.
  • The sums of squares add up: SSTO = SSR + SSE. That is, here: 53637 = 36464 + 17173.

Let's tackle a few more columns of the analysis of variance table, namely the "mean square" column, labeled MS, and the F-statistic column labeled F.

Definitions of mean squares

We already know the "mean square error (MSE)" is defined as:

\(MSE=\dfrac{\sum(y_i-\hat{y}_i)^2}{n-2}=\dfrac{SSE}{n-2}\)

That is, we obtain the mean square error by dividing the error sum of squares by its associated degrees of freedom n-2. Similarly, we obtain the "regression mean square (MSR)" by dividing the regression sum of squares by its degrees of freedom 1:

\(MSR=\dfrac{\sum(\hat{y}_i-\bar{y})^2}{1}=\dfrac{SSR}{1}\)

Of course, that means the regression sum of squares (SSR) and the regression mean square (MSR) are always identical for the simple linear regression model.

Now, why do we care about mean squares? Because their expected values suggest how to test the null hypothesis \(H_{0} \colon \beta_{1} = 0\) against the alternative hypothesis \(H_{A} \colon \beta_{1} ≠ 0\).

Expected mean squares

Imagine taking many, many random samples of size n from some population, estimating the regression line, and determining MSR and MSE for each data set obtained. It has been shown that the average (that is, the expected value) of all of the MSRs you can obtain equals:

\(E(MSR)=\sigma^2+\beta_{1}^{2}\sum_{i=1}^{n}(X_i-\bar{X})^2\)

Similarly, it has been shown that the average (that is, the expected value) of all of the MSEs you can obtain equals:

\(E(MSE)=\sigma^2\)

These expected values suggest how to test \(H_{0} \colon \beta_{1} = 0\) versus \(H_{A} \colon \beta_{1} ≠ 0\):

  • If \(\beta_{1} = 0\), then we'd expect the ratio MSR/MSE to equal 1.
  • If \(\beta_{1} ≠ 0\), then we'd expect the ratio MSR/MSE to be greater than 1.

These two facts suggest that we should use the ratio, MSR/MSE, to determine whether or not \(\beta_{1} = 0\).

Note! because \(\beta_{1}\) is squared in E(MSR), we cannot use the ratio MSR/MSE:
  • to test \(H_{0} \colon \beta_{1} = 0\) versus \(H_{A} \colon \beta_{1} < 0\)
  • or to test \(H_{0} \colon \beta_{1} = 0\) versus \(H_{A} \colon \beta_{1} > 0\).
We can only use MSR/MSE to test \(H_{0} \colon \beta_{1} = 0\) versus \(H_{A} \colon \beta_{1} ≠ 0\).

We have now completed our investigation of all of the entries of a standard analysis of variance table. The formula for each entry is summarized for you in the following analysis of variance table:

Source of Variation DF SS MS F
Regression 1 \(SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2\) \(MSR=\dfrac{SSR}{1}\) \(F^*=\dfrac{MSR}{MSE}\)
Residual error n-2 \(SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2\) \(MSE=\dfrac{SSE}{n-2}\)  
Total n-1 \(SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2\)    

However, we will always let Minitab do the dirty work of calculating the values for us. Why is the ratio MSR/MSE labeled F* in the analysis of variance table? That's because the ratio is known to follow an F distribution with 1 numerator degree of freedom and n-2 denominator degrees of freedom. For this reason, it is often referred to as the analysis of variance F-test. The following section summarizes the formal F-test.

The formal F-test for the slope parameter \(\beta_{1}\)

The null hypothesis is \(H_{0} \colon \beta_{1} = 0\).

The alternative hypothesis is \(H_{A} \colon \beta_{1} ≠ 0\).

The test statistic is \(F^*=\dfrac{MSR}{MSE}\).

As always, the P-value is obtained by answering the question: "What is the probability that we’d get an F* statistic as large as we did if the null hypothesis is true?"

The P-value is determined by comparing F* to an F distribution with 1 numerator degree of freedom and n-2 denominator degrees of freedom.

In reality, we are going to let Minitab calculate the F* statistic and the P-value for us. Let's try it out on a new example!


2.7 - Example: Are Men Getting Faster?

2.7 - Example: Are Men Getting Faster?

Example 2-2: Men's 200m Data

The following data set (Men's 200m data) contains the winning times (in seconds) of the 22 men's 200-meter Olympic sprints held between 1900 and 1996. (Notice that the Olympics were not held during World War I and II years.) Is there a linear relationship between the year and the winning times? The plot of the estimated regression line sure makes it look so!

men's 200m

To answer the research question, let's conduct the formal F-test of the null hypothesis \(H_{0}\colon \beta_{1} = 0\) against the alternative hypothesis \(H_{A}\colon \beta_{1} ≠ 0\).

The analysis of variance table, which was obtained in Minitab, has been animated to allow you to interact with the table. As you roll your mouse over the blue (or bold) numbers, you are reminded of how those numbers are determined.

Analysis of Variance
Source DF SS MS F P
Regression 1 15.8 15.8 177.7 0.000
Residual Error 20 1.8 0.09    
Total 21 17.6      

From a scientific point of view, what we ultimately care about is the P-value, which Minitab indicates is 0.000 (to three decimal places). That is, the P-value is less than 0.001. The P-value is very small. It is unlikely that we would have obtained such a large F* statistic if the null hypothesis were true. Therefore, we reject the null hypothesis \(H_{0}\colon \beta_{1} = 0\) in favor of the alternative hypothesis \(H_{A}\colon \beta_{1} ≠ 0\). There is sufficient evidence at the \(\alpha = 0.05\) level to conclude that there is a linear relationship between year and winning time.

Equivalence of the analysis of variance F-test and the t-test

As we noted in the first two examples, the P-value associated with the t-test is the same as the P-value associated with the analysis of variance F-test. This will always be true for the simple linear regression model. It is illustrated in the year and winning time example also. Both P-values are 0.000 (to three decimal places):

Coefficients

Predictor Coef SE Coef T-Value P-Value
Constant 76.153 4.152 18.34 0.000
Year -0.0284 0.00213 -13.33 0.000

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Regression 1 15.796 15.796 177.7 0.000
Residual Error 20 1.778 0.089    
Total 21 17.574      

The P-values are the same because of a well-known relationship between a t random variable and an F random variable that has 1 numerator degree of freedom. Namely:

\((t^{*}_{(n-2)})^2=F^{*}_{(1,n-2)}\)

This will always hold for the simple linear regression model. This relationship is demonstrated in this example:

\(\left(-13.33\right)^{2} = 177.7\)

In short:

  • For a given significance level \(\alpha\), the F-test of \(\beta_{1} = 0\) versus \(\beta_{1} ≠ 0\) is algebraically equivalent to the two-tailed t-test.
  • We will get exactly the same P-values, so…
    • If one test rejects \(H_{0}\), then so will the other.
    • If one test does not reject \(H_{0}\), then so will the other.

The natural question then is ... when should we use the F-test and when should we use the t-test?

  • The F-test is only appropriate for testing that the slope differs from 0 (\(\beta_{1} ≠ 0\)).
  • Use the t-test to test that the slope is positive (\(\beta_{1} > 0\)) or negative (\(\beta_{1} < 0\)). Remember, though, that you will have to divide the P-value that Minitab reports by 2 to get the appropriate P-value.

The F-test is more useful for the multiple regression model when we want to test that more than one slope parameter is 0. We'll learn more about this later in the course!

Try it!

The ANOVA F-test

Height of white spruce trees

In forestry, the diameter of a tree at breast height (which is fairly easy to measure) is used to predict the height of a tree (a difficult measurement to obtain). Silviculturists working in British Columbia's boreal forest conducted a series of spacing trials to predict the heights of several species of trees. The data set White Spruce data contains the breast height diameters (in centimeters) and heights (in meters) for a sample of 36 white spruce trees.

  1. Is there sufficient evidence to conclude that there is a linear association between breast height diameter and tree height? Justify your response by looking at the fitted line plot and by conducting the analysis of variance F-test. In conducting the F-test, specify the null and alternative hypotheses, the significance level you used, and your final conclusion. (See Minitab Help: Creating a fitted line plot and Performing a basic regression analysis).
  2. Which value in the ANOVA table quantifies how far the estimated regression line is from the "no trend" line? That is, what is the particular value for this data set?
  3. Use the Minitab output to illustrate, for this example, the relationship between the t-test and the ANOVA F-test for testing \(H_{0} \colon \beta_{1} = 0\) against \(H_{A} \colon \beta_{1} ≠ 0\).

2.8 - Equivalent linear relationship tests

2.8 - Equivalent linear relationship tests

Investigating Husband and Wife Data

It should be noted that the three hypothesis tests we have learned for testing the existence of a linear relationship — the t-test for \(H_{0} \colon \beta_{1}= 0\), the ANOVA F-test for \(H_{0} \colon \beta_{1} = 0\), and the t-test for \(H_{0} \colon \rho = 0\) — will always yield the same results. For example, when evaluating whether or not a linear relationship exists between a husband's age and his wife's age if we treat the husband's age ("HAge") as the response and the wife's age ("WAge") as the predictor, each test yields a P-value of 0.000... < 0.001 (Husband and Wife data):

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Regression 1 20577 20577 1242.51 0.000
Error 168 2782 17    
Total 169 23359      

Model Summary

S R-sq R-sq(adj) R-sq(pred)
4.06946 88.09% 88.02% 87.84%

Coefficients

Predictor Coef SE Coef T-Value P-Value
Constant 3.590 1.159 3.10 0.002
WAge 0.96670 0.02742 35.25 0.000

Regression Equation

HAge = 3.59 + 0.967 WAge
*48 rows unused

Correlation: HAge, WAge

Pearson correlation 0.939
P-Value 0.000

And similarly, if we treat the wife's age ("WAge") as the response and the husband's age ("HAge") as the predictor, each test yields of P-value of 0.000... < 0.001:

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Regression 1 19396 19396 1242.51 0.000
Error 168 2623 16    
Total 169 22019      

Model Summary

S R-sq R-sq(adj)
3.951 88.1% 88.0%

Coefficients

Predictor Coef SE Coef T-Value P-Value
Constant 1.574 1.150 1.37 0.173
HAge 0.91124 0.02585 35.25 0.000

Regression Equation

WAge = 1.57 + 0.911 HAge

*48 rows unused

Correlation: WAge, HAge

Pearson Correlation 0.939
P-Value 0.000

Technically, then, it doesn't matter what test you use to obtain the P-value. You will always get the same P-value. But, you should report the results of the test that make sense for your particular situation:

  • If one of the variables can be clearly identified as the response, report that you conducted a t-test or F-test results for testing \(H_{0} \colon \beta_{1} =0\). (Does it make sense to use x to predict y?)
  • If it is not obvious which variable is the response, report that you conducted a t-test for testing \(H_{0} \colon \rho = 0\). (Does it only make sense to look for an association between x and y?)

2.9 - Notation for the Lack of Fit test

2.9 - Notation for the Lack of Fit test

To conclude this lesson we'll digress slightly to consider the lack of fit test for linearity —the "L" in the "LINE" conditions. The reason we consider this here is that, like the ANOVA test of earlier, this test is an F-test based on decomposing sums of squares.

However, before we "derive" the lack of fit F-test, it is important to note that the test requires repeat observations — called "replicates" — for at least one of the values of the predictor x. That is, if each x value in the data set is unique, then the lack of fit test can't be conducted on the data set. Even when we do have replicates, we typically need quite a few for the test to have any power. As such, this test generally only applies to specific types of datasets with plenty of replicates.

As is often the case before we learn a new hypothesis test, we have to get some new notation under our belt. In doing so, we'll look at some (contrived) data that purports to describe the relationship between the size of the minimum deposit required when opening a new checking account at a bank (x) and the number of new accounts at the bank (y) (New Accounts data). Suppose the trend in the data looks curved, but we fit a line through the data nonetheless:

If you select each of the specific x values (75, 100, 125, 150, 175, and 200) in the video above, you will see the standard notation used for the lack of fit F-test. Let's take the case where x = 75 dollars:

  • \(y_{11}\) denotes the first measurement (28) made at the first x-value (x = 75) in the data set
  • \(y_{12}\) denotes the second measurement (42) made at the first x-value (x = 75) in the data set
  • \(\bar{y}_{1}\) denotes the average (35) of all of the y values at the first x-value (x = 75)
  • \(\hat{y}_{11}\) denotes the predicted response (87.5) for the first measurement made at the first x-value (x = 75)
  • \(\hat{y}_{12}\) denotes the predicted response (87.5) for the second measurement made at the first x-value (x = 75)

You should now understand the notation that appears when you roll your cursor over the other x values (100, 125, and so on). In general:

  • \(y_{ij}\) denotes the \(j^{th}\) measurement made at the \(i^{th}\) x-value in the data set
  • \(\bar{y}_{i}\) denotes the average of all of the y values at the \(i^{th}\) x-value
  • \(\hat{y}_{ij}\) denotes the predicted response for the \(j^{th}\) measurement made at the \(i^{th}\) x-value

2.10 - Decomposing the Error

2.10 - Decomposing the Error

Example 2-3

If you think about it, there are two different explanations for why our data points might not fall right on the estimated regression line. One possibility is that our regression model doesn't describe the trend in the data well enough. That is, the model may exhibit a "lack of fit." The second possibility is that, as is often the case, there is just random variation in the data. This realization suggests that we should decompose the error into two components — one part due to the lack of fit of the model and the second part just due to random error. If most of the error is due to lack of fit, and not just random error, it suggests that we should scrap our model and try a different one.

Let's try decomposing the error in the checking account example, (New Accounts data). Recall that the prediction error for any data point is the distance of the observed response from the predicted response, i.e., \(y_{ij}-\hat{y}_{ij}\). (Can you identify these distances on the plot of the data below?) To quantify the total error of prediction, we determine this distance for each data point, square the distance, and add up all of the distances to get:

\(\sum_{i}\sum_{j}(y_{ij}-\hat{y}_{ij})^2\)

Not surprisingly, this quantity is called the "error sum of squares" and is denoted SSE. The error sum of squares for our checking account example is \(\sum_{i}\sum_{j}(y_{ij}-\hat{y}_{ij})^2=14742\).

If a line fits the data well, then the average of the observed responses at each x-value should be close to the predicted response for that x-value. Therefore, to determine how much of the total error is due to the lack of model fit, we determine how far the average observed response at each x-value is from the predicted response of each data point. That is, we calculate the distance \(\bar{y}_{i}-\hat{y}_{ij}\). To quantify the total lack of fit, we determine this distance for each data point, square the distance, and add up all of the distances to get:

\(\sum_{i}\sum_{j}(\bar{y}_{i}-\hat{y}_{ij})^2\)

Not surprisingly, this quantity is called the "lack of fit sum of squares" and is denoted SSLF. The lack of fit sum of squares for our checking account example is \(\sum_{i}\sum_{j}(\bar{y}_{i}-\hat{y}_{ij})^2=13594\).

To determine how much of the total error is due to just random error, we determine how far each observed response is from the average observed response at each x-value. That is, we calculate the distance \(y_{ij}-\bar{y}_{i}\). To quantify the total pure error, we determine this distance for each data point, square the distance, and add up all of the distances to get:

\(\sum_{i}\sum_{j}(y_{ij}-\bar{y}_{i})^2\)

Not surprisingly, this quantity is called the "pure error sum of squares" and is denoted SSPE. The pure error sum of squares for our checking account example is \(\sum_{i}\sum_{j}(y_{ij}-\bar{y}_{i})^2=1148\).

new accounts vs size of minimum desposit plot

\(\hat{y}=50.7+.49x\)

\(\sum_{i}\sum_{j}(y_{ij}-\hat{y}_{ij})^2=14742\)

\(\sum_{i}\sum_{j}(\bar{y}_{i}-\hat{y}_{ij})^2=13594\)

\(\sum_{i}\sum_{j}(y_{ij}-\bar{y}_{i})^2=1148\)

In summary, we've shown in this checking account example that most of the error (SSE = 14742) is attributed to the lack of a linear fit (SSLF = 13594) and not just to random error (SSPE = 1148).

Example 2-4

Let's see how our decomposition of the error works with a different example — one in which a line fits the data well. Suppose the relationship between the size of the minimum deposit required when opening a new checking account at a bank (x) and the number of new accounts at the bank (y) instead looks like this:

new accounts vs size of minimum desposit plot

\(\hat{y}=48.7+0.50x\)

\(\sum_{i}\sum_{j}(y_{ij}-\hat{y}_{ij})^2=45.1\)

\(\sum_{i}\sum_{j}(\bar{y}_{i}-\hat{y}_{ij})^2=6.6\)

\(\sum_{i}\sum_{j}(y_{ij}-\bar{y}_{i})^2=38.5\)

In this case, as we would expect based on the plot, very little of the total error (SSE = 45.1) is due to a lack of a linear fit (SSLF = 6.6). Most of the error appears to be due to just random variation in the number of checking accounts (SSPE = 38.5).

In summary

The basic idea behind decomposing the total error is:

  • We break down the residual error ("error sum of squares" — denoted SSE) into two components:
    • a component that is due to a lack of model fit ("lack of fit sum of squares" — denoted SSLF)
    • a component that is due to pure random error ("pure error sum of squares" — denoted SSPE)
  • If the lack of fit sum of squares is a large component of the residual error, it suggests that a linear function is inadequate.

Here is a simple picture illustrating how the distance \(y_{ij}-\hat{y}_{ij}\) is decomposed into the sum of two distances \(\bar{y}_{i}-\hat{y}_{ij}\) and \(y_{ij}-\bar{y}_{i}\). Drag the bar at the bottom of the image to see each of the three components of the equation represented geometrically.

Although the derivation isn't as simple as it seems, the decomposition holds for the sum of the squared distances as well:

\(\underbrace{\sum\limits_{i=1}^c \sum\limits_{j=1}^{n_i} \left(y_{ij} - \hat{y}_{ij}\right)^{2}}_{\underset{\text{Error Sum of Squares}}{\text{SSE}}} = \underbrace{\sum\limits_{i=1}^c \sum\limits_{j=1}^{n_i} \left(\overline{y}_{i} - \hat{y}_{ij}\right)^{2}}_{\underset{\text{Lack of Fit Sums of Squares}}{\text{SSLF}}} + \underbrace{\sum\limits_{i=1}^c \sum\limits_{j=1}^{n_i} \left(y_{ij} - \overline{y}_{i}\right)^{2}}_{\underset{\text{Pure Error Sum of Squares}}{\text{SSPE}}}\)

SSE = SSLF + SSPE

The degrees of freedom associated with each of these sums of squares follow a similar decomposition.

  • As before, the degrees of freedom associated with SSE is n-2. (The 2 comes from the fact that you estimate 2 parameters — the slope and the intercept — whenever you fit a line to a set of data.)
  • The degrees of freedom associated with SSLF is c-2, where c denotes the number of distinct x values you have.
  • The degrees of freedom associated with SSPE is n-c, where again c denotes the number of distinct x values you have.

You might notice that the degrees of freedom breakdown as:

\(\underset{\substack{\text{degrees of freedom}\\ \text{associated with SSE}}}{\left(n-2\right)} = \underset{\substack{\text{degrees of freedom}\\ \text{associated with SSLF}}}{\left(c-2\right)} + \underset{\substack{\text{degrees of freedom}\\ \text{associated with SSPE}}}{\left(n-c\right)}\)

where again c denotes the number of distinct x values you have.


2.11 - The Lack of Fit F-test

2.11 - The Lack of Fit F-test

Investigating New Accounts Data

We're almost there! We just need to determine an objective way of deciding when too much of the error in our prediction is due to a lack of model fit. That's where the lack of fit F-test comes into play. Let's return to the first checking account example, (New Accounts data):

new accounts vs size of minimum deposit plot

Jumping ahead to the punchline, here's Minitab's output for the lack of fit F-test for this data set:

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Regression 1 5141 5141 3.14 0.110
Residual Error 9 14742 1638    
Lack of Fit 4 13594 3398 14.80 0.006
Pure Error 5 1148 230    
Total 10 19883      

1 row with no replicates

As you can see, the lack of fit output appears as a portion of the analysis of variance table. In the Sum of Squares ("SS") column, we see — as we previously calculated — that SSLF = 13594 and SSPE = 1148 sum to SSE = 14742. We also see in the Degrees of Freedom ("DF") column that — since there are n = 11 data points and c = 6 distinct x values (75, 100, 125, 150, 175, and 200) — the lack of fit degrees of freedom c - 2 = 4 and the pure error degrees of freedom is n - c = 5 sum to the error degrees of freedom n - 2 = 9.

Just as is done for the sums of squares in the basic analysis of variance table, the lack of fit sum of squares and the error sum of squares are used to calculate "mean squares." They are even calculated similarly, namely by dividing the sum of squares by their associated degrees of freedom. Here are the formal definitions of the mean squares:

The "lack of fit mean square" is \(MSLF=\dfrac{\sum\sum(\bar{y}_i-\hat{y}_{ij})^2}{c-2}=\dfrac{SSLF}{c-2}\)
The "pure error mean square" is \(MSPE=\dfrac{\sum\sum(y_{ij}-\bar{y}_{i})^2}{n-c}=\dfrac{SSPE}{n-c}\)

In the Mean Squares ("MS") column, we see that the lack of fit mean square MSLF is 13594 divided by 4, or 3398. The pure error mean square MSPE is 1148 divided by 5, or 230:

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Regression 1 5141 5141 3.14 0.110
Residual Error 9 14742 1638    
Lack of Fit 4 13594 3398 14.80 0.006
Pure Error 5 1148 230    
Total 10 19883      

You might notice that the lack of fit F-statistic is calculated by dividing the lack of fit mean square (MSLF = 3398) by the pure error mean square (MSPE = 230) to get 14.80. How do we know that this F-statistic helps us in testing the hypotheses:

  • \(H_{0 }\): The relationship assumed in the model is reasonable, i.e., there is no lack of fit.
  • \(H_{A }\): The relationship assumed in the model is not reasonable, i.e., there is a lack of fit.

The answer lies in the "expected mean squares." In our sample of n = 11 newly opened checking accounts, we obtained MSLF = 3398. If we had taken a different random sample of size n = 11, we would have obtained a different value for MSLF. Theory tells us that the average of all of the possible MSLF values we could obtain is:

\(E(MSLF) =\sigma^2+\dfrac{\sum n_i(\mu_i-(\beta_0+\beta_1X_i))^2}{c-2}\)

That is, we should expect MSLF, on average, to equal the above quantity — \(\sigma^{2}\) plus another messy-looking term. Think about that messy term. If the null hypothesis is true, i.e., if the relationship between the predictor x and the response y is linear, then \(\mu_{i} = \beta_{0} + \beta_{1}X_{i}\) and the messy term becomes 0 and goes away. That is, if there is no lack of fit, we should expect the lack of fit mean square MSLF to equal \(\sigma^{2}\).

What should we expect MSPE to equal? Theory tells us it should, on average, always equal \(\sigma^{2}\):

\(E(MSPE) =\sigma^2\)

Aha — there we go! The logic behind the calculation of the F-statistic is now clear:

  • If there is a linear relationship between x and y, then \(\mu_{i} = \beta_{0} + \beta_{1}X_{i}\). That is, there is no lack of fit in the simple linear regression model. We would expect the ratio MSLF/MSPE to be close to 1.
  • If there is not a linear relationship between x and y, then \(\mu_{i} ≠ \beta_{0} + \beta_{1}X_{i}\). That is, there is a lack of fit in the simple linear regression model. We would expect the ratio MSLF/MSPE to be large, i.e., a value greater than 1.

So, to conduct the lack of fit test, we calculate the value of the F-statistic:

\(F^*=\dfrac{MSLF}{MSPE}\)

and determine if it is large. To decide if it is large, we compare the F*-statistic to an F-distribution with c - 2 numerator degrees of freedom and n - c denominator degrees of freedom.

In summary

We follow standard hypothesis test procedures in conducting the lack of fit F-test. First, we specify the null and alternative hypotheses:

  • \(H_{0}\): The relationship assumed in the model is reasonable, i.e., there is no lack of fit in the model \(\mu_{i} = \beta_{0} + \beta_{1}X_{i}\).
  • \(H_{A}\): The relationship assumed in the model is not reasonable, i.e., there is lack of fit in the model \(\mu_{i} = \beta_{0} + \beta_{1}X_{i}\).

Second, we calculate the value of the F-statistic:

\(F^*=\dfrac{MSLF}{MSPE}\)

To do so, we complete the analysis of variance table using the following formulas.

Analysis of Variance

Source DF SS MS F
Regression 1 \(SSR=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(\hat{y}_{ij}-\bar{y})^2\) \(MSR=\dfrac{SSR}{1}\) \(F=\dfrac{MSR}{MSE}\)
Residual Error n - 2 \(SSE=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(y_{ij}-\hat{y}_{ij})^2\) \(MSE=\dfrac{SSE}{n-2}\)  
Lack of Fit c - 2 \(SSLF=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(\bar{y}_{i}-\hat{y}_{ij})^2\) \(MSLF=\dfrac{SSLF}{c-2}\) \(F^*=\dfrac{MSLF}{MSPE}\)
Pure Error n - c \(SSPE=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(y_{ij}-\bar{y}_{i})^2\) \(MSPE=\dfrac{SSPE}{n-c}\)  
Total n - 1 \(SSTO=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(y_{ij}-\bar{y})^2\)    

In reality, we let statistical software such as Minitab, determine the analysis of variance table for us.

Third, we use the resulting F*-statistic to calculate the P-value. As always, the P-value is the answer to the question "how likely is it that we’d get an F*-statistic as extreme as we did if the null hypothesis were true?" The P-value is determined by referring to an F-distribution with c - 2 numerator degrees of freedom and n - c denominator degrees of freedom.

Finally, we make a decision:

  • If the P-value is smaller than the significance level \(\alpha\), we reject the null hypothesis in favor of the alternative. We conclude that "there is sufficient evidence at the \(\alpha\) level to conclude that there is a lack of fit in the simple linear regression model."
  • If the P-value is larger than the significance level \(\alpha\), we fail to reject the null hypothesis. We conclude "there is not enough evidence at the \(\alpha\) level to conclude that there is a lack of fit in the simple linear regression model."

For our checking account example:

new accounts vs size of minimum deposit plot

in which we obtain:

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Regression 1 5141 5141 3.14 0.110
Residual Error 9 14742 1638    
Lack of Fit 4 13594 3398 14.80 0.006
Pure Error 5 1148 230    
Total 10 19883      

the F*-statistic is 14.80 and the P-value is 0.006. The P-value is smaller than the significance level \(\alpha = 0.05\) — we reject the null hypothesis in favor of the alternative. There is sufficient evidence at the \(\alpha = 0.05\) level to conclude that there is a lack of fit in the simple linear regression model. In light of the scatterplot, the lack of fit test provides the answer we expected.

Try it!

The lack of fit test

Fill in the missing numbers (??) in the following analysis of variance table resulting from a simple linear regression analysis.

Click on the light bulb in each cell to reveal the correct answer.

Source DF Adj SS Adj MS F-Value P-Value
Regression ?? 12.597 ?? ?? 0.000
Residual Error ?? ?? ??    
Lack of Fit 3 ?? ?? ?? ??
Pure Error ?? 0.157 ??    
Total 14 15.522      

2.12 - Further Examples

2.12 - Further Examples

Example 2-5: Highway Sign Reading Distance and Driver Age

The data are n = 30 observations on driver age and the maximum distance (feet) at which individuals can read a highway sign (Sign Distance data).

(Data source: Mind On Statistics, 3rd edition, Utts and Heckard)

The plot below gives a scatterplot of the highway sign data along with the least squares regression line.

scatterplot of highway sign data

Here is the accompanying Minitab output, which is found by performing Stat >> Regression >> Regression on the highway sign data.

Regression Analysis: Distance, Age

Coefficients

Predictor Coef SE Coef T-Value P-Value
Constant 576.68 23.47 24.57 0.000
Age -3.0068 0.4243 -7.09 0.000

Regression Equation

Distance = 577 - 3.01 Age

Hypothesis Test for the Intercept (\(\beta_{0}\))

This test is rarely a test of interest, but does show up when one is interested in performing a regression through the origin (which we touched on earlier in this lesson). In the Minitab output above, the row labeled Constant gives the information used to make inferences about the intercept. The null and alternative hypotheses for a hypotheses test about the intercept are written as:

\(H_{0} \colon \beta_{0} = 0\)
\(H_{A} \colon \beta_{0} \ne 0\)

In other words, the null hypothesis is testing if the population intercept is equal to 0 versus the alternative hypothesis that the population intercept is not equal to 0. In most problems, we are not particularly interested in hypotheses about the intercept. For instance, in our example, the intercept is the mean distance when the age is 0, a meaningless age. Also, the intercept does not give information about how the value of y changes when the value of x changes. Nevertheless, to test whether the population intercept is 0, the information from the Minitab output is used as follows:

  1. The sample intercept is \(b_{0}\) = 576.68, the value under Coef.
  2. The standard error (SE) of the sample intercept, written as se(\(b_{0}\)), is se(\(b_{0}\)) = 23.47, the value under SE Coef. The SE of any statistic is a measure of its accuracy. In this case, the SE of \(b_{0}\) gives, very roughly, the average difference between the sample \(b_{0}\) and the true population intercept \(\beta_{0}\), for random samples of this size (and with these x-values).
  3. The test statistic is t = \(b_{0}\)/se(\(b_{0}\)) = 576.68/23.47 = 24.57, the value under T-Value.
  4. The p-value for the test is p = 0.000 and is given under P-Value. The p-value is actually very small and not exactly 0.
  5. The decision rule at the 0.05 significance level is to reject the null hypothesis since our p < 0.05. Thus, we conclude that there is statistically significant evidence that the population intercept is not equal to 0.

So how exactly is the p-value found? For simple regression, the p-value is determined using a t distribution with n − 2 degrees of freedom (df), which is written as \(t_{n−2}\), and is calculated as 2 × area past |t| under a \(t_{n−2}\) curve. In this example, df = 30 − 2 = 28. The p-value region is the type of region shown in the figure below. The negative and positive versions of the calculated t provide the interior boundaries of the two shaded regions. As the value of t increases, the p-value (area in the shaded regions) decreases.

t - t
2 x the area to the right of \(\mid t \mid\)
 

Hypothesis Test for the Slope (\(\beta_{1}\))

This test can be used to test whether or not x and y are linearly related. The row pertaining to the variable Age in the Minitab output from earlier gives information used to make inferences about the slope. The slope directly tells us about the link between the mean y and x. When the true population slope does not equal 0, the variables y and x are linearly related. When the slope is 0, there is not a linear relationship because the mean y does not change when the value of x is  changed. The null and alternative hypotheses for a hypotheses test about the slope are written as:

\(H_{0} \colon \beta_{1}\) = 0
\(H_{A} \colon \beta_{1}\) ≠ 0

In other words, the null hypothesis is testing if the population slope is equal to 0 versus the alternative hypothesis that the population slope is not equal to 0. To test whether the population slope is 0, the information from the Minitab output is used as follows:

  1. The sample slope is \(b_{1}\) = −3.0068, the value under Coef in the Age row of the output.
  2. The SE of the sample slope, written as se(\(b_{1}\)), is se(\(b_{1}\)) = 0.4243, the value under SE Coef. Again, the SE of any statistic is a measure of its accuracy. In this case, the SE of b1 gives, very roughly, the average difference between the sample \(b_{1 }\)and the true population slope \(\beta_{1}\), for random samples of this size (and with these x-values).
  3. The test statistic is t = \(b_{1}\)/se(\(b_{1}\)) = −3.0068/0.4243 = −7.09, the value under T-Value.
  4. The p-value for the test is p = 0.000 and is given under P-Value.
  5. The decision rule at the 0.05 significance level is to reject the null hypothesis since our p < 0.05. Thus, we conclude that there is statistically significant evidence that the variables of Distance and Age are linearly related.

As before, the p-value is the region illustrated in the figure above.

Confidence Interval for the Slope (\(\beta_{1}\))

A confidence interval for the unknown value of the population slope \(\beta_{1}\) can be computed as

sample statistic ± multiplier × standard error of statistic

→ \(b_{1 }\)± t* × se(\(b_{1}\))

To find the t* multiplier, you can do one of the following:

  1. In simple regression, the t* multiplier is determined using a \(t_{n−2}\) distribution. The value of t* is such that the confidence level is the area (probability) between −t* and +t* under the t-curve.
  2. A table such as the one in the textbook can be used to look up the multiplier.
  3. Alternatively, software like Minitab can be used.

95% Confidence Interval

In our example, n = 30 and df = n − 2 = 28. For 95% confidence, t* = 2.05. A 95% confidence interval for \(\beta_{1}\), the true population slope, is:

3.0068 ± (2.05 × 0.4243)
3.0068 ± 0.870
or about − 3.88 to − 2.14.

Interpretation: With 95% confidence, we can say the mean sign reading distance decreases somewhere between 2.14 and 3.88 feet per each one-year increase in age. It is incorrect to say that with 95% probability the mean sign reading distance decreases somewhere between 2.14 and 3.88 feet per each one-year increase in age. Make sure you understand why!!!

99% Confidence Interval

For 99% confidence, t* = 2.76. A 99% confidence interval for \(\beta_{1}\) , the true population slope is:

3.0068 ± (2.76 × 0.4243)
3.0068 ± 1.1711
or about − 4.18 to − 1.84.

Interpretation: With 99% confidence, we can say the mean sign reading distance decreases somewhere between 1.84 and 4.18 feet per each one-year increase in age. Notice that as we increase our confidence, the interval becomes wider. So as we approach 100% confidence, our interval grows to become the whole real line.

As a final note, the above procedures can be used to calculate a confidence interval for the population intercept. Just use \(b_{0}\) (and its standard error) rather than \(b_{1}\).

Example 2-6: Handspans Data

Stretched handspans and heights are measured in inches for n = 167 college students (Hand Height data). We’ll use y = height and x = stretched handspan. A scatterplot with a regression line superimposed is given below, together with results of a simple linear regression model fit to the data.

scatterplot with a regression line superimposed

Regression Analysis: Height versus HandSpan

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Regression 1 1500.06 1500.06 199.17 0.000
HandSpan 1 1500.06 1500.06 199.17 0.000
Error 165 1242.70 7.53    
Lack-of-Fit 17 96.24 5.66 0.73 0.767
Pure Error 148 1146.46 7.75    
Total 166 2742.76      

Model Summary

S R-sq R-sq(adj) R-sq(pred)
2.74436 54.69% 54.42% 53.76%

Coefficients

Predictor Coef SE Coef T-Value P-Value VIF
Constant 35.53 2.32 15.34 0.000  
HandSpan 1.560 0.111 14.11 0.000 1.00

Regression Equation

Height = 35.53 + 1.560 HandSpan

Note! Some things to note are:

  • The residual standard deviation S is 2.744 and this estimates the standard deviation of the errors.
  • \(r^2\) = (SSTO-SSE) / SSTO = SSR / (SSR+SSE) = 1500.1 / (1500.1+1242.7) = 1500.1 / 2742.8 = 0.547 or 54.7%. The interpretation is that handspan differences explain 54.7% of the variation in heights.
  • The value of the F statistic is F = 199.2 with 1 and 165 degrees of freedom, and the p-value for this F statistic is 0.000. Thus we reject the null hypothesis \(H_{0} \colon \beta_{1}\) = 0 in favor of \(H_A\colon\beta_1\neq 0\). In other words, the observed relationship is statistically significant.

Example 2-7: Quality Data

You are a manufacturer who wants to obtain a quality measure on a product, but the procedure to obtain the measure is expensive. There is an indirect approach, which uses a different product score (Score 1) in place of the actual quality measure (Score 2). This approach is less costly but also is less precise. You can use regression to see if Score 1 explains a significant amount of the variance in Score 2 to determine if Score 1 is an acceptable substitute for Score 2. The results from a simple linear regression analysis are given below:

Regression Analysis: Score2 versus Score1

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Regression 1 2.5419 2.5419 156.56 0.000
Residual Error 7 0.1136 0.0162    
Total 8 2.6556      

Model Summary

S R-sq R-sq(adj)
0.127419 95.7% 95.1%

Coefficients

Predictor Coef SE Coef T-Value P-Value
Constant 1.1177 0.1093 10.23 0.000
WAge 0.21767 0.01740 12.51 0.000

Regression Equation

Score2 = 1.12 + 0.218 Score1

We are concerned in testing the null hypothesis that Score 1 is not a significant predictor of Score 2 versus the alternative that Score 1 is a significant predictor of Score 2. More formally, we are testing:

\(H_{0} \colon\beta_{1}\) = 0
\(H_{A} \colon \beta_{1}\) ≠ 0

The p-value in the ANOVA table (0.000), indicates that the relationship between Score 1 and Score 2 is statistically significant at an α-level of 0.05. This is also shown by the p-value for the estimated coefficient of Score 1, which is 0.000.


Software Help 2

Software Help 2

  

The next two pages cover the Minitab and R commands for the procedures in this lesson.

Below is a zip file that contains all the data sets used in this lesson:

STAT501_Lesson02.zip

  • couplesheight.txt
  • handheight.txt
  • heightgpa.txt
  • husbandwife.txt
  • leadcord.txt
  • mens200m.txt
  • newaccounts.txt
  • signdist.txt
  • skincancer.txt
  • solutions_conc.txt
  • whitespruce.txt

Minitab Help 2: SLR Model Evaluation

Minitab Help 2: SLR Model Evaluation

Minitab®

Skin cancer

Cord blood lead concentration

Skin cancer

Height and grade point average

Sprinters

Highway sign reading distance and driver age

  • Perform a basic regression analysis with y = Distance and x = Age.
  • Create a fitted line plot.
  • To display confidence intervals for the model parameters (regression coefficients) click "Results" in the Regression Dialog and select "Expanded tables" for "Display of results."
  • To change the confidence level for the intervals click "Options" in the Regression Dialog.

Handspan and height

Checking account deposits


R Help 2: SLR Model Evaluation

R Help 2: SLR Model Evaluation

R Help

Skin cancer

  • Load the skin cancer data.
  • Fit a simple linear regression model with y = Mort and x = Lat.
  • Display a scatterplot of the data with the simple linear regression line.
  • Display model results.
  • Calculate confidence intervals for the model parameters (regression coefficients).
skincancer <- read.table("~/path-to-folder/skincancer.txt", header=T)
attach(skincancer)

model <- lm(Mort ~ Lat)

plot(x=Lat, y=Mort,
     xlab="Latitude (at center of state)", ylab="Mortality (deaths per 10 million)",
     panel.last = lines(sort(Lat), fitted(model)[order(Lat)]))

summary(model)
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 389.1894    23.8123   16.34  < 2e-16 ***
# Lat          -5.9776     0.5984   -9.99 3.31e-13 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 19.12 on 47 degrees of freedom
# Multiple R-squared:  0.6798,  Adjusted R-squared:  0.673 
# F-statistic:  99.8 on 1 and 47 DF,  p-value: 3.309e-13

confint(model, level=0.95)
#                  2.5 %     97.5 %
# (Intercept) 341.285151 437.093552
# Lat          -7.181404  -4.773867

detach(skincancer)

Cord blood lead concentration

  • Load the cord blood lead concentration data.
  • Fit a simple linear regression model with y = Cord and x = Sold.
  • Display a scatterplot of the data with the simple linear regression line.
  • Display model results.
  • Calculate confidence intervals for the model parameters (regression coefficients).
cordblood <- read.table("~/path-to-folder/cordblood.txt", header=T)
attach(cordblood)

model <- lm(Cord ~ Sold)

plot(x=Sold, y=Cord,
     xlab="Monthly gasoline lead sales (metric tons)",
     ylab="Mean cord blood lead concentration",
     panel.last = lines(sort(Sold), fitted(model)[order(Sold)]))

summary(model)
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 4.108182   0.608806   6.748 2.05e-05 ***
# Sold        0.014885   0.004719   3.155   0.0083 ** 
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.6162 on 12 degrees of freedom
# Multiple R-squared:  0.4533,  Adjusted R-squared:  0.4078 
# F-statistic: 9.952 on 1 and 12 DF,  p-value: 0.008303

confint(model, level=0.95)
#                   2.5 %     97.5 %
# (Intercept) 2.781707607 5.43465712
# Sold        0.004604418 0.02516608

detach(cordblood)

Skin cancer

  • Load the skin cancer data.
  • Fit a simple linear regression model with y = Mort and x = Lat.
  • Display analysis of variance table.
skincancer <- read.table("~/path-to-folder/skincancer.txt", header=T)
attach(skincancer)

model <- lm(Mort ~ Lat)

anova(model)
# Analysis of Variance Table
# Response: Mort
#           Df Sum Sq Mean Sq F value    Pr(>F)    
# Lat        1  36464   36464  99.797 3.309e-13 ***
# Residuals 47  17173     365                      
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Note: R anova function does not display the total sum of squares.
# Add regression and residual sums of squares to get total sum of squares. 
# SSR + SSE = SSTO, i.e., 36464 + 17173 = 53637.

detach(skincancer)

Height and grade point average

  • Load the height and grade point average data.
  • Fit a simple linear regression model with y = gpa and x = height.
  • Display a scatterplot of the data with the simple linear regression line.
  • Display model results.
  • Display analysis of variance table.
heightgpa <- read.table("~/path-to-folder/heightgpa.txt", header=T)
attach(heightgpa)

model <- lm(gpa ~ height)

plot(x=height, y=gpa,
     xlab="Height (inches)", ylab="Grade Point Average",
     panel.last = lines(sort(height), fitted(model)[order(height)]))

summary(model)
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)  
# (Intercept)  3.410214   1.434616   2.377   0.0234 *
# height      -0.006563   0.021428  -0.306   0.7613  
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.5423 on 33 degrees of freedom
# Multiple R-squared:  0.002835,  Adjusted R-squared:  -0.02738 
# F-statistic: 0.09381 on 1 and 33 DF,  p-value: 0.7613

anova(model)
# Analysis of Variance Table
# Response: gpa
#           Df Sum Sq Mean Sq F value Pr(>F)
# height     1 0.0276 0.02759  0.0938 0.7613
# Residuals 33 9.7055 0.29411
# SSTO = SSR + SSE = 0.0276 + 9.7055 = 9.7331.

detach(heightgpa)

Sprinters

  • Load the sprinter's data.
  • Fit a simple linear regression model with y = Men200m and x = Year.
  • Display a scatterplot of the data with the simple linear regression line.
  • Display model results.
  • Display analysis of variance table.
sprinters <- read.table("~/path-to-folder/mens200m.txt", header=T)
attach(sprinters)

model <- lm(Men200m ~ Year)

plot(x=Year, y=Men200m,
     xlab="Year", ylab="Men's 200m time (secs)",
     panel.last = lines(sort(Year), fitted(model)[order(Year)]))

summary(model)
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 76.153369   4.152226   18.34 5.61e-14 ***
# Year        -0.028383   0.002129  -13.33 2.07e-11 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.2981 on 20 degrees of freedom
# Multiple R-squared:  0.8988,  Adjusted R-squared:  0.8938 
# F-statistic: 177.7 on 1 and 20 DF,  p-value: 2.074e-11

anova(model)
# Analysis of Variance Table
# Response: Men200m
#           Df  Sum Sq Mean Sq F value    Pr(>F)    
# Year       1 15.7964 15.7964  177.72 2.074e-11 ***
# Residuals 20  1.7777  0.0889                      
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# SSTO = SSR + SSE = 15.7964 + 1.7777 = 17.5741.

detach(sprinters)

Highway sign reading distance and driver age

  • Load the signdist data.
  • Fit a simple linear regression model with y = Distance and x = Age.
  • Display a scatterplot of the data with the simple linear regression line.
  • Display model results.
  • Calculate confidence intervals for the slope.
signdist <- read.table("~/path-to-folder/signdist.txt", header=T)
attach(signdist)

model <- lm(Distance ~ Age)

plot(x=Age, y=Distance,
     xlab="Age", ylab="Distance",
     panel.last = lines(sort(Age), fitted(model)[order(Age)]))

summary(model)
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 576.6819    23.4709  24.570  < 2e-16 ***
# Age          -3.0068     0.4243  -7.086 1.04e-07 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 49.76 on 28 degrees of freedom
# Multiple R-squared:  0.642,  Adjusted R-squared:  0.6292 
# F-statistic: 50.21 on 1 and 28 DF,  p-value: 1.041e-07

confint(model, parm="Age", level=0.95)
#                  2.5 %    97.5 %
# Age          -3.876051  -2.13762

confint(model, parm="Age", level=0.99)
#                  0.5 %    99.5 %
# Age          -4.179391  -1.83428

detach(signdist)

Handcode and height

  • Load the handheight data.
  • Fit a simple linear regression model with y = Height and x = Handcode
  • Display a scatterplot of the data with the simple linear regression line.
  • Display model results.
  • Display analysis of variance table.
handheight <- read.table("~/path-to-folder/handheight.txt", header=T)
attach(handheight)

model <- lm(Height ~ HandSpan)

plot(x=HandSpan, y=Height,
     xlab="HandSpan", ylab="Height",
     panel.last = lines(sort(HandSpan), fitted(model)[order(HandSpan)]))

summary(model)
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  35.5250     2.3160   15.34   <2e-16 ***
# HandSpan      1.5601     0.1105   14.11   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 2.744 on 165 degrees of freedom
# Multiple R-squared:  0.5469,  Adjusted R-squared:  0.5442 
# F-statistic: 199.2 on 1 and 165 DF,  p-value: < 2.2e-16

anova(model)
# Analysis of Variance Table
# Response: Height
#            Df Sum Sq Mean Sq F value    Pr(>F)    
# HandSpan    1 1500.1 1500.06  199.17 < 2.2e-16 ***
# Residuals 165 1242.7    7.53                      
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# SSTO = SSR + SSE = 1500.1 + 1242.7 = 2742.8.

detach(handheight)

Checking account deposits

  • Load the newaccounts data.
  • Fit a simple linear regression model with y = New and x = Size
  • Display a scatterplot of the data with the simple linear regression line.
  • Display model results.
  • Display lack of fit analysis of variance table.
  • Display usual analysis of variance table.
newaccounts <- read.table("~/path-to-folder/newaccounts.txt", header=T)
attach(newaccounts)

model <- lm(New ~ Size)

plot(x=Size, y=New,
     xlab="Size of minimum deposit", ylab="Number of new accounts",
     panel.last = lines(sort(Size), fitted(model)[order(Size)]))

summary(model)
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)  50.7225    39.3979   1.287     0.23
# Size          0.4867     0.2747   1.772     0.11
# 
# Residual standard error: 40.47 on 9 degrees of freedom
# Multiple R-squared:  0.2586,  Adjusted R-squared:  0.1762 
# F-statistic: 3.139 on 1 and 9 DF,  p-value: 0.1102

library(alr3) # alr3 package must be installed first
pureErrorAnova(model) # Lack of fit anova table
# Analysis of Variance Table
# Response: New
#              Df  Sum Sq Mean Sq F value   Pr(>F)   
# Size          1  5141.3  5141.3  22.393 0.005186 **
# Residuals     9 14741.6  1638.0                    
#  Lack of fit  4 13593.6  3398.4  14.801 0.005594 **
#  Pure Error   5  1148.0   229.6                    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# NOTE: The F value for Size uses MSPE in its denominator.
# So, F value for Size is 5141.3 / 229.6 = 22.393.
# Thus it differs from the F value for Size in the usual anova table: 

anova(model)
# Analysis of Variance Table
# Response: New
#           Df  Sum Sq Mean Sq F value Pr(>F)
# Size       1  5141.3  5141.3  3.1389 0.1102
# Residuals  9 14741.6  1638.0               
# NOTE: Here the F value for Size uses MSE in its denominator.
# So, F value for Size is 5141.3 / 1638.0 = 3.1389.

detach(newaccounts)

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility