Lesson 2: SLR Model Evaluation
Lesson 2: SLR Model EvaluationOverview
This lesson presents two alternative methods for testing whether a linear association exists between the predictor x and the response y in a simple linear regression model:
\(H_{0}\): \(\beta_{1}\) = 0 versus \(H_{A}\): \(\beta_{1}\) ≠ 0.
One is the ttest for the slope while the other is an analysis of variance (ANOVA) Ftest.
As you know, one of the primary goals of this course is to be able to translate a research question into a statistical procedure. Here are two examples of research questions and the alternative statistical procedures that could be used to answer them:
 Is there a (linear) relationship between skin cancer mortality and latitude?
 What statistical procedure answers this research question? We could estimate the regression line and then use the ttest to determine if the slope, \(\beta_{1}\), of the population regression line, is 0.
 Alternatively, we could perform an (analysis of variance) Ftest.
 Is there a (linear) relationship between height and grade point average?
 What statistical procedure answers this research question? We could estimate the regression line and then use the ttest to see if the slope, \(\beta_{1}\), of the population regression line, is 0.
 Again, we could alternatively perform an (analysis of variance) Ftest.
We also learn a way to check for linearity — the "L" in the "LINE" conditions — using the linear lack of fit test. This test requires replicates, that is multiple observations of y for at least one (preferably more) values of x, and concerns the following hypotheses:
 \(H_{0}\): There is no lack of linear fit.
 \(H_{A}\): There is a lack of linear fit.
Objectives
 Calculate confidence intervals and conduct hypothesis tests for the population intercept \(\beta_{0}\) and population slope \(\beta_{1}\) using Minitab's regression analysis output.
 Draw research conclusions about the population intercept \(\beta_{0}\) and population slope \(\beta_{1}\) using the above confidence intervals and hypothesis tests.
 Know the six possible outcomes about the slope \(\beta_{1}\) whenever we test whether there is a linear relationship between a predictor x and a response y.
 Understand the "derivation" of the analysis of variance Ftest for testing \(H_{0}\): \(\beta_{1} = 0\). That is, understand how the total variation in a response y is broken down into two parts — a component that is due to the predictor x and a component that is just due to random error. And, understand how the expected mean squares tell us to use the ratio MSR/MSE to conduct the test.
 Know how each element of the analysis of variance table is calculated.
 Know what scientific questions can be answered with the analysis of variance Ftest.
 Conduct the analysis of variance Ftest to test \(H_{0}\): \(\beta_{1} = 0\) versus \(H_{A}\): \(\beta_{1} ≠ 0\).
 Know the similarities and distinctions of the ttest and Ftest for testing \(H_{0}\):\(\beta_{1} = 0\).
 Know the ttest for testing that \(\beta_{1}\) = 0, the Ftest for testing that \(\beta_{1}\) = 0, and the ttest for testing that \(\rho = 0\) yield similar results, but understand when it makes sense to report the results of each one.
 Calculate all of the values in the lack of fit analysis of variance table.
 Conduct the Ftest for lack of fit.
 Know that the (linear) lack of fit test only gives you evidence against linearity. If you reject the null and conclude a lack of linear fit, it doesn't tell you what (nonlinear) regression function would work.
 Understand the "derivation" of the linear lack of fit test. That is, understand the decomposition of the error sum of squares, and how the expected mean squares tell us to use the ratio MSLF/MSPE to test for lack of linear fit.
Lesson 2 Code Files
Below is a zip file that contains all the data sets used in this lesson:
 couplesheight.txt
 handheight.txt
 heightgpa.txt
 husbandwife.txt
 leadcord.txt
 mens200m.txt
 newaccounts.txt
 signdist.txt
 skincancer.txt
 solutions_conc.txt
 whitespruce.txt
2.1  Inference for the Population Intercept and Slope
2.1  Inference for the Population Intercept and SlopeRecall that we are ultimately always interested in drawing conclusions about the population, not the particular sample we observed. In the simple regression setting, we are often interested in learning about the population intercept \(\beta_{0}\) and the population slope \(\beta_{1}\). As you know, confidence intervals and hypothesis tests are two related, but different, ways of learning about the values of population parameters. Here, we will learn how to calculate confidence intervals and conduct hypothesis tests for both \(\beta_{0}\) and \(\beta_{1}\).
Let's revisit the example concerning the relationship between skin cancer mortality and state latitude (Skin Cancer data). The response variable y is the mortality rate (number of deaths per 10 million people) of white males due to malignant skin melanoma from 19501959. The predictor variable x is the latitude (degrees North) at the center of each of the 49 states in the United States. A subset of the data looks like this:
#

State

Latitude

Mortality


1

Alabama

33.0

219

2

Arizona

34.5

160

3

Arkansas

35.0

170

4

California

37.5

182

5

Colorado

39.0

149

\(\vdots\)

\(\vdots\) 
\(\vdots\)

\(\vdots\)

49

Wyoming

43.0

134

and a plot of the data with the estimated regression equation looks like this:
Is there a relationship between state latitude and skin cancer mortality? Certainly, since the estimated slope of the line, b1, is 5.98, not 0, there is a relationship between state latitude and skin cancer mortality in the sample of 49 data points. But, we want to know if there is a relationship between the population of all the latitudes and skin cancer mortality rates. That is, we want to know if the population slope \(\beta_{1}\)is unlikely to be 0.
(1\(\alpha\))100% tinterval for the slope parameter \(\beta_{1}\)
 Confidence Interval for \(\beta_{1}\)

The formula for the confidence interval for \(\beta_{1}\), in words, is:
Sample estimate ± (tmultiplier × standard error)
and, in notation, is:
\(b_1 \pm t_{(\alpha/2, n2)}\times \left( \dfrac{\sqrt{MSE}}{\sqrt{\sum(x_i\bar{x})^2}} \right)\)
The resulting confidence interval not only gives us a range of values that is likely to contain the true unknown value \(\beta_{1}\). It also allows us to answer the research question "is the predictor x linearly related to the response y?" If the confidence interval for \(\beta_{1}\) contains 0, then we conclude that there is no evidence of a linear relationship between the predictor x and the response y in the population. On the other hand, if the confidence interval for \(\beta_{1}\)does not contain 0, then we conclude that there is evidence of a linear relationship between the predictor x and the response y in the population.
An \(\alpha\)level hypothesis test for the slope parameter \(\beta_{1}\)
We follow standard hypothesis test procedures in conducting a hypothesis test for the slope \(\beta_{1}\). First, we specify the null and alternative hypotheses:
 Null hypothesis \(H_{0} \colon \beta_{1}\)= some number \(\beta\)
 Alternative hypothesis \(H_{A} \colon \beta_{1}\)≠ some number \(\beta\)
The phrase "some number \(\beta\)" means that you can test whether or not the population slope takes on any value. Most often, however, we are interested in testing whether \(\beta_{1}\) is 0. By default, Minitab conducts the hypothesis test with the null hypothesis, \(\beta_{1}\) is equal to 0, and the alternative hypothesis, \(\beta_{1}\)is not equal to 0. However, we can test values other than 0 and the alternative hypothesis can also state that \(\beta_{1}\) is less than (<) some number \(\beta\) or greater than (>) some number \(\beta\).
Second, we calculate the value of the test statistic using the following formula:
Third, we use the resulting test statistic to calculate the Pvalue. As always, the Pvalue is the answer to the question "how likely is it that we’d get a test statistic t* as extreme as we did if the null hypothesis were true?" The Pvalue is determined by referring to a tdistribution with n2 degrees of freedom.
Finally, we make a decision:
 If the Pvalue is smaller than the significance level \(\alpha\), we reject the null hypothesis in favor of the alternative. We conclude that "there is sufficient evidence at the \(\alpha\) level to conclude that there is a linear relationship in the population between the predictor x and response y."
 If the Pvalue is larger than the significance level \(\alpha\), we fail to reject the null hypothesis. We conclude "there is not enough evidence at the \(\alpha\) level to conclude that there is a linear relationship in the population between the predictor x and response y."
Minitab^{®}
Drawing conclusions about the slope parameter \(\beta_{1}\) using Minitab
Let's see how we can use Minitab to calculate confidence intervals and conduct hypothesis tests for the slope \(\beta_{1}\). Minitab's regression analysis output for our skin cancer mortality and latitude example appears below.
The line pertaining to the latitude predictor, Lat, in the summary table of predictors has been bolded. It tells us that the estimated slope coefficient \(b_{1}\), under the column labeled Coef, is 5.9776. The estimated standard error of \(b_{1}\), denoted se(\(b_{1}\)), in the column labeled SE Coef for "standard error of the coefficient," is 0.5984.
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Constant  1  36464  36464  98.80  0.000 
Residual Error  47  17173  365  
Total  48  53637 
Coefficients
Predictor  Coef  SE Coef  TValue  PValue 

Constant  389.19  23.81  16.34  0.000 
Lat  5.9776  0.5984  9.99  0.000 
Model Summary
S  Rsq  Rsq(adj) 

19.12  68.0%  67.3% 
The Regression equation
Mort = 389  5.98 Lat
By default, the test statistic is calculated assuming the user wants to test that the slope is 0. Dividing the estimated coefficient of 5.9776 by the estimated standard error of 0.5984, Minitab reports that the test statistic T is 9.99.
By default, the Pvalue is calculated assuming the alternative hypothesis is a "twotailed, notequalto" hypothesis. Upon calculating the probability that a trandom variable with n2 = 47 degrees of freedom would be larger than 9.99, and multiplying the probability by 2, Minitab reports that P is 0.000 (to three decimal places). That is, the Pvalue is less than 0.001. (Note we multiply the probability by 2 since this is a twotailed test.)
Because the Pvalue is so small (less than 0.001), we can reject the null hypothesis and conclude that \(\beta_{1}\) does not equal 0. There is sufficient evidence, at the \(\alpha\) = 0.05 level, to conclude that there is a linear relationship in the population between skin cancer mortality and latitude.
It's easy to calculate a 95% confidence interval for \(\beta_{1}\) using the information in the Minitab output. You just need to use Minitab to find the tmultiplier for you. It is \(t_{\left(0.025, 47\right)} = 2.0117\). Then, the 95% confidence interval for \(\beta_{1}\)is \(5.9776 ± 2.0117(0.5984) \) or (7.2, 4.8). (Alternatively, Minitab can display the interval directly if you click the "Results" tab in the Regression dialog box, select "Expanded Table" and check "Coefficients.")
We can be 95% confident that the population slope is between 7.2 and 4.8. That is, we can be 95% confident that for every additional onedegree increase in latitude, the mean skin cancer mortality rate decreases between 4.8 and 7.2 deaths per 10 million people.
Video: Using Minitab for the Slope Test
Factors affecting the width of a confidence interval for \(\beta_{1}\)
Recall that, in general, we want our confidence intervals to be as narrow as possible. If we know what factors affect the length of a confidence interval for the slope \(\beta_{1}\), we can control them to ensure that we obtain a narrow interval. The factors can be easily determined by studying the formula for the confidence interval:
First, subtracting the lower endpoint of the interval from the upper endpoint of the interval, we determine that the width of the interval is:
So, how can we affect the width of our resulting interval for \(\beta_{1}\)?

As the confidence level decreases, the width of the interval decreases. Therefore, if we decrease our confidence level, we decrease the width of our interval. Clearly, we don't want to decrease the confidence level too much. Typically, confidence levels are never set below 90%.

As MSE decreases, the width of the interval decreases. The value of MSE depends on only two factors — how much the responses vary naturally around the estimated regression line, and how well your regression function (line) fits the data. Clearly, you can't control the first factor all that much other than to ensure that you are not adding any unnecessary error in your measurement process. Throughout this course, we'll learn ways to make sure that the regression function fits the data as well as it can.

The more spread out the predictor x values, the narrower the interval. The quantity \(\sum(x_i\bar{x})^2\) in the denominator summarizes the spread of the predictor x values. The more spread out the predictor values, the larger the denominator, and hence the narrower the interval. Therefore, we can decrease the width of our interval by ensuring that our predictor values are sufficiently spread out.

As the sample size increases, the width of the interval decreases. The sample size plays a role in two ways. First, recall that the tmultiplier depends on the sample size through n2. Therefore, as the sample size increases, the tmultiplier decreases, and the length of the interval decreases. Second, the denominator \(\sum(x_i\bar{x})^2\) also depends on n. The larger the sample size, the more terms you add to this sum, the larger the denominator, and the narrower the interval. Therefore, in general, you can ensure that your interval is narrow by having a large enough sample.
Six possible outcomes concerning slope \(\beta_{1}\)
There are six possible outcomes whenever we test whether there is a linear relationship between the predictor x and the response y, that is, whenever we test the null hypothesis \(H_{0} \colon \beta_{1}\) = 0 against the alternative hypothesis \(H_{A} \colon \beta_{1} ≠ 0\).
When we don't reject the null hypothesis, \(H_{0} \colon \beta_{1} = 0\), any of the following three realities are possible:
 We committed a Type II error. That is, in reality \(\beta_{1} ≠ 0\) and our sample data just didn't provide enough evidence to conclude that \(\beta_{1}\)≠ 0.
 There really is not much of a linear relationship between x and y.
 There is a relationship between x and y — it is just not linear.
When we do reject the null hypothesis, \(H_{0} \colon \beta_{1}\)= 0 in favor of the alternative hypothesis \(H_{A} \colon \beta_{1}\)≠ 0, any of the following three realities are possible:
 We committed a Type I error. That is, in reality \(\beta_{1} = 0\), but we have an unusual sample that suggests that \(\beta_{1} ≠ 0\).
 The relationship between x and y is indeed linear.
 A linear function fits the data, okay, but a curved ("curvilinear") function would fit the data even better.
(1\(\alpha\))100% tinterval for intercept parameter \(\beta_{0}\)
Calculating confidence intervals and conducting hypothesis tests for the intercept parameter \(\beta_{0}\) is not done as often as it is for the slope parameter \(\beta_{1}\). The reason for this becomes clear upon reviewing the meaning of \(\beta_{0}\). The intercept parameter \(\beta_{0}\) is the mean of the responses at x = 0. If x = 0 is meaningless, as it would be, for example, if your predictor variable was height, then \(\beta_{0}\) is not meaningful. For the sake of completeness, we present the methods here for those situations in which \(\beta_{0}\) is meaningful.
 Confidence Interval for \(\beta_{0}\)

The formula for the confidence interval for \(\beta_{0}\), in words, is:
Sample estimate ± (tmultiplier × standard error)
and, in notation, is:
\(b_0 \pm t_{\alpha/2, n2} \times \sqrt{MSE} \sqrt{\dfrac{1}{n}+\dfrac{\bar{x}^2}{\sum(x_i\bar{x})^2}}\)
The resulting confidence interval gives us a range of values that is likely to contain the true unknown value \(\beta_{0}\). The factors affecting the length of a confidence interval for \(\beta_{0}\) are identical to the factors affecting the length of a confidence interval for \(\beta_{1}\).
An \(\alpha\)level hypothesis test for intercept parameter \(\beta_{0}\)
Again, we follow standard hypothesis test procedures. First, we specify the null and alternative hypotheses:
 Null hypothesis \(H_{0}\): \(\beta_{0}\) = some number \(\beta\)
 Alternative hypothesis \(H_{A}\): \(\beta_{0}\) ≠ some number \(\beta\)
The phrase "some number \(\beta\)" means that you can test whether or not the population intercept takes on any value. By default, Minitab conducts the hypothesis test for testing whether or not \(\beta_{0}\) is 0. But, the alternative hypothesis can also state that \(\beta_{0}\) is less than (<) some number \(\beta\) or greater than (>) some number \(\beta\).
Second, we calculate the value of the test statistic using the following formula:
\(t^*=\dfrac{b_0\beta}{\sqrt{MSE} \sqrt{\dfrac{1}{n}+\dfrac{\bar{x}^2}{\sum(x_i\bar{x})^2}}}=\dfrac{b_0\beta}{se(b_0)}\)
Third, we use the resulting test statistic to calculate the Pvalue. Again, the Pvalue is the answer to the question "how likely is it that we’d get a test statistic t* as extreme as we did if the null hypothesis were true?" The Pvalue is determined by referring to a tdistribution with n2 degrees of freedom.
Finally, we make a decision. If the Pvalue is smaller than the significance level \(\alpha\), we reject the null hypothesis in favor of the alternative. If we conduct a "twotailed, notequalto0" test, we conclude "there is sufficient evidence at the \(\alpha\) level to conclude that the mean of the responses is not 0 when x = 0." If the Pvalue is larger than the significance level \(\alpha\), we fail to reject the null hypothesis.
Minitab^{®}
Drawing conclusions about intercept parameter \(\beta_{0}\) using Minitab
Let's see how we can use Minitab to calculate confidence intervals and conduct hypothesis tests for the intercept \(\beta_{0}\). Minitab's regression analysis output for our skin cancer mortality and latitude example appears below. The work involved is very similar to that for the slope \(\beta_{1}\).
The line pertaining to the intercept, which Minitab always refers to as Constant, in the summary table of predictors has been bolded. It tells us that the estimated intercept coefficient \(b_{0}\), under the column labeled Coef, is 389.19. The estimated standard error of \(b_{0}\), denoted se(\(b_{0}\)), in the column labeled SE Coef is 23.81.
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Constant  1  36464  36464  98.80  0.000 
Residual Error  47  17173  365  
Total  48  53637 
Model Summary
S  Rsq  Rsq(adj) 

19.12  68.0%  67.3% 
Coefficients
Predictor  Coef  SE Coef  TValue  PValue 

Constant  389.19  23.81  16.34  0.000 
Lat  5.9776  0.5984  9.99  0.000 
Regression Equation
Mort = 389  5.98 Lat
By default, the test statistic is calculated assuming the user wants to test that the mean response is 0 when x = 0. Note that this is an illadvised test here because the predictor values in the sample do not include a latitude of 0. That is, such a test involves extrapolating outside the scope of the model. Nonetheless, for the sake of illustration, let's proceed to assume that it is an okay thing to do.
Dividing the estimated coefficient of 389.19 by the estimated standard error of 23.81, Minitab reports that the test statistic T is 16.34. By default, the Pvalue is calculated assuming the alternative hypothesis is a "twotailed, notequalto0" hypothesis. Upon calculating the probability that a t random variable with n2 = 47 degrees of freedom would be larger than 16.34, and multiplying the probability by 2, Minitab reports that P is 0.000 (to three decimal places). That is, the Pvalue is less than 0.001.
Because the Pvalue is so small (less than 0.001), we can reject the null hypothesis and conclude that \(\beta_{0}\) does not equal 0 when x = 0. There is sufficient evidence, at the \(\alpha\) = 0.05 level, to conclude that the mean mortality rate at a latitude of 0 degrees North is not 0. (Again, note that we have to extrapolate in order to arrive at this conclusion, which in general is not advisable.)
Proceed as previously described to calculate a 95% confidence interval for \(\beta_{0}\). Use Minitab to find the tmultiplier for you. Again, it is \(t_{\left(0.025, 47\right)} = 2.0117 \). Then, the 95% confidence interval for \(\beta_{0}\) is \(389.19 ± 2.0117\left(23.81\right) = \left(341.3, 437.1\right) \). (Alternatively, Minitab can display the interval directly if you click the "Results" tab in the Regression dialog box, select "Expanded Table" and check "Coefficients.") We can be 95% confident that the population intercept is between 341.3 and 437.1. That is, we can be 95% confident that the mean mortality rate at a latitude of 0 degrees North is between 341.3 and 437.1 deaths per 10 million people. (Again, it is probably not a good idea to make this claim because of the severe extrapolation involved.)
Statistical inference conditions
We've made no mention yet of the conditions that must be true in order for it to be okay to use the above confidence interval formulas and hypothesis testing procedures for \(\beta_{0}\) and \(\beta_{1}\). In short, the "LINE" assumptions we discussed earlier — linearity, independence, normality, and equal variance — must hold. It is not a big deal if the error terms (and thus responses) are only approximately normal. If you have a large sample, then the error terms can even deviate somewhat far from normality.
Regression Through the Origin (RTO)
In rare circumstances, it may make sense to consider a simple linear regression model in which the intercept, \(\beta_{0}\), is assumed to be exactly 0. For example, suppose we have data on the number of items produced per hour along with the number of rejects in each of those time spans. If we have a period where no items were produced, then there are obviously 0 rejects. Such a situation may indicate deleting \(\beta_{0}\) from the model since \(\beta_{0}\) reflects the amount of the response (in this case, the number of rejects) when the predictor is assumed to be 0 (in this case, the number of items produced). Thus, the model to estimate becomes
\(\begin{equation*} y_{i}=\beta_{1}x_{i}+\epsilon_{i},\end{equation*}\)
which is called a Regression Through the Origin (or RTO) model. The estimate for \(\beta_{1}\)when using the regression through the origin model is:
\(b_{\textrm{RTO}}=\dfrac{\sum_{i=1}^{n}x_{i}y_{i}}{\sum_{i=1}^{n}x_{i}^{2}}.\)
Thus, the estimated regression equation is
\(\begin{equation*} \hat{y}_{i}=b_{\textrm{RTO}}x_{i}\end{equation*}.\)
Note that we no longer have to center (or "adjust") the \(x_{i}\)'s and \(y_{i}\)'s by their sample means (compare this estimate for \(b_{1}\) to that of the estimate found for the simple linear regression model). Since there is no intercept, there is no correction factor and no adjustment for the mean (i.e., the regression line can only pivot about the point (0,0)).
Generally, regression through the origin is not recommended due to the following:
 Removal of \(\beta_{0}\) is a strong assumption that forces the line to go through the point (0,0). Imposing this restriction does not give ordinary least squares as much flexibility in finding the line of best fit for the data.
 In a simple linear regression model, \(\sum_{i=1}^{n}(y_{i}\hat{y}_i)=\sum_{i=1}^{n}e_{i}=0\). However, in regression through the origin, generally \(\sum_{i=1}^{n}e_{i}\neq 0\). Because of this, the SSE could actually be larger than the SSTO, thus resulting in \(r^{2}<0\).
 Since \(r^{2}\) can be negative, the usual interpretation of this value as a measure of the strength of the linear component in the simple linear regression model cannot be used here.
If you strongly believe that a regression through the origin model is appropriate for your situation, then statistical testing can help justify your decision. Moreover, if data has not been collected near \(x=0\), then forcing the regression line through the origin is likely to make for a worsefitting model. So again, this model is not usually recommended unless there is a strong belief that it is appropriate.
To fit a "regression through the origin model in Minitab click "Model" in the regular regression window and then uncheck the "Include the constant term in the model."
2.2  Another Example of Slope Inference
2.2  Another Example of Slope InferenceExampe 21
Is there a positive relationship between sales of leaded gasoline and the lead burden in the bodies of newborn infants? Researchers (Rabinowitz, et al, 1984) who were interested in answering this research question compiled data (Lead Cord data) on the monthly gasoline lead sales (in metric tons) in Massachusetts and mean lead concentrations (µl/dl) in umbilicalcord blood of babies born at a major Boston hospital over 14 months in 19801981.
Analyzing their data, the researchers obtained the following Minitab fitted line plot:
and standard regression analysis output:
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  1  3.7783  3.7783  9.95  0.008 
Residual Error  12  4.5560  0.3797  
Total  13  8.3343 
Model Summary
S  Rsq  Rsq(adj) 

0.616170  45.3%  40.8% 
Coefficients
Predictor  Coef  SE Coef  TValue  PValue 

Constant  4.1082  0.6088  6.75  0.000 
Sold  0.014885  0.004719  3.15  0.008 
Regression Equation
Cord = 4.11 + 0.0149 Sold
Minitab reports that the Pvalue for testing \(H_{0} \colon \beta_{1} = 0\) against the alternative hypothesis \(H_{A} \colon \beta_{1} ≠ 0\) is 0.008. Therefore, since the test statistic is positive, the Pvalue for testing \(H_{0} \colon \beta_{1}= 0\) against the alternative hypothesis \(H_{A} \colon \beta_{1} > 0\) is 0.008 ÷ 2 = 0.004. The Pvalue is less than 0.05. There is sufficient statistical evidence, at the 0.05 level, to conclude that \(\beta_{1} > 0\).
Furthermore, since the 95% tmultiplier is \(t_{\left(0.025, 12 \right)} = 2.1788\), a 95% confidence interval for \(\beta_{1}\) is:
0.014885 ± 2.1788(0.004719) or (0.0046, 0.0252).
The researchers can be 95% confident that the mean lead concentrations in the umbilicalcord blood of Massachusetts babies increase between 0.0046 and 0.0252 µl/dl for every onemetric ton increase in monthly gasoline lead sales in Massachusetts. It is up to the researchers to debate whether or not this is a meaningful increase.
2.3  Sums of Squares
2.3  Sums of SquaresLet's return to the skin cancer mortality example (Skin Cancer data) and investigate the research question, "Is there a (linear) relationship between skin cancer mortality and latitude?"
Review the following scatter plot and estimated regression line. What does the plot suggest for answering the above research question? The linear relationship looks fairly strong. The estimated slope is negative, not equal to 0.
We can answer the research question using the Pvalue of the ttest for testing:
 the null hypothesis \(H_{0} \colon \beta_{1} = 0\)
 against the alternative hypothesis \(H_{A} \colon \beta_{1} ≠ 0\).
As the Minitab output below suggests, the Pvalue of the ttest for "Lat" is less than 0.001. There is enough statistical evidence to conclude that the slope is not 0, that is, there is a linear relationship between skin cancer mortality and latitude.
There is an alternative method for answering the research question, which uses the analysis of variance Ftest. Let's first look at what we are working towards understanding. The (standard) "analysis of variance" table for this data set is highlighted in the Minitab output below. There is a column labeled F, which contains the Ftest statistic, and there is a column labeled P, which contains the Pvalue associated with the Ftest. Notice that the Pvalue, 0.000, appears to be the same as the Pvalue, 0.000, for the ttest for the slope. The Ftest similarly tells us that there is enough statistical evidence to conclude that there is a linear relationship between skin cancer mortality and latitude.
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Constant  1  36464  36464  99.80  0.000 
Residual Error  47  17173  365  
Total  48  53637 
Model Summary
S  Rsq  Rsq(adj) 

19.12  68.0%  67.3% 
Coefficients
Predictor  Coef  SE Coef  TValue  PValue 

Constant  389.19  23.81  16.34  0.000 
Lat  5.9776  0.5984  9.99  0.000 
Regression Equation
Mort = 389  5.98 Lat
Now, let's investigate what all the numbers in the table represent. Let's start with the column labeled SS for "sums of squares." We considered sums of squares in Lesson 1 when we defined the coefficient of determination, \(r^2\), but now we consider them again in the context of the analysis of variance table.
The scatter plot of mortality and latitude appears again below, but now it is adorned with three labels:
 \(y_{i}\) denotes the observed mortality for the state i
 \(\hat{y}_i\) is the estimated regression line (solid line) and therefore denotes the estimated (or "fitted") mortality for the latitude of the state i
 \(\bar{y}\) represents what the line would look like if there were no relationship between mortality and latitude. That is, it denotes the "no relationship" line (dashed line). It is simply the average mortality of the sample.
If there is a linear relationship between mortality and latitude, then the estimated regression line should be "far" from the no relationship line. We just need a way of quantifying "far." The above three elements are useful in quantifying how far the estimated regression line is from the no relationship line. As illustrated by the plot, the two lines are quite far apart.
\(\sum_{i=1}^{n}(\hat{y}_i\bar{y})^2 =36464\)
\(\sum_{i=1}^{n}(y_i\hat{y}_i)^2 =17173\)
\(\sum_{i=1}^{n}(y_i\bar{y})^2 =53637\)
 Total Sum of Squares

The distance of each observed value \(y_{i}\) from the no regression line \(\bar{y}\) is \(y_i  \bar{y}\). If you determine this distance for each data point, square each distance, and add up all of the squared distances, you get:
\(\sum_{i=1}^{n}(y_i\bar{y})^2 =53637\)
Called the "total sum of squares," it quantifies how much the observed responses vary if you don't take into account their latitude.
 Regression Sum of Squares

The distance of each fitted value \(\hat{y}_i\) from the no regression line \(\bar{y}\) is \(\hat{y}_i  \bar{y}\). If you determine this distance for each data point, square each distance, and add up all of the squared distances, you get:
\(\sum_{i=1}^{n}(\hat{y}_i\bar{y})^2 =36464\)
Called the "regression sum of squares," it quantifies how far the estimated regression line is from the no relationship line.
 Error Sum of Squares

The distance of each observed value \(y_{i}\) from the estimated regression line \(\hat{y}_i\) is \(y_i\hat{y}_i\). If you determine this distance for each data point, square each distance, and add up all of the squared distances, you get:
\(\sum_{i=1}^{n}(y_i\hat{y}_i)^2 =17173\)
Called the "error sum of squares," as you know, it quantifies how much the data points vary around the estimated regression line.
In short, we have illustrated that the total variation in observed mortality y (53637) is the sum of two parts — variation "due to" latitude (36464) and variation just due to random error (17173). (We are careful to put "due to" in quotes in order to emphasize that a change in latitude does not necessarily cause a change in mortality. All we could conclude is that latitude is "associated with" mortality.)
2.4  Sums of Squares (continued)
2.4  Sums of Squares (continued)Investigating Height and GPA Data
Now, let's do a similar analysis to investigate the research question, "Is there a (linear) relationship between height and grade point average?"(Height and GPA data)
Review the following scatterplot and estimated regression line. What does the plot suggest for answering the above research question? In this case, it appears as if there is almost no relationship whatsoever. The estimated slope is almost 0.
Again, we can answer the research question using the Pvalue of the ttest for:
 testing the null hypothesis \(H_{0} \colon \beta_{1} = 0\)
 against the alternative hypothesis \(H_{A} \colon \beta_{1} ≠ 0\).
As the Minitab output below suggests, the Pvalue of the ttest for "height" is 0.761. There is not enough statistical evidence to conclude that the slope is not 0. We conclude that there is no linear relationship between height and grade point average.
The Minitab output also shows the analysis of variable table for this data set. Again, the Pvalue associated with the analysis of variance Ftest, 0.761, appears to be the same as the Pvalue, 0.761, for the ttest for the slope. The Ftest similarly tells us that there is insufficient statistical evidence to conclude that there is a linear relationship between height and grade point average.
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Constant  1  0.0276  0.0276  0.09  0.761 
Residual Error  33  9.7055  0.2941  
Total  34  9.7331 
Model Summary
S = 0.5423 RSq = 0.3% RSq (adj) = 0.0%
Coefficients
Predictor  Coef  SE Coef  TValue  PValue 

Constant  3.410  1.435  2.38  0.023 
height  0.00656  0.02143  0.31  0.761 
Regression Equation
gpa = 3.14 0.0066 height
The scatter plot of grade point average and height appear below, now adorned with the three labels:
 \(y_{i}\) denotes the observed grade point average for student i
 \(\hat{y}_i\) is the estimated regression line (solid line) and therefore denotes the estimated grade point average for the height of student i
 \(\bar{y}\) represents the "no relationship" line (dashed line) between height and grade point average. It is simply the average grade point average of the sample.
For this data set, note that the estimated regression line and the "no relationship" line are very close together. Let's see how the sums of squares summarize this point.
\(\sum_{i=1}^{n}(\hat{y}_i\bar{y})^2 =0.0276\)
\(\sum_{i=1}^{n}(y_i\hat{y}_i)^2 =9.7055\)
\(\sum_{i=1}^{n}(y_i\bar{y})^2 =9.7331\)
 The "total sum of squares," which again quantifies how much the observed grade point averages vary if you don't take into account height, is \(\sum_{i=1}^{n}(y_i\bar{y})^2 =9.7331\).
 The "regression sum of squares," which again quantifies how far the estimated regression line is from the no relationship line, is \(\sum_{i=1}^{n}(\hat{y}_i\bar{y})^2 =0.0276\).
 The "error sum of squares," which again quantifies how much the data points vary around the estimated regression line, is \(\sum_{i=1}^{n}(y_i\hat{y}_i)^2 =9.7055\).
In short, we have illustrated that the total variation in the observed grade point averages y (9.7331) is the sum of two parts — variation "due to" height (0.0276) and variation due to random error (9.7055). Unlike the last example, most of the variation in the observed grade point averages is just due to random error. It appears as if very little of the variation can be attributed to the predictor height.
Try It!
Sums of Squares
Some researchers at UCLA conducted a study on cyanotic heart disease in children. They measured the age at which the child spoke his or her first word (x, in months) and the Gesell adaptive score (y) on a sample of 21 children. Upon analyzing the resulting data, they obtained the following analysis of variance table:
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Constant  1  1604.08  1604.08  13.20  0.002 
Residual Error  19  2308.59  121.50  
Total  20  3912.67 
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Constant  1  1604.08  1604.08  13.20  0.002 
Residual Error  19  2308.59  121.50  
Total  20  3912.67 
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Constant  1  1604.08  1604.08  13.20  0.002 
Residual Error  19  2308.59  121.50  
Total  20  3912.67 
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Constant  1  1604.08  1604.08  13.20  0.002 
Residual Error  19  2308.59  121.50  
Total  20  3912.67 
2.5  Analysis of Variance: The Basic Idea
2.5  Analysis of Variance: The Basic IdeaBreak down the total variation in y (the "total sum of squares (SSTO)") into two components:
 a component that is "due to" the change in x ("regression sum of squares (SSR)")
 a component that is just due to random error ("error sum of squares (SSE)")
If the regression sum of squares is a "large" component of the total sum of squares, it suggests that there is a linear association between the predictor x and the response y.
Here is a simple picture illustrating how the distance \(y_i\bar{y}\) is decomposed into the sum of two distances, \(\hat{y}_i\bar{y}\) and \(y_i\hat{y}_i\). Drag the bar at the bottom of the image to see each of the three components of the equation represented geometrically.
Although the derivation isn't as simple as it seems, the decomposition holds for the sum of the squared distances, too:
\(\underbrace{\left(\sum\limits_{i=1}^{n}(y_i\bar{y})^2\right)}_{\underset{\text{Total Sum of Squares}}{\text{SSTO}}} = \underbrace{\sum\limits_{i=1} ^{n} \left( \hat{y}_{i}  \overline{y} \right)^{2}}_{\underset{\text{Regression of Sums}}{\text{SSR}}} + \underbrace{\sum\limits_{i=1} ^{n} \left( y_{i}  \hat{y} \right)^{2}}_{\underset{\text{Error Sum of Squares}}{\text{SSE}}}\)
\(\text{SSTO} = \text{SSR} + \text{SSE}\)
The degrees of freedom associated with each of these sums of squares follow a similar decomposition.
 You might recognize SSTO as being the numerator of the sample variance. Recall that the denominator of the sample variance is n1. Therefore, n1 is the degree of freedom associated with SSTO.
 Recall that the mean square error MSE is obtained by dividing SSE by n2. Therefore, n2 is the degree of freedom associated with SSE.
Then, we obtain the following breakdown of the degrees of freedom:
\(\underset{\substack{\text{degrees of freedom}\\ \text{associated with SSTO}}}{\left(n1\right)} = \underset{\substack{\text{degrees of freedom}\\ \text{associated with SSR}}}{\left(1\right)} + \underset{\substack{\text{degrees of freedom}\\ \text{associated with SSE}}}{\left(n2\right)}\)
2.6  The Analysis of Variance (ANOVA) table and the Ftest
2.6  The Analysis of Variance (ANOVA) table and the FtestAnalysis of Variance for Skin Cancer Data
We've covered quite a bit of ground. Let's review the analysis of variance table for the example concerning skin cancer mortality and latitude (Skin Cancer data).
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Constant  1  36464  36464  99.80  0.000 
Residual Error  47  17173  365  
Total  48  53637 
Model Summary
S  Rsq  Rsq(adj) 

19.12  68.0%  67.3% 
Coefficients
Predictor  Coef  SE Coef  TValue  PValue 

Constant  389.19  23.81  16.34  0.000 
Lat  5.9776  0.5984  9.99  0.000 
Regression Equation
Mort = 389  5.98 Lat
Recall that there were 49 states in the data set.
 The degrees of freedom associated with SSR will always be 1 for the simple linear regression model. The degrees of freedom associated with SSTO is n1 = 491 = 48. The degrees of freedom associated with SSE is n2 = 492 = 47. And the degrees of freedom add up: 1 + 47 = 48.
 The sums of squares add up: SSTO = SSR + SSE. That is, here: 53637 = 36464 + 17173.
Let's tackle a few more columns of the analysis of variance table, namely the "mean square" column, labeled MS, and the Fstatistic column labeled F.
Definitions of mean squares
We already know the "mean square error (MSE)" is defined as:
\(MSE=\dfrac{\sum(y_i\hat{y}_i)^2}{n2}=\dfrac{SSE}{n2}\)
That is, we obtain the mean square error by dividing the error sum of squares by its associated degrees of freedom n2. Similarly, we obtain the "regression mean square (MSR)" by dividing the regression sum of squares by its degrees of freedom 1:
\(MSR=\dfrac{\sum(\hat{y}_i\bar{y})^2}{1}=\dfrac{SSR}{1}\)
Of course, that means the regression sum of squares (SSR) and the regression mean square (MSR) are always identical for the simple linear regression model.
Now, why do we care about mean squares? Because their expected values suggest how to test the null hypothesis \(H_{0} \colon \beta_{1} = 0\) against the alternative hypothesis \(H_{A} \colon \beta_{1} ≠ 0\).
Expected mean squares
Imagine taking many, many random samples of size n from some population, estimating the regression line, and determining MSR and MSE for each data set obtained. It has been shown that the average (that is, the expected value) of all of the MSRs you can obtain equals:
\(E(MSR)=\sigma^2+\beta_{1}^{2}\sum_{i=1}^{n}(X_i\bar{X})^2\)
Similarly, it has been shown that the average (that is, the expected value) of all of the MSEs you can obtain equals:
\(E(MSE)=\sigma^2\)
These expected values suggest how to test \(H_{0} \colon \beta_{1} = 0\) versus \(H_{A} \colon \beta_{1} ≠ 0\):
 If \(\beta_{1} = 0\), then we'd expect the ratio MSR/MSE to equal 1.
 If \(\beta_{1} ≠ 0\), then we'd expect the ratio MSR/MSE to be greater than 1.
These two facts suggest that we should use the ratio, MSR/MSE, to determine whether or not \(\beta_{1} = 0\).
 to test \(H_{0} \colon \beta_{1} = 0\) versus \(H_{A} \colon \beta_{1} < 0\)
 or to test \(H_{0} \colon \beta_{1} = 0\) versus \(H_{A} \colon \beta_{1} > 0\).
We have now completed our investigation of all of the entries of a standard analysis of variance table. The formula for each entry is summarized for you in the following analysis of variance table:
Source of Variation  DF  SS  MS  F 

Regression  1  \(SSR=\sum_{i=1}^{n}(\hat{y}_i\bar{y})^2\)  \(MSR=\dfrac{SSR}{1}\)  \(F^*=\dfrac{MSR}{MSE}\) 
Residual error  n2  \(SSE=\sum_{i=1}^{n}(y_i\hat{y}_i)^2\)  \(MSE=\dfrac{SSE}{n2}\)  
Total  n1  \(SSTO=\sum_{i=1}^{n}(y_i\bar{y})^2\) 
However, we will always let Minitab do the dirty work of calculating the values for us. Why is the ratio MSR/MSE labeled F* in the analysis of variance table? That's because the ratio is known to follow an F distribution with 1 numerator degree of freedom and n2 denominator degrees of freedom. For this reason, it is often referred to as the analysis of variance Ftest. The following section summarizes the formal Ftest.
The formal Ftest for the slope parameter \(\beta_{1}\)
The null hypothesis is \(H_{0} \colon \beta_{1} = 0\).
The alternative hypothesis is \(H_{A} \colon \beta_{1} ≠ 0\).
The test statistic is \(F^*=\dfrac{MSR}{MSE}\).
As always, the Pvalue is obtained by answering the question: "What is the probability that we’d get an F* statistic as large as we did if the null hypothesis is true?"
The Pvalue is determined by comparing F* to an F distribution with 1 numerator degree of freedom and n2 denominator degrees of freedom.
In reality, we are going to let Minitab calculate the F* statistic and the Pvalue for us. Let's try it out on a new example!
2.7  Example: Are Men Getting Faster?
2.7  Example: Are Men Getting Faster?Example 22: Men's 200m Data
The following data set (Men's 200m data) contains the winning times (in seconds) of the 22 men's 200meter Olympic sprints held between 1900 and 1996. (Notice that the Olympics were not held during World War I and II years.) Is there a linear relationship between the year and the winning times? The plot of the estimated regression line sure makes it look so!
To answer the research question, let's conduct the formal Ftest of the null hypothesis \(H_{0}\colon \beta_{1} = 0\) against the alternative hypothesis \(H_{A}\colon \beta_{1} ≠ 0\).
The analysis of variance table, which was obtained in Minitab, has been animated to allow you to interact with the table. As you roll your mouse over the blue (or bold) numbers, you are reminded of how those numbers are determined.
Source  DF  SS  MS  F  P 

Regression  1  15.8  15.8  177.7  0.000 
Residual Error  20  1.8  0.09  
Total  21  17.6 
From a scientific point of view, what we ultimately care about is the Pvalue, which Minitab indicates is 0.000 (to three decimal places). That is, the Pvalue is less than 0.001. The Pvalue is very small. It is unlikely that we would have obtained such a large F* statistic if the null hypothesis were true. Therefore, we reject the null hypothesis \(H_{0}\colon \beta_{1} = 0\) in favor of the alternative hypothesis \(H_{A}\colon \beta_{1} ≠ 0\). There is sufficient evidence at the \(\alpha = 0.05\) level to conclude that there is a linear relationship between year and winning time.
Equivalence of the analysis of variance Ftest and the ttest
As we noted in the first two examples, the Pvalue associated with the ttest is the same as the Pvalue associated with the analysis of variance Ftest. This will always be true for the simple linear regression model. It is illustrated in the year and winning time example also. Both Pvalues are 0.000 (to three decimal places):
Coefficients
Predictor  Coef  SE Coef  TValue  PValue 

Constant  76.153  4.152  18.34  0.000 
Year  0.0284  0.00213  13.33  0.000 
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  1  15.796  15.796  177.7  0.000 
Residual Error  20  1.778  0.089  
Total  21  17.574 
The Pvalues are the same because of a wellknown relationship between a t random variable and an F random variable that has 1 numerator degree of freedom. Namely:
\((t^{*}_{(n2)})^2=F^{*}_{(1,n2)}\)
This will always hold for the simple linear regression model. This relationship is demonstrated in this example:
\(\left(13.33\right)^{2} = 177.7\)
In short:
 For a given significance level \(\alpha\), the Ftest of \(\beta_{1} = 0\) versus \(\beta_{1} ≠ 0\) is algebraically equivalent to the twotailed ttest.
 We will get exactly the same Pvalues, so…
 If one test rejects \(H_{0}\), then so will the other.
 If one test does not reject \(H_{0}\), then so will the other.
The natural question then is ... when should we use the Ftest and when should we use the ttest?
 The Ftest is only appropriate for testing that the slope differs from 0 (\(\beta_{1} ≠ 0\)).
 Use the ttest to test that the slope is positive (\(\beta_{1} > 0\)) or negative (\(\beta_{1} < 0\)). Remember, though, that you will have to divide the Pvalue that Minitab reports by 2 to get the appropriate Pvalue.
The Ftest is more useful for the multiple regression model when we want to test that more than one slope parameter is 0. We'll learn more about this later in the course!
Try it!
The ANOVA Ftest
Height of white spruce trees
In forestry, the diameter of a tree at breast height (which is fairly easy to measure) is used to predict the height of a tree (a difficult measurement to obtain). Silviculturists working in British Columbia's boreal forest conducted a series of spacing trials to predict the heights of several species of trees. The data set White Spruce data contains the breast height diameters (in centimeters) and heights (in meters) for a sample of 36 white spruce trees.
 Is there sufficient evidence to conclude that there is a linear association between breast height diameter and tree height? Justify your response by looking at the fitted line plot and by conducting the analysis of variance Ftest. In conducting the Ftest, specify the null and alternative hypotheses, the significance level you used, and your final conclusion. (See Minitab Help: Creating a fitted line plot and Performing a basic regression analysis).
 Which value in the ANOVA table quantifies how far the estimated regression line is from the "no trend" line? That is, what is the particular value for this data set?
 Use the Minitab output to illustrate, for this example, the relationship between the ttest and the ANOVA Ftest for testing \(H_{0} \colon \beta_{1} = 0\) against \(H_{A} \colon \beta_{1} ≠ 0\).
2.8  Equivalent linear relationship tests
2.8  Equivalent linear relationship testsInvestigating Husband and Wife Data
It should be noted that the three hypothesis tests we have learned for testing the existence of a linear relationship — the ttest for \(H_{0} \colon \beta_{1}= 0\), the ANOVA Ftest for \(H_{0} \colon \beta_{1} = 0\), and the ttest for \(H_{0} \colon \rho = 0\) — will always yield the same results. For example, when evaluating whether or not a linear relationship exists between a husband's age and his wife's age if we treat the husband's age ("HAge") as the response and the wife's age ("WAge") as the predictor, each test yields a Pvalue of 0.000... < 0.001 (Husband and Wife data):
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  1  20577  20577  1242.51  0.000 
Error  168  2782  17  
Total  169  23359 
Model Summary
S  Rsq  Rsq(adj)  Rsq(pred) 

4.06946  88.09%  88.02%  87.84% 
Coefficients
Predictor  Coef  SE Coef  TValue  PValue 

Constant  3.590  1.159  3.10  0.002 
WAge  0.96670  0.02742  35.25  0.000 
Regression Equation
HAge = 3.59 + 0.967 WAge
*48 rows unused
Correlation: HAge, WAge
Pearson correlation  0.939 

PValue  0.000 
And similarly, if we treat the wife's age ("WAge") as the response and the husband's age ("HAge") as the predictor, each test yields of Pvalue of 0.000... < 0.001:
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  1  19396  19396  1242.51  0.000 
Error  168  2623  16  
Total  169  22019 
Model Summary
S  Rsq  Rsq(adj) 

3.951  88.1%  88.0% 
Coefficients
Predictor  Coef  SE Coef  TValue  PValue 

Constant  1.574  1.150  1.37  0.173 
HAge  0.91124  0.02585  35.25  0.000 
Regression Equation
WAge = 1.57 + 0.911 HAge
*48 rows unused
Correlation: WAge, HAge
Pearson Correlation  0.939 

PValue  0.000 
Technically, then, it doesn't matter what test you use to obtain the Pvalue. You will always get the same Pvalue. But, you should report the results of the test that make sense for your particular situation:
 If one of the variables can be clearly identified as the response, report that you conducted a ttest or Ftest results for testing \(H_{0} \colon \beta_{1} =0\). (Does it make sense to use x to predict y?)
 If it is not obvious which variable is the response, report that you conducted a ttest for testing \(H_{0} \colon \rho = 0\). (Does it only make sense to look for an association between x and y?)
2.9  Notation for the Lack of Fit test
2.9  Notation for the Lack of Fit testTo conclude this lesson we'll digress slightly to consider the lack of fit test for linearity —the "L" in the "LINE" conditions. The reason we consider this here is that, like the ANOVA test of earlier, this test is an Ftest based on decomposing sums of squares.
However, before we "derive" the lack of fit Ftest, it is important to note that the test requires repeat observations — called "replicates" — for at least one of the values of the predictor x. That is, if each x value in the data set is unique, then the lack of fit test can't be conducted on the data set. Even when we do have replicates, we typically need quite a few for the test to have any power. As such, this test generally only applies to specific types of datasets with plenty of replicates.
As is often the case before we learn a new hypothesis test, we have to get some new notation under our belt. In doing so, we'll look at some (contrived) data that purports to describe the relationship between the size of the minimum deposit required when opening a new checking account at a bank (x) and the number of new accounts at the bank (y) (New Accounts data). Suppose the trend in the data looks curved, but we fit a line through the data nonetheless:
If you select each of the specific x values (75, 100, 125, 150, 175, and 200) in the video above, you will see the standard notation used for the lack of fit Ftest. Let's take the case where x = 75 dollars:
 \(y_{11}\) denotes the first measurement (28) made at the first xvalue (x = 75) in the data set
 \(y_{12}\) denotes the second measurement (42) made at the first xvalue (x = 75) in the data set
 \(\bar{y}_{1}\) denotes the average (35) of all of the y values at the first xvalue (x = 75)
 \(\hat{y}_{11}\) denotes the predicted response (87.5) for the first measurement made at the first xvalue (x = 75)
 \(\hat{y}_{12}\) denotes the predicted response (87.5) for the second measurement made at the first xvalue (x = 75)
You should now understand the notation that appears when you roll your cursor over the other x values (100, 125, and so on). In general:
 \(y_{ij}\) denotes the \(j^{th}\) measurement made at the \(i^{th}\) xvalue in the data set
 \(\bar{y}_{i}\) denotes the average of all of the y values at the \(i^{th}\) xvalue
 \(\hat{y}_{ij}\) denotes the predicted response for the \(j^{th}\) measurement made at the \(i^{th}\) xvalue
2.10  Decomposing the Error
2.10  Decomposing the ErrorExample 23
If you think about it, there are two different explanations for why our data points might not fall right on the estimated regression line. One possibility is that our regression model doesn't describe the trend in the data well enough. That is, the model may exhibit a "lack of fit." The second possibility is that, as is often the case, there is just random variation in the data. This realization suggests that we should decompose the error into two components — one part due to the lack of fit of the model and the second part just due to random error. If most of the error is due to lack of fit, and not just random error, it suggests that we should scrap our model and try a different one.
Let's try decomposing the error in the checking account example, (New Accounts data). Recall that the prediction error for any data point is the distance of the observed response from the predicted response, i.e., \(y_{ij}\hat{y}_{ij}\). (Can you identify these distances on the plot of the data below?) To quantify the total error of prediction, we determine this distance for each data point, square the distance, and add up all of the distances to get:
\(\sum_{i}\sum_{j}(y_{ij}\hat{y}_{ij})^2\)
Not surprisingly, this quantity is called the "error sum of squares" and is denoted SSE. The error sum of squares for our checking account example is \(\sum_{i}\sum_{j}(y_{ij}\hat{y}_{ij})^2=14742\).
If a line fits the data well, then the average of the observed responses at each xvalue should be close to the predicted response for that xvalue. Therefore, to determine how much of the total error is due to the lack of model fit, we determine how far the average observed response at each xvalue is from the predicted response of each data point. That is, we calculate the distance \(\bar{y}_{i}\hat{y}_{ij}\). To quantify the total lack of fit, we determine this distance for each data point, square the distance, and add up all of the distances to get:
\(\sum_{i}\sum_{j}(\bar{y}_{i}\hat{y}_{ij})^2\)
Not surprisingly, this quantity is called the "lack of fit sum of squares" and is denoted SSLF. The lack of fit sum of squares for our checking account example is \(\sum_{i}\sum_{j}(\bar{y}_{i}\hat{y}_{ij})^2=13594\).
To determine how much of the total error is due to just random error, we determine how far each observed response is from the average observed response at each xvalue. That is, we calculate the distance \(y_{ij}\bar{y}_{i}\). To quantify the total pure error, we determine this distance for each data point, square the distance, and add up all of the distances to get:
\(\sum_{i}\sum_{j}(y_{ij}\bar{y}_{i})^2\)
Not surprisingly, this quantity is called the "pure error sum of squares" and is denoted SSPE. The pure error sum of squares for our checking account example is \(\sum_{i}\sum_{j}(y_{ij}\bar{y}_{i})^2=1148\).
\(\sum_{i}\sum_{j}(y_{ij}\hat{y}_{ij})^2=14742\)
\(\sum_{i}\sum_{j}(\bar{y}_{i}\hat{y}_{ij})^2=13594\)
\(\sum_{i}\sum_{j}(y_{ij}\bar{y}_{i})^2=1148\)
In summary, we've shown in this checking account example that most of the error (SSE = 14742) is attributed to the lack of a linear fit (SSLF = 13594) and not just to random error (SSPE = 1148).
Example 24
Let's see how our decomposition of the error works with a different example — one in which a line fits the data well. Suppose the relationship between the size of the minimum deposit required when opening a new checking account at a bank (x) and the number of new accounts at the bank (y) instead looks like this:
\(\sum_{i}\sum_{j}(y_{ij}\hat{y}_{ij})^2=45.1\)
\(\sum_{i}\sum_{j}(\bar{y}_{i}\hat{y}_{ij})^2=6.6\)
\(\sum_{i}\sum_{j}(y_{ij}\bar{y}_{i})^2=38.5\)
In this case, as we would expect based on the plot, very little of the total error (SSE = 45.1) is due to a lack of a linear fit (SSLF = 6.6). Most of the error appears to be due to just random variation in the number of checking accounts (SSPE = 38.5).
In summary
The basic idea behind decomposing the total error is:
 We break down the residual error ("error sum of squares" — denoted SSE) into two components:
 a component that is due to a lack of model fit ("lack of fit sum of squares" — denoted SSLF)
 a component that is due to pure random error ("pure error sum of squares" — denoted SSPE)
 If the lack of fit sum of squares is a large component of the residual error, it suggests that a linear function is inadequate.
Here is a simple picture illustrating how the distance \(y_{ij}\hat{y}_{ij}\) is decomposed into the sum of two distances \(\bar{y}_{i}\hat{y}_{ij}\) and \(y_{ij}\bar{y}_{i}\). Drag the bar at the bottom of the image to see each of the three components of the equation represented geometrically.
Although the derivation isn't as simple as it seems, the decomposition holds for the sum of the squared distances as well:
\(\underbrace{\sum\limits_{i=1}^c \sum\limits_{j=1}^{n_i} \left(y_{ij}  \hat{y}_{ij}\right)^{2}}_{\underset{\text{Error Sum of Squares}}{\text{SSE}}} = \underbrace{\sum\limits_{i=1}^c \sum\limits_{j=1}^{n_i} \left(\overline{y}_{i}  \hat{y}_{ij}\right)^{2}}_{\underset{\text{Lack of Fit Sums of Squares}}{\text{SSLF}}} + \underbrace{\sum\limits_{i=1}^c \sum\limits_{j=1}^{n_i} \left(y_{ij}  \overline{y}_{i}\right)^{2}}_{\underset{\text{Pure Error Sum of Squares}}{\text{SSPE}}}\)
SSE = SSLF + SSPE
The degrees of freedom associated with each of these sums of squares follow a similar decomposition.
 As before, the degrees of freedom associated with SSE is n2. (The 2 comes from the fact that you estimate 2 parameters — the slope and the intercept — whenever you fit a line to a set of data.)
 The degrees of freedom associated with SSLF is c2, where c denotes the number of distinct x values you have.
 The degrees of freedom associated with SSPE is nc, where again c denotes the number of distinct x values you have.
You might notice that the degrees of freedom breakdown as:
\(\underset{\substack{\text{degrees of freedom}\\ \text{associated with SSE}}}{\left(n2\right)} = \underset{\substack{\text{degrees of freedom}\\ \text{associated with SSLF}}}{\left(c2\right)} + \underset{\substack{\text{degrees of freedom}\\ \text{associated with SSPE}}}{\left(nc\right)}\)
where again c denotes the number of distinct x values you have.
2.11  The Lack of Fit Ftest
2.11  The Lack of Fit FtestInvestigating New Accounts Data
We're almost there! We just need to determine an objective way of deciding when too much of the error in our prediction is due to a lack of model fit. That's where the lack of fit Ftest comes into play. Let's return to the first checking account example, (New Accounts data):
Jumping ahead to the punchline, here's Minitab's output for the lack of fit Ftest for this data set:
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  1  5141  5141  3.14  0.110 
Residual Error  9  14742  1638  
Lack of Fit  4  13594  3398  14.80  0.006 
Pure Error  5  1148  230  
Total  10  19883 
1 row with no replicates
As you can see, the lack of fit output appears as a portion of the analysis of variance table. In the Sum of Squares ("SS") column, we see — as we previously calculated — that SSLF = 13594 and SSPE = 1148 sum to SSE = 14742. We also see in the Degrees of Freedom ("DF") column that — since there are n = 11 data points and c = 6 distinct x values (75, 100, 125, 150, 175, and 200) — the lack of fit degrees of freedom c  2 = 4 and the pure error degrees of freedom is n  c = 5 sum to the error degrees of freedom n  2 = 9.
Just as is done for the sums of squares in the basic analysis of variance table, the lack of fit sum of squares and the error sum of squares are used to calculate "mean squares." They are even calculated similarly, namely by dividing the sum of squares by their associated degrees of freedom. Here are the formal definitions of the mean squares:
In the Mean Squares ("MS") column, we see that the lack of fit mean square MSLF is 13594 divided by 4, or 3398. The pure error mean square MSPE is 1148 divided by 5, or 230:
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  1  5141  5141  3.14  0.110 
Residual Error  9  14742  1638  
Lack of Fit  4  13594  3398  14.80  0.006 
Pure Error  5  1148  230  
Total  10  19883 
You might notice that the lack of fit Fstatistic is calculated by dividing the lack of fit mean square (MSLF = 3398) by the pure error mean square (MSPE = 230) to get 14.80. How do we know that this Fstatistic helps us in testing the hypotheses:
 \(H_{0 }\): The relationship assumed in the model is reasonable, i.e., there is no lack of fit.
 \(H_{A }\): The relationship assumed in the model is not reasonable, i.e., there is a lack of fit.
The answer lies in the "expected mean squares." In our sample of n = 11 newly opened checking accounts, we obtained MSLF = 3398. If we had taken a different random sample of size n = 11, we would have obtained a different value for MSLF. Theory tells us that the average of all of the possible MSLF values we could obtain is:
\(E(MSLF) =\sigma^2+\dfrac{\sum n_i(\mu_i(\beta_0+\beta_1X_i))^2}{c2}\)
That is, we should expect MSLF, on average, to equal the above quantity — \(\sigma^{2}\) plus another messylooking term. Think about that messy term. If the null hypothesis is true, i.e., if the relationship between the predictor x and the response y is linear, then \(\mu_{i} = \beta_{0} + \beta_{1}X_{i}\) and the messy term becomes 0 and goes away. That is, if there is no lack of fit, we should expect the lack of fit mean square MSLF to equal \(\sigma^{2}\).
What should we expect MSPE to equal? Theory tells us it should, on average, always equal \(\sigma^{2}\):
\(E(MSPE) =\sigma^2\)
Aha — there we go! The logic behind the calculation of the Fstatistic is now clear:
 If there is a linear relationship between x and y, then \(\mu_{i} = \beta_{0} + \beta_{1}X_{i}\). That is, there is no lack of fit in the simple linear regression model. We would expect the ratio MSLF/MSPE to be close to 1.
 If there is not a linear relationship between x and y, then \(\mu_{i} ≠ \beta_{0} + \beta_{1}X_{i}\). That is, there is a lack of fit in the simple linear regression model. We would expect the ratio MSLF/MSPE to be large, i.e., a value greater than 1.
So, to conduct the lack of fit test, we calculate the value of the Fstatistic:
\(F^*=\dfrac{MSLF}{MSPE}\)
and determine if it is large. To decide if it is large, we compare the F*statistic to an Fdistribution with c  2 numerator degrees of freedom and n  c denominator degrees of freedom.
In summary
We follow standard hypothesis test procedures in conducting the lack of fit Ftest. First, we specify the null and alternative hypotheses:
 \(H_{0}\): The relationship assumed in the model is reasonable, i.e., there is no lack of fit in the model \(\mu_{i} = \beta_{0} + \beta_{1}X_{i}\).
 \(H_{A}\): The relationship assumed in the model is not reasonable, i.e., there is lack of fit in the model \(\mu_{i} = \beta_{0} + \beta_{1}X_{i}\).
Second, we calculate the value of the Fstatistic:
\(F^*=\dfrac{MSLF}{MSPE}\)
To do so, we complete the analysis of variance table using the following formulas.
Analysis of Variance
Source  DF  SS  MS  F 

Regression  1  \(SSR=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(\hat{y}_{ij}\bar{y})^2\)  \(MSR=\dfrac{SSR}{1}\)  \(F=\dfrac{MSR}{MSE}\) 
Residual Error  n  2  \(SSE=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(y_{ij}\hat{y}_{ij})^2\)  \(MSE=\dfrac{SSE}{n2}\)  
Lack of Fit  c  2  \(SSLF=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(\bar{y}_{i}\hat{y}_{ij})^2\)  \(MSLF=\dfrac{SSLF}{c2}\)  \(F^*=\dfrac{MSLF}{MSPE}\) 
Pure Error  n  c  \(SSPE=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(y_{ij}\bar{y}_{i})^2\)  \(MSPE=\dfrac{SSPE}{nc}\)  
Total  n  1  \(SSTO=\sum_{i=1}^{c}\sum_{j=1}^{n_i}(y_{ij}\bar{y})^2\) 
In reality, we let statistical software such as Minitab, determine the analysis of variance table for us.
Third, we use the resulting F*statistic to calculate the Pvalue. As always, the Pvalue is the answer to the question "how likely is it that we’d get an F*statistic as extreme as we did if the null hypothesis were true?" The Pvalue is determined by referring to an Fdistribution with c  2 numerator degrees of freedom and n  c denominator degrees of freedom.
Finally, we make a decision:
 If the Pvalue is smaller than the significance level \(\alpha\), we reject the null hypothesis in favor of the alternative. We conclude that "there is sufficient evidence at the \(\alpha\) level to conclude that there is a lack of fit in the simple linear regression model."
 If the Pvalue is larger than the significance level \(\alpha\), we fail to reject the null hypothesis. We conclude "there is not enough evidence at the \(\alpha\) level to conclude that there is a lack of fit in the simple linear regression model."
For our checking account example:
in which we obtain:
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  1  5141  5141  3.14  0.110 
Residual Error  9  14742  1638  
Lack of Fit  4  13594  3398  14.80  0.006 
Pure Error  5  1148  230  
Total  10  19883 
the F*statistic is 14.80 and the Pvalue is 0.006. The Pvalue is smaller than the significance level \(\alpha = 0.05\) — we reject the null hypothesis in favor of the alternative. There is sufficient evidence at the \(\alpha = 0.05\) level to conclude that there is a lack of fit in the simple linear regression model. In light of the scatterplot, the lack of fit test provides the answer we expected.
Try it!
The lack of fit test
Fill in the missing numbers (??) in the following analysis of variance table resulting from a simple linear regression analysis.
Click on the light bulb
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  ??  12.597  ??  ??  0.000 
Residual Error  ??  ??  ??  
Lack of Fit  3  ??  ??  ??  ?? 
Pure Error  ??  0.157  ??  
Total  14  15.522 
2.12  Further Examples
2.12  Further ExamplesExample 25: Highway Sign Reading Distance and Driver Age
The data are n = 30 observations on driver age and the maximum distance (feet) at which individuals can read a highway sign (Sign Distance data).
(Data source: Mind On Statistics, 3rd edition, Utts and Heckard)
The plot below gives a scatterplot of the highway sign data along with the least squares regression line.
Here is the accompanying Minitab output, which is found by performing Stat >> Regression >> Regression on the highway sign data.
Regression Analysis: Distance, Age
Coefficients
Predictor  Coef  SE Coef  TValue  PValue 

Constant  576.68  23.47  24.57  0.000 
Age  3.0068  0.4243  7.09  0.000 
Regression Equation
Distance = 577  3.01 Age
Hypothesis Test for the Intercept (\(\beta_{0}\))
This test is rarely a test of interest, but does show up when one is interested in performing a regression through the origin (which we touched on earlier in this lesson). In the Minitab output above, the row labeled Constant gives the information used to make inferences about the intercept. The null and alternative hypotheses for a hypotheses test about the intercept are written as:
\(H_{0} \colon \beta_{0} = 0\)
\(H_{A} \colon \beta_{0} \ne 0\)
In other words, the null hypothesis is testing if the population intercept is equal to 0 versus the alternative hypothesis that the population intercept is not equal to 0. In most problems, we are not particularly interested in hypotheses about the intercept. For instance, in our example, the intercept is the mean distance when the age is 0, a meaningless age. Also, the intercept does not give information about how the value of y changes when the value of x changes. Nevertheless, to test whether the population intercept is 0, the information from the Minitab output is used as follows:
 The sample intercept is \(b_{0}\) = 576.68, the value under Coef.
 The standard error (SE) of the sample intercept, written as se(\(b_{0}\)), is se(\(b_{0}\)) = 23.47, the value under SE Coef. The SE of any statistic is a measure of its accuracy. In this case, the SE of \(b_{0}\) gives, very roughly, the average difference between the sample \(b_{0}\) and the true population intercept \(\beta_{0}\), for random samples of this size (and with these xvalues).
 The test statistic is t = \(b_{0}\)/se(\(b_{0}\)) = 576.68/23.47 = 24.57, the value under TValue.
 The pvalue for the test is p = 0.000 and is given under PValue. The pvalue is actually very small and not exactly 0.
 The decision rule at the 0.05 significance level is to reject the null hypothesis since our p < 0.05. Thus, we conclude that there is statistically significant evidence that the population intercept is not equal to 0.
So how exactly is the pvalue found? For simple regression, the pvalue is determined using a t distribution with n − 2 degrees of freedom (df), which is written as \(t_{n−2}\), and is calculated as 2 × area past t under a \(t_{n−2}\) curve. In this example, df = 30 − 2 = 28. The pvalue region is the type of region shown in the figure below. The negative and positive versions of the calculated t provide the interior boundaries of the two shaded regions. As the value of t increases, the pvalue (area in the shaded regions) decreases.
Hypothesis Test for the Slope (\(\beta_{1}\))
This test can be used to test whether or not x and y are linearly related. The row pertaining to the variable Age in the Minitab output from earlier gives information used to make inferences about the slope. The slope directly tells us about the link between the mean y and x. When the true population slope does not equal 0, the variables y and x are linearly related. When the slope is 0, there is not a linear relationship because the mean y does not change when the value of x is changed. The null and alternative hypotheses for a hypotheses test about the slope are written as:
\(H_{0} \colon \beta_{1}\) = 0
\(H_{A} \colon \beta_{1}\) ≠ 0
In other words, the null hypothesis is testing if the population slope is equal to 0 versus the alternative hypothesis that the population slope is not equal to 0. To test whether the population slope is 0, the information from the Minitab output is used as follows:
 The sample slope is \(b_{1}\) = −3.0068, the value under Coef in the Age row of the output.
 The SE of the sample slope, written as se(\(b_{1}\)), is se(\(b_{1}\)) = 0.4243, the value under SE Coef. Again, the SE of any statistic is a measure of its accuracy. In this case, the SE of b1 gives, very roughly, the average difference between the sample \(b_{1 }\)and the true population slope \(\beta_{1}\), for random samples of this size (and with these xvalues).
 The test statistic is t = \(b_{1}\)/se(\(b_{1}\)) = −3.0068/0.4243 = −7.09, the value under TValue.
 The pvalue for the test is p = 0.000 and is given under PValue.
 The decision rule at the 0.05 significance level is to reject the null hypothesis since our p < 0.05. Thus, we conclude that there is statistically significant evidence that the variables of Distance and Age are linearly related.
As before, the pvalue is the region illustrated in the figure above.
Confidence Interval for the Slope (\(\beta_{1}\))
A confidence interval for the unknown value of the population slope \(\beta_{1}\) can be computed as
sample statistic ± multiplier × standard error of statistic
→ \(b_{1 }\)± t* × se(\(b_{1}\))
To find the t* multiplier, you can do one of the following:
 In simple regression, the t* multiplier is determined using a \(t_{n−2}\) distribution. The value of t* is such that the confidence level is the area (probability) between −t* and +t* under the tcurve.
 A table such as the one in the textbook can be used to look up the multiplier.
 Alternatively, software like Minitab can be used.
95% Confidence Interval
In our example, n = 30 and df = n − 2 = 28. For 95% confidence, t* = 2.05. A 95% confidence interval for \(\beta_{1}\), the true population slope, is:
3.0068 ± (2.05 × 0.4243)
3.0068 ± 0.870
or about − 3.88 to − 2.14.
Interpretation: With 95% confidence, we can say the mean sign reading distance decreases somewhere between 2.14 and 3.88 feet per each oneyear increase in age. It is incorrect to say that with 95% probability the mean sign reading distance decreases somewhere between 2.14 and 3.88 feet per each oneyear increase in age. Make sure you understand why!!!
99% Confidence Interval
For 99% confidence, t* = 2.76. A 99% confidence interval for \(\beta_{1}\) , the true population slope is:
3.0068 ± (2.76 × 0.4243)
3.0068 ± 1.1711
or about − 4.18 to − 1.84.
Interpretation: With 99% confidence, we can say the mean sign reading distance decreases somewhere between 1.84 and 4.18 feet per each oneyear increase in age. Notice that as we increase our confidence, the interval becomes wider. So as we approach 100% confidence, our interval grows to become the whole real line.
As a final note, the above procedures can be used to calculate a confidence interval for the population intercept. Just use \(b_{0}\) (and its standard error) rather than \(b_{1}\).
Example 26: Handspans Data
Stretched handspans and heights are measured in inches for n = 167 college students (Hand Height data). We’ll use y = height and x = stretched handspan. A scatterplot with a regression line superimposed is given below, together with results of a simple linear regression model fit to the data.
Regression Analysis: Height versus HandSpan
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  1  1500.06  1500.06  199.17  0.000 
HandSpan  1  1500.06  1500.06  199.17  0.000 
Error  165  1242.70  7.53  
LackofFit  17  96.24  5.66  0.73  0.767 
Pure Error  148  1146.46  7.75  
Total  166  2742.76 
Model Summary
S  Rsq  Rsq(adj)  Rsq(pred) 

2.74436  54.69%  54.42%  53.76% 
Coefficients
Predictor  Coef  SE Coef  TValue  PValue  VIF 

Constant  35.53  2.32  15.34  0.000  
HandSpan  1.560  0.111  14.11  0.000  1.00 
Regression Equation
Height = 35.53 + 1.560 HandSpan
Note! Some things to note are:
 The residual standard deviation S is 2.744 and this estimates the standard deviation of the errors.
 \(r^2\) = (SSTOSSE) / SSTO = SSR / (SSR+SSE) = 1500.1 / (1500.1+1242.7) = 1500.1 / 2742.8 = 0.547 or 54.7%. The interpretation is that handspan differences explain 54.7% of the variation in heights.
 The value of the F statistic is F = 199.2 with 1 and 165 degrees of freedom, and the pvalue for this F statistic is 0.000. Thus we reject the null hypothesis \(H_{0} \colon \beta_{1}\) = 0 in favor of \(H_A\colon\beta_1\neq 0\). In other words, the observed relationship is statistically significant.
Example 27: Quality Data
You are a manufacturer who wants to obtain a quality measure on a product, but the procedure to obtain the measure is expensive. There is an indirect approach, which uses a different product score (Score 1) in place of the actual quality measure (Score 2). This approach is less costly but also is less precise. You can use regression to see if Score 1 explains a significant amount of the variance in Score 2 to determine if Score 1 is an acceptable substitute for Score 2. The results from a simple linear regression analysis are given below:
Regression Analysis: Score2 versus Score1
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  1  2.5419  2.5419  156.56  0.000 
Residual Error  7  0.1136  0.0162  
Total  8  2.6556 
Model Summary
S  Rsq  Rsq(adj) 

0.127419  95.7%  95.1% 
Coefficients
Predictor  Coef  SE Coef  TValue  PValue 

Constant  1.1177  0.1093  10.23  0.000 
WAge  0.21767  0.01740  12.51  0.000 
Regression Equation
Score2 = 1.12 + 0.218 Score1
We are concerned in testing the null hypothesis that Score 1 is not a significant predictor of Score 2 versus the alternative that Score 1 is a significant predictor of Score 2. More formally, we are testing:
\(H_{0} \colon\beta_{1}\) = 0
\(H_{A} \colon \beta_{1}\) ≠ 0
The pvalue in the ANOVA table (0.000), indicates that the relationship between Score 1 and Score 2 is statistically significant at an αlevel of 0.05. This is also shown by the pvalue for the estimated coefficient of Score 1, which is 0.000.
2.13  Software Help 2
2.13  Software Help 2
The next two pages cover the Minitab and R commands for the procedures in this lesson.
Below is a zip file that contains all the data sets used in this lesson:
 couplesheight.txt
 handheight.txt
 heightgpa.txt
 husbandwife.txt
 leadcord.txt
 mens200m.txt
 newaccounts.txt
 signdist.txt
 skincancer.txt
 solutions_conc.txt
 whitespruce.txt
Minitab Help 2: SLR Model Evaluation
Minitab Help 2: SLR Model EvaluationMinitab^{®}
Skin cancer
 Perform a basic regression analysis with y = Mort and x = Lat.
 Create a fitted line plot.
 To display confidence intervals for the model parameters (regression coefficients) click "Results" in the Regression Dialog and select "Expanded tables" for "Display of results."
Cord blood lead concentration
 Perform a basic regression analysis with y = Cord and x = Sold.
 Create a fitted line plot.
 To display confidence intervals for the model parameters (regression coefficients) click "Results" in the Regression Dialog and select "Expanded tables" for "Display of results."
Skin cancer
 Perform a basic regression analysis with y = Mort and x = Lat.
Height and grade point average
 Perform a basic regression analysis with y = gpa and x = height.
 Create a fitted line plot.
Sprinters
 Perform a basic regression analysis with y = Men200m and x = Year.
 Create a fitted line plot.
Highway sign reading distance and driver age
 Perform a basic regression analysis with y = Distance and x = Age.
 Create a fitted line plot.
 To display confidence intervals for the model parameters (regression coefficients) click "Results" in the Regression Dialog and select "Expanded tables" for "Display of results."
 To change the confidence level for the intervals click "Options" in the Regression Dialog.
Handspan and height
 Perform a basic regression analysis with y = Height and x = Handspan.
 Create a fitted line plot.
Checking account deposits
 Perform a basic regression analysis with y = New and x = Size.
 Create a fitted line plot.
 Conduct a lack of fit test. (Note Minitab v17 automatically recognizes replicates of data and produces a Lack of Fit test with Pure error by default.)
R Help 2: SLR Model Evaluation
R Help 2: SLR Model EvaluationR Help
Skin cancer
 Load the skin cancer data.
 Fit a simple linear regression model with y = Mort and x = Lat.
 Display a scatterplot of the data with the simple linear regression line.
 Display model results.
 Calculate confidence intervals for the model parameters (regression coefficients).
skincancer < read.table("~/pathtofolder/skincancer.txt", header=T)
attach(skincancer)
model < lm(Mort ~ Lat)
plot(x=Lat, y=Mort,
xlab="Latitude (at center of state)", ylab="Mortality (deaths per 10 million)",
panel.last = lines(sort(Lat), fitted(model)[order(Lat)]))
summary(model)
# Coefficients:
# Estimate Std. Error t value Pr(>t)
# (Intercept) 389.1894 23.8123 16.34 < 2e16 ***
# Lat 5.9776 0.5984 9.99 3.31e13 ***
# 
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 19.12 on 47 degrees of freedom
# Multiple Rsquared: 0.6798, Adjusted Rsquared: 0.673
# Fstatistic: 99.8 on 1 and 47 DF, pvalue: 3.309e13
confint(model, level=0.95)
# 2.5 % 97.5 %
# (Intercept) 341.285151 437.093552
# Lat 7.181404 4.773867
detach(skincancer)
Cord blood lead concentration
 Load the cord blood lead concentration data.
 Fit a simple linear regression model with y = Cord and x = Sold.
 Display a scatterplot of the data with the simple linear regression line.
 Display model results.
 Calculate confidence intervals for the model parameters (regression coefficients).
cordblood < read.table("~/pathtofolder/cordblood.txt", header=T)
attach(cordblood)
model < lm(Cord ~ Sold)
plot(x=Sold, y=Cord,
xlab="Monthly gasoline lead sales (metric tons)",
ylab="Mean cord blood lead concentration",
panel.last = lines(sort(Sold), fitted(model)[order(Sold)]))
summary(model)
# Coefficients:
# Estimate Std. Error t value Pr(>t)
# (Intercept) 4.108182 0.608806 6.748 2.05e05 ***
# Sold 0.014885 0.004719 3.155 0.0083 **
# 
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.6162 on 12 degrees of freedom
# Multiple Rsquared: 0.4533, Adjusted Rsquared: 0.4078
# Fstatistic: 9.952 on 1 and 12 DF, pvalue: 0.008303
confint(model, level=0.95)
# 2.5 % 97.5 %
# (Intercept) 2.781707607 5.43465712
# Sold 0.004604418 0.02516608
detach(cordblood)
Skin cancer
 Load the skin cancer data.
 Fit a simple linear regression model with y = Mort and x = Lat.
 Display analysis of variance table.
skincancer < read.table("~/pathtofolder/skincancer.txt", header=T)
attach(skincancer)
model < lm(Mort ~ Lat)
anova(model)
# Analysis of Variance Table
# Response: Mort
# Df Sum Sq Mean Sq F value Pr(>F)
# Lat 1 36464 36464 99.797 3.309e13 ***
# Residuals 47 17173 365
# 
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Note: R anova function does not display the total sum of squares.
# Add regression and residual sums of squares to get total sum of squares.
# SSR + SSE = SSTO, i.e., 36464 + 17173 = 53637.
detach(skincancer)
Height and grade point average
 Load the height and grade point average data.
 Fit a simple linear regression model with y = gpa and x = height.
 Display a scatterplot of the data with the simple linear regression line.
 Display model results.
 Display analysis of variance table.
heightgpa < read.table("~/pathtofolder/heightgpa.txt", header=T)
attach(heightgpa)
model < lm(gpa ~ height)
plot(x=height, y=gpa,
xlab="Height (inches)", ylab="Grade Point Average",
panel.last = lines(sort(height), fitted(model)[order(height)]))
summary(model)
# Coefficients:
# Estimate Std. Error t value Pr(>t)
# (Intercept) 3.410214 1.434616 2.377 0.0234 *
# height 0.006563 0.021428 0.306 0.7613
# 
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.5423 on 33 degrees of freedom
# Multiple Rsquared: 0.002835, Adjusted Rsquared: 0.02738
# Fstatistic: 0.09381 on 1 and 33 DF, pvalue: 0.7613
anova(model)
# Analysis of Variance Table
# Response: gpa
# Df Sum Sq Mean Sq F value Pr(>F)
# height 1 0.0276 0.02759 0.0938 0.7613
# Residuals 33 9.7055 0.29411
# SSTO = SSR + SSE = 0.0276 + 9.7055 = 9.7331.
detach(heightgpa)
Sprinters
 Load the sprinter's data.
 Fit a simple linear regression model with y = Men200m and x = Year.
 Display a scatterplot of the data with the simple linear regression line.
 Display model results.
 Display analysis of variance table.
sprinters < read.table("~/pathtofolder/mens200m.txt", header=T)
attach(sprinters)
model < lm(Men200m ~ Year)
plot(x=Year, y=Men200m,
xlab="Year", ylab="Men's 200m time (secs)",
panel.last = lines(sort(Year), fitted(model)[order(Year)]))
summary(model)
# Coefficients:
# Estimate Std. Error t value Pr(>t)
# (Intercept) 76.153369 4.152226 18.34 5.61e14 ***
# Year 0.028383 0.002129 13.33 2.07e11 ***
# 
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.2981 on 20 degrees of freedom
# Multiple Rsquared: 0.8988, Adjusted Rsquared: 0.8938
# Fstatistic: 177.7 on 1 and 20 DF, pvalue: 2.074e11
anova(model)
# Analysis of Variance Table
# Response: Men200m
# Df Sum Sq Mean Sq F value Pr(>F)
# Year 1 15.7964 15.7964 177.72 2.074e11 ***
# Residuals 20 1.7777 0.0889
# 
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# SSTO = SSR + SSE = 15.7964 + 1.7777 = 17.5741.
detach(sprinters)
Highway sign reading distance and driver age
 Load the signdist data.
 Fit a simple linear regression model with y = Distance and x = Age.
 Display a scatterplot of the data with the simple linear regression line.
 Display model results.
 Calculate confidence intervals for the slope.
signdist < read.table("~/pathtofolder/signdist.txt", header=T)
attach(signdist)
model < lm(Distance ~ Age)
plot(x=Age, y=Distance,
xlab="Age", ylab="Distance",
panel.last = lines(sort(Age), fitted(model)[order(Age)]))
summary(model)
# Coefficients:
# Estimate Std. Error t value Pr(>t)
# (Intercept) 576.6819 23.4709 24.570 < 2e16 ***
# Age 3.0068 0.4243 7.086 1.04e07 ***
# 
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 49.76 on 28 degrees of freedom
# Multiple Rsquared: 0.642, Adjusted Rsquared: 0.6292
# Fstatistic: 50.21 on 1 and 28 DF, pvalue: 1.041e07
confint(model, parm="Age", level=0.95)
# 2.5 % 97.5 %
# Age 3.876051 2.13762
confint(model, parm="Age", level=0.99)
# 0.5 % 99.5 %
# Age 4.179391 1.83428
detach(signdist)
Handcode and height
 Load the handheight data.
 Fit a simple linear regression model with y = Height and x = Handcode
 Display a scatterplot of the data with the simple linear regression line.
 Display model results.
 Display analysis of variance table.
handheight < read.table("~/pathtofolder/handheight.txt", header=T)
attach(handheight)
model < lm(Height ~ HandSpan)
plot(x=HandSpan, y=Height,
xlab="HandSpan", ylab="Height",
panel.last = lines(sort(HandSpan), fitted(model)[order(HandSpan)]))
summary(model)
# Coefficients:
# Estimate Std. Error t value Pr(>t)
# (Intercept) 35.5250 2.3160 15.34 <2e16 ***
# HandSpan 1.5601 0.1105 14.11 <2e16 ***
# 
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 2.744 on 165 degrees of freedom
# Multiple Rsquared: 0.5469, Adjusted Rsquared: 0.5442
# Fstatistic: 199.2 on 1 and 165 DF, pvalue: < 2.2e16
anova(model)
# Analysis of Variance Table
# Response: Height
# Df Sum Sq Mean Sq F value Pr(>F)
# HandSpan 1 1500.1 1500.06 199.17 < 2.2e16 ***
# Residuals 165 1242.7 7.53
# 
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# SSTO = SSR + SSE = 1500.1 + 1242.7 = 2742.8.
detach(handheight)
Checking account deposits
 Load the newaccounts data.
 Fit a simple linear regression model with y = New and x = Size
 Display a scatterplot of the data with the simple linear regression line.
 Display model results.
 Display lack of fit analysis of variance table.
 Display usual analysis of variance table.
newaccounts < read.table("~/pathtofolder/newaccounts.txt", header=T)
attach(newaccounts)
model < lm(New ~ Size)
plot(x=Size, y=New,
xlab="Size of minimum deposit", ylab="Number of new accounts",
panel.last = lines(sort(Size), fitted(model)[order(Size)]))
summary(model)
# Coefficients:
# Estimate Std. Error t value Pr(>t)
# (Intercept) 50.7225 39.3979 1.287 0.23
# Size 0.4867 0.2747 1.772 0.11
#
# Residual standard error: 40.47 on 9 degrees of freedom
# Multiple Rsquared: 0.2586, Adjusted Rsquared: 0.1762
# Fstatistic: 3.139 on 1 and 9 DF, pvalue: 0.1102
library(alr3) # alr3 package must be installed first
pureErrorAnova(model) # Lack of fit anova table
# Analysis of Variance Table
# Response: New
# Df Sum Sq Mean Sq F value Pr(>F)
# Size 1 5141.3 5141.3 22.393 0.005186 **
# Residuals 9 14741.6 1638.0
# Lack of fit 4 13593.6 3398.4 14.801 0.005594 **
# Pure Error 5 1148.0 229.6
# 
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# NOTE: The F value for Size uses MSPE in its denominator.
# So, F value for Size is 5141.3 / 229.6 = 22.393.
# Thus it differs from the F value for Size in the usual anova table:
anova(model)
# Analysis of Variance Table
# Response: New
# Df Sum Sq Mean Sq F value Pr(>F)
# Size 1 5141.3 5141.3 3.1389 0.1102
# Residuals 9 14741.6 1638.0
# NOTE: Here the F value for Size uses MSE in its denominator.
# So, F value for Size is 5141.3 / 1638.0 = 3.1389.
detach(newaccounts)