8.2 - Simple Linear Regression
8.2 - Simple Linear RegressionFor Bob’s simple linear regression example, he wants to see how changes in the number of critical areas (the predictor variable) impact the dollar amount for land development (the response variable). If the value of the predictor variable (number of critical areas) increases, does the response (cost) tend to increase, decrease, or stay constant? For Bob, as the number of critical features increases, does the dollar amount increase, decrease or stay the same?
We test this by using the characteristics of the linear relationships, particularly the slope as defined above. Remember from hypothesis testing, we test the null hypothesis that a value is zero. We extend this principle to the slope, with a null hypothesis that the slope is equal to zero. Non-zero slopes indicate a significant impact of the predictor variable from the response variable, whereas zero slope indicates change in the predictor variable do not impact changes in the response.
Let’s take a closer look at the linear model producing our regression results.
8.2.1 - Assumptions for the SLR Model
8.2.1 - Assumptions for the SLR ModelBefore we get started in interpreting the output it is critical that we address specific assumptions for regression. Not meeting these assumptions is a flag that the results are not valid (the results cannot be interpreted with any certainty because the model may not fit the data).
In this section, we will present the assumptions needed to perform the hypothesis test for the population slope:
We will also demonstrate how to verify if they are satisfied. To verify the assumptions, you must run the analysis in Minitab first.
Assumptions for Simple Linear Regression
- Linearity: The relationship between
and must be linear.Check this assumption by examining a scatterplot of x and y.
- Independence of errors: There is not a relationship between the residuals and the Y variable; in other words, Y is independent of errors.
Check this assumption by examining a scatterplot of “residuals versus fits”; the correlation should be approximately 0. In other words, there should not look like there is a relationship.
- Normality of errors: The residuals must be approximately normally distributed.
Check this assumption by examining a normal probability plot; the observations should be near the line. You can also examine a histogram of the residuals; it should be approximately normally distributed.
- Equal variances: The variance of the residuals is the same for all values of
.Check this assumption by examining the scatterplot of “residuals versus fits”; the variance of the residuals should be the same across all values of the x-axis. If the plot shows a pattern (e.g., bowtie or megaphone shape), then variances are not consistent, and this assumption has not been met.
8.2.2 - The SLR Model
8.2.2 - The SLR ModelThe errors referred to in the assumptions are only one component of the linear model. The basis of the model, the observations are considered as coordinates,
The graph below is an example of a scatter plot showing height as the explanatory variable for height. Select the + icons to view the explanations of the different parts of the scatterplot and the least-squares regression line.
The graph below summarizes the least-squares regression for Bob's data. We will define what we mean by least squares regression in more detail later in the Lesson, for now, focus on how the red line (the regression line) "fits" the blue dots (Bob's data)
We combine the linear relationship along with the error in the simple linear regression model.
Simple Linear Regression Model
-
The general form of the simple linear regression model is...
For an individual observation,
where,
is the population y-intercept, is the population slope, and is the error or deviation of from the line,
To make inferences about these unknown population parameters (namely the slope and intercept), we must find an estimate for them. There are different ways to estimate the parameters from the sample. This is where we get to n the least-squares method.
Least Squares Line
The least-squares line is the line for which the sum of squared errors of predictions for all sample points is the least.
Using the least-squares method, we can find estimates for the two parameters.
The formulas to calculate least squares estimates are:
- Sample Slope
- Sample Intercept
The least squares line for Bob’s data is the red line on the scatterplot below.
Let’s jump ahead for a moment and generate the regression output. Below we will work through the content of the output. The regression output for Bob’s data look like this:
Coefficients
Predictor | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | 49.542 | 0.560 | 88.40 | 0.000 | |
Critical Areas | 10.417 | 0.115 | 90.92 | 0.000 | 1.00 |
Regression Equation
Cost = 49.542 + 10.417 Critical Areas
8.2.3 - Interpreting the Coefficients
8.2.3 - Interpreting the CoefficientsOnce we have the estimates for the slope and intercept, we need to interpret them. For Bob’s data, the estimate for the slope is 10.417 and the estimate for the intercept (constant) is 49.542. Recall from the beginning of the Lesson what the slope of a line means algebraically. If the slope is denoted as
Going back to algebra, the intercept is the value of y when
Interpreting the intercept of the regression equation,
Note, however, when
As we already noted, the slope of a line is the change in the y variable over the change in the x variable. If the change in the x variable is one, then the slope is:
The slope is interpreted as the change of y for a one unit increase in x. In Bob’s example, for every one unit change in critical areas, the cost of development increases by 10.417.
Interpreting the slope of the regression equation,
Note that the change may be negative which is reflected when
If the slope of the line is positive, as it is in Bob’s example, then there is a positive linear relationship, i.e., as one increases, the other increases. If the slope is negative, then there is a negative linear relationship, i.e., as one increases the other variable decreases. If the slope is 0, then as one increases, the other remains constant, i.e., no predictive relationship.
Therefore, we are interested in testing the following hypotheses:
Let’s take a closer look at the hypothesis test for the estimate of the slope. A similar test for the population intercept,
8.2.4 - Hypothesis Test for the Population Slope
8.2.4 - Hypothesis Test for the Population SlopeAs mentioned, the test for the slope follows the logic for a one sample hypothesis test for the mean. Typically (and will be the case in this course) we test the null hypothesis that the slope is equal to zero. However, it is possible to test the null hypothesis that the slope is zero or less than zero OR test the null hypothesis that the slope is zero or greater than zero.
Research Question | Is there a linear relationship? | Is there a positive linear relationship? | Is there a negative linear relationship? |
---|---|---|---|
Null Hypothesis | |||
Alternative Hypothesis | |||
Type of Test | Two-tailed, non-directional | Right-tailed, directional | Left-tailed, directional |
The test statistic for the test of population slope is:
where
Take another look at the output from Bob’s data.
Coefficients
Predictor | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | 49.542 | 0.560 | 88.40 | 0.000 | |
Critical Areas | 10.417 | 0.115 | 90.92 | 0.000 | 1.00 |
Regression Equation
Cost = 49.542 + 10.417 Critical Areas
Here we can see that the “T-Value” is 90.92, a very large t value indicating the difference between the null value for the slope (zero) is very different from the value for the slope calculated by the least-squares method (10.417). This results in a small probability value that the null is true (P-Value is less then .05), so Bob can reject the null, and conclude that the slope is not zero. Therefore, the number of critical areas significantly predicts the cost of development.
He can be more specific and conclude that for every one unit change in critical areas, the cost of development increases by 10.417.
As with most of our calculations, we need to allow some room for imprecision in our estimate. We return to the concept of confidence intervals to build in some error around the estimate of the slope.
The
where
The final piece of output from Minitab is the Least Squares Regression Equation. Remember that Bob is interested in being able to predict the development cost of land given the number of critical areas. Bob can use the equation to do this.
If a given piece of land has 10 critical areas, Bob can “plug in” the value of “10” for X, the resulting equation
Results in a predicted cost of:
So, if Bob knows a piece of land has 10 critical areas, he can predict the development cost will be about 153 dollars!
Using the 10 critical features allowed Bob to predict the development cost, but there is an important distinction to make about predicting an “AVERAGE” cost, or a “SPECIFIC” cost. These are represented by ‘CONFIDENCE INTERVALS” versus ‘PREDICTION INTERVALS’ for new observations. (notice the difference here is that we are referring to a new observation as opposed to above when we used confidence intervals for the estimate of the slope!)
The mean response at a given X value is given by:
Inferences about Outcome for New Observation
- The point estimate for the outcome at
is provided above. - The interval to estimate the mean response is called the confidence interval. Minitab calculates this for us.
- The interval used to estimate (or predict) an outcome is called prediction interval.
For a given x value, the prediction interval and confidence interval have the same center, but the width of the prediction interval is wider than the width of the confidence interval. That makes good sense since it is harder to estimate a value for a single subject (for example a particular piece of land in Bob’s town that may have some unique features) than it would be to estimate the average for all pieces of land. Again, Minitab will calculate this interval as well.
8.2.5 - SLR with Minitab
8.2.5 - SLR with MinitabMinitab®
Simple Linear Regression with Minitab
- Select Stat > Regression > Regression > Fit Regression Model
- In the box labeled "Response", specify the desired response variable.
- In the box labeled "Predictors", specify the desired predictor variable.
- Select OK. The basic regression analysis output will be displayed in the session window.
To check assumptions...
- Click Graphs .
- In 'Residuals plots, choose 'Four in one.'
- Select OK .