8.2 - Simple Linear Regression

8.2 - Simple Linear Regression

For Bob’s simple linear regression example, he wants to see how changes in the number of critical areas (the predictor variable) impact the dollar amount for land development (the response variable). If the value of the predictor variable (number of critical areas) increases, does the response (cost) tend to increase, decrease, or stay constant? For Bob, as the number of critical features increases, does the dollar amount increase, decrease or stay the same?

We test this by using the characteristics of the linear relationships, particularly the slope as defined above. Remember from hypothesis testing, we test the null hypothesis that a value is zero. We extend this principle to the slope, with a null hypothesis that the slope is equal to zero. Non-zero slopes indicate a significant impact of the predictor variable from the response variable, whereas zero slope indicates change in the predictor variable do not impact changes in the response.

Let’s take a closer look at the linear model producing our regression results.


8.2.1 - Assumptions for the SLR Model

8.2.1 - Assumptions for the SLR Model

Before we get started in interpreting the output it is critical that we address specific assumptions for regression.  Not meeting these assumptions is a flag that the results are not valid (the results cannot be interpreted with any certainty because the model may not fit the data).  

In this section, we will present the assumptions needed to perform the hypothesis test for the population slope:

\(H_0\colon \ \beta_1=0\)

\(H_a\colon \ \beta_1\ne0\)

We will also demonstrate how to verify if they are satisfied. To verify the assumptions, you must run the analysis in Minitab first.

Assumptions for Simple Linear Regression

  1. Linearity: The relationship between \(X\) and \(Y\) must be linear.

    Check this assumption by examining a scatterplot of x and y.

  2. Independence of errors: There is not a relationship between the residuals and the Y variable; in other words, Y is independent of errors.

    Check this assumption by examining a scatterplot of “residuals versus fits”; the correlation should be approximately 0. In other words, there should not look like there is a relationship.

  3. Normality of errors: The residuals must be approximately normally distributed.

    Check this assumption by examining a normal probability plot; the observations should be near the line. You can also examine a histogram of the residuals; it should be approximately normally distributed.

  4. Equal variances: The variance of the residuals is the same for all values of \(X\).

    Check this assumption by examining the scatterplot of “residuals versus fits”; the variance of the residuals should be the same across all values of the x-axis. If the plot shows a pattern (e.g., bowtie or megaphone shape), then variances are not consistent, and this assumption has not been met.


8.2.2 - The SLR Model

8.2.2 - The SLR Model

The errors referred to in the assumptions are only one component of the linear model. The basis of the model, the observations are considered as coordinates, \((x_i, y_i)\), for \(i=1, …, n\). The points, \(\left(x_1,y_1\right), \dots,\left(x_n,y_n\right)\), may not fall exactly on a line, (like the cost and number of critical areas). This gap is the error!

The graph below is an example of a scatter plot showing height as the explanatory variable for height. Select the + icons to view the explanations of the different parts of the scatterplot and the least-squares regression line.

 

The graph below summarizes the least-squares regression for Bob's data. We will define what we mean by least squares regression in more detail later in the Lesson, for now, focus on how the red line (the regression line) "fits" the blue dots (Bob's data)

We combine the linear relationship along with the error in the simple linear regression model.

 

Simple Linear Regression Model

The general form of the simple linear regression model is...

\(Y=\beta_0+\beta_1X+\epsilon\)

For an individual observation,

\(y_i=\beta_0+\beta_1x_i+\epsilon_i\)

where,

  • \(\beta_0\) is the population y-intercept,
  • \(\beta_1\) is the population slope, and
  • \(\epsilon_i\) is the error or deviation of \(y_i\) from the line, \(\beta_0+\beta_1x_i\)

To make inferences about these unknown population parameters (namely the slope and intercept), we must find an estimate for them. There are different ways to estimate the parameters from the sample. This is where we get to n the least-squares method.

Least Squares Line

The least-squares line is the line for which the sum of squared errors of predictions for all sample points is the least.

Using the least-squares method, we can find estimates for the two parameters.

The formulas to calculate least squares estimates are:

Sample Slope
\(\hat{\beta}_1=\dfrac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2}\)
Sample Intercept
\(\hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}\)

The least squares line for Bob’s data is the red line on the scatterplot below.

Note! You will not be expected to memorize these formulas or to find the estimates by hand. We will use Minitab to find these estimates for you. We estimate the population slope, \(\beta_1\), with the sample slope denoted \(\hat{\beta_1}\). The population intercept, \(\beta_0\), is estimated with the sample intercept denoted \(\hat{\beta_0}\). The intercept is often referred to as the constant or the constant term. Once the parameters are estimated, we have the least square regression equation line (or the estimated regression line).

Let’s jump ahead for a moment and generate the regression output. Below we will work through the content of the output. The regression output for Bob’s data look like this:

Coefficients
Predictor Coef SE Coef T-Value P-Value VIF
Constant 49.542 0.560 88.40 0.000  
Critical Areas 10.417 0.115 90.92 0.000 1.00
Regression Equation

Cost = 49.542 + 10.417 Critical Areas


8.2.3 - Interpreting the Coefficients

8.2.3 - Interpreting the Coefficients

Once we have the estimates for the slope and intercept, we need to interpret them. For Bob’s data, the estimate for the slope is 10.417 and the estimate for the intercept (constant) is 49.542. Recall from the beginning of the Lesson what the slope of a line means algebraically. If the slope is denoted as \(m\), then

\(m=\dfrac{\text{change in y}}{\text{change in x}}\)

Going back to algebra, the intercept is the value of y when \(x = 0\). It has the same interpretation in statistics.

Interpreting the intercept of the regression equation, \(\hat{\beta}_0\) is the \(Y\)-intercept of the regression line. When \(X = 0\) is within the scope of observation, \(\hat{\beta}_0\) is the estimated value of Y when \(X = 0\).

Note, however, when \(X = 0\) is not within the scope of the observation, the Y-intercept is usually not of interest. In Bob’s example, \(X = 0\) or 49.542 would be a plot of land with no critical areas. This might be of interest in establishing a baseline value, but specifically, in looking at land that HAS critical areas, this might not be of much interest to Bob.

As we already noted, the slope of a line is the change in the y variable over the change in the x variable. If the change in the x variable is one, then the slope is:

\(m=\dfrac{\text{change in y}}{1}\)

The slope is interpreted as the change of y for a one unit increase in x. In Bob’s example, for every one unit change in critical areas, the cost of development increases by 10.417.

Interpreting the slope of the regression equation, \(\hat{\beta}_1\)

\(\hat{\beta}_1\) represents the estimated change in Y per unit increase in X

Note that the change may be negative which is reflected when \(\hat{\beta}_1\) is negative or positive when \(\hat{\beta}_1\) is positive.

If the slope of the line is positive, as it is in Bob’s example, then there is a positive linear relationship, i.e., as one increases, the other increases. If the slope is negative, then there is a negative linear relationship, i.e., as one increases the other variable decreases. If the slope is 0, then as one increases, the other remains constant, i.e., no predictive relationship.

Therefore, we are interested in testing the following hypotheses:

\(H_0\colon \beta_1=0\\H_a\colon \beta_1\ne0\)

Let’s take a closer look at the hypothesis test for the estimate of the slope. A similar test for the population intercept, \(\beta_0\), is not discussed in this class because it is not typically of interest.


8.2.4 - Hypothesis Test for the Population Slope

8.2.4 - Hypothesis Test for the Population Slope

As mentioned, the test for the slope follows the logic for a one sample hypothesis test for the mean. Typically (and will be the case in this course) we test the null hypothesis that the slope is equal to zero. However, it is possible to test the null hypothesis that the slope is zero or less than zero OR test the null hypothesis that the slope is zero or greater than zero.

 

Research Question Is there a linear relationship? Is there a positive linear relationship? Is there a negative linear relationship?
Null Hypothesis \(\beta_1=0\) \(\beta_1=0\) \(\beta_1=0\)
Alternative Hypothesis \(\beta_1\ne0\) \(\beta_1>0\) \(\beta_1<0\)
Type of Test Two-tailed, non-directional Right-tailed, directional Left-tailed, directional

The test statistic for the test of population slope is:

\(t^*=\dfrac{\hat{\beta}_1}{\hat{SE}(\hat{\beta}_1)}\)

where \(\hat{SE}(\hat{\beta}_1)\) is the estimated standard error of the sample slope (found in Minitab output). Under the null hypothesis and with the assumptions shown in the previous section, \(t^*\) follows a \(t\)-distribution with \(n-2\) degrees of freedom.

Take another look at the output from Bob’s data.

Coefficients
Predictor Coef SE Coef T-Value P-Value VIF
Constant 49.542 0.560 88.40 0.000  
Critical Areas 10.417 0.115 90.92 0.000 1.00
Regression Equation

Cost = 49.542 + 10.417 Critical Areas

Here we can see that the “T-Value” is 90.92, a very large t value indicating the difference between the null value for the slope (zero) is very different from the value for the slope calculated by the least-squares method (10.417). This results in a small probability value that the null is true (P-Value is less then .05), so Bob can reject the null, and conclude that the slope is not zero. Therefore, the number of critical areas significantly predicts the cost of development.

He can be more specific and conclude that for every one unit change in critical areas, the cost of development increases by 10.417.

Note! In this class, we will have Minitab perform the calculations for this test. Minitab's output gives the result for two-tailed tests for \(\beta_1\) and \(\beta_0\). If you wish to perform a one-sided test, you would have to adjust the p-value Minitab provides.

As with most of our calculations, we need to allow some room for imprecision in our estimate. We return to the concept of confidence intervals to build in some error around the estimate of the slope.

The \( (1-\alpha)100\%\) confidence interval for \(\beta_1\) is:

\(\hat{\beta}_1\pm t_{\alpha/2}\left(\hat{SE}(\hat{\beta}_1)\right)\)

where \(t\) has \(n-2\) degrees of freedom.

Note! The degrees of freedom of t depends on the number of independent variables. The degrees of freedom is \(n - 2\) when there is only one independent variable.

The final piece of output from Minitab is the Least Squares Regression Equation. Remember that Bob is interested in being able to predict the development cost of land given the number of critical areas. Bob can use the equation to do this.

If a given piece of land has 10 critical areas, Bob can “plug in” the value of “10” for X, the resulting equation

\(Cost = 49.542 + 10.417 * 10\)

Results in a predicted cost of:

\(153.712 = 49.542 + 10.417 * 10\)

So, if Bob knows a piece of land has 10 critical areas, he can predict the development cost will be about 153 dollars!

Using the 10 critical features allowed Bob to predict the development cost, but there is an important distinction to make about predicting an “AVERAGE” cost, or a “SPECIFIC” cost. These are represented by ‘CONFIDENCE INTERVALS” versus ‘PREDICTION INTERVALS’ for new observations. (notice the difference here is that we are referring to a new observation as opposed to above when we used confidence intervals for the estimate of the slope!)

The mean response at a given X value is given by:

\(E(Y)=\beta_0+\beta_1X\)

Inferences about Outcome for New Observation

  • The point estimate for the outcome at \(X = x\) is provided above.
  • The interval to estimate the mean response is called the confidence interval. Minitab calculates this for us.
  • The interval used to estimate (or predict) an outcome is called prediction interval.

For a given x value, the prediction interval and confidence interval have the same center, but the width of the prediction interval is wider than the width of the confidence interval. That makes good sense since it is harder to estimate a value for a single subject (for example a particular piece of land in Bob’s town that may have some unique features)  than it would be to estimate the average for all pieces of land. Again, Minitab will calculate this interval as well.


8.2.5 - SLR with Minitab

8.2.5 - SLR with Minitab

Minitab®

Simple Linear Regression with Minitab

  1. Select Stat > Regression > Regression > Fit Regression Model
  2. In the box labeled "Response", specify the desired response variable.
  3. In the box labeled "Predictors", specify the desired predictor variable.
  4. Select OK. The basic regression analysis output will be displayed in the session window.

To check assumptions...

  1. Click Graphs .
  2. In 'Residuals plots, choose 'Four in one.'
  3. Select OK .

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility