8: Regression (General Linear Models Part I)

8: Regression (General Linear Models Part I)

Overview

Case-Study: Land development

Bob works in a local government and is responsible for approving development proposals. As part of his work, developers come to him with proposals for developing land in his town. His work involves assessing the amount of “critical areas” such as wetlands, rivers, streams, landslide hazard areas, etc. in a given area of land. Lately he notices that the dollar amount of the proposal appears to increase as the number of critical areas in the land increases. However, he needs to test his theory, let’s see if we can help him.

The first step is for Bob to look at his data. He wants to use the number of critical areas to predict the dollar amount in the proposal. Both of these variables are quantitative. Below are the descriptive statistics for Bob’s data.

Descriptive Statistics: Critical Areas, Cost
Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum
Critical Areas 95 0 4.7990 0.0979 0.9539 2.1296 4.0708 4.8344 5.4800 6.7657
Cost 95 0 99.53 1.03 9.99 72.71 92.56 100.71 106.17 118.04

Bob is looking to predict a quantitative variable based on what he knows about a his actual value of a quantitative variable. Because he wants to predict, he needs to consider a regression.

Before we get started thinking about regression, let’s take a step back. A regression is a simple linear model. If that sounds strange to you, you can think about a linear model as an equation where we consider Y to be some function of X. In other words, Y = f(X). Both the topic in this unit (regression) and that in the next unit (ANOVA) are linear models. With advances in both computing power and the complexity of designs, separating regression and ANOVA are really a matter of semantics than substance. That said, this unit will focus on regression, with ANOVA coming later in the course. Let’s get back to Bob.

Bob did a great job understanding correlations and scatterplots, so he creates a scatterplot of the data.

He recognizes the fact that he has two quantitative variables, dollar amount and number of critical areas and that they have a positive strong linear relationship. However, he learned that the limitation of correlation is that the technique cannot lead to insights about causality between variables. Now he needs a new statistical technique.

Regression analysis provides the evidence that Bob is seeking, specifically how a specific variable of interest is affected by one or more variables. For Bob’s example, he is using number of critical areas to predict dollar amount.

Before we get starting with regression, it is important to distinguish between the variable of interest and the variable(s) we will use to predict the variable of interest.

Response Variable
Denoted, Y, is also called the variable of interest or dependent variable. In Bob's example, this is the dollar amount
Predictor Variable
Denoted, X, is also called the explanatory variable or independent variable. In Bob’s example, this is the number of critical features

When there is only one predictor variable, we refer to the regression model as a simple linear regression model.

In statistics, we can describe how variables are related using the mathematical function as we described as a linear model. We refer to this model as the simple linear regression model.

Objectives

Upon completion of this lesson, you should be able to:

  • Identify the slope,  intercept, and coefficient of determination
  • Calculate predicted and residual (error) values
  • Test the significance of the slope, including statement of the null hypothesis for the slope
  • State the assumptions for regression

8.1 - Linear Relationships

8.1 - Linear Relationships

To define a useful model, we must investigate the relationship between the response and the predictor variables. As mentioned before, the focus of this Lesson is linear relationships. For a brief review of linear functions, recall that the equation of a line has the following form:

\(y=mx+b\)

where m is the slope and b is the y-intercept.

Given two points on a line, \(\left(x_1,y_1\right)\) and \(\left(x_2, y_2\right)\), the slope is calculated by:

\begin{align} m&=\dfrac{y_2-y_1}{x_2-x_1}\\&=\dfrac{\text{change in y}}{\text{change in x}}\\&=\frac{\text{rise}}{\text{run}} \end{align}

The slope of a line describes a lot about the linear relationship between two variables. If the slope is positive, then there is a positive linear relationship, i.e., as one increases, the other increases. If the slope is negative, then there is a negative linear relationship, i.e., as one increases the other variable decreases. If the slope is 0, then as one increases, the other remains constant.


8.2 - Simple Linear Regression

8.2 - Simple Linear Regression

For Bob’s simple linear regression example, he wants to see how changes in the number of critical areas (the predictor variable) impact the dollar amount for land development (the response variable). If the value of the predictor variable (number of critical areas) increases, does the response (cost) tend to increase, decrease, or stay constant? For Bob, as the number of critical features increases, does the dollar amount increase, decrease or stay the same?

We test this by using the characteristics of the linear relationships, particularly the slope as defined above. Remember from hypothesis testing, we test the null hypothesis that a value is zero. We extend this principle to the slope, with a null hypothesis that the slope is equal to zero. Non-zero slopes indicate a significant impact of the predictor variable from the response variable, whereas zero slope indicates change in the predictor variable do not impact changes in the response.

Let’s take a closer look at the linear model producing our regression results.


8.2.1 - Assumptions for the SLR Model

8.2.1 - Assumptions for the SLR Model

Before we get started in interpreting the output it is critical that we address specific assumptions for regression.  Not meeting these assumptions is a flag that the results are not valid (the results cannot be interpreted with any certainty because the model may not fit the data).  

In this section, we will present the assumptions needed to perform the hypothesis test for the population slope:

\(H_0\colon \ \beta_1=0\)

\(H_a\colon \ \beta_1\ne0\)

We will also demonstrate how to verify if they are satisfied. To verify the assumptions, you must run the analysis in Minitab first.

Assumptions for Simple Linear Regression

  1. Linearity: The relationship between \(X\) and \(Y\) must be linear.

    Check this assumption by examining a scatterplot of x and y.

  2. Independence of errors: There is not a relationship between the residuals and the Y variable; in other words, Y is independent of errors.

    Check this assumption by examining a scatterplot of “residuals versus fits”; the correlation should be approximately 0. In other words, there should not look like there is a relationship.

  3. Normality of errors: The residuals must be approximately normally distributed.

    Check this assumption by examining a normal probability plot; the observations should be near the line. You can also examine a histogram of the residuals; it should be approximately normally distributed.

  4. Equal variances: The variance of the residuals is the same for all values of \(X\).

    Check this assumption by examining the scatterplot of “residuals versus fits”; the variance of the residuals should be the same across all values of the x-axis. If the plot shows a pattern (e.g., bowtie or megaphone shape), then variances are not consistent, and this assumption has not been met.


8.2.2 - The SLR Model

8.2.2 - The SLR Model

The errors referred to in the assumptions are only one component of the linear model. The basis of the model, the observations are considered as coordinates, \((x_i, y_i)\), for \(i=1, …, n\). The points, \(\left(x_1,y_1\right), \dots,\left(x_n,y_n\right)\), may not fall exactly on a line, (like the cost and number of critical areas). This gap is the error!

The graph below is an example of a scatter plot showing height as the explanatory variable for height. Select the + icons to view the explanations of the different parts of the scatterplot and the least-squares regression line.

 

The graph below summarizes the least-squares regression for Bob's data. We will define what we mean by least squares regression in more detail later in the Lesson, for now, focus on how the red line (the regression line) "fits" the blue dots (Bob's data)

We combine the linear relationship along with the error in the simple linear regression model.

 

Simple Linear Regression Model

The general form of the simple linear regression model is...

\(Y=\beta_0+\beta_1X+\epsilon\)

For an individual observation,

\(y_i=\beta_0+\beta_1x_i+\epsilon_i\)

where,

  • \(\beta_0\) is the population y-intercept,
  • \(\beta_1\) is the population slope, and
  • \(\epsilon_i\) is the error or deviation of \(y_i\) from the line, \(\beta_0+\beta_1x_i\)

To make inferences about these unknown population parameters (namely the slope and intercept), we must find an estimate for them. There are different ways to estimate the parameters from the sample. This is where we get to n the least-squares method.

Least Squares Line

The least-squares line is the line for which the sum of squared errors of predictions for all sample points is the least.

Using the least-squares method, we can find estimates for the two parameters.

The formulas to calculate least squares estimates are:

Sample Slope
\(\hat{\beta}_1=\dfrac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2}\)
Sample Intercept
\(\hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}\)

The least squares line for Bob’s data is the red line on the scatterplot below.

Note! You will not be expected to memorize these formulas or to find the estimates by hand. We will use Minitab to find these estimates for you. We estimate the population slope, \(\beta_1\), with the sample slope denoted \(\hat{\beta_1}\). The population intercept, \(\beta_0\), is estimated with the sample intercept denoted \(\hat{\beta_0}\). The intercept is often referred to as the constant or the constant term. Once the parameters are estimated, we have the least square regression equation line (or the estimated regression line).

Let’s jump ahead for a moment and generate the regression output. Below we will work through the content of the output. The regression output for Bob’s data look like this:

Coefficients
Predictor Coef SE Coef T-Value P-Value VIF
Constant 49.542 0.560 88.40 0.000  
Critical Areas 10.417 0.115 90.92 0.000 1.00
Regression Equation

Cost = 49.542 + 10.417 Critical Areas


8.2.3 - Interpreting the Coefficients

8.2.3 - Interpreting the Coefficients

Once we have the estimates for the slope and intercept, we need to interpret them. For Bob’s data, the estimate for the slope is 10.417 and the estimate for the intercept (constant) is 49.542. Recall from the beginning of the Lesson what the slope of a line means algebraically. If the slope is denoted as \(m\), then

\(m=\dfrac{\text{change in y}}{\text{change in x}}\)

Going back to algebra, the intercept is the value of y when \(x = 0\). It has the same interpretation in statistics.

Interpreting the intercept of the regression equation, \(\hat{\beta}_0\) is the \(Y\)-intercept of the regression line. When \(X = 0\) is within the scope of observation, \(\hat{\beta}_0\) is the estimated value of Y when \(X = 0\).

Note, however, when \(X = 0\) is not within the scope of the observation, the Y-intercept is usually not of interest. In Bob’s example, \(X = 0\) or 49.542 would be a plot of land with no critical areas. This might be of interest in establishing a baseline value, but specifically, in looking at land that HAS critical areas, this might not be of much interest to Bob.

As we already noted, the slope of a line is the change in the y variable over the change in the x variable. If the change in the x variable is one, then the slope is:

\(m=\dfrac{\text{change in y}}{1}\)

The slope is interpreted as the change of y for a one unit increase in x. In Bob’s example, for every one unit change in critical areas, the cost of development increases by 10.417.

Interpreting the slope of the regression equation, \(\hat{\beta}_1\)

\(\hat{\beta}_1\) represents the estimated change in Y per unit increase in X

Note that the change may be negative which is reflected when \(\hat{\beta}_1\) is negative or positive when \(\hat{\beta}_1\) is positive.

If the slope of the line is positive, as it is in Bob’s example, then there is a positive linear relationship, i.e., as one increases, the other increases. If the slope is negative, then there is a negative linear relationship, i.e., as one increases the other variable decreases. If the slope is 0, then as one increases, the other remains constant, i.e., no predictive relationship.

Therefore, we are interested in testing the following hypotheses:

\(H_0\colon \beta_1=0\\H_a\colon \beta_1\ne0\)

Let’s take a closer look at the hypothesis test for the estimate of the slope. A similar test for the population intercept, \(\beta_0\), is not discussed in this class because it is not typically of interest.


8.2.4 - Hypothesis Test for the Population Slope

8.2.4 - Hypothesis Test for the Population Slope

As mentioned, the test for the slope follows the logic for a one sample hypothesis test for the mean. Typically (and will be the case in this course) we test the null hypothesis that the slope is equal to zero. However, it is possible to test the null hypothesis that the slope is zero or less than zero OR test the null hypothesis that the slope is zero or greater than zero.

 

Research Question Is there a linear relationship? Is there a positive linear relationship? Is there a negative linear relationship?
Null Hypothesis \(\beta_1=0\) \(\beta_1=0\) \(\beta_1=0\)
Alternative Hypothesis \(\beta_1\ne0\) \(\beta_1>0\) \(\beta_1<0\)
Type of Test Two-tailed, non-directional Right-tailed, directional Left-tailed, directional

The test statistic for the test of population slope is:

\(t^*=\dfrac{\hat{\beta}_1}{\hat{SE}(\hat{\beta}_1)}\)

where \(\hat{SE}(\hat{\beta}_1)\) is the estimated standard error of the sample slope (found in Minitab output). Under the null hypothesis and with the assumptions shown in the previous section, \(t^*\) follows a \(t\)-distribution with \(n-2\) degrees of freedom.

Take another look at the output from Bob’s data.

Coefficients
Predictor Coef SE Coef T-Value P-Value VIF
Constant 49.542 0.560 88.40 0.000  
Critical Areas 10.417 0.115 90.92 0.000 1.00
Regression Equation

Cost = 49.542 + 10.417 Critical Areas

Here we can see that the “T-Value” is 90.92, a very large t value indicating the difference between the null value for the slope (zero) is very different from the value for the slope calculated by the least-squares method (10.417). This results in a small probability value that the null is true (P-Value is less then .05), so Bob can reject the null, and conclude that the slope is not zero. Therefore, the number of critical areas significantly predicts the cost of development.

He can be more specific and conclude that for every one unit change in critical areas, the cost of development increases by 10.417.

Note! In this class, we will have Minitab perform the calculations for this test. Minitab's output gives the result for two-tailed tests for \(\beta_1\) and \(\beta_0\). If you wish to perform a one-sided test, you would have to adjust the p-value Minitab provides.

As with most of our calculations, we need to allow some room for imprecision in our estimate. We return to the concept of confidence intervals to build in some error around the estimate of the slope.

The \( (1-\alpha)100\%\) confidence interval for \(\beta_1\) is:

\(\hat{\beta}_1\pm t_{\alpha/2}\left(\hat{SE}(\hat{\beta}_1)\right)\)

where \(t\) has \(n-2\) degrees of freedom.

Note! The degrees of freedom of t depends on the number of independent variables. The degrees of freedom is \(n - 2\) when there is only one independent variable.

The final piece of output from Minitab is the Least Squares Regression Equation. Remember that Bob is interested in being able to predict the development cost of land given the number of critical areas. Bob can use the equation to do this.

If a given piece of land has 10 critical areas, Bob can “plug in” the value of “10” for X, the resulting equation

\(Cost = 49.542 + 10.417 * 10\)

Results in a predicted cost of:

\(153.712 = 49.542 + 10.417 * 10\)

So, if Bob knows a piece of land has 10 critical areas, he can predict the development cost will be about 153 dollars!

Using the 10 critical features allowed Bob to predict the development cost, but there is an important distinction to make about predicting an “AVERAGE” cost, or a “SPECIFIC” cost. These are represented by ‘CONFIDENCE INTERVALS” versus ‘PREDICTION INTERVALS’ for new observations. (notice the difference here is that we are referring to a new observation as opposed to above when we used confidence intervals for the estimate of the slope!)

The mean response at a given X value is given by:

\(E(Y)=\beta_0+\beta_1X\)

Inferences about Outcome for New Observation

  • The point estimate for the outcome at \(X = x\) is provided above.
  • The interval to estimate the mean response is called the confidence interval. Minitab calculates this for us.
  • The interval used to estimate (or predict) an outcome is called prediction interval.

For a given x value, the prediction interval and confidence interval have the same center, but the width of the prediction interval is wider than the width of the confidence interval. That makes good sense since it is harder to estimate a value for a single subject (for example a particular piece of land in Bob’s town that may have some unique features)  than it would be to estimate the average for all pieces of land. Again, Minitab will calculate this interval as well.


8.2.5 - SLR with Minitab

8.2.5 - SLR with Minitab

Minitab®

Simple Linear Regression with Minitab

  1. Select Stat > Regression > Regression > Fit Regression Model
  2. In the box labeled "Response", specify the desired response variable.
  3. In the box labeled "Predictors", specify the desired predictor variable.
  4. Select OK. The basic regression analysis output will be displayed in the session window.

To check assumptions...

  1. Click Graphs .
  2. In 'Residuals plots, choose 'Four in one.'
  3. Select OK .

8.3 - Cautions with Linear Regression

8.3 - Cautions with Linear Regression

Extrapolation is applying a regression model to X-values outside the range of sample X-values to predict values of the response variable Y. For example, Bob would not want to use the number of critical features to predict dollar amount using a regression model based on an urban area if Bob’s town is rural.

Second, if no linear relationship (i.e. correlation is zero) exists it does not imply there is no relationship . The scatter plot will reveal whether other possible relationships may exist. The figure below gives an example where X, Y are related, but not linearly related i.e. the correlation is zero.

Outliers and Influential Observations

Influential observations are points whose removal causes the regression equation to change considerably. It is flagged by Minitab in the unusual observation list and denoted as X. Outliers are points that lie outside the overall pattern of the data. Potential outliers are flagged by Minitab in the unusual observation list and denoted as R. The following is the Minitab output for the unusual observations within Bob’s study:

Fits and Diagnostics for Unusual Observations
Obs Cost Fit Resid Std Resid    
1 72.714 71.725 0.989 0.98   X
2 78.825 75.829 2.996 2.93 R X
6 81.967 84.507 -2.540 -2.44 R  
7 83.490 85.640 -2.150 -2.06 R  
85 113.540 111.440 2.100 2.01 R  

R Large Residual

X Unusual X

Some observations may be both outliers and influential, and these are flagged by R. Those observational points will merit particular attention because these points are not well “fit” by the model and maybe influencing conclusions or indicate an alternative model is needed.


8.4 - Estimating the standard deviation of the error term

8.4 - Estimating the standard deviation of the error term

Our simple linear regression model is:

\(Y=\beta_0+\beta_1X+\epsilon\)

The errors for the \(n\) observations are denoted as \(\epsilon_i\), for \(i=1, …, n\). One of our assumptions is that the errors have equal variance (or equal standard deviation). We can estimate the standard deviation of the error by finding the standard deviation of the residuals, \(\hat{\epsilon}_i=\hat{y}_i-y_i\). Minitab also provides the estimate for us, denoted as \(S\), under the Model Summary. We can also calculate it by:

\(s=\sqrt{\text{MSE}}\)

Find the MSE in the ANOVA table, under the Adj MS column and the Error row. The value of 1.12 represents the average squared error. This becomes the denominator for the F test.

Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 1 9281.7 9281.72 8267.30 0.000
Critical Areas 1 9287.7 9281.72 8267.30 0.000
Error 93 104.4 1.12    
Total 94 9386.1      

8.5 - Coefficient of Determination

8.5 - Coefficient of Determination

Now that we know how to estimate the coefficients and perform the hypothesis test, is there any way to tell how useful the model is?

One measure is the coefficient of determination, denoted \(R^2\).

Coefficient of Determination \(R^2\)
The coefficient of determination measures the percentage of variability within the \(y\)-values that can be explained by the regression model.

Therefore, a value close to 100% means that the model is useful and a value close to zero indicates that the model is not useful.
It can be shown by mathematical manipulation that:

\(\text{SST }=\text{ SSR }+\text{ SSE}\)

\(\sum (y_i-\bar{y})^2=\sum (\hat{y}_i-\bar{y})^2+\sum (y_i-\hat{y}_i)^2\)

Total variability in the y value = Variability explained by the model + Unexplained variability

To get the total, explained and unexplained variability, first we need to calculate corresponding deviances. Drag the slider on the image below to see how the total deviance \((y_i-\bar{y})\) is split into explained \((\hat{y}_i-\bar{y})\) and unexplained deviances \((y_i-\hat{y}_i)\).

he breakdown of variability in the above equation holds for the multiple regression model also.

Coefficient of Determination \(R^2\) Formula

\(R^2=\dfrac{\text{variability explained by the model}}{\text{total variability in the y values}}\)

\(R^2\) represents the proportion of total variability of the \(y\)-value that is accounted for by the independent variable \(x\).

For the specific case when there is only one independent variable \(X\) (i.e., simple linear regression), one can show that \(R^2 =r^2\), where \(r\) is correlation coefficient between \(X\) and \(Y\). For Bob’s data, the correlation of the two variable is 0.994 and the R2 value is 98.89.

Correlations
Pearson correlation 0.994
P-value 0.000
Model Summary
S R-sq R-sq(adj) R-sq(pred)
1.05958 98.89% 98.88% 98.82%

Minitab®

Finding Correlation

  1. Select Stat > Basic statistics > Correlation
  2. Specify the two (or more) variables for which you want the correlation coefficient(s) calculated.
    • Pearson correlation is the default.  An optional Spearman rho method is also available. 
  3. If it isn't already checked, put a checkmark in the box labeled Display p-values by clicking once on the box.
  4. Select OK. The output will appear in the session window.

8.6 - Lesson Summary

8.6 - Lesson Summary

Now that you have seen what the components of a regression are, you can see that a regression is a linear model (following the form \(y=b_0+b_1+ \text{error}\)). In our next lesson, we will learn that this same linear model can analyze a categorical variable’s impact on a quantitative (y) variable.

 

Let’s return to Bob. Bob can now state with confidence that the number of critical features does significantly impact the dollar amount of a land development proposal. Hopefully growing statistical skills will make his work with developers more effective and efficient!


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility