8: Regression (General Linear Models Part I)
8: Regression (General Linear Models Part I)Overview
CaseStudy: Land development
Bob works in a local government and is responsible for approving development proposals. As part of his work, developers come to him with proposals for developing land in his town. His work involves assessing the amount of “critical areas” such as wetlands, rivers, streams, landslide hazard areas, etc. in a given area of land. Lately he notices that the dollar amount of the proposal appears to increase as the number of critical areas in the land increases. However, he needs to test his theory, let’s see if we can help him.The first step is for Bob to look at his data. He wants to use the number of critical areas to predict the dollar amount in the proposal. Both of these variables are quantitative. Below are the descriptive statistics for Bob’s data.
Descriptive Statistics: Critical Areas, Cost
Variable  N  N*  Mean  SE Mean  StDev  Minimum  Q1  Median  Q3  Maximum 

Critical Areas  95  0  4.7990  0.0979  0.9539  2.1296  4.0708  4.8344  5.4800  6.7657 
Cost  95  0  99.53  1.03  9.99  72.71  92.56  100.71  106.17  118.04 
Bob is looking to predict a quantitative variable based on what he knows about a his actual value of a quantitative variable. Because he wants to predict, he needs to consider a regression.
Before we get started thinking about regression, let’s take a step back. A regression is a simple linear model. If that sounds strange to you, you can think about a linear model as an equation where we consider Y to be some function of X. In other words, Y = f(X). Both the topic in this unit (regression) and that in the next unit (ANOVA) are linear models. With advances in both computing power and the complexity of designs, separating regression and ANOVA are really a matter of semantics than substance. That said, this unit will focus on regression, with ANOVA coming later in the course. Let’s get back to Bob.
Bob did a great job understanding correlations and scatterplots, so he creates a scatterplot of the data.
He recognizes the fact that he has two quantitative variables, dollar amount and number of critical areas and that they have a positive strong linear relationship. However, he learned that the limitation of correlation is that the technique cannot lead to insights about causality between variables. Now he needs a new statistical technique.
Regression analysis provides the evidence that Bob is seeking, specifically how a specific variable of interest is affected by one or more variables. For Bob’s example, he is using number of critical areas to predict dollar amount.
Before we get starting with regression, it is important to distinguish between the variable of interest and the variable(s) we will use to predict the variable of interest.
 Response Variable
 Denoted, Y, is also called the variable of interest or dependent variable. In Bob's example, this is the dollar amount
 Predictor Variable
 Denoted, X, is also called the explanatory variable or independent variable. In Bob’s example, this is the number of critical features
When there is only one predictor variable, we refer to the regression model as a simple linear regression model.
In statistics, we can describe how variables are related using the mathematical function as we described as a linear model. We refer to this model as the simple linear regression model.
Objectives
 Identify the slope, intercept, and coefficient of determination
 Calculate predicted and residual (error) values
 Test the significance of the slope, including statement of the null hypothesis for the slope
 State the assumptions for regression
8.1  Linear Relationships
8.1  Linear RelationshipsTo define a useful model, we must investigate the relationship between the response and the predictor variables. As mentioned before, the focus of this Lesson is linear relationships. For a brief review of linear functions, recall that the equation of a line has the following form:
\(y=mx+b\)
where m is the slope and b is the yintercept.
Given two points on a line, \(\left(x_1,y_1\right)\) and \(\left(x_2, y_2\right)\), the slope is calculated by:
\begin{align} m&=\dfrac{y_2y_1}{x_2x_1}\\&=\dfrac{\text{change in y}}{\text{change in x}}\\&=\frac{\text{rise}}{\text{run}} \end{align}
The slope of a line describes a lot about the linear relationship between two variables. If the slope is positive, then there is a positive linear relationship, i.e., as one increases, the other increases. If the slope is negative, then there is a negative linear relationship, i.e., as one increases the other variable decreases. If the slope is 0, then as one increases, the other remains constant.
8.2  Simple Linear Regression
8.2  Simple Linear RegressionFor Bob’s simple linear regression example, he wants to see how changes in the number of critical areas (the predictor variable) impact the dollar amount for land development (the response variable). If the value of the predictor variable (number of critical areas) increases, does the response (cost) tend to increase, decrease, or stay constant? For Bob, as the number of critical features increases, does the dollar amount increase, decrease or stay the same?
We test this by using the characteristics of the linear relationships, particularly the slope as defined above. Remember from hypothesis testing, we test the null hypothesis that a value is zero. We extend this principle to the slope, with a null hypothesis that the slope is equal to zero. Nonzero slopes indicate a significant impact of the predictor variable from the response variable, whereas zero slope indicates change in the predictor variable do not impact changes in the response.
Let’s take a closer look at the linear model producing our regression results.
8.2.1  Assumptions for the SLR Model
8.2.1  Assumptions for the SLR ModelBefore we get started in interpreting the output it is critical that we address specific assumptions for regression. Not meeting these assumptions is a flag that the results are not valid (the results cannot be interpreted with any certainty because the model may not fit the data).
In this section, we will present the assumptions needed to perform the hypothesis test for the population slope:
\(H_0\colon \ \beta_1=0\)
\(H_a\colon \ \beta_1\ne0\)
We will also demonstrate how to verify if they are satisfied. To verify the assumptions, you must run the analysis in Minitab first.
Assumptions for Simple Linear Regression
 Linearity: The relationship between \(X\) and \(Y\) must be linear.
Check this assumption by examining a scatterplot of x and y.
 Independence of errors: There is not a relationship between the residuals and the Y variable; in other words, Y is independent of errors.
Check this assumption by examining a scatterplot of “residuals versus fits”; the correlation should be approximately 0. In other words, there should not look like there is a relationship.
 Normality of errors: The residuals must be approximately normally distributed.
Check this assumption by examining a normal probability plot; the observations should be near the line. You can also examine a histogram of the residuals; it should be approximately normally distributed.
 Equal variances: The variance of the residuals is the same for all values of \(X\).
Check this assumption by examining the scatterplot of “residuals versus fits”; the variance of the residuals should be the same across all values of the xaxis. If the plot shows a pattern (e.g., bowtie or megaphone shape), then variances are not consistent, and this assumption has not been met.
8.2.2  The SLR Model
8.2.2  The SLR ModelThe errors referred to in the assumptions are only one component of the linear model. The basis of the model, the observations are considered as coordinates, \((x_i, y_i)\), for \(i=1, …, n\). The points, \(\left(x_1,y_1\right), \dots,\left(x_n,y_n\right)\), may not fall exactly on a line, (like the cost and number of critical areas). This gap is the error!
The graph below is an example of a scatter plot showing height as the explanatory variable for height. Select the + icons to view the explanations of the different parts of the scatterplot and the leastsquares regression line.
The graph below summarizes the leastsquares regression for Bob's data. We will define what we mean by least squares regression in more detail later in the Lesson, for now, focus on how the red line (the regression line) "fits" the blue dots (Bob's data)
We combine the linear relationship along with the error in the simple linear regression model.
Simple Linear Regression Model

The general form of the simple linear regression model is...
\(Y=\beta_0+\beta_1X+\epsilon\)
For an individual observation,
\(y_i=\beta_0+\beta_1x_i+\epsilon_i\)
where,
 \(\beta_0\) is the population yintercept,
 \(\beta_1\) is the population slope, and
 \(\epsilon_i\) is the error or deviation of \(y_i\) from the line, \(\beta_0+\beta_1x_i\)
To make inferences about these unknown population parameters (namely the slope and intercept), we must find an estimate for them. There are different ways to estimate the parameters from the sample. This is where we get to n the leastsquares method.
Least Squares Line
The leastsquares line is the line for which the sum of squared errors of predictions for all sample points is the least.
Using the leastsquares method, we can find estimates for the two parameters.
The formulas to calculate least squares estimates are:
 Sample Slope
 \(\hat{\beta}_1=\dfrac{\sum (x_i\bar{x})(y_i\bar{y})}{\sum (x_i\bar{x})^2}\)
 Sample Intercept
 \(\hat{\beta}_0=\bar{y}\hat{\beta}_1\bar{x}\)
The least squares line for Bob’s data is the red line on the scatterplot below.
Let’s jump ahead for a moment and generate the regression output. Below we will work through the content of the output. The regression output for Bob’s data look like this:
Coefficients
Predictor  Coef  SE Coef  TValue  PValue  VIF 

Constant  49.542  0.560  88.40  0.000  
Critical Areas  10.417  0.115  90.92  0.000  1.00 
Regression Equation
Cost = 49.542 + 10.417 Critical Areas
8.2.3  Interpreting the Coefficients
8.2.3  Interpreting the CoefficientsOnce we have the estimates for the slope and intercept, we need to interpret them. For Bob’s data, the estimate for the slope is 10.417 and the estimate for the intercept (constant) is 49.542. Recall from the beginning of the Lesson what the slope of a line means algebraically. If the slope is denoted as \(m\), then
\(m=\dfrac{\text{change in y}}{\text{change in x}}\)
Going back to algebra, the intercept is the value of y when \(x = 0\). It has the same interpretation in statistics.
Interpreting the intercept of the regression equation, \(\hat{\beta}_0\) is the \(Y\)intercept of the regression line. When \(X = 0\) is within the scope of observation, \(\hat{\beta}_0\) is the estimated value of Y when \(X = 0\).
Note, however, when \(X = 0\) is not within the scope of the observation, the Yintercept is usually not of interest. In Bob’s example, \(X = 0\) or 49.542 would be a plot of land with no critical areas. This might be of interest in establishing a baseline value, but specifically, in looking at land that HAS critical areas, this might not be of much interest to Bob.
As we already noted, the slope of a line is the change in the y variable over the change in the x variable. If the change in the x variable is one, then the slope is:
\(m=\dfrac{\text{change in y}}{1}\)
The slope is interpreted as the change of y for a one unit increase in x. In Bob’s example, for every one unit change in critical areas, the cost of development increases by 10.417.
Interpreting the slope of the regression equation, \(\hat{\beta}_1\)
\(\hat{\beta}_1\) represents the estimated change in Y per unit increase in X
Note that the change may be negative which is reflected when \(\hat{\beta}_1\) is negative or positive when \(\hat{\beta}_1\) is positive.
If the slope of the line is positive, as it is in Bob’s example, then there is a positive linear relationship, i.e., as one increases, the other increases. If the slope is negative, then there is a negative linear relationship, i.e., as one increases the other variable decreases. If the slope is 0, then as one increases, the other remains constant, i.e., no predictive relationship.
Therefore, we are interested in testing the following hypotheses:
\(H_0\colon \beta_1=0\\H_a\colon \beta_1\ne0\)
Let’s take a closer look at the hypothesis test for the estimate of the slope. A similar test for the population intercept, \(\beta_0\), is not discussed in this class because it is not typically of interest.
8.2.4  Hypothesis Test for the Population Slope
8.2.4  Hypothesis Test for the Population SlopeAs mentioned, the test for the slope follows the logic for a one sample hypothesis test for the mean. Typically (and will be the case in this course) we test the null hypothesis that the slope is equal to zero. However, it is possible to test the null hypothesis that the slope is zero or less than zero OR test the null hypothesis that the slope is zero or greater than zero.
Research Question  Is there a linear relationship?  Is there a positive linear relationship?  Is there a negative linear relationship? 

Null Hypothesis  \(\beta_1=0\)  \(\beta_1=0\)  \(\beta_1=0\) 
Alternative Hypothesis  \(\beta_1\ne0\)  \(\beta_1>0\)  \(\beta_1<0\) 
Type of Test  Twotailed, nondirectional  Righttailed, directional  Lefttailed, directional 
The test statistic for the test of population slope is:
\(t^*=\dfrac{\hat{\beta}_1}{\hat{SE}(\hat{\beta}_1)}\)
where \(\hat{SE}(\hat{\beta}_1)\) is the estimated standard error of the sample slope (found in Minitab output). Under the null hypothesis and with the assumptions shown in the previous section, \(t^*\) follows a \(t\)distribution with \(n2\) degrees of freedom.
Take another look at the output from Bob’s data.
Coefficients
Predictor  Coef  SE Coef  TValue  PValue  VIF 

Constant  49.542  0.560  88.40  0.000  
Critical Areas  10.417  0.115  90.92  0.000  1.00 
Regression Equation
Cost = 49.542 + 10.417 Critical Areas
Here we can see that the “TValue” is 90.92, a very large t value indicating the difference between the null value for the slope (zero) is very different from the value for the slope calculated by the leastsquares method (10.417). This results in a small probability value that the null is true (PValue is less then .05), so Bob can reject the null, and conclude that the slope is not zero. Therefore, the number of critical areas significantly predicts the cost of development.
He can be more specific and conclude that for every one unit change in critical areas, the cost of development increases by 10.417.
As with most of our calculations, we need to allow some room for imprecision in our estimate. We return to the concept of confidence intervals to build in some error around the estimate of the slope.
The \( (1\alpha)100\%\) confidence interval for \(\beta_1\) is:
\(\hat{\beta}_1\pm t_{\alpha/2}\left(\hat{SE}(\hat{\beta}_1)\right)\)
where \(t\) has \(n2\) degrees of freedom.
The final piece of output from Minitab is the Least Squares Regression Equation. Remember that Bob is interested in being able to predict the development cost of land given the number of critical areas. Bob can use the equation to do this.
If a given piece of land has 10 critical areas, Bob can “plug in” the value of “10” for X, the resulting equation
\(Cost = 49.542 + 10.417 * 10\)
Results in a predicted cost of:
\(153.712 = 49.542 + 10.417 * 10\)
So, if Bob knows a piece of land has 10 critical areas, he can predict the development cost will be about 153 dollars!
Using the 10 critical features allowed Bob to predict the development cost, but there is an important distinction to make about predicting an “AVERAGE” cost, or a “SPECIFIC” cost. These are represented by ‘CONFIDENCE INTERVALS” versus ‘PREDICTION INTERVALS’ for new observations. (notice the difference here is that we are referring to a new observation as opposed to above when we used confidence intervals for the estimate of the slope!)
The mean response at a given X value is given by:
\(E(Y)=\beta_0+\beta_1X\)
Inferences about Outcome for New Observation
 The point estimate for the outcome at \(X = x\) is provided above.
 The interval to estimate the mean response is called the confidence interval. Minitab calculates this for us.
 The interval used to estimate (or predict) an outcome is called prediction interval.
For a given x value, the prediction interval and confidence interval have the same center, but the width of the prediction interval is wider than the width of the confidence interval. That makes good sense since it is harder to estimate a value for a single subject (for example a particular piece of land in Bob’s town that may have some unique features) than it would be to estimate the average for all pieces of land. Again, Minitab will calculate this interval as well.
8.2.5  SLR with Minitab
8.2.5  SLR with MinitabMinitab^{®}
Simple Linear Regression with Minitab
 Select Stat > Regression > Regression > Fit Regression Model
 In the box labeled "Response", specify the desired response variable.
 In the box labeled "Predictors", specify the desired predictor variable.
 Select OK. The basic regression analysis output will be displayed in the session window.
To check assumptions...
 Click Graphs .
 In 'Residuals plots, choose 'Four in one.'
 Select OK .
8.3  Cautions with Linear Regression
8.3  Cautions with Linear RegressionExtrapolation is applying a regression model to Xvalues outside the range of sample Xvalues to predict values of the response variable Y. For example, Bob would not want to use the number of critical features to predict dollar amount using a regression model based on an urban area if Bob’s town is rural.
Second, if no linear relationship (i.e. correlation is zero) exists it does not imply there is no relationship . The scatter plot will reveal whether other possible relationships may exist. The figure below gives an example where X, Y are related, but not linearly related i.e. the correlation is zero.
Outliers and Influential Observations
Influential observations are points whose removal causes the regression equation to change considerably. It is flagged by Minitab in the unusual observation list and denoted as X. Outliers are points that lie outside the overall pattern of the data. Potential outliers are flagged by Minitab in the unusual observation list and denoted as R. The following is the Minitab output for the unusual observations within Bob’s study:
Fits and Diagnostics for Unusual Observations
Obs  Cost  Fit  Resid  Std Resid  

1  72.714  71.725  0.989  0.98  X  
2  78.825  75.829  2.996  2.93  R  X 
6  81.967  84.507  2.540  2.44  R  
7  83.490  85.640  2.150  2.06  R  
85  113.540  111.440  2.100  2.01  R 
R Large Residual
X Unusual X
Some observations may be both outliers and influential, and these are flagged by R. Those observational points will merit particular attention because these points are not well “fit” by the model and maybe influencing conclusions or indicate an alternative model is needed.
8.4  Estimating the standard deviation of the error term
8.4  Estimating the standard deviation of the error termOur simple linear regression model is:
\(Y=\beta_0+\beta_1X+\epsilon\)
The errors for the \(n\) observations are denoted as \(\epsilon_i\), for \(i=1, …, n\). One of our assumptions is that the errors have equal variance (or equal standard deviation). We can estimate the standard deviation of the error by finding the standard deviation of the residuals, \(\hat{\epsilon}_i=\hat{y}_iy_i\). Minitab also provides the estimate for us, denoted as \(S\), under the Model Summary. We can also calculate it by:
\(s=\sqrt{\text{MSE}}\)
Find the MSE in the ANOVA table, under the Adj MS column and the Error row. The value of 1.12 represents the average squared error. This becomes the denominator for the F test.
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  1  9281.7  9281.72  8267.30  0.000 
Critical Areas  1  9287.7  9281.72  8267.30  0.000 
Error  93  104.4  1.12  
Total  94  9386.1 
8.5  Coefficient of Determination
8.5  Coefficient of DeterminationNow that we know how to estimate the coefficients and perform the hypothesis test, is there any way to tell how useful the model is?
One measure is the coefficient of determination, denoted \(R^2\).
 Coefficient of Determination \(R^2\)
 The coefficient of determination measures the percentage of variability within the \(y\)values that can be explained by the regression model.
Therefore, a value close to 100% means that the model is useful and a value close to zero indicates that the model is not useful.
It can be shown by mathematical manipulation that:
\(\text{SST }=\text{ SSR }+\text{ SSE}\)
\(\sum (y_i\bar{y})^2=\sum (\hat{y}_i\bar{y})^2+\sum (y_i\hat{y}_i)^2\)
Total variability in the y value = Variability explained by the model + Unexplained variability
To get the total, explained and unexplained variability, first we need to calculate corresponding deviances. Drag the slider on the image below to see how the total deviance \((y_i\bar{y})\) is split into explained \((\hat{y}_i\bar{y})\) and unexplained deviances \((y_i\hat{y}_i)\).
he breakdown of variability in the above equation holds for the multiple regression model also.
 Coefficient of Determination \(R^2\) Formula

\(R^2=\dfrac{\text{variability explained by the model}}{\text{total variability in the y values}}\)
\(R^2\) represents the proportion of total variability of the \(y\)value that is accounted for by the independent variable \(x\).
For the specific case when there is only one independent variable \(X\) (i.e., simple linear regression), one can show that \(R^2 =r^2\), where \(r\) is correlation coefficient between \(X\) and \(Y\). For Bob’s data, the correlation of the two variable is 0.994 and the R2 value is 98.89.
Correlations
Pearson correlation  0.994 
Pvalue  0.000 
Model Summary
S  Rsq  Rsq(adj)  Rsq(pred) 

1.05958  98.89%  98.88%  98.82% 
Minitab^{®}
Finding Correlation
 Select Stat > Basic statistics > Correlation
 Specify the two (or more) variables for which you want the correlation coefficient(s) calculated.
 Pearson correlation is the default. An optional Spearman rho method is also available.
 If it isn't already checked, put a checkmark in the box labeled Display pvalues by clicking once on the box.
 Select OK. The output will appear in the session window.
8.6  Lesson Summary
8.6  Lesson SummaryNow that you have seen what the components of a regression are, you can see that a regression is a linear model (following the form \(y=b_0+b_1+ \text{error}\)). In our next lesson, we will learn that this same linear model can analyze a categorical variable’s impact on a quantitative (y) variable.
Let’s return to Bob. Bob can now state with confidence that the number of critical features does significantly impact the dollar amount of a land development proposal. Hopefully growing statistical skills will make his work with developers more effective and efficient!