3.4.3  Simple Linear Regression
3.4.3  Simple Linear RegressionRegression uses one or more explanatory variables (\(x\)) to predict one response variable (\(y\)). In this course, we will be learning specifically about simple linear regression. The "simple" part is that we will be using only one explanatory variable. If there are two or more explanatory variables, then multiple linear regression is necessary. The "linear" part is that we will be using a straight line to predict the response variable using the explanatory variable. Unlike in correlation, in regression is does matter which variable is called \(x\) and which is called \(y\). In regression, the explanatory variable is always \(x\) and the response variable is always \(y\). Both \(x\) and \(y\) must be quantitative variables.
You may recall from an algebra class that the formula for a straight line is \(y=mx+b\), where \(m\) is the slope and \(b\) is the yintercept. The slope is a measure of how steep the line is; in algebra, this is sometimes described as "change in y over change in x," (\(\frac{\Delta y}{\Delta x}\)), or "rise over run." A positive slope indicates a line moving from the bottom left to top right. A negative slope indicates a line moving from the top left to bottom right. The yintercept is the location on the yaxis where the line passes through. In other words, when \(x=0\) then \(y=y  intercept\).
In statistics, we use similar formulas:
 Simple Linear Regression Line: Sample
 \(\widehat{y}=a+bx\)

\(\widehat{y}\) = predicted value of \(y\) for a given value of \(x\)
\(a\) = \(y\)intercept
\(b\) = slope
In a population, the yintercept is denoted as \(\beta_0\) ("beta sub 0") or \(\alpha\) ("alpha"). The slope is denoted as \(\beta_1\) ("beta sub 1") or just \(\beta\) ("beta").
 Simple Linear Regression Line: Population
 \(\widehat{y}=\alpha+\beta x\)
Simple linear regression uses data from a sample to construct the line of best fit. But what makes a line “best fit”? The most common method of constructing a simple linear regression line, and the only method that we will be using in this course, is the least squares method. The least squares method finds the values of the yintercept and slope that make the sum of the squared residuals (also know as the sum of squared errors or SSE) as small as possible.
 Residual
 The difference between an observed y value and the predicted y value. In other words, \(y \widehat y\). On a scatterplot, this is the vertical distance between the line of best fit and the observation. In a sample this may be denoted as \(e\) or \(\widehat \epsilon\) ("epsilonhat") and in a population this may be denoted as \(\epsilon\) ("epsilon")
 Residual
 \(e=y\widehat{y}\)

\(y\) = actual value of \(y\)
\(\widehat{y}\) = predicted value of \(y\)
Example
The plot below shows the line \(\widehat{y}=6.5+1.8x\)
Identify and interpret the yintercept.
The yintercept is 6.5. When \(x=0\) the predicted value of y is 6.5.
Identify and interpret the slope.
The slope is 1.8. For every one unit increase in x, the predicted value of y increases by 1.8.
Compute and interpret the residual for the point (0.2, 5.1).
The observed x value is 0.2 and the observed y value is 5.1.
The formula for the residual is \(e=y\widehat{y}\)
We can compute \(\widehat{y}\) using the regression equation that we have and \(x=0.2\)
\(\widehat{y}=6.5+1.8(0.2)=6.14\)
Given an x value of 0.2, we would predict this observation to have a y value of 6.14. In reality, they had a y value of 5.1. The residual is the difference between these two values.
\(e=y\widehat{y}=5.16.14=1.04\)
The residual for this observation is 1.04. This observation's y value is 1.04 less than predicted given their x value.
Cautions
 Avoid extrapolation. This means that a regression line should not be used to make a prediction about someone from a population different from the one that the sample used to define the model was from.
 Make a scatterplot of your data before running a regression model to confirm that a linear relationship is reasonable. Simple linear regression constructs a straight line. If the relationship between x and y is not linear, then a linear model is not the most appropriate.
 Outliers can heavily influence a regression model. Recall the plots that we looked at when learning about correlation. The addition of one outlier can greatly change the line of best fit. In addition to examining a scatterplot for linearity, you should also be looking for outliers.
Later in the course, we will devote a week to correlation and regression.
3.4.3.1  Minitab: SLR
3.4.3.1  Minitab: SLRMinitab^{®} – Simple Linear Regression
We previously created a scatterplot of quiz averages and final exam scores and observed a linear relationship. Here, we will use quiz scores to predict final exam scores.
 Open the Minitab file: Exam.mwx (or Exam.csv)
 Select Stat > Regression > Regression > Fit Regression Model...
 Double click Final in the box on the left to insert it into the Responses (Y) box on the right
 Double click Quiz_Average in the box on the left to insert it into the Continuous Predictors (X) box on the right
 Click OK
This should result in the following output:
Regression Equation
Final = 12.1 + 0.751 Quiz_Average
Coefficients
Term  Coef  SE Coef  TValue  PValue  VIF 

Constant  12.1  11.9  1.01  0.3153  
Quiz_Average  0.751  0.141  5.31  0.000  1.00 
Model Summary
S  Rsq  Rsq(adj)  Rsq(pred) 

9.71152  37.04%  35.73%  29.82% 
Analysis of Variance
Source  DF  Adj SS  Adj MS  FValue  PValue 

Regression  1  2664  2663.66  28.24  0.000 
Quiz_Average  1  2664  2663.66  28.24  0.000 
Error  48  4527  94.31  
Total  49  7191 
Fits and Diagnostics for Unusual Observations
Obs  Final  Fit  Resid  Std Resid  

11  49.00  70.50  21.50  2.25  R 
40  80.00  61.22  18.78  2.03  R 
47  37.00  59.51  22.51  2.46  R 
R Large residual
Interpretation
In the output in the above example we are given a simple linear regression model of Final = 12.1 + 0.751 Quiz_Average
This means that the yintercept is 12.1 and the slope is 0.751.
3.4.3.2  Example: Interpreting Output
3.4.3.2  Example: Interpreting OutputThis example uses the "CAOSExam" dataset available from http://www.lock5stat.com/datapage.html.
CAOS stands for Comprehensive Assessment of Outcomes in a First Statistics course. It is a measure of students' statistical reasoning skills. Here we have data from 10 students who took the CAOS at the beginning (pretest) and end (posttest) of a statistics course.
Research question: How can we use students' pretest scores to predict their posttest scores?
Minitab was used to construct a simple linear regression model. The two pieces of output that we are going to interpret here are the regression equation and the scatterplot containing the regression line.
Let's work through a few common questions.
What is the regression model?
The "regression model" refers to the regression equation. This is \(\widehat {posttest}=21.43 + 0.8394(Pretest)\)
Identify and interpret the slope.
The slope is 0.8394. For every one point increase in a student's pretest score, their predicted posttest score increases by 0.8394 points.
Identify and interpret the yintercept.
The yintercept is 21.43. A student with a pretest score of 0 would have a predicted posttest score of 21.43. However, in this scenario, we should not actually use this model to predict the posttest score of someone who scored 0 on the pretest because that would be extrapolation. This model should only be used to predict the posttest score of students from a comparable population whose pretest scores were between approximately 35 and 65.
One student scored 60 on the pretest and 65 on the posttest. Calculate and interpret that student's residual.
This student's observed x value was 60 and their observed y value was 65.
\(e=y \widehat y\)
We have y. We can compute \(\widehat y\) using the x value and regression equation that we have.
\(\widehat y = 21.43 + 0.8394(60) = 71.794\)
\(e=6571.794=6.794\)
This student's residual is 6.794. They scored 6.794 points lower on the posttest than we predicted given their pretest score.