Regression uses one or more explanatory variables (\(x\)) to predict one response variable (\(y\)). In this course, we will be learning specifically about simple linear regression. The "simple" part is that we will be using only one explanatory variable. If there are two or more explanatory variables, then multiple linear regression is necessary. The "linear" part is that we will be using a straight line to predict the response variable using the explanatory variable. Unlike in correlation, in regression is does matter which variable is called \(x\) and which is called \(y\). In regression, the explanatory variable is always \(x\) and the response variable is always \(y\). Both \(x\) and \(y\) must be quantitative variables.
You may recall from an algebra class that the formula for a straight line is \(y=mx+b\), where \(m\) is the slope and \(b\) is the y-intercept. The slope is a measure of how steep the line is; in algebra, this is sometimes described as "change in y over change in x," (\(\frac{\Delta y}{\Delta x}\)), or "rise over run." A positive slope indicates a line moving from the bottom left to top right. A negative slope indicates a line moving from the top left to bottom right. The y-intercept is the location on the y-axis where the line passes through. In other words, when \(x=0\) then \(y=y - intercept\).
In statistics, we use similar formulas:
- Simple Linear Regression Line: Sample
- \(\widehat{y}=a+bx\)
-
\(\widehat{y}\) = predicted value of \(y\) for a given value of \(x\)
\(a\) = \(y\)-intercept
\(b\) = slope
In a population, the y-intercept is denoted as \(\beta_0\) ("beta sub 0") or \(\alpha\) ("alpha"). The slope is denoted as \(\beta_1\) ("beta sub 1") or just \(\beta\) ("beta").
- Simple Linear Regression Line: Population
- \(\widehat{y}=\alpha+\beta x\)
Simple linear regression uses data from a sample to construct the line of best fit. But what makes a line “best fit”? The most common method of constructing a simple linear regression line, and the only method that we will be using in this course, is the least squares method. The least squares method finds the values of the y-intercept and slope that make the sum of the squared residuals (also know as the sum of squared errors or SSE) as small as possible.
- Residual
- The difference between an observed y value and the predicted y value. In other words, \(y- \widehat y\). On a scatterplot, this is the vertical distance between the line of best fit and the observation. In a sample this may be denoted as \(e\) or \(\widehat \epsilon\) ("epsilon-hat") and in a population this may be denoted as \(\epsilon\) ("epsilon")
- Residual
- \(e=y-\widehat{y}\)
-
\(y\) = actual value of \(y\)
\(\widehat{y}\) = predicted value of \(y\)
Example Section
The plot below shows the line \(\widehat{y}=6.5+1.8x\)
Identify and interpret the y-intercept.
The y-intercept is 6.5. When \(x=0\) the predicted value of y is 6.5.
Identify and interpret the slope.
The slope is 1.8. For every one unit increase in x, the predicted value of y increases by 1.8.
Compute and interpret the residual for the point (-0.2, 5.1).
The observed x value is -0.2 and the observed y value is 5.1.
The formula for the residual is \(e=y-\widehat{y}\)
We can compute \(\widehat{y}\) using the regression equation that we have and \(x=-0.2\)
\(\widehat{y}=6.5+1.8(-0.2)=6.14\)
Given an x value of -0.2, we would predict this observation to have a y value of 6.14. In reality, they had a y value of 5.1. The residual is the difference between these two values.
\(e=y-\widehat{y}=5.1-6.14=-1.04\)
The residual for this observation is -1.04. This observation's y value is 1.04 less than predicted given their x value.
Cautions Section
- Avoid extrapolation. This means that a regression line should not be used to make a prediction about someone from a population different from the one that the sample used to define the model was from.
- Make a scatterplot of your data before running a regression model to confirm that a linear relationship is reasonable. Simple linear regression constructs a straight line. If the relationship between x and y is not linear, then a linear model is not the most appropriate.
- Outliers can heavily influence a regression model. Recall the plots that we looked at when learning about correlation. The addition of one outlier can greatly change the line of best fit. In addition to examining a scatterplot for linearity, you should also be looking for outliers.
Later in the course, we will devote a week to correlation and regression.