5.4 - Regression

Regression is a descriptive method used with two different measurement variables to find the best straight line (equation) to fit the data points on the scatterplot. A key feature of the regression equation is that it can be used to make predictions. In order to carry out a regression analysis, the variables need to be designated as either the:

Explanatory or Predictor Variable = x (on horizontal axis)

Response or Outcome Variable = y (vertical axis)

The explanatory variable can be used to predict (estimate) a typical value for the response variable. (Note: It is not necessary to indicate which variable is the explanatory variable and which variable is the response with correlation.)

Review: Equation of a Line

Let's review the basics of the equation of a line:

\(y = a + bx\) where:

a = y-intercept (the value of y when x = 0)

b = slope of the line. The slope is the change in the variable (y) as the other variable (x) increases by one unit. When b is positive there is a positive association, when b is negative there is a negative association.

In this graph, the line is drawn by the equation y equals a plus b times x. The intercept on y axis is a. And y will change by b units as x increases one unit.

Example 5.5: Example of Regression Equation Section

Consider the following two variables for a sample of ten Stat 100 students.

x = quiz score
y = exam score

Figure 5.6 displays the scatterplot of this data whose correlation is 0.883.

The scatterplot shows the positive relationship between quiz and exam since the exam score increases as the quiz score increases.

Figure 5.6. Scatterplot of Quiz versus exam scores

We would like to be able to predict the exam score based on the quiz score for students who come from this same population. To make that prediction we notice that the points generally fall in a linear pattern so we can use the equation of a line that will allow us to put in a specific value for x (quiz) and determine the best estimate of the corresponding y (exam). The line represents our best guess at the average value of y for a given x value and the best line would be one that has the least variability of the points around it (i.e. we want the points to come as close to the line as possible). Remembering that the standard deviation measures the deviations of the numbers on a list about their average, we find the line that has the smallest standard deviation for the distance from the points to the line. That line is called the regression line or the least squares line. Least squares essentially find the line that will be the closest to all the data points than any other possible line. Figure 5.7 displays the least squares regression for the data in Example 5.5.

In this scatterplot, a regression line has been added with the least squares method. The line slopes upward and passes through some data points.

Figure 5.7. Least Squares Regression Equation

As you look at the plot of the regression line in Figure 5.7, you find that some of the points lie above the line while other points lie below the line. In fact the total distance for the points above the line is exactly equal to the total distance from the line to the points that fall below it.

The least squares regression equation used to plot the equation in Figure 5.7 is:

\begin{align} &y = 1.15 + 1.05 x \text{ or}   \\ &\text{predicted exam score = 1.15 + 1.05 Quiz}\end{align}


Interpretation of Y-Intercept

Y-Intercept = 1.15 points

Y-Intercept Interpretation: If a student has a quiz score of 0 points, one would expect that he or she would score 1.15 points on the exam.

However, this y-intercept does not offer any logical interpretation in the context of this problem, because x = 0 is not in the sample. If you look at the graph, you will find the lowest quiz score is 56 points. So, while the y-intercept is a necessary part of the regression equation, by itself it provides no meaningful information about student performance on an exam when the quiz score is 0.


Interpretation of Slope

Slope = 1.05 = 1.05/1 = (change in exam score)/(1 unit change in quiz score)

Slope Interpretation: For every increase in quiz score by 1 point, you can expect that a student will score 1.05 additional points on the exam.

In this example, the slope is a positive number, which is not surprising because the correlation is also positive. A positive correlation always leads to a positive slope and a negative correlation always leads to a negative slope.


Remember that we can also use this equation for prediction. So consider the following question:

 If a student has a quiz score of 85 points, what score would we expect the student to make on the exam? We can use the regression equation to predict the exam score for the student.

Exam = 1.15 + 1.05 Quiz
Exam = 1.15 + 1.05 (85) = 1.15 + 89.25 = 90.4 points

Figure 5.8 verifies that when a quiz score is 85 points, the predicted exam score is about 90 points.

In this plot, the predicted data point (85, 90.4) is really close to the regression line.

Figure 5.8. Prediction of Exam Score at a Quiz Score of 85 Points

Example 5.6 Section

Let's return now to Example 4.8 the experiment to see the relationship between the number of beers you drink and your blood alcohol content (BAC) a half-hour later (scatterplot shown in Figure 4.8). Figure 5.9 below shows the scatterplot with the regression line included. The line is given by

predicted Blood Alcohol Content = -0.0127 +0.0180(# of beers)

BAC Regression

Figure 5.9. Regression line relating # of beers consumed and blood alcohol content

Notice that four different students taking part in this experiment drank exactly 5 beers. For that group we would expect their average blood alcohol content to come out around -0.0127 + 0.0180(5) = 0.077. The line works really well for this group as 0.077 falls extremely close to the average for those four participants.