Lesson 11: Multiple Linear Regression **

Lesson 11: Multiple Linear Regression **

Learning objectives for this lesson

Upon completion of this lesson, you should be able to do the following:

  • Understand the differences between simple and multiple linear regression
  • Interpret the coefficients in a multiple linear regression model
  • Conduct t−tests for the individual slope estimates
  • Learn how to include an indicator, or dummy, variable in a regression model

Multiple Linear Regression

In simple linear regression we only consider one predictor variable. When we include more than one predictor variable, we have what is now a multiple linear regression model. This new model is just an extension of the simple model where we now include parameter (i.e. slope) estimates for each predictor variable in the model. These coefficient values for each predictor are the slope estimates. As with simple linear regression, we have one Y or response variable (also called the dependent variable), but now have more than one X variable, also called explanatory, independent, or predictor variables. The Multiple Linear Regression model is as follows:

\(Y=\beta_0+\beta_1 X_1+\ldots+\beta_k X_k +\epsilon\)

Where Y is the response variable and X1, ... , Xk are independent variables. β0 , β1 , ... , βk are fixed (but unknown) parameters and ε is a random variable representing the error, or residuals, that is normally distributed with mean 0 and having a variance \(\sigma^2_\epsilon\) .


11.1 - Inferences About the Parameters

11.1 - Inferences About the Parameters

The inferences and interpretations made in multiple linear regression are similar to those made in simple linear regression (in fact, the assumptions required for multiple linear regression are the same as those for simple linear regression except here they must hold for all predictors and not just the one used in simple linear regression) with four major differences

  1. The t-tests for the slope coefficients are conditional tests. That is, the tests are analyzing the significant predictive value that a variable adds to the model when the other variables are already included in the model
  2. The values of the slopes are interpreted as to how much of a unit change in Y will occur for a unit increase in a particular X predictor variable, given that the other variables are held constant
  3. The coefficient of determination, R2, still measures the amount of variation in the response variable Y that is explained by all of the predictor variables in the model. However, where before the square root of R2 could be interpreted as the correlation between X and Y, this result no longer holds true in multiple linear regression. Since we now have more than one X, this square root is no longer representative of a linear relationship between two variables which is what correlation measures.
  4. The ANOVA F−test is a test that all of the slopes in the model are equal to zero (this is the null hypothesis, H0, versus the alternative hypothesis, H0 that the slopes are not all equal to zero; i.e. at least one slope does not equal zero. This test is called the F−test for Overall Significance. The hypotheses statements appear as follows:
  H0 : β1 = ... = βk = 0
  H1 : At least one of βi's is non zero

F statistic [MSR is Mean Square Regression and MSE is Mean Square Error]:

\(F=\frac{MSR}{MSE}\)


11.2 - Example

11.2 - Example

We will refer to the Exam Data set, (Exam.MTW or Exam.XLS), that consists of random sample of 50 students from a previous introduction to statistics course. We will use the variable Final as the response, and include the variables, Quiz Average, Midterm, and Gender as the predictors.

Since Gender is recorded as text, we need to first code Gender into a dummy variable.

To recode a text variable to a dummy variable in Minitab:

  1. Import data into Minitab
  2. Data > Code > Text to NumericGo to Transform > Recode Into Different Variables
  3. Enter Gender into the text box for Code Data From Columns
  4. Type the word DummyGender into the text box for Store Coded Data In Columns
  5. In the first text box under Original Values type in Female and in the first text box under New type the number 0
  6. In the second text box under Original Values type in Male and in the second text box under New type the number 1
  7. Click OK

minitab output

To recode a text variable to a dummy variable in SPSS:

  1. Import data into SPSS
  2. Since the variable Gender has text responses (i.e. Male, Female) we need to recode this variable into a numeric. We will use 1 to represent Male and 0 for Female.
  3. Go to Transform > Recode Into Different Variables
  4. Enter Gender into the Output Variable Window
  5. In the text box under Output Variable labeled Name: enter DummyGender
  6. Click Change
  7. Click the button Old and New Values
  8. Under Old Value click Value and type in Male
  9. Under New Value enter in the Value text box the value 1
  10. Under Old ? New click Add
  11. Repeat steps 8 through 11 typing in Female and 0
  12. Click Continue
  13. Click OK (you should now have a new column of ones and zeroes titled Male)

spss output

NOTE: Software is case sensitive. This means that you have to enter the text variable exactly as it appears in the worksheet. Here, for example, Female uses a capital F. If you used lower case for female the DummyGender would include missing data (i.e. an *) for where Gender was Female. Also, notice that DummyGender is entered as one word and not two words.

With the data set open, do the following:

  1. From the menu bar select Stat > Regression > Regression
  2. With the cursor in the Response box double click on Final from the list of variables. After making this selection the cursor should automatically move to the Predictors window
  3. With the cursor in the Predictors window double click on Quiz Average, Midterm, and DummyGender. This should move these variables into the Predictors window.
  4. Click OK

The output is as follows:

Regression Analysis

The regression equation is
Final = 12.6 + 0.731 Quiz Average + 0.024 Midterm − 1.26 DummyGender

Predictor
Coef
StDev
T
P
Constant
12.64
15.26
0.83
0.412
Quiz Average
0.7307
01653
4.42
0.000
Midterm
0.0237
0.1934
0.12
0.903
DummyGender
−1.259
2.487
−0.44
0.661

 

S = 9.89832 R−Sq = 37.3% R−Sq(adj) = 33.2%

Analysis of Variance

Source
DF
SS
MS
F
P
Regression
3
2683.79
849.60
9.13
0.000
Residual Error
46
4506.93
97.98
Total
49
7190.72

To perform a multiple linear regression analysis in SPSS:

  1. Open SPSS without data
  2. Click Analyze > Regression > Linear
  3. Click the variable Final and move to the text box for Dependent
  4. Click the variables Quiz_Average, Midterm, DummyGender and move to the text window for Independent(s)
  5. Click OK

The output is as follows:

spss output

spss output

spss output


11.3 - About the Output

11.3 - About the Output

Coefficient Interpretation

The coefficient values are interpreted as to how much of a unit change in Y will occur for a unit increase in a particular X predictor variable, given that the other variables are held constant. For example, with DummyGender represented by 0 for Female and 1 for Male, if we held Quiz Average and Midterm constant, then for Males we would expect a 1.26 percent decrease (since slope is negative!) on the Final

t−Test for Individual Coefficients

A t−test on an individual coefficient is a test of its significance in the presence of all other explanatory variables. For instance, the slope test for Quiz Average is equal to zero versus not equal to zero given that Midterm and DummyGender are already present in the model, we have0.000 as the p-value for this test. If we use α of 0.05 to perform this test, with the p-value less than alpha we would reject the null hypothesis and decide that the slope for Quiz Average differs from zero. Our conclusion is that Quiz Average is a significant linear predictor of Final when Midterm and DummyGender are present in the model; we should include Quiz Average in a model containing the other two predictors

Notice that the p-values for the other two predictor variables is greater than our 0.05 level of significance. This indicates that neither variable is a significant linear predictor of Final when the other two variables are in the model. This does NOT imply that both variables should be dropped; only that when Quiz Average and Midterm are in the model, DummyGender is not offer significant predictive value, or if Quiz Average and DummyGender are in the model, Midterm does not provide significant predictive value.

F-Test of Overall Significance

From the ANOVA table output the p-value of 0.000 shows that we would reject the null hypothesis that all the slopes equal 0 and conclude that at least one of the slopes differs significantly from zero. HOWEVER, this does not tells us how many differ and/or which one(s) differ.

Coefficient of Determination R2

The R2 (i.e. R−sq in the output) of 37.3% is interpreted as "37.3 percent of the variation in Final scores are explained by Quiz Average, Midterm, and Gender." This is not a very high percentage as roughly 63% is left unexplained.

Since Sum of Squares Total (SST) = Sum of Squares Regression (SSR) + Sum of Squares Error (SSE), we calculate R2 as follows:

R2 = SSR / SST or by 1 − SSE / SST. Since R-squared is typically reported as a percentage, we then multiple this value by 100%.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility