Lesson 11: Multiple Linear Regression **
Lesson 11: Multiple Linear Regression **Learning objectives for this lesson
Upon completion of this lesson, you should be able to do the following:
- Understand the differences between simple and multiple linear regression
- Interpret the coefficients in a multiple linear regression model
- Conduct t−tests for the individual slope estimates
- Learn how to include an indicator, or dummy, variable in a regression model
Multiple Linear Regression
In simple linear regression we only consider one predictor variable. When we include more than one predictor variable, we have what is now a multiple linear regression model. This new model is just an extension of the simple model where we now include parameter (i.e. slope) estimates for each predictor variable in the model. These coefficient values for each predictor are the slope estimates. As with simple linear regression, we have one Y or response variable (also called the dependent variable), but now have more than one X variable, also called explanatory, independent, or predictor variables. The Multiple Linear Regression model is as follows:
\(Y=\beta_0+\beta_1 X_1+\ldots+\beta_k X_k +\epsilon\)
Where Y is the response variable and X1, ... , Xk are independent variables. β0 , β1 , ... , βk are fixed (but unknown) parameters and ε is a random variable representing the error, or residuals, that is normally distributed with mean 0 and having a variance \(\sigma^2_\epsilon\) .
11.1 - Inferences About the Parameters
11.1 - Inferences About the ParametersThe inferences and interpretations made in multiple linear regression are similar to those made in simple linear regression (in fact, the assumptions required for multiple linear regression are the same as those for simple linear regression except here they must hold for all predictors and not just the one used in simple linear regression) with four major differences
- The t-tests for the slope coefficients are conditional tests. That is, the tests are analyzing the significant predictive value that a variable adds to the model when the other variables are already included in the model
- The values of the slopes are interpreted as to how much of a unit change in Y will occur for a unit increase in a particular X predictor variable, given that the other variables are held constant
- The coefficient of determination, R2, still measures the amount of variation in the response variable Y that is explained by all of the predictor variables in the model. However, where before the square root of R2 could be interpreted as the correlation between X and Y, this result no longer holds true in multiple linear regression. Since we now have more than one X, this square root is no longer representative of a linear relationship between two variables which is what correlation measures.
- The ANOVA F−test is a test that all of the slopes in the model are equal to zero (this is the null hypothesis, H0, versus the alternative hypothesis, H0 that the slopes are not all equal to zero; i.e. at least one slope does not equal zero. This test is called the F−test for Overall Significance. The hypotheses statements appear as follows:
H0 : β1 = ... = βk = 0 | |
H1 : At least one of βi's is non zero |
F statistic [MSR is Mean Square Regression and MSE is Mean Square Error]:
\(F=\frac{MSR}{MSE}\)
11.2 - Example
11.2 - ExampleWe will refer to the Exam Data set, (Exam.MTW or Exam.XLS), that consists of random sample of 50 students from a previous introduction to statistics course. We will use the variable Final as the response, and include the variables, Quiz Average, Midterm, and Gender as the predictors.
Since Gender is recorded as text, we need to first code Gender into a dummy variable.
To recode a text variable to a dummy variable in Minitab:
- Import data into Minitab
- Data > Code > Text to NumericGo to Transform > Recode Into Different Variables
- Enter Gender into the text box for Code Data From Columns
- Type the word DummyGender into the text box for Store Coded Data In Columns
- In the first text box under Original Values type in Female and in the first text box under New type the number 0
- In the second text box under Original Values type in Male and in the second text box under New type the number 1
- Click OK
To recode a text variable to a dummy variable in SPSS:
- Import data into SPSS
- Since the variable Gender has text responses (i.e. Male, Female) we need to recode this variable into a numeric. We will use 1 to represent Male and 0 for Female.
- Go to Transform > Recode Into Different Variables
- Enter Gender into the Output Variable Window
- In the text box under Output Variable labeled Name: enter DummyGender
- Click Change
- Click the button Old and New Values
- Under Old Value click Value and type in Male
- Under New Value enter in the Value text box the value 1
- Under Old ? New click Add
- Repeat steps 8 through 11 typing in Female and 0
- Click Continue
- Click OK (you should now have a new column of ones and zeroes titled Male)
NOTE: Software is case sensitive. This means that you have to enter the text variable exactly as it appears in the worksheet. Here, for example, Female uses a capital F. If you used lower case for female the DummyGender would include missing data (i.e. an *) for where Gender was Female. Also, notice that DummyGender is entered as one word and not two words.
With the data set open, do the following:
- From the menu bar select Stat > Regression > Regression
- With the cursor in the Response box double click on Final from the list of variables. After making this selection the cursor should automatically move to the Predictors window
- With the cursor in the Predictors window double click on Quiz Average, Midterm, and DummyGender. This should move these variables into the Predictors window.
- Click OK
The output is as follows:
Regression Analysis
The regression equation is
Final = 12.6 + 0.731 Quiz Average + 0.024 Midterm − 1.26 DummyGender
Predictor |
Coef
|
StDev
|
T
|
P
|
Constant |
12.64
|
15.26
|
0.83
|
0.412
|
Quiz Average |
0.7307
|
01653
|
4.42
|
0.000
|
Midterm |
0.0237
|
0.1934
|
0.12
|
0.903
|
DummyGender |
−1.259
|
2.487
|
−0.44
|
0.661
|
S = 9.89832 | R−Sq = 37.3% | R−Sq(adj) = 33.2% |
Analysis of Variance
Source |
DF
|
SS
|
MS
|
F
|
P
|
Regression |
3
|
2683.79
|
849.60
|
9.13
|
0.000
|
Residual Error |
46
|
4506.93
|
97.98
|
|
|
Total |
49
|
7190.72
|
|
|
|
To perform a multiple linear regression analysis in SPSS:
- Open SPSS without data
- Click Analyze > Regression > Linear
- Click the variable Final and move to the text box for Dependent
- Click the variables Quiz_Average, Midterm, DummyGender and move to the text window for Independent(s)
- Click OK
The output is as follows:
11.3 - About the Output
11.3 - About the OutputCoefficient Interpretation
The coefficient values are interpreted as to how much of a unit change in Y will occur for a unit increase in a particular X predictor variable, given that the other variables are held constant. For example, with DummyGender represented by 0 for Female and 1 for Male, if we held Quiz Average and Midterm constant, then for Males we would expect a 1.26 percent decrease (since slope is negative!) on the Final
t−Test for Individual Coefficients
A t−test on an individual coefficient is a test of its significance in the presence of all other explanatory variables. For instance, the slope test for Quiz Average is equal to zero versus not equal to zero given that Midterm and DummyGender are already present in the model, we have0.000 as the p-value for this test. If we use α of 0.05 to perform this test, with the p-value less than alpha we would reject the null hypothesis and decide that the slope for Quiz Average differs from zero. Our conclusion is that Quiz Average is a significant linear predictor of Final when Midterm and DummyGender are present in the model; we should include Quiz Average in a model containing the other two predictors
Notice that the p-values for the other two predictor variables is greater than our 0.05 level of significance. This indicates that neither variable is a significant linear predictor of Final when the other two variables are in the model. This does NOT imply that both variables should be dropped; only that when Quiz Average and Midterm are in the model, DummyGender is not offer significant predictive value, or if Quiz Average and DummyGender are in the model, Midterm does not provide significant predictive value.
F-Test of Overall Significance
From the ANOVA table output the p-value of 0.000 shows that we would reject the null hypothesis that all the slopes equal 0 and conclude that at least one of the slopes differs significantly from zero. HOWEVER, this does not tells us how many differ and/or which one(s) differ.
Coefficient of Determination R2
The R2 (i.e. R−sq in the output) of 37.3% is interpreted as "37.3 percent of the variation in Final scores are explained by Quiz Average, Midterm, and Gender." This is not a very high percentage as roughly 63% is left unexplained.
Since Sum of Squares Total (SST) = Sum of Squares Regression (SSR) + Sum of Squares Error (SSE), we calculate R2 as follows:
R2 = SSR / SST or by 1 − SSE / SST. Since R-squared is typically reported as a percentage, we then multiple this value by 100%.