Perhaps somewhere along the way in our most recent discussion, you thought "why not just fit two separate regression functions — one for the smokers and one for the non-smokers?" (If you didn't think of it, I thought of it for you!) Are there advantages to including both the binary and quantitative predictor variables within one multiple regression model? The answer is yes! In this section, we explore the two primary advantages.
The first advantage
An easy way of discovering the first advantage is to analyze the data three times — once using the data on all 32 subjects, using the data on only the 16 non-smokers, and once using the data on only the 16 smokers. Then, we can investigate the effects of the different analyses on important things such as the sizes of standard errors of the coefficients and the widths of confidence intervals. Let's try it!
Here's the Minitab output for the analysis using a (0,1) indicator variable and the data on all 32 subjects. Let's just run through the output and collect information on various values obtained:
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | -2390 | 349 | -6.84 | 0.000 | |
Gest | 143.10 | 9.13 | 15.68 | 0.000 | 1.06 |
Smoke | -244.5 | 42.0 | -5.83 | 0.000 | 1.06 |
Regression Equation
Wgt = -2390 + 143.10 Gest - 244.5 SmokeThe standard error of the Gest coefficient is 9.13. Recall that this value quantifies how much the estimated Gest coefficient would vary from sample to sample. And, the following output:
Variable Setting
Gest | 38 |
---|---|
Smoke | 1 |
Fit | SE Fit | 95% CI | 95% PI |
---|---|---|---|
2803.69 | 30.8496 | (2740.60, 2866.79) | (2559.13, 3048.26) |
Variable Setting
Gest | 38 |
---|---|
Smoke | 0 |
Fit | SE Fit | 95% CI | 95% PI |
---|---|---|---|
3048.24 | 28.9051 | (2989.12, 3107.36) | (2804.67, 3291.81) |
tells us that for mothers with a 38-week gestation, the width of the confidence interval for the mean birth weight is 126.2 for smoking mothers and 118.2 for non-smoking mothers.
Let's do that again, but this time for the Minitab output on just the 16 non-smoking mothers:
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | -2546 | 457 | -5.57 | 0.000 | |
Gest_0 | 147.2 | 12.0 | 12.29 | 0.000 | 1.00 |
Regression Equation
Wgt_0 = -2546 + 147.2 Gest_0The standard error of the Gest coefficient is 12.0. And:
Variable | Setting |
---|---|
Gest_0 | 38 |
Fit | SE Fit | 95% CI | 95% PI |
---|---|---|---|
3047.72 | 26.7748 | (2990.30, 3105.15) | (2811.30, 3284.15) |
for non-smoking mothers with a 38-week gestation, the width of the confidence interval for the mean birth weight is 114.9.
And, let's do the same thing one more time for the Minitab output on just the 16 smoking mothers:
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | -2475 | 554 | -4.447 | 0.001 | |
Gest_1 | 139.0 | 14.1 | 9.85 | 0.000 | 1.00 |
Regression Equation
Wgt_1 = -2475 + 139.2 Gest_1The standard error of the Gest coefficient is 14.1. And:
Variable | Setting |
---|---|
Gest_1 | 38 |
Fit | SE Fit | 95% CI | 95% PI |
---|---|---|---|
2808.53 | 35.8088 | (2731.73, 2885.33) | (2526.39, 3090.67) |
for smoking mothers with a 38-week gestation, the length of the confidence interval is 153.6.
Here's a summary of what we've gleaned from the three pieces of output:
Model estimated using… |
SE(Gest)
|
Width of CI for \(\mu_Y\) |
---|---|---|
all 32 data points |
9.13
|
(NS) 118.2
(S) 126.2 |
16 nonsmokers |
12.0
|
114.9
|
16 smokers |
14.1
|
153.6
|
Let's see what we learn from this investigation:
- The standard error of the Gest coefficient — SE(Gest) — is the smallest for the estimated model based on all 32 data points. Therefore, confidence intervals for the Gest coefficient will be narrower if calculated using the analysis based on all 32 data points. (This is a good thing!)
- The width of the confidence interval for the mean weight of babies born to smoking mothers is narrower for the estimated model based on all 32 data points (126.2 compared to 153.6), and not substantially different for non-smoking mothers (118.2 compared to 114.9). (Another good thing!)
In short, there appears to be an advantage in "pooling" and analyzing the data all at once rather than breaking it apart and conducting different analyses for each group. Our regression model assumes that the slope for the two groups is equal. It also assumes that the variances of the error terms are equal. Therefore, it makes sense to use as much data as possible to estimate these quantities.
The second advantage
An easy way of discovering the second advantage of fitting one "combined" regression function using all of the data is to consider how you'd answer the research question if you broke apart the data and conducted two separate analyses obtaining:
Nonsmokers
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | -2546 | 457 | -5.57 | 0.000 | |
Gest_0 | 147.2 | 12.0 | 12.29 | 0.000 | 1.00 |
Regression Equation
Wgt_0 = -2546 + 147.2 Gest_0Smokers
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | -2475 | 554 | -4.47 | 0.001 | |
Gest_1 | 139.0 | 14.1 | 9.85 | 0.000 | 1.00 |
Regression Equation
Wgt_1 = -2475 + 139.0 Gest_1How could you use these results to determine if the mean birth weight of babies differs between smoking and non-smoking mothers, after taking into account the length of gestation? Not completely obvious is it?! It actually could be done with much more (complicated) work than would be necessary if you analyze the data as a whole and fit one combined regression function:
Coefficients
Term | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|
Constant | -2390 | 349 | -6.84 | 0.000 | |
Gest | 143.10 | 9.13 | 15.68 | 0.000 | 1.06 |
Smoke | -244.5 | 42.0 | -5.83 | 0.000 | 1.06 |
Regression Equation
Wgt = -2390 + 143.10 Gest - 244.5 SmokeAs we previously discussed, answering the research question merely involves testing the null hypothesis \(H_0 \colon \beta_2 = 0\) against the alternative \(H_0 \colon \beta_2 \ne 0\). The P-value is < 0.001. There is sufficient evidence to conclude that there is a statistically significant difference in the mean birth weight of all babies of smoking mothers and the mean birth weight of all babies of non-smoking mothers, after taking into account the length of gestation.
In summary, "pooling" your data and fitting one combined regression function allows you to easily and efficiently answer research questions concerning the binary predictor variable.