8.3 - Two Separate Advantages

Perhaps somewhere along the way in our most recent discussion, you thought "why not just fit two separate regression functions — one for the smokers and one for the non-smokers?" (If you didn't think of it, I thought of it for you!) Are there advantages to including both the binary and quantitative predictor variables within one multiple regression model? The answer is yes! In this section, we explore the two primary advantages.

The first advantage

An easy way of discovering the first advantage is to analyze the data three times — once using the data on all 32 subjects, once using the data on only the 16 non-smokers, and once using the data on only the 16 smokers. Then, we can investigate the effects of the different analyses on important things such as sizes of standard errors of the coefficients and the widths of confidence intervals. Let's try it! 

(Birth Smokers data)

Here's the Minitab output for the analysis using a (0,1) indicator variable and the data on all 32 subjects. Let's just run through the output and collect information on various values obtained:

Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant -2390 349 -6.84 0.000  
Gest 143.10 9.13 15.68 0.000 1.06
Smoke -244.5 42.0 -5.83 0.000 1.06
Regression Equation
Wgt = -2390 + 143.10 Gest - 244.5 Smoke

The standard error of the Gest coefficient is 9.13. Recall that this value quantifies how much the estimated Gest coefficient would vary from sample to sample. And, the following output:

Variable Setting
Gest 38
Smoke 1
Fit SE Fit 95% CI 95% PI
2803.69 30.8496 (2740.60, 2866.79) (2559.13, 3048.26)
Variable Setting
Gest 38
Smoke 0
Fit SE Fit 95% CI 95% PI
3048.24 28.9051 (2989.12, 3107.36) (2804.67, 3291.81)

tells us that for mothers with a 38-week gestation, the width of the confidence interval for the mean birth weight is 126.2 for smoking mothers and 118.2 for non-smoking mothers.

Let's do that again, but this time for the Minitab output on just the 16 non-smoking mothers:

Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant -2546 457 -5.57 0.000  
Gest_0 147.2 12.0 12.29 0.000 1.00
Regression Equation
Wgt_0 = -2546 + 147.2 Gest_0

The standard error of the Gest coefficient is 12.0. And:

Variable Setting
Gest_0 38
Fit SE Fit 95% CI 95% PI
3047.72 26.7748 (2990.30, 3105.15) (2811.30, 3284.15)

for non-smoking mothers with a 38-week gestation, the width of the confidence interval for the mean birth weight is 114.9.

And, let's do the same thing one more time for the Minitab output on just the 16 smoking mothers:

Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant -2475 554 -4.447 0.001  
Gest_1 139.0 14.1 9.85 0.000 1.00
Regression Equation
Wgt_1 = -2475 + 139.2 Gest_1

The standard error of the Gest coefficient is 14.1. And:

Variable Setting
Gest_1 38
Fit SE Fit 95% CI 95% PI
2808.53 35.8088 (2731.73, 2885.33) (2526.39, 3090.67)

for smoking mothers with a 38-week gestation, the length of the confidence interval is 153.6.

Here's a summary of what we've gleaned from the three pieces of output:

Model estimated using…
SE(Gest)
Width of CI for \(\mu_Y\)
all 32 data points
9.13
(NS) 118.2
(S) 126.2
16 nonsmokers
12.0
114.9
16 smokers
14.1
153.6

Let's see what we learn from this investigation:

  • The standard error of the Gest coefficient — SE(Gest) — is smallest for the estimated model based on all 32 data points. Therefore, confidence intervals for the Gest coefficient will be narrower if calculated using the analysis based on all 32 data points. (This is a good thing!)
  • The width of the confidence interval for the mean weight of babies born to smoking mothers is narrower for the estimated model based on all 32 data points (126.2 compared to 153.6), and not substantially different for non-smoking mothers (118.2 compared to 114.9). (Another good thing!)

In short, there appears to be an advantage in "pooling" and analyzing the data all at once rather than breaking it apart and conducting different analyses for each group. Our regression model assumes that the slope for the two groups are equal. It also assumes that the variances of the error terms are equal. Therefore, it makes sense to use as much data as possible to estimate these quantities.

The second advantage

An easy way of discovering the second advantage of fitting one "combined" regression function using all of the data is to consider how you'd answer the research question if you broke apart the data and conducted two separate analyses obtaining:

Nonsmokers

Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant -2546 457 -5.57 0.000  
Gest_0 147.2 12.0 12.29 0.000 1.00
Regression Equation
Wgt_0 = -2546 + 147.2 Gest_0

Smokers

Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant -2475 554 -4.47 0.001  
Gest_1 139.0 14.1 9.85 0.000 1.00
Regression Equation
Wgt_1 = -2475 + 139.0 Gest_1

How could you use these results to determine if the mean birth weight of babies differs between smoking and non-smoking mothers, after taking into account length of gestation? Not completely obvious, is it?! It actually could be done with much more (complicated) work than would be necessary if you analyze the data as a whole and fit one combined regression function:

Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant -2390 349 -6.84 0.000  
Gest 143.10 9.13 15.68 0.000 1.06
Smoke -244.5 42.0 -5.83 0.000 1.06
Regression Equation
Wgt = -2390 + 143.10 Gest - 244.5 Smoke

As we previously discussed, answering the research question merely involves testing the null hypothesis \(H_0 \colon \beta_2 = 0\) against the alternative \(H_0 \colon \beta_2 \ne 0\). The P-value is < 0.001. There is sufficient evidence to conclude that there is a statistically significant difference in the mean birth weight of all babies of smoking mothers and the mean birth weight of all babies of non-smoking mothers, after taking into account length of gestation.

In summary, "pooling" your data and fitting one combined regression function allows you to easily and efficiently answer research questions concerning the binary predictor variable.