8.3 - Two Separate Advantages

Perhaps somewhere along the way in our most recent discussion, you thought "why not just fit two separate regression functions — one for the smokers and one for the non-smokers?" (If you didn't think of it, I thought of it for you!) Are there advantages to including both the binary and quantitative predictor variables within one multiple regression model? The answer is yes! In this section, we explore the two primary advantages.

The first advantage

An easy way of discovering the first advantage is to analyze the data three times — once using the data on all 32 subjects, using the data on only the 16 non-smokers, and once using the data on only the 16 smokers. Then, we can investigate the effects of the different analyses on important things such as the sizes of standard errors of the coefficients and the widths of confidence intervals. Let's try it!

(Birth Smokers data)

Here's the Minitab output for the analysis using a (0,1) indicator variable and the data on all 32 subjects. Let's just run through the output and collect information on various values obtained:

Coefficients

Term	Coef	SE Coef	T-Value	VIF
Constant	-2390	349	-6.84
Gest	143.10	9.13	15.68	1.06
Smoke	-244.5	42.0	-5.83	1.06

Regression Equation

Wgt = -2390 + 143.10 Gest - 244.5 Smoke

The standard error of the Gest coefficient is 9.13. Recall that this value quantifies how much the estimated Gest coefficient would vary from sample to sample. And, the following output:

Variable Setting

Gest	38
Smoke	1

Fit	SE Fit	95% CI	95% PI
2803.69	30.8496	(2740.60, 2866.79)	(2559.13, 3048.26)

Variable Setting

Gest	38
Smoke	0

Fit	SE Fit	95% CI	95% PI
3048.24	28.9051	(2989.12, 3107.36)	(2804.67, 3291.81)

tells us that for mothers with a 38-week gestation, the width of the confidence interval for the mean birth weight is 126.2 for smoking mothers and 118.2 for non-smoking mothers.

Let's do that again, but this time for the Minitab output on just the 16 non-smoking mothers:

Coefficients

Term	Coef	SE Coef	T-Value	P-Value	VIF
Constant	-2546	457	-5.57	0.000
Gest_0	147.2	12.0	12.29	0.000	1.00

Regression Equation

Wgt_0 = -2546 + 147.2 Gest_0

The standard error of the Gest coefficient is 12.0. And:

Variable	Setting
Gest_0	38

Fit	SE Fit	95% CI	95% PI
3047.72	26.7748	(2990.30, 3105.15)	(2811.30, 3284.15)

for non-smoking mothers with a 38-week gestation, the width of the confidence interval for the mean birth weight is 114.9.

And, let's do the same thing one more time for the Minitab output on just the 16 smoking mothers:

Coefficients

Term	Coef	SE Coef	T-Value	P-Value	VIF
Constant	-2475	554	-4.447	0.001
Gest_1	139.0	14.1	9.85	0.000	1.00

Regression Equation

Wgt_1 = -2475 + 139.2 Gest_1

The standard error of the Gest coefficient is 14.1. And:

Variable	Setting
Gest_1	38

Fit	SE Fit	95% CI	95% PI
2808.53	35.8088	(2731.73, 2885.33)	(2526.39, 3090.67)

for smoking mothers with a 38-week gestation, the length of the confidence interval is 153.6.

Here's a summary of what we've gleaned from the three pieces of output:

Model estimated using…	SE(Gest)	Width of CI for \(\mu_Y\)
all 32 data points	9.13	(NS) 118.2 (S) 126.2
16 nonsmokers	12.0	114.9
16 smokers	14.1	153.6

Let's see what we learn from this investigation:

The standard error of the Gest coefficient — SE(Gest) — is the smallest for the estimated model based on all 32 data points. Therefore, confidence intervals for the Gest coefficient will be narrower if calculated using the analysis based on all 32 data points. (This is a good thing!)
The width of the confidence interval for the mean weight of babies born to smoking mothers is narrower for the estimated model based on all 32 data points (126.2 compared to 153.6), and not substantially different for non-smoking mothers (118.2 compared to 114.9). (Another good thing!)

In short, there appears to be an advantage in "pooling" and analyzing the data all at once rather than breaking it apart and conducting different analyses for each group. Our regression model assumes that the slope for the two groups is equal. It also assumes that the variances of the error terms are equal. Therefore, it makes sense to use as much data as possible to estimate these quantities.

The second advantage

An easy way of discovering the second advantage of fitting one "combined" regression function using all of the data is to consider how you'd answer the research question if you broke apart the data and conducted two separate analyses obtaining:

Nonsmokers

Coefficients

Term	Coef	SE Coef	T-Value	P-Value	VIF
Constant	-2546	457	-5.57	0.000
Gest_0	147.2	12.0	12.29	0.000	1.00

Regression Equation

Wgt_0 = -2546 + 147.2 Gest_0

Smokers

Coefficients

Term	Coef	SE Coef	T-Value	P-Value	VIF
Constant	-2475	554	-4.47	0.001
Gest_1	139.0	14.1	9.85	0.000	1.00

Regression Equation

Wgt_1 = -2475 + 139.0 Gest_1

How could you use these results to determine if the mean birth weight of babies differs between smoking and non-smoking mothers, after taking into account the length of gestation? Not completely obvious is it?! It actually could be done with much more (complicated) work than would be necessary if you analyze the data as a whole and fit one combined regression function:

Coefficients

Term	Coef	SE Coef	T-Value	VIF
Constant	-2390	349	-6.84
Gest	143.10	9.13	15.68	1.06
Smoke	-244.5	42.0	-5.83	1.06

Regression Equation

Wgt = -2390 + 143.10 Gest - 244.5 Smoke

As we previously discussed, answering the research question merely involves testing the null hypothesis \(H_0 \colon \beta_2 = 0\) against the alternative \(H_0 \colon \beta_2 \ne 0\). The P-value is < 0.001. There is sufficient evidence to conclude that there is a statistically significant difference in the mean birth weight of all babies of smoking mothers and the mean birth weight of all babies of non-smoking mothers, after taking into account the length of gestation.

In summary, "pooling" your data and fitting one combined regression function allows you to easily and efficiently answer research questions concerning the binary predictor variable.