8.1 - Example on Birth Weight and Smoking

Example: Is a baby's birth weight related to the mother's smoking during pregnancy?

image of an infant babyResearchers (Daniel, 1999) interested in answering the above research question collected the following data (birthsmokers.txt) on a random sample of n = 32 births:

  • Response (y): birth weight (Weight) in grams of baby
  • Potential predictor (x1): Smoking status of mother (yes or no)
  • Potential predictor (x2): length of gestation (Gest) in weeks

The distinguishing feature of this data set is that one of the predictor variables — Smoking — is a qualitative predictor. To be more precise, smoking is a "binary variable" with only two possible values (yes or no). The other predictor variable (Gest) is, of course, quantitative.

The scatter plot matrix:

scatter plot matrix

suggests, not surprisingly, that there is a positive linear relationship between length of gestation and birth weight. That is, as the length of gestation increases, the birth weight of babies tends to increase. It is hard to see if any kind of (marginal) relationship exists between birth weight and smoking status, or between length of gestation and smoking status.

The important question remains — after taking into account length of gestation, is there a significant difference in the average birth weights of babies born to smoking and non-smoking mothers? A first-order model with one binary predictor and one quantitative predictor that helps us answer the question is:

\[y_i=(\beta_0+\beta_1x_{i1}+\beta_2x_{i2})+\epsilon_i\] 

where:

  • yi is the birth weight of baby i
  • xi1 is length of gestation of baby i
  • xi2 is a binary variable coded as a 1, if the baby's mother smoked during pregnancy and 0, if she did not

and the independent error terms εi follow a normal distribution with mean 0 and equal variance σ2.

Notice that in order to include a qualitative variable in a regression model, we have to "code" the variable, that is, assign a unique number to each of the possible categories. We'll learn more about coding in the remainder of this lesson.

Using the sample data on n = 32 births, the plot of the estimated regression function looks like:

plot of the estimated regression function

The blue circles represent the data on non-smoking mothers (x2=0), while the red circles represent the data on smoking mothers (x2=1). And, the blue line represents the estimated linear relationship between length of gestation and birth weight for non-smoking mothers, while the red line represents the estimated linear relationship for smoking mothers.

At least in this sample of data, it appears as if the birth weights for non-smoking mothers is higher than that for smoking mothers, regardless of the length of gestation. A hypothesis test or confidence interval would allow us to see if this result extends to the larger population.

Did you expect the plot of the estimated regression equation to appear as two distinct lines? Let's consider this question. Statistical software tells us that the estimated regression function is:

minitab outout

Therefore, as illustrated in this screencast below, the estimated regression equation for non-smoking mothers (smoking = 0) is:

Weight = - 2390 + 143 Gest

and the estimated regression equation for smoking mothers (when smoking = 1) is:

Weight = - 2635 + 143 Gest

That is, we obtain two different parallel estimated lines (they are parallel because they have the same slope, 143). The difference between the two lines, –245, represents the difference in the average birth weights for a fixed gestation length for smoking and non-smoking mothers in the sample.

How would we answer the following set of research questions? (Do the procedures that appear in parentheses seem appropriate in answering the research question?)

  • Is baby's birth weight related to smoking during pregnancy, after taking into account length of gestation? (Conduct a hypothesis test for testing whether the slope parameter for smoking is 0.)
  • How is birth weight related to gestation, after taking into account a mother's smoking status? (Calculate and interpret a confidence interval for the slope parameter for gestation.)

Upon analyzing the data, the software output:

minitab outout

tells us that:

  • A whopping 89.64% of the variation in the birth weights of babies is explained by the length of gestation and the smoking status of the mother.
  • The P-values for the t-tests appearing in the table of estimates suggest that the slope parameters for Gest (P < 0.001) and Smoking (P < 0.001) are significantly different from 0.
  • The P-value for the analysis of variance F-test (P < 0.001) suggests that the model containing length of gestation and smoking status is more useful in predicting birth weight than not taking into account the two predictors.