8.8 - Further Categorical Predictor Examples

Example 1: Muscle Mass Data

Suppose that we describe y = muscle mass as a function of x1 = age and x2 = gender for people in the 40 to 60 year-old age group. We could code the gender variable as x2 = 1 if the subject is female and x2 = 0 if the subject is male.

Consider the multiple regression equation

E(Y ) = β 0 + β1 x1 + β2 x2 .

The usual slope interpretation will work for β2, the coefficient that multiplies the gender indicator. Increasing gender by one unit simply moves us from male to female. Thus β2 = the difference between average muscle mass for females and males of the same age.

Example 2: Real Estate Air Conditioning

Consider the real estate dataset: realestate.txt. Let us define

  • Y = sale price of home
  • X1 = square footage of home
  • X2 = whether home has air conditioning or not.

To put the air conditioning variable into a model create a variable coded as either 1 or 0 to represent the presence or absence of air conditioning, respectively. With a 1, 0 coding for air conditioning and the model:

yi = β0 + β1 xi,1 + β2 xi,2 + εi ,

the beta coefficient that multiplies the air conditioning variable will estimate the difference in average sale prices of homes that have air conditioning versus homes that do not, given that the homes have the same square foot area.

Suppose we think that the effect of air conditioning (yes or no) depends upon the size of the home. In other words, suppose that there is interaction between the x-variables. To put an interaction into a model, we multiply the variables involved. The model here is

yi = β0 + β1 xi,1 + β2 xi,2 + β3 xi,1xi,2 + εi

The data are from n = 521 homes. Statistical software output follows. Notice that there is a statistically significant result for the interaction term.

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -3.218     30.085  -0.107 0.914871    
SqFeet       104.902     15.748   6.661 6.96e-11 ***
Air          -78.868     32.663  -2.415 0.016100 *  
SqFeet.Air    55.888     16.580   3.371 0.000805 ***
Software note: We would calculate a new variable by multiplying the square feet size and air conditioning variables. That variable would then be used as a predictor variable, along with the original x-variables.

The regression equation is:

Average SalePrice = −3.218 + 104.902 × SqrFeet − 78.868 × Air + 55.888 × SqrFeet × Air.

Suppose that a home has air conditioning. That means the variable Air = 1, so we’ll substitute Air = 1 in both places that Air occurs in the estimated model. This gives us

Average SalePrice = −3.218 + 104.902 × SqrFeet − 78.868(1) + 55.888 × SqrFeet × 1
= −82.086 + 160.790 × SqrFeet.

Suppose that a home does not have air conditioning. That means the variable Air = 0, so we’ll substitute Air = 0 in both places that Air occurs in the estimated model. This gives us

Average SalePrice = −3.218 + 104.902 × SqrFeet − 78.868(0) + 55.888 × SqrFeet × 0
= −3.218 + 104.902 × SqrFeet.

The figure below is a graph of the relationship between sale price and square foot area for homes with air conditioning and homes without air conditioning. The equations of the two lines are the equations that we just derived above. The difference between the two lines increases as the square foot area increases. This means that air conditioning versus no air conditioning difference in average sale price increases as the size of the home increases.

plot

There is an increasing variance problem apparent in the above plot, which is even more obvious from the megaphone pattern in the following residual plot:

residual plot

To remedy this, we'll try using log transformations for sale price and square footage (which are quite highly skewed). Now, Y = log(sale price), X1 = log(home’s square foot area), and X2 = 1 if air conditioning present and 0 if not. After fitting the model:

yi = β0 + β1 xi,1 + β2 xi,2 + β3 xi,1xi,2 + εi

the plot showing the regression lines is as follows:

plot with regression lines

and the residual plot, which shows a vast improvement, is as follows:

residual plot

Example 3: Hospital Infection Risk Data

Consider the hospital infection risk data: infectionrisk.txt. For this example, the data are limited to observations with average length of stay ≤ 14 days. The overall sample size is n = 111. The variables we will analyze are the following:

Y = infection risk in hospital
X1 = average length of patient’s stay (in days)
X2 = a measure of frequency of giving X-rays
X3 = indication in which of 4 U.S. regions the hospital is located (north-east, north-central, south, west).

The focus of the analysis will be on regional differences. Region is a categorical variable so we must use indicator variables to incorporate region information into the model. There are four regions. The full set of indicator variables for the four regions is as follows:

I1 = 1 if hospital is in region 1 (north-east) and 0 if not
I2 = 1 if hospital is in region 2 (north-central) and 0 if not
I3 = 1 if hospital is in region 3 (south) and 0 if not
I4 = 1 if hospital is in region 4 (west), 0 otherwise.

To avoid a linear dependency in the X matrix, we will leave out one of these indicators when we forming the model. Using all but the first indicator to describe regional differences (so that "north-east" is the reference region), a possible multiple regression model for E(Y), the mean infection risk, is:

E(Y ) = β0 + β1X1 + β2X2 + β3I2 + β4I3 + β5I4.

To understand the meanings of the beta coefficients, consider each region separately:

  • For hospitals in region 1 (north-east), I2 = 0, I3 = 0, and I4 = 0, so
  • E(Y ) = β0 + β1X1 + β2X2 + β3(0) + β4(0) +β5(0)
    = β0 + β1X1 + β2X2.

  • For hospitals in region 2 (north-central), I2 = 1, I3 = 0, and I4 = 0, so

E(Y ) = β0 + β1X1 + β2X2 + β3(1) + β4(0) +β5(0)
= β0 + β1X1 + β2X2 + β3.

  • For hospitals in region 3 (south), I2 = 0, I3 = 1, and I4 = 0, so

E(Y ) = β0 + β1X1 + β2X2 + β3(0) + β4(1) +β5(0)
= β0 + β1X1 + β2X2 + β4.

  • For hospitals in region 4 (west), I2 = 0, I3 = 0, and I4 = 1, so

E(Y ) = β0 + β1X1 + β2X2 + β3(0) + β4(0) +β5(1)
= β0 + β1X1 + β2X2 + β5.

A comparison of the four equations just given provides these interpretations of the coefficients that multiply indicator variables:

  • β3 = difference in mean infection risk for region 2 (north-central) versus region 1 (north-east), assuming the same values for stay (X1) and X-rays (X2).
  • β4 = difference in mean infection risk for region 3 (south) versus region 1 (north-east), assuming the same values for stay (X1) and X-rays (X2).
  • β5 = difference in mean infection risk for region 4 (west) versus region 1 (north-east), assuming the same values for stay (X1) and X-rays (X2).
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.134259   0.877347  -2.433  0.01668 *  
Stay         0.505394   0.081455   6.205 1.11e-08 ***
Xray         0.017587   0.005649   3.113  0.00238 ** 
i2           0.171284   0.281475   0.609  0.54416    
i3           0.095461   0.288852   0.330  0.74169    
i4           1.057835   0.378077   2.798  0.00612 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.036 on 105 degrees of freedom
Multiple R-squared:  0.4198,	Adjusted R-squared:  0.3922 
F-statistic: 15.19 on 5 and 105 DF,  p-value: 3.243e-11

Some interpretations of results for individual variables are:

  • We have statistical significance for the sample coefficient that multiplies I4 (p-value = 0.006). This is the sample coefficient that estimates the coefficient β5, so we have evidence of a difference in the infection risks for hospitals in region 4 (west) and hospitals in region 1 (north-east). The positive coefficient indicates that the infection risk is higher in the west.
  • The non-significance for the coefficients multiplying I2 and I3 indicates no observed difference between mean infection risks in region 2 (north-central) versus region 1 (north-east) nor between region 3 (south) versus region 1 (north-east).

Next, the finding of a difference between mean infection risks in the north-east and west seems to be strong, but for the sake of example, we’ll now consider an overall test of regional differences. There is, in fact, an argument for doing so beyond “for the sake of example.” To assess regional differences, we considered three significance tests (for the three indicator variables). When we carry out multiple inferences, the overall error rate is increased so we may be concerned about a “fluke” result for one of the comparisons. If there are no regional differences, we would not have any indicator variables for regions in the model.

  • The null hypothesis that makes this happen is H0 : β3 = β4 = β5 = 0.
  • The reduced model is simply E(Y ) = β0 + β1X1 + β2X2. This model has SSE = 123.56 with error df = 108.
  • The full model is E(Y ) = β0 + β1X1 + β2X2 + β3I2 + β4I3 + β5I4, the model that we have already estimated. This model has SSE = 112.71 with error df = 105.

The test statistic for H0 : β3 = β4 = β5 = 0 is the general linear F-statistic calculated as

\[F=\frac{\frac{\text{SSE(reduced) - SSE(full)}}{\text{error df for reduced - error df for full}}}{\text{MSE(full)}}=\frac{\frac{123.56-112.71}{108-105}}{\frac{112.71}{105}}=3.369.\]

The degrees of freedom for this F-statistic are 3 and 105. We find that the probability of getting an F statistic as extreme or more extreme than 3.369 under an F3,105 distribution is 0.021 (i.e., the p-value). We reject the null hypothesis and conclude that at least one of β3, β4, and β5 is not 0. Our previous look at the tests for individual coefficients showed us that it is β5 (measuring the difference between west and north-east) that we conclude is different from 0.

Finally, the results seem to indicate that the west is the only regional difference we see that has a higher infection risk than the other three regions. (If the north-central and south regions don’t differ from the north-east, it is reasonable to think that they don’t differ from each other as well.) We can test this by considering a reduced model in which the only region indicator is I4 = 1 if west, and 0 otherwise. The model is

E(Y ) = β0 + β1X1 + β2X2 + β5I4.

The null hypothesis leading to this reduced model is H0 : β3 = β4 = 0. This model has SSE = 113.11 with error df = 107.

The full model is still

E(Y ) = β0 + β1X1 + β2X2 + β3I1 + β4I2 + β5I3,

which has SSE = 112.71 with error df = 105. Finally,

\[F=\frac{\frac{\text{SSE(reduced) - SSE(full)}}{\text{error df for reduced - error df for full}}}{\text{MSE(full)}}=\frac{\frac{113.11-112.71}{107-105}}{\frac{112.71}{105}}=0.186.\]

The degrees of freedom for this F-statistic are 2 and 105. We find that the probability of getting an F-statistic as extreme or more extreme than 0.186 under an F2,105 distribution is 0.831 (i.e., the p-value). Thus, we cannot reject the null hypothesis and conclude that the west differing from the other three regions seems to be reasonable.