8.8 - Further Categorical Predictor Examples

Example 1: Muscle Mass Data

Suppose that we describe y = muscle mass as a function of x₁ = age and x₂ = gender for people in the 40 to 60 year-old age group. We could code the gender variable as x₂ = 1 if the subject is female and x₂ = 0 if the subject is male.

Consider the multiple regression equation

E(Y ) = β ₀ + β₁ x₁ + β₂ x₂ .

The usual slope interpretation will work for β₂, the coefficient that multiplies the gender indicator. Increasing gender by one unit simply moves us from male to female. Thus β₂ = the difference between average muscle mass for females and males of the same age.

Example 2: Real Estate Air Conditioning

Consider the real estate dataset: realestate.txt. Let us define

Y = sale price of home
X₁ = square footage of home
X₂ = whether home has air conditioning or not.

To put the air conditioning variable into a model create a variable coded as either 1 or 0 to represent the presence or absence of air conditioning, respectively. With a 1, 0 coding for air conditioning and the model:

y_i = β₀+ β₁ x_i_,1 + β₂ x_i_,2 + ε_i ,

the beta coefficient that multiplies the air conditioning variable will estimate the difference in average sale prices of homes that have air conditioning versus homes that do not, given that the homes have the same square foot area.

Suppose we think that the effect of air conditioning (yes or no) depends upon the size of the home. In other words, suppose that there is interaction between the x-variables. To put an interaction into a model, we multiply the variables involved. The model here is

y_i = β₀+ β₁ x_i_,1 + β₂ x_i_,2 + β₃ x_i_,1x_i_,2 + ε_i

The data are from n = 521 homes. Statistical software output follows. Notice that there is a statistically significant result for the interaction term.

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -3.218     30.085  -0.107 0.914871    
SqFeet       104.902     15.748   6.661 6.96e-11 ***
Air          -78.868     32.663  -2.415 0.016100 *  
SqFeet.Air    55.888     16.580   3.371 0.000805 ***

Software note: We would calculate a new variable by multiplying the square feet size and air conditioning variables. That variable would then be used as a predictor variable, along with the original x-variables.

The regression equation is:

Average SalePrice = −3.218 + 104.902 × SqrFeet − 78.868 × Air + 55.888 × SqrFeet × Air.

Suppose that a home has air conditioning. That means the variable Air = 1, so we’ll substitute Air = 1 in both places that Air occurs in the estimated model. This gives us

Average SalePrice = −3.218 + 104.902 × SqrFeet − 78.868(1) + 55.888 × SqrFeet × 1
= −82.086 + 160.790 × SqrFeet.

Suppose that a home does not have air conditioning. That means the variable Air = 0, so we’ll substitute Air = 0 in both places that Air occurs in the estimated model. This gives us

Average SalePrice = −3.218 + 104.902 × SqrFeet − 78.868(0) + 55.888 × SqrFeet × 0
= −3.218 + 104.902 × SqrFeet.

The figure below is a graph of the relationship between sale price and square foot area for homes with air conditioning and homes without air conditioning. The equations of the two lines are the equations that we just derived above. The difference between the two lines increases as the square foot area increases. This means that air conditioning versus no air conditioning difference in average sale price increases as the size of the home increases.

plot

There is an increasing variance problem apparent in the above plot, which is even more obvious from the megaphone pattern in the following residual plot:

residual plot

To remedy this, we'll try using log transformations for sale price and square footage (which are quite highly skewed). Now, Y = log(sale price), X₁ = log(home’s square foot area), and X₂ = 1 if air conditioning present and 0 if not. After fitting the model:

y_i = β₀+ β₁ x_i_,1 + β₂ x_i_,2 + β₃ x_i_,1x_i_,2 + ε_i

the plot showing the regression lines is as follows:

plot with regression lines

and the residual plot, which shows a vast improvement, is as follows:

residual plot

Example 3: Hospital Infection Risk Data

Consider the hospital infection risk data: infectionrisk.txt. For this example, the data are limited to observations with average length of stay ≤ 14 days. The overall sample size is n = 111. The variables we will analyze are the following:

Y = infection risk in hospital
X₁ = average length of patient’s stay (in days)
X₂ = a measure of frequency of giving X-rays
X₃ = indication in which of 4 U.S. regions the hospital is located (north-east, north-central, south, west).

The focus of the analysis will be on regional differences. Region is a categorical variable so we must use indicator variables to incorporate region information into the model. There are four regions. The full set of indicator variables for the four regions is as follows:

I₁ = 1 if hospital is in region 1 (north-east) and 0 if not
I₂ = 1 if hospital is in region 2 (north-central) and 0 if not
I₃= 1 if hospital is in region 3 (south) and 0 if not
I₄= 1 if hospital is in region 4 (west), 0 otherwise.

To avoid a linear dependency in the X matrix, we will leave out one of these indicators when we forming the model. Using all but the first indicator to describe regional differences (so that "north-east" is the reference region), a possible multiple regression model for E(Y), the mean infection risk, is:

E(Y ) = β₀+ β₁X₁ + β₂X₂ + β₃I₂ + β₄I₃ + β₅I₄.

To understand the meanings of the beta coefficients, consider each region separately:

For hospitals in region 1 (north-east), I₂ = 0, I₃ = 0, and I₄ = 0, so

E(Y ) = β₀+ β₁X₁ + β₂X₂ + β₃(0) + β₄(0) +β₅(0)
= β₀+ β₁X₁ + β₂X₂.

For hospitals in region 2 (north-central), I₂ = 1, I₃ = 0, and I₄ = 0, so

E(Y ) = β₀+ β₁X₁ + β₂X₂ + β₃(1) + β₄(0) +β₅(0)
= β₀+ β₁X₁ + β₂X₂ + β₃.

For hospitals in region 3 (south), I₂ = 0, I₃ = 1, and I₄ = 0, so

E(Y ) = β₀+ β₁X₁ + β₂X₂ + β₃(0) + β₄(1) +β₅(0)
= β₀+ β₁X₁ + β₂X₂ + β₄.

For hospitals in region 4 (west), I₂ = 0, I₃ = 0, and I₄ = 1, so

E(Y ) = β₀+ β₁X₁ + β₂X₂ + β₃(0) + β₄(0) +β₅(1)
= β₀+ β₁X₁ + β₂X₂ + β₅.

A comparison of the four equations just given provides these interpretations of the coefficients that multiply indicator variables:

β₃ = difference in mean infection risk for region 2 (north-central) versus region 1 (north-east), assuming the same values for stay (X₁) and X-rays (X₂).
β₄= difference in mean infection risk for region 3 (south) versus region 1 (north-east), assuming the same values for stay (X₁) and X-rays (X₂).
β₅ = difference in mean infection risk for region 4 (west) versus region 1 (north-east), assuming the same values for stay (X₁) and X-rays (X₂).

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.134259   0.877347  -2.433  0.01668 *  
Stay         0.505394   0.081455   6.205 1.11e-08 ***
Xray         0.017587   0.005649   3.113  0.00238 ** 
i2           0.171284   0.281475   0.609  0.54416    
i3           0.095461   0.288852   0.330  0.74169    
i4           1.057835   0.378077   2.798  0.00612 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.036 on 105 degrees of freedom
Multiple R-squared:  0.4198,	Adjusted R-squared:  0.3922 
F-statistic: 15.19 on 5 and 105 DF,  p-value: 3.243e-11

Some interpretations of results for individual variables are:

We have statistical significance for the sample coefficient that multiplies I₄(p-value = 0.006). This is the sample coefficient that estimates the coefficient β₅, so we have evidence of a difference in the infection risks for hospitals in region 4 (west) and hospitals in region 1 (north-east). The positive coefficient indicates that the infection risk is higher in the west.
The non-significance for the coefficients multiplying I₂and I₃indicates no observed difference between mean infection risks in region 2 (north-central) versus region 1 (north-east) nor between region 3 (south) versus region 1 (north-east).

Next, the finding of a difference between mean infection risks in the north-east and west seems to be strong, but for the sake of example, we’ll now consider an overall test of regional differences. There is, in fact, an argument for doing so beyond “for the sake of example.” To assess regional differences, we considered three significance tests (for the three indicator variables). When we carry out multiple inferences, the overall error rate is increased so we may be concerned about a “fluke” result for one of the comparisons. If there are no regional differences, we would not have any indicator variables for regions in the model.

The null hypothesis that makes this happen is H₀ : β₃= β₄= β₅= 0.
The reduced model is simply E(Y ) = β₀+ β₁X₁ + β₂X₂. This model has SSE = 123.56 with error df = 108.
The full model is E(Y ) = β₀+ β₁X₁ + β₂X₂ + β₃I₂ + β₄I₃ + β₅I₄, the model that we have already estimated. This model has SSE = 112.71 with error df = 105.

The test statistic for H₀ : β₃= β₄= β₅= 0 is the general linear F-statistic calculated as

\[F=\frac{\frac{\text{SSE(reduced) - SSE(full)}}{\text{error df for reduced - error df for full}}}{\text{MSE(full)}}=\frac{\frac{123.56-112.71}{108-105}}{\frac{112.71}{105}}=3.369.\]

The degrees of freedom for this F-statistic are 3 and 105. We find that the probability of getting an F statistic as extreme or more extreme than 3.369 under an F_3,105 distribution is 0.021 (i.e., the p-value). We reject the null hypothesis and conclude that at least one of β₃, β₄, and β₅ is not 0. Our previous look at the tests for individual coefficients showed us that it is β₅ (measuring the difference between west and north-east) that we conclude is different from 0.

Finally, the results seem to indicate that the west is the only regional difference we see that has a higher infection risk than the other three regions. (If the north-central and south regions don’t differ from the north-east, it is reasonable to think that they don’t differ from each other as well.) We can test this by considering a reduced model in which the only region indicator is I₄ = 1 if west, and 0 otherwise. The model is

E(Y ) = β₀+ β₁X₁ + β₂X₂ + β₅I₄.

The null hypothesis leading to this reduced model is H₀ : β₃= β₄= 0. This model has SSE = 113.11 with error df = 107.

The full model is still

E(Y ) = β₀+ β₁X₁ + β₂X₂ + β₃I₁ + β₄I₂ + β₅I₃,

which has SSE = 112.71 with error df = 105. Finally,

\[F=\frac{\frac{\text{SSE(reduced) - SSE(full)}}{\text{error df for reduced - error df for full}}}{\text{MSE(full)}}=\frac{\frac{113.11-112.71}{107-105}}{\frac{112.71}{105}}=0.186.\]

The degrees of freedom for this F-statistic are 2 and 105. We find that the probability of getting an F-statistic as extreme or more extreme than 0.186 under an F_2,105 distribution is 0.831 (i.e., the p-value). Thus, we cannot reject the null hypothesis and conclude that the west differing from the other three regions seems to be reasonable.

8.8 - Further Categorical Predictor Examples

Example 1: Muscle Mass Data

Example 2: Real Estate Air Conditioning

Example 3: Hospital Infection Risk Data

Navigation

Start Here!

Lessons

Resources