8.2.2 - Example: Housing Satisfaction in R

Example: Housing Data (using R) Section

Apartment building

For R, there are a number of different packages and functions we can use. For example, vglm() from VGAM package, or multinom() from nnet package, or mlogit() from globaltest package from BIOCONDUCTOR; see the links at the first page of these lecture notes and the later examples. We will focus on vglm().

The data in "housing.dat" gives information on housing conditions from a survey in Copenhagen. Variables are satisfaction (low, medium, high), perceived influence on management (low, medium, high), type of housing (tower, atrium, apartment, terrace), contact with other residents (low, high), and the frequency count of individuals in that row.

The vglm function expects the response categories \(1,2,\ldots,r\) to appear in separate columns, with rows corresponding to predictor combinations, much like the glm function. When the data are stacked as they are in this example, we can use the "spread" function to separate them. The lines that have a frequency of zero are not actually used in the modeling, because they contribute nothing to the log-likelihood; as such, we can omit them if we wish.

library(MASS)
library(tidyr)
data(housing)
housing = spread(housing, Sat, Freq)

The result is below. Note also that there are 24 rows corresponding to the unique combinations of the predictors.

> housing
     Infl      Type Cont Low Medium High
1     Low     Tower  Low  21     21   28
2     Low     Tower High  14     19   37
3     Low Apartment  Low  61     23   17
4     Low Apartment High  78     46   43
5     Low    Atrium  Low  13      9   10
6     Low    Atrium High  20     23   20
7     Low   Terrace  Low  18      6    7
8     Low   Terrace High  57     23   13
9  Medium     Tower  Low  34     22   36
10 Medium     Tower High  17     23   40
11 Medium Apartment  Low  43     35   40
12 Medium Apartment High  48     45   86
13 Medium    Atrium  Low   8      8   12
14 Medium    Atrium High  10     22   24
15 Medium   Terrace  Low  15     13   13
16 Medium   Terrace High  31     21   13
17   High     Tower  Low  10     11   36
18   High     Tower High   3      5   23
19   High Apartment  Low  26     18   54
20   High Apartment High  15     25   62
21   High    Atrium  Low   6      7    9
22   High    Atrium High   7     10   21
23   High   Terrace  Low   7      5   11
24   High   Terrace High   5      6   13

Let's describe these data by a baseline-category model, with Satisfaction as the outcome (baseline chosen as "medium") and other variables are predictors. Note that with this model, we treat the satisfaction levels as nominal; we'll revisit this data set with satisfaction treated as ordinal in the next section.

We let \(\pi_1\) = probability of low satisfaction, \(\pi_2\) = probability of high satisfaction, and \(\pi_3\) = probability of medium satisfaction so that the equations are

\(\log\left(\dfrac{\pi_j}{\pi_3}\right)=\beta_{0j}+\beta_{1j}x_1+\beta_{2j}x_2+\cdots\)

for \(j = 1,2\). The \(x\)'s represent the predictor terms, including any interactions we wish to fit as well. Note that these predictor terms do not depend on \(j\). Also, treating the predictors as nominal requires

  • two indicators for perceived influence on management
  • three indicators for type of housing
  • one indicator for contact with other residents

With first-order terms and all two-way interactions, this gives us a total of \((3-1)(1+6+11)=36\) parameters (1 intercept, 6 first-order terms, and 11 i


  Specifying the Model

The syntax of the vglm function is very similar to that of the glm function, but note that the last of the response categories is taken as the baseline by default. If we want to use "Medium" as the baseline, we need to place that last. Also, note the family is "multinomial".

library(tidyr)
library(VGAM)
library(MASS)

data(housing)
# see also https://rdrr.io/cran/MASS/man/housing.html
# baseline category logit model for nominal response

# unstacking the satisfaction response
housing = spread(housing, Sat, Freq)

# default baseline in R is last category
# all two-way interactions
fit.full = vglm(cbind(Low,High,Medium) ~ (Infl+Type+Cont)^2, family=multinomial, data=housing)
summary(fit.full)

# deviance test for lack of fit
g2 = deviance(fit.full)
df = df.residual(fit.full)
1 - pchisq(g2, df)

# pearson test for lack of fit
e = residuals(fit.full, type='pearson')
x2 = sum(e^2)
1 - pchisq(x2, df)

# significance tests based on LR statistics
anova(fit.full, type=3, test="LR")

# additive model
fit.add = vglm(cbind(Low,High,Medium) ~ Infl+Type+Cont, family=multinomial, data=housing)
summary(fit.add)
deviance(fit.add)
exp(coef(fit.add))
exp(confint(fit.add))
anova(fit.add, type=3, test="LR")
e.add = residuals(fit.add, type='pearson')
x2.add = sum(e.add^2)
df.add = df.residual(fit.add)
1 - pchisq(x2.add, df.add)

# type and cont only
fit.tc = vglm(cbind(Low,High,Medium) ~ Type+Cont, family=multinomial, data=housing)
deviance(fit.tc)

# infl and cont only
fit.ic = vglm(cbind(Low,High,Medium) ~ Infl+Cont, family=multinomial, data=housing)
deviance(fit.ic)

# infl and type only
fit.it = vglm(cbind(Low,High,Medium) ~ Infl+Type, family=multinomial, data=housing)
deviance(fit.it)

# intercept only
fit0 = vglm(cbind(Low,High,Medium) ~ 1, family=multinomial, data=housing)
deviance(fit0)

# stepwise algorithm using AIC
step4vglm(fit.full, direction='backward')

With \(N = 24\) profiles (unique combinations of the predictors) for each of the two baseline logit equations, the saturated model has \(24\cdot2 = 48\) parameters. Compared with the current model, which has 36 parameters to include up to two-way interactions only, this gives 12 degrees of freedom for the goodness of fit test.

> # deviance test for lack of fit
g2 = deviance(fit.full)
df = df.residual(fit.full)
1 - pchisq(g2, df)

# pearson test for lack of fit
e = residuals(fit.full, type='pearson')
x2 = sum(e^2)
1 - pchisq(x2, df)
 Stop and Think!
Does this model fit well?
The model fits well because both the residual deviance and Pearson test statistics are small, relative to their chi-square distributions; both p-values are large and indicate very little evidence of lack of fit.

Output pertaining to the significance of the predictors:

> anova(fit.full, type=3, test="LR")

          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
Infl       4  13.9044        16     19.849 0.0076066 ** 
Type       6  26.2749        18     32.219 0.0001979 ***
Cont       2   5.8487        14     11.793 0.0536990 .  
Infl:Type 12  21.0581        24     27.002 0.0495363 *  
Infl:Cont  4   0.8841        16      6.828 0.9268361    
Type:Cont  6   9.2898        18     15.234 0.1579255 

These results are based on the likelihood-ratio tests and are very similar to SAS's Wald Type 3 statistics. They show the change in fit resulting from discarding any one of the predictors—influence, type, or contact—while keeping the others in the model. For example, the test for influence is equivalent to setting its two indicator coefficients equal to zero in each of the two logit equations; so the test for significance of influence has \(2\cdot2=4\) degrees of freedom. Judging from these tests, we see that

  • Perceived influence on management depends on the type of housing to a degree (interaction), but it's only borderline significant, given that other terms are already in the model. Other interaction terms are relatively weak.
  • Both influence and housing type are strongly related to satisfaction, while contact is borderline insignificant.

This suggests that we may get a more efficient model by removing one or more interaction terms. Below are the goodness of fit statistics for the additive model (no interaction terms). This seems to fit reasonably well and has considerably fewer parameters: \((3-1)(1+6)=14\). We'll retain this additive model and interpret the parameter estimates next.

  Interpreting Parameter Estimates

Let's look at the estimated coefficients of the current model that contains only the main effects:

 >summary(fit.add)

Call:
vglm(formula = cbind(Low, High, Medium) ~ Infl + Type + Cont, 
    family = multinomial, data = housing)

Coefficients: 
                Estimate Std. Error z value Pr(>|z|)    
(Intercept):1     0.4192     0.1729   2.424 0.015342 *  
(Intercept):2     0.2805     0.1662   1.687 0.091525 .  
InflMedium:1     -0.4464     0.1416  -3.153 0.001613 ** 
InflMedium:2      0.2885     0.1448   1.993 0.046306 *  
InflHigh:1       -0.6649     0.1863  -3.568 0.000359 ***
InflHigh:2        0.9477     0.1681   5.639 1.71e-08 ***
TypeApartment:1   0.4357     0.1725   2.525 0.011562 *  
TypeApartment:2  -0.2999     0.1563  -1.919 0.054955 .  
TypeAtrium:1     -0.1314     0.2231  -0.589 0.555980    
TypeAtrium:2     -0.5393     0.1996  -2.702 0.006883 ** 
TypeTerrace:1     0.6666     0.2063   3.232 0.001230 ** 
TypeTerrace:2    -0.7458     0.2105  -3.543 0.000396 ***
ContHigh:1       -0.3609     0.1324  -2.726 0.006420 ** 
ContHigh:2        0.1210     0.1293   0.936 0.349522    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
Residual deviance: 38.6622 on 34 degrees of freedom
Log-likelihood: -118.8993 on 34 degrees of freedom  

How do we interpret this? If we jump to toward the lower part of the output, we see that that there are two logit models (i.e., linear predictors) being fitted and that this is a baseline logit model with the last category as the baseline level; this is "Medium" since we specified the response as cbind(Low,High,Medium) in the function call.

Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])

The two logit equations predict the log-odds of

  • low satisfaction versus medium (1 vs 3)
  • high satisfaction versus medium (2 vs 3)

This baseline reference for the response is in addition to any baselines involved in the predictors. For the two indicators used for perceived influence, the baseline corresponds to "low". So, from the output, the estimated coefficient for high influence is interpreted as follows: for those with high perceived influence on management, the odds of high satisfaction (versus medium) is estimated to be \(e^{.9477}=2.58\) times as high as that for those with low influence. As with other glms we've dealt with to this point, we may change the baselines arbitrarily, which changes the interpretations of the numerical values reported but not the significance of the predictor (all levels together) or the fit of the model as a whole.

The intercepts give the estimated log-odds for the baseline groups. For example, the estimated log-odds of high satisfaction (versus medium) is \(e^{.2805}=1.32\) for birds versus fish in this group is −2.4633 for those with low perceived influence, housing type tower, and low contact with other residents.

To get odds ratios, we exponentiate the fitted model coefficients. Note the (rounded) value 2.58 for InflHigh : 2 corresponds to the odds ratio we found above.

> exp(coef(fit.add))
  (Intercept):1   (Intercept):2 
      1.5207882       1.3237730 
   InflMedium:1    InflMedium:2 
      0.6399304       1.3343807 
     InflHigh:1      InflHigh:2 
      0.5143068       2.5797584 
TypeApartment:1 TypeApartment:2 
      1.5460274       0.7408604 
   TypeAtrium:1    TypeAtrium:2 
      0.8768930       0.5831281 
  TypeTerrace:1   TypeTerrace:2 
      1.9475467       0.4743750 
     ContHigh:1      ContHigh:2 
      0.6970822       1.1285968 

And to add 95% confidence limits, we can run

> exp(confint(fit.add))
                    2.5 %    97.5 %
(Intercept):1   1.0835940 2.1343757
(Intercept):2   0.9557058 1.8335924
InflMedium:1    0.4848852 0.8445523
InflMedium:2    1.0047350 1.7721807
InflHigh:1      0.3569534 0.7410253
InflHigh:2      1.8558064 3.5861247
TypeApartment:1 1.1024451 2.1680907
TypeApartment:2 0.5453913 1.0063859
TypeAtrium:1    0.5662886 1.3578612
TypeAtrium:2    0.3943523 0.8622705
TypeTerrace:1   1.2999442 2.9177698
TypeTerrace:2   0.3140002 0.7166607
ContHigh:1      0.5377582 0.9036100
ContHigh:2      0.8759248 1.4541555

The likelihood-ratio tests for main effects for the additive model can be found with the anova function:

> anova(fit.add, type=3, test="LR")

     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
Infl  4  109.117        38    147.780 < 2.2e-16 ***
Type  6   62.227        40    100.889 1.586e-11 ***
Cont  2   16.060        36     54.722 0.0003256 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Although the above output shows all three predictor main effects are significantly related to satisfaction, we may consider the same model selection techniques here as we used earlier for binary logistic regression.

Model Selection

In the main effects model, the one we fitted above, the LR statistics reported indicate that each of the three predictors is highly significant, given the others. For example, the current (additive) model has deviance \(G^2=38.6622\) with 34 degrees of freedom versus the saturated model as we noted above. If we remove the influence indicators, the deviance increases to \(G^2=147.7797\) with 38 degrees of freedom, and the LRT statistic for comparing these two models would be

\(147.7797 - 38.6622=109.1175\)

which can be compared against a chi-square distribution with \(38-34=4\) degrees of freedom (p-value approximately 0). So, the effect of influence is considered (highly) significant. Similarly, we can compare any two models in this way, provided one is a special case of the other. Repeating the model-fitting for various sets of predictors, we obtain the following analysis-of-deviance table:

Model \(G^2\) df
Saturated 0.00 0
Infl + Type + Cont 38.6622 34
Type + Cont 147.7797 38
Infl + Cont 100.8891 40
Infl + Type 54.7219 36
... ... ...
Null 217.4560 46

We can refine the model by removing terms if doing so does not significantly reduce the fit. 

To compare models that are not of the form full versus reduced, we can use an information criterion, such as AIC. For example, to compare the models with only Type and Cont against the model with only Infl and Cont, the AIC values would be 366.9161 and 316.0256, respectively, in favor of the latter (lower AIC). Note that these differ from the values reported by SAS because of how the software packages differently handle constants in the likelihoods; their ordering is the same, however.

Finally, an adjustment for overdispersion could also be implemented as it was with binary logistic regression. The assumption is that the multinomial variance assumed in the model is off by a scale parameter, which we can estimate from the Pearson goodness of fit test. For the additive model, this would be

\(\dfrac{X^2}{df}=\dfrac{38.9104}{34}=1.144\)

Since this is so close to 1, we would not opt to include such an adjustment in this case, but if we did, the effect would be to multiply all variance and standard error estimates by \(1.144\) and \(\sqrt{1.144}\), respectively.

> e.add = residuals(fit.add, type='pearson')
> x2.add = sum(e.add^2)
> df.add = df.residual(fit.add)
> x2.add/df.add
[1] 1.144424