8.2 - Baseline-Category Logit Model
8.2 - Baseline-Category Logit ModelGoal:
Give a simultaneous representation (summary) of the odds of being in one category relative to being in a designated category, called the baseline category, for all pairs of categories. This is an extension of binary logistic regression model, where we will consider \(r − 1\) non-redundant logits.
Variables:
Suppose that a response variable \(Y\),
\(y_i=(y_{i1},y_{i2},\ldots,y_{ir})^T \),
has a multinomial distribution with index \(n_i=\sum_{j=1}^r y_{ij}\) and parameter
\(\pi_i=(\pi_{i1},\pi_{i2},\ldots,\pi_{ir})^T \).
A set of explanatory variables \(x_i = (x_{i1}, x_{i2}, \ldots, x_{ip})\) can be discrete, continuous, or a combination of both. When the explanatory/predictor variables are all categorical, the baseline category logit model has an equivalent loglinear model.
Model:
When the response categories \(1, 2,\ldots, r\) are unordered, the most popular way to relate \(\pi_i\) to covariates is through a set of \(r − 1\) baseline-category logits. Taking \(j^*\) as the baseline category, the model is
\(\log\left(\dfrac{\pi_{ij}}{\pi_{ij^\ast}}\right)=x_i^T \beta_j,\qquad j \neq j^\ast\)
Note here that \(x_i\), which has length \(p\), represents the vector of terms needed to include all the predictors and any interactions of interest. We may also take the intercept to represent a predictor \(x_{i1}=1\), so that may be included with this notation as well. The coefficient vector is then
\(\beta_j=(\beta_{1j},\beta_{2j},\ldots,\beta_{pj})^T,\qquad j \neq j^\ast\)
We can choose the baseline or let the software program do it automatically.
Questions: How many logits of the model are there for the study of the effects of the radiation exposure to mortality? Can you write these models down if the baseline level is "being alive"?
Interpretation of Parameter Estimates
- The \(k^{th}\) element of \(\beta_j\) can be interpreted as the increase in log-odds of falling into category \(j\) versus category \(j^*\) resulting from a one-unit increase in the \(k^{th}\) predictor term, holding the other terms constant.
- Removing the \(k^{th}\) term from the model is equivalent to simultaneously setting \(r − 1\) coefficients to zero.
- Any of the categories can be chosen to be the baseline. The model will fit equally well, achieving the same likelihood and producing the same fitted values. Only the values and interpretation of the coefficients will change. However to develop a sensible model a 'natural' baseline is chosen. If the response is ordinal, usually the highest or the lowest category in the ordinal scale is chosen.
To calculate the \(\pi_{ij}\) values, we have for the non-baseline categories (\(j \ne j^*\))
\(\pi_{ij}=\dfrac{\exp(x_i^T \beta_j)}{1+\sum_{k \neq j^\ast}\exp(x_i^T \beta_k)}\)
And for the baseline-category, the probability is
\(\pi_{ij^\ast}=\dfrac{1}{1+\sum_{k \neq j^\ast}\exp(x_i^T \beta_k)}\)
Note that as expected, we have \(\sum_{j=1}^r\pi_{ij}=1\).
Goodness of Fit
If the estimated expected counts \(\hat{\mu}_{ij}=n_i \hat{\pi}_{ij}\) are large enough, we can test the fit of our model versus a saturated model that estimates \(\pi\) independently for \(i = 1,\ldots, N\). The deviance for comparing this model to a saturated one is (if data are grouped)
\(G^2=2\sum\limits_{i=1}^N \sum\limits_{j=1}^r y_{ij}\log\dfrac{y_{ij}}{\hat{\mu}_{ij}}\)
The saturated model has \(N(r − 1)\) free parameters, and the current model (i.e., the model we are fitting) has \(p(r − 1)\), where \(p\) is the length of \(x_i\), so the degrees of freedom are
\(df=(N-p)(r-1)\)
The corresponding Pearson statistic is
\(X^2=\sum\limits_{i=1}^N \sum\limits_{j=1}^r r^2_{ij}\)
where
\(r_{ij}=\dfrac{y_{ij}-\hat{\mu}_{ij}}{\sqrt{\hat{\mu}_{ij}}}\)
is the Pearson residual.
Under the null hypothesis that the current model is correct, both statistics are approximately distributed as \(\chi^2_{df}\) provided that
- no more than 20% of the \(\hat{\mu}_{ij}\)'s are below 5.0, and
- none are below 1.0.
In practice, this is often not satisfied, so there may be no way to assess the overall fit of the model. However, we may still apply a \(\chi^2\) approximation to \(G^2\) when comparing nested models, provided that \((N − p)(r − 1)\) is large relative to the difference in the number of parameters between them.
Overdispersion
As before, overdispersion means that the actual covariance matrix of \(Y_i\) exceeds that specified by the multinomial model,
\(V(Y_i)=n_i \left[\text{Diag}(\pi_i)-\pi_i \pi_i^T\right]\)
The expression above utilizes the format for a multivariate dispersion matrix. \(\pi_i\) denotes the vector of probabilities and \(\pi_{i}^{T}\) denotes the transpose of the vector. It is simply another way of writing the dispersion matrix instead of writing it element-wise.
It is reasonable to consider overdispersion if
- the data are grouped (\(n_i\)'s are greater than 1),
- \(x_i\) already contains all covariates worth considering, and
- the overall \(X^2\) is substantially larger than its degrees of freedom \((N − p)(r − 1)\).
In this situation, it may be worthwhile to introduce a scale parameter \(\sigma^2\), so that
\(V(Y_i)=n_i \sigma^2 \left[\text{Diag}(\pi_i)-\pi_i \pi_i^T\right]\)
Recall that the usual estimate for \(\sigma^2\) is obtained from the maximal model (not saturated) by dividing the Pearson chi-square statistic with its degrees of freedom:
\(\hat{\sigma}^2=\dfrac{X^2}{(N-p)(r-1)}\)
which is approximately unbiased if \((N − p)(r − 1)\) is large. Introducing a scale parameter does not alter the estimate of \(\beta\) (which then becomes a quasi-likelihood estimate), but it does alter our estimate of the variability of \(\hat{\beta}\). If we estimate a scale parameter, we should
- multiply the estimated ML covariance matrix for \(\hat{\beta}\) by \(\hat{\sigma}^2\) (SAS does this automatically; in R, it depends which function we use);
- divide the usual Pearson residuals by \(\hat{\sigma}^2\); and
- divide the usual \(X^2\) and \(G^2\) statistics by \(\hat{\sigma}^2\) (SAS reports these as scaled statistics; in R, it depends which function we use.).
These adjustments will have little practical effect unless the estimated scale parameter is substantially greater than 1.0 (say, 1.2 or higher).
8.2.1 - Example: Housing Satisfaction in SAS
8.2.1 - Example: Housing Satisfaction in SASExample: Housing Data (using SAS)

In this section, we will fit the baseline-category logit model to the data below via PROC LOGISTIC with LINK=GLOGIT in SAS. The model can be also fitted by using PROC GENMOD (see the reference link provided at the top page of this lesson).
The data in "housing.dat" gives information on housing conditions from a survey in Copenhagen. Variables are satisfaction (low, medium, high), perceived influence on management (low, medium, high), type of housing (tower, atrium, apartment, terrace), contact with other residents (low, high), and the frequency count of individuals in that row.
When the data are grouped, as they are in this example, SAS expects the response categories \(1, 2, \ldots , r\) to appear in a single column of the dataset, with another column containing the frequency or count as they are in this case. The lines that have a frequency of zero are not actually used in the modeling, because they contribute nothing to the log-likelihood; as such, we can omit them if we wish.
Let's describe these data by a baseline-category model, with Satisfaction as the outcome (baseline chosen as "medium") and other variables are predictors. Note that with this model, we treat the satisfaction levels as nominal; we'll revisit this data set with satisfaction treated as ordinal in the next section.
We let \(\pi_1\) = probability of low satisfaction, \(\pi_2\) = probability of high satisfaction, and \(\pi_3\) = probability of medium satisfaction so that the equations are
\(\log\left(\dfrac{\pi_j}{\pi_3}\right)=\beta_{0j}+\beta_{1j}x_1+\beta_{2j}x_2+\cdots\)
for \(j = 1,2\). The \(x\)'s represent the predictor terms, including any interactions we wish to fit as well. Note that these predictor terms do not depend on \(j\). Also, treating the predictors as nominal requires
- two indicators for perceived influence on management
- three indicators for type of housing
- one indicator for contact with other residents
With first-order terms and all two-way interactions, this gives us a total of \((3-1)(1+6+11)=36\) parameters (1 intercept, 6 first-order terms, and 11 interaction terms for each of the two non-baseline satisfaction response categories)!
Specifying the Model
To get fit statistics, we include the options aggregate and scale=none. To adjust for overdispersion, we would use scale=pearson.
data housing;
input Sat $ Infl $ Type $ Cont $ Freq;
cards;
Low Low Tower Low 21
Medium Low Tower Low 21
High Low Tower Low 28
Low Medium Tower Low 34
Medium Medium Tower Low 22
High Medium Tower Low 36
Low High Tower Low 10
Medium High Tower Low 11
High High Tower Low 36
Low Low Apartment Low 61
Medium Low Apartment Low 23
High Low Apartment Low 17
Low Medium Apartment Low 43
Medium Medium Apartment Low 35
High Medium Apartment Low 40
Low High Apartment Low 26
Medium High Apartment Low 18
High High Apartment Low 54
Low Low Atrium Low 13
Medium Low Atrium Low 9
High Low Atrium Low 10
Low Medium Atrium Low 8
Medium Medium Atrium Low 8
High Medium Atrium Low 12
Low High Atrium Low 6
Medium High Atrium Low 7
High High Atrium Low 9
Low Low Terrace Low 18
Medium Low Terrace Low 6
High Low Terrace Low 7
Low Medium Terrace Low 15
Medium Medium Terrace Low 13
High Medium Terrace Low 13
Low High Terrace Low 7
Medium High Terrace Low 5
High High Terrace Low 11
Low Low Tower High 14
Medium Low Tower High 19
High Low Tower High 37
Low Medium Tower High 17
Medium Medium Tower High 23
High Medium Tower High 40
Low High Tower High 3
Medium High Tower High 5
High High Tower High 23
Low Low Apartment High 78
Medium Low Apartment High 46
High Low Apartment High 43
Low Medium Apartment High 48
Medium Medium Apartment High 45
High Medium Apartment High 86
Low High Apartment High 15
Medium High Apartment High 25
High High Apartment High 62
Low Low Atrium High 20
Medium Low Atrium High 23
High Low Atrium High 20
Low Medium Atrium High 10
Medium Medium Atrium High 22
High Medium Atrium High 24
Low High Atrium High 7
Medium High Atrium High 10
High High Atrium High 21
Low Low Terrace High 57
Medium Low Terrace High 23
High Low Terrace High 13
Low Medium Terrace High 31
Medium Medium Terrace High 21
High Medium Terrace High 13
Low High Terrace High 5
Medium High Terrace High 6
High High Terrace High 13
; run;
* two-way interactions;
proc logistic data=housing;
freq freq;
class infl (ref="Low") type cont / param=ref;
model sat (ref="Medium") = infl type cont
infl*type infl*cont type*cont /link=glogit
aggregate=(infl type cont) scale=none;
run;
* additive model;
proc logistic data=housing;
freq freq;
class infl (ref="Low") type cont / param=ref;
model sat (ref="Medium") = infl type cont /link=glogit
aggregate=(infl type cont) scale=none;
run;
* type and cont only;
proc logistic data=housing;
freq freq;
class type cont / param=ref;
model sat (ref="Medium") = type cont /link=glogit
aggregate=(infl type cont) scale=none;
run;
* infl and cont only;
proc logistic data=housing;
freq freq;
class infl (ref="Low") cont / param=ref;
model sat (ref="Medium") = infl cont /link=glogit
aggregate=(infl type cont) scale=none;
run;
* infl and type only;
proc logistic data=housing;
freq freq;
class infl (ref="Low") type / param=ref;
model sat (ref="Medium") = infl type /link=glogit
aggregate=(infl type cont) scale=none;
run;
* intercept only;
proc logistic data=housing;
freq freq;
class / param=ref;
model sat (ref="Medium") = /link=glogit
aggregate=(infl type cont) scale=none;
run;
Output
Here is the output pertaining to the "response profile" that gives us three levels of the satisfaction response and clarifying the baseline as "Medium". For example, there are 668 individuals who responded with "High" satisfaction.
Response Profile | ||
---|---|---|
Ordered Value |
Sat | Total Frequency |
1 | High | 668 |
2 | Low | 567 |
3 | Medium | 446 |
Here is the output pertaining to the goodness of fit.
Deviance and Pearson Goodness-of-Fit Statistics | ||||
---|---|---|---|---|
Criterion | Value | DF | Value/DF | Pr > ChiSq |
Deviance | 5.9443 | 12 | 0.4954 | 0.9189 |
Pearson | 5.9670 | 12 | 0.4973 | 0.9177 |
Number of unique profiles: 24
There are \(N = 24\) profiles (unique combinations of the predictors) in this dataset. Recall that this part of the output tests the fit of the current model versus the saturated model. The saturated model, which fits a separate multinomial distribution to each profile, has \(24\cdot2 = 48\) parameters. Compared with the current model, which has 36 parameters to include up to two-way interactions only, this gives 12 degrees of freedom for the test.
The model fits well because both the residual deviance and Pearson test statistics are small, relative to their chi-square distributions; both p-values are large and indicate very little evidence of lack of fit.
Output pertaining to the significance of the predictors:
Joint Tests | |||
---|---|---|---|
Effect | DF | Wald Chi-Square |
Pr > ChiSq |
Infl | 4 | 13.2983 | 0.0099 |
Type | 6 | 25.3474 | 0.0003 |
Cont | 2 | 5.7270 | 0.0571 |
Infl*Type | 12 | 20.8975 | 0.0519 |
Infl*Cont | 4 | 0.8772 | 0.9278 |
Type*Cont | 6 | 9.4081 | 0.1519 |
Under full-rank parameterizations, Type 3 effect tests are replaced by joint tests. The joint test for an effect is a test that all the parameters associated with that effect are zero. Such joint tests might not be equivalent to Type 3 effect tests under GLM parameterization.
The first section (global null hypothesis) tests the fit of the current model against the null or intercept-only model. The null model has two parameters (one-intercept for each non-baseline equation). Therefore, the comparison has \(36 - 2 = 34\) degrees of freedom. This test is highly significant, indicating that at least one of the predictors has an effect on satisfaction.
The next section (Type III analysis of effects) shows the change in fit resulting from discarding any one of the predictors—influence, type, or contact—while keeping the others in the model. For example, the test for influence is equivalent to setting its two indicator coefficients equal to zero in each of the two logit equations; so the test for significance of influence has \(2\cdot2=4\) degrees of freedom. Judging from these tests, we see that
- Perceived influence on management depends on the type of housing to a degree (interaction), but it's borderline insignificant, given that other terms are already in the model. Other interaction terms are relatively weak.
- Both influence and housing type are strongly related to satisfaction, while contact is borderline insignificant.
This suggests that we may get a more efficient model by removing one or more interaction terms. Below are the goodness of fit statistics for the additive model (no interaction terms). This seems to fit reasonably well and has considerably fewer parameters: \((3-1)(1+6)=14\). We'll retain this additive model and interpret the parameter estimates next.
Interpreting Parameter Estimates
Let's look at the estimated coefficients of the current model that contains only the main effects:
Analysis of Maximum Likelihood Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | Sat | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | High | 1 | 0.2805 | 0.1662 | 2.8473 | 0.0915 | |
Intercept | Low | 1 | 0.4192 | 0.1729 | 5.8768 | 0.0153 | |
Infl | High | High | 1 | 0.9477 | 0.1681 | 31.8016 | <.0001 |
Infl | High | Low | 1 | -0.6649 | 0.1863 | 12.7338 | 0.0004 |
Infl | Medium | High | 1 | 0.2885 | 0.1448 | 3.9704 | 0.0463 |
Infl | Medium | Low | 1 | -0.4464 | 0.1416 | 9.9443 | 0.0016 |
Type | Apartmen | High | 1 | -0.2999 | 0.1563 | 3.6834 | 0.0550 |
Type | Apartmen | Low | 1 | 0.4357 | 0.1725 | 6.3769 | 0.0116 |
Type | Atrium | High | 1 | -0.5393 | 0.1996 | 7.3033 | 0.0069 |
Type | Atrium | Low | 1 | -0.1314 | 0.2231 | 0.3467 | 0.5560 |
Type | Terrace | High | 1 | -0.7458 | 0.2105 | 12.5494 | 0.0004 |
Type | Terrace | Low | 1 | 0.6666 | 0.2063 | 10.4445 | 0.0012 |
Cont | High | High | 1 | 0.1210 | 0.1293 | 0.8752 | 0.3495 |
Cont | High | Low | 1 | -0.3609 | 0.1324 | 7.4284 | 0.0064 |
The LOGISTIC Procedure
Model Information | |
---|---|
Data Set | WORK.HOUSING |
Response Variable | Sat |
Number of Response Levels | 3 |
Frequency Variable | Freq |
Model | generalized logit |
Optimization Technique | Newton-Raphson |
Odds Ratio Estimates | ||||
---|---|---|---|---|
Effect | Sat | Point Estimate | 95% Wald Confidence Limits |
|
Infl High vs Low | High | 2.580 | 1.856 | 3.586 |
Infl High vs Low | Low | 0.514 | 0.357 | 0.741 |
Infl Medium vs Low | High | 1.334 | 1.005 | 1.772 |
Infl Medium vs Low | Low | 0.640 | 0.485 | 0.845 |
Type Apartmen vs Tower | High | 0.741 | 0.545 | 1.006 |
Type Apartmen vs Tower | Low | 1.546 | 1.102 | 2.168 |
Type Atrium vs Tower | High | 0.583 | 0.394 | 0.862 |
Type Atrium vs Tower | Low | 0.877 | 0.566 | 1.358 |
Type Terrace vs Tower | High | 0.474 | 0.314 | 0.717 |
Type Terrace vs Tower | Low | 1.948 | 1.300 | 2.918 |
Cont High vs Low | High | 1.129 | 0.876 | 1.454 |
Cont High vs Low | Low | 0.697 | 0.538 | 0.904 |
How do we interpret them? Recall that there are two logit equations involved: 1) the log-odds of low satisfaction versus medium and 2) the log-odds of high satisfaction versus medium. This baseline reference is in addition to any baselines involved in the predictors. For the two indicators used for perceived influence, the baseline corresponds to "low". So, from the output, the estimated coefficient for high influence is interpreted as follows: for those with high perceived influence on management, the odds of high satisfaction (versus medium) is estimated to be \(e^{.9477}=2.58\) times as high as that for those with low influence. As with other glms we've dealt with to this point, we may change the baselines arbitrarily, which changes the interpretations of the numerical values reported but not the significance of the predictor (all levels together) or the fit of the model as a whole.
The intercepts give the estimated log-odds for the baseline groups. For example, the estimated log-odds of high satisfaction (versus medium) is \(e^{.2805}=1.32\) for those with low perceived influence, housing type tower, and low contact with other residents.
Although the above output shows all three predictor main effects are significantly related to satisfaction, we may consider the same model selection techniques here as we used earlier for binary logistic regression.
Model Selection
In the main effects model, the one we fitted above, the Wald statistics reported indicate that each of the three predictors is highly significant, given the others. We can also address this by comparing deviances, full model versus reduced model; we just have to be careful of the number of parameters since each effect we test applies to two (in general, \(r-1\)) logit models. The current (additive) model has deviance \(G^2=38.6622\) with 34 degrees of freedom versus the saturated model as we noted above. If we remove the influence indicators, the deviance increases to \(G^2=147.7797\) with 38 degrees of freedom, and the LRT statistic for comparing these two models would be
\(147.7797 - 38.6622=109.1175\)
which can be compared against a chi-square distribution with \(38-34=4\) degrees of freedom (p-value approximately 0). So, the effect of influence is considered (highly) significant. Similarly, we can compare any two models in this way, provided one is a special case of the other. Repeating the model-fitting for various sets of predictors, we obtain the following analysis-of-deviance table:
Model | \(G^2\) | df |
---|---|---|
Saturated | 0.00 | 0 |
Infl + Type + Cont | 38.6622 | 34 |
Type + Cont | 147.7797 | 38 |
Infl + Cont | 100.8891 | 40 |
Infl + Type | 54.7219 | 36 |
... | ... | ... |
Null | 217.4560 | 46 |
We can refine the model by removing terms if doing so does not significantly reduce the fit.
To compare models that are not of the form full versus reduced, we can use an information criterion, such as AIC. For example, to compare the models with only Type and Cont against the model with only Infl and Cont, the AIC values would be 3599.201 and 3548.311, respectively, in favor of the latter (lower AIC).
Finally, an adjustment for overdispersion could also be implemented as it was with binary logistic regression. The assumption is that the multinomial variance assumed in the model is off by a scale parameter, which we can estimate from the Pearson goodness of fit test. For the additive model, this would be
\(\dfrac{X^2}{df}=\dfrac{38.9104}{34}=1.144\)
Since this is so close to 1, we would not opt to include such an adjustment in this case, but if we did, the effect would be to multiply all variance and standard error estimates by \(1.144\) and \(\sqrt{1.144}\), respectively.
8.2.2 - Example: Housing Satisfaction in R
8.2.2 - Example: Housing Satisfaction in RExample: Housing Data (using R)

For R, there are a number of different packages and functions we can use. For example, vglm() from VGAM package, or multinom() from nnet package, or mlogit() from globaltest package from BIOCONDUCTOR; see the links at the first page of these lecture notes and the later examples. We will focus on vglm().
The data in "housing.dat" gives information on housing conditions from a survey in Copenhagen. Variables are satisfaction (low, medium, high), perceived influence on management (low, medium, high), type of housing (tower, atrium, apartment, terrace), contact with other residents (low, high), and the frequency count of individuals in that row.
The vglm function expects the response categories \(1,2,\ldots,r\) to appear in separate columns, with rows corresponding to predictor combinations, much like the glm function. When the data are stacked as they are in this example, we can use the "spread" function to separate them. The lines that have a frequency of zero are not actually used in the modeling, because they contribute nothing to the log-likelihood; as such, we can omit them if we wish.
library(MASS)
library(tidyr)
data(housing)
housing = spread(housing, Sat, Freq)
The result is below. Note also that there are 24 rows corresponding to the unique combinations of the predictors.
> housing
Infl Type Cont Low Medium High
1 Low Tower Low 21 21 28
2 Low Tower High 14 19 37
3 Low Apartment Low 61 23 17
4 Low Apartment High 78 46 43
5 Low Atrium Low 13 9 10
6 Low Atrium High 20 23 20
7 Low Terrace Low 18 6 7
8 Low Terrace High 57 23 13
9 Medium Tower Low 34 22 36
10 Medium Tower High 17 23 40
11 Medium Apartment Low 43 35 40
12 Medium Apartment High 48 45 86
13 Medium Atrium Low 8 8 12
14 Medium Atrium High 10 22 24
15 Medium Terrace Low 15 13 13
16 Medium Terrace High 31 21 13
17 High Tower Low 10 11 36
18 High Tower High 3 5 23
19 High Apartment Low 26 18 54
20 High Apartment High 15 25 62
21 High Atrium Low 6 7 9
22 High Atrium High 7 10 21
23 High Terrace Low 7 5 11
24 High Terrace High 5 6 13
Let's describe these data by a baseline-category model, with Satisfaction as the outcome (baseline chosen as "medium") and other variables are predictors. Note that with this model, we treat the satisfaction levels as nominal; we'll revisit this data set with satisfaction treated as ordinal in the next section.
We let \(\pi_1\) = probability of low satisfaction, \(\pi_2\) = probability of high satisfaction, and \(\pi_3\) = probability of medium satisfaction so that the equations are
\(\log\left(\dfrac{\pi_j}{\pi_3}\right)=\beta_{0j}+\beta_{1j}x_1+\beta_{2j}x_2+\cdots\)
for \(j = 1,2\). The \(x\)'s represent the predictor terms, including any interactions we wish to fit as well. Note that these predictor terms do not depend on \(j\). Also, treating the predictors as nominal requires
- two indicators for perceived influence on management
- three indicators for type of housing
- one indicator for contact with other residents
With first-order terms and all two-way interactions, this gives us a total of \((3-1)(1+6+11)=36\) parameters (1 intercept, 6 first-order terms, and 11 i
Specifying the Model
The syntax of the vglm function is very similar to that of the glm function, but note that the last of the response categories is taken as the baseline by default. If we want to use "Medium" as the baseline, we need to place that last. Also, note the family is "multinomial".
library(tidyr)
library(VGAM)
library(MASS)
data(housing)
# see also https://rdrr.io/cran/MASS/man/housing.html
# baseline category logit model for nominal response
# unstacking the satisfaction response
housing = spread(housing, Sat, Freq)
# default baseline in R is last category
# all two-way interactions
fit.full = vglm(cbind(Low,High,Medium) ~ (Infl+Type+Cont)^2, family=multinomial, data=housing)
summary(fit.full)
# deviance test for lack of fit
g2 = deviance(fit.full)
df = df.residual(fit.full)
1 - pchisq(g2, df)
# pearson test for lack of fit
e = residuals(fit.full, type='pearson')
x2 = sum(e^2)
1 - pchisq(x2, df)
# significance tests based on LR statistics
anova(fit.full, type=3, test="LR")
# additive model
fit.add = vglm(cbind(Low,High,Medium) ~ Infl+Type+Cont, family=multinomial, data=housing)
summary(fit.add)
deviance(fit.add)
exp(coef(fit.add))
exp(confint(fit.add))
anova(fit.add, type=3, test="LR")
e.add = residuals(fit.add, type='pearson')
x2.add = sum(e.add^2)
df.add = df.residual(fit.add)
1 - pchisq(x2.add, df.add)
# type and cont only
fit.tc = vglm(cbind(Low,High,Medium) ~ Type+Cont, family=multinomial, data=housing)
deviance(fit.tc)
# infl and cont only
fit.ic = vglm(cbind(Low,High,Medium) ~ Infl+Cont, family=multinomial, data=housing)
deviance(fit.ic)
# infl and type only
fit.it = vglm(cbind(Low,High,Medium) ~ Infl+Type, family=multinomial, data=housing)
deviance(fit.it)
# intercept only
fit0 = vglm(cbind(Low,High,Medium) ~ 1, family=multinomial, data=housing)
deviance(fit0)
# stepwise algorithm using AIC
step4vglm(fit.full, direction='backward')
With \(N = 24\) profiles (unique combinations of the predictors) for each of the two baseline logit equations, the saturated model has \(24\cdot2 = 48\) parameters. Compared with the current model, which has 36 parameters to include up to two-way interactions only, this gives 12 degrees of freedom for the goodness of fit test.
> # deviance test for lack of fit
g2 = deviance(fit.full)
df = df.residual(fit.full)
1 - pchisq(g2, df)
# pearson test for lack of fit
e = residuals(fit.full, type='pearson')
x2 = sum(e^2)
1 - pchisq(x2, df)
Output pertaining to the significance of the predictors:
> anova(fit.full, type=3, test="LR")
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
Infl 4 13.9044 16 19.849 0.0076066 **
Type 6 26.2749 18 32.219 0.0001979 ***
Cont 2 5.8487 14 11.793 0.0536990 .
Infl:Type 12 21.0581 24 27.002 0.0495363 *
Infl:Cont 4 0.8841 16 6.828 0.9268361
Type:Cont 6 9.2898 18 15.234 0.1579255
These results are based on the likelihood-ratio tests and are very similar to SAS's Wald Type 3 statistics. They show the change in fit resulting from discarding any one of the predictors—influence, type, or contact—while keeping the others in the model. For example, the test for influence is equivalent to setting its two indicator coefficients equal to zero in each of the two logit equations; so the test for significance of influence has \(2\cdot2=4\) degrees of freedom. Judging from these tests, we see that
- Perceived influence on management depends on the type of housing to a degree (interaction), but it's only borderline significant, given that other terms are already in the model. Other interaction terms are relatively weak.
- Both influence and housing type are strongly related to satisfaction, while contact is borderline insignificant.
This suggests that we may get a more efficient model by removing one or more interaction terms. Below are the goodness of fit statistics for the additive model (no interaction terms). This seems to fit reasonably well and has considerably fewer parameters: \((3-1)(1+6)=14\). We'll retain this additive model and interpret the parameter estimates next.
Interpreting Parameter Estimates
Let's look at the estimated coefficients of the current model that contains only the main effects:
>summary(fit.add)
Call:
vglm(formula = cbind(Low, High, Medium) ~ Infl + Type + Cont,
family = multinomial, data = housing)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 0.4192 0.1729 2.424 0.015342 *
(Intercept):2 0.2805 0.1662 1.687 0.091525 .
InflMedium:1 -0.4464 0.1416 -3.153 0.001613 **
InflMedium:2 0.2885 0.1448 1.993 0.046306 *
InflHigh:1 -0.6649 0.1863 -3.568 0.000359 ***
InflHigh:2 0.9477 0.1681 5.639 1.71e-08 ***
TypeApartment:1 0.4357 0.1725 2.525 0.011562 *
TypeApartment:2 -0.2999 0.1563 -1.919 0.054955 .
TypeAtrium:1 -0.1314 0.2231 -0.589 0.555980
TypeAtrium:2 -0.5393 0.1996 -2.702 0.006883 **
TypeTerrace:1 0.6666 0.2063 3.232 0.001230 **
TypeTerrace:2 -0.7458 0.2105 -3.543 0.000396 ***
ContHigh:1 -0.3609 0.1324 -2.726 0.006420 **
ContHigh:2 0.1210 0.1293 0.936 0.349522
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
Residual deviance: 38.6622 on 34 degrees of freedom
Log-likelihood: -118.8993 on 34 degrees of freedom
How do we interpret this? If we jump to toward the lower part of the output, we see that that there are two logit models (i.e., linear predictors) being fitted and that this is a baseline logit model with the last category as the baseline level; this is "Medium" since we specified the response as cbind(Low,High,Medium)
in the function call.
Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
The two logit equations predict the log-odds of
- low satisfaction versus medium (1 vs 3)
- high satisfaction versus medium (2 vs 3)
This baseline reference for the response is in addition to any baselines involved in the predictors. For the two indicators used for perceived influence, the baseline corresponds to "low". So, from the output, the estimated coefficient for high influence is interpreted as follows: for those with high perceived influence on management, the odds of high satisfaction (versus medium) is estimated to be \(e^{.9477}=2.58\) times as high as that for those with low influence. As with other glms we've dealt with to this point, we may change the baselines arbitrarily, which changes the interpretations of the numerical values reported but not the significance of the predictor (all levels together) or the fit of the model as a whole.
The intercepts give the estimated log-odds for the baseline groups. For example, the estimated log-odds of high satisfaction (versus medium) is \(e^{.2805}=1.32\) for birds versus fish in this group is −2.4633 for those with low perceived influence, housing type tower, and low contact with other residents.
To get odds ratios, we exponentiate the fitted model coefficients. Note the (rounded) value 2.58 for InflHigh : 2 corresponds to the odds ratio we found above.
> exp(coef(fit.add))
(Intercept):1 (Intercept):2
1.5207882 1.3237730
InflMedium:1 InflMedium:2
0.6399304 1.3343807
InflHigh:1 InflHigh:2
0.5143068 2.5797584
TypeApartment:1 TypeApartment:2
1.5460274 0.7408604
TypeAtrium:1 TypeAtrium:2
0.8768930 0.5831281
TypeTerrace:1 TypeTerrace:2
1.9475467 0.4743750
ContHigh:1 ContHigh:2
0.6970822 1.1285968
And to add 95% confidence limits, we can run
> exp(confint(fit.add))
2.5 % 97.5 %
(Intercept):1 1.0835940 2.1343757
(Intercept):2 0.9557058 1.8335924
InflMedium:1 0.4848852 0.8445523
InflMedium:2 1.0047350 1.7721807
InflHigh:1 0.3569534 0.7410253
InflHigh:2 1.8558064 3.5861247
TypeApartment:1 1.1024451 2.1680907
TypeApartment:2 0.5453913 1.0063859
TypeAtrium:1 0.5662886 1.3578612
TypeAtrium:2 0.3943523 0.8622705
TypeTerrace:1 1.2999442 2.9177698
TypeTerrace:2 0.3140002 0.7166607
ContHigh:1 0.5377582 0.9036100
ContHigh:2 0.8759248 1.4541555
The likelihood-ratio tests for main effects for the additive model can be found with the anova function:
> anova(fit.add, type=3, test="LR")
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
Infl 4 109.117 38 147.780 < 2.2e-16 ***
Type 6 62.227 40 100.889 1.586e-11 ***
Cont 2 16.060 36 54.722 0.0003256 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Although the above output shows all three predictor main effects are significantly related to satisfaction, we may consider the same model selection techniques here as we used earlier for binary logistic regression.
Model Selection
In the main effects model, the one we fitted above, the LR statistics reported indicate that each of the three predictors is highly significant, given the others. For example, the current (additive) model has deviance \(G^2=38.6622\) with 34 degrees of freedom versus the saturated model as we noted above. If we remove the influence indicators, the deviance increases to \(G^2=147.7797\) with 38 degrees of freedom, and the LRT statistic for comparing these two models would be
\(147.7797 - 38.6622=109.1175\)
which can be compared against a chi-square distribution with \(38-34=4\) degrees of freedom (p-value approximately 0). So, the effect of influence is considered (highly) significant. Similarly, we can compare any two models in this way, provided one is a special case of the other. Repeating the model-fitting for various sets of predictors, we obtain the following analysis-of-deviance table:
Model | \(G^2\) | df |
Saturated | 0.00 | 0 |
Infl + Type + Cont | 38.6622 | 34 |
Type + Cont | 147.7797 | 38 |
Infl + Cont | 100.8891 | 40 |
Infl + Type | 54.7219 | 36 |
... | ... | ... |
Null | 217.4560 | 46 |
We can refine the model by removing terms if doing so does not significantly reduce the fit.
To compare models that are not of the form full versus reduced, we can use an information criterion, such as AIC. For example, to compare the models with only Type and Cont against the model with only Infl and Cont, the AIC values would be 366.9161 and 316.0256, respectively, in favor of the latter (lower AIC). Note that these differ from the values reported by SAS because of how the software packages differently handle constants in the likelihoods; their ordering is the same, however.
Finally, an adjustment for overdispersion could also be implemented as it was with binary logistic regression. The assumption is that the multinomial variance assumed in the model is off by a scale parameter, which we can estimate from the Pearson goodness of fit test. For the additive model, this would be
\(\dfrac{X^2}{df}=\dfrac{38.9104}{34}=1.144\)
Since this is so close to 1, we would not opt to include such an adjustment in this case, but if we did, the effect would be to multiply all variance and standard error estimates by \(1.144\) and \(\sqrt{1.144}\), respectively.
> e.add = residuals(fit.add, type='pearson')
> x2.add = sum(e.add^2)
> df.add = df.residual(fit.add)
> x2.add/df.add
[1] 1.144424