8.1 - Polytomous (Multinomial) Logistic Regression

We have already learned about binary logistic regression, where the response is a binary variable with "success" and "failure" being only two categories. But logistic regression can be extended to handle responses, \(Y\), that are polytomous, i.e. taking \(r > 2\) categories. When \(r = 2\), \(Y\) is dichotomous, and we can model the log odds that an event occurs (versus not). For binary logistic regression, there is only one logit that we can form:

\(\text{logit}(\pi)=\log\left(\dfrac{\pi}{1-\pi}\right)\)

When \(r > 2\), we have a multi-category or polytomous response variable. There are \(\dfrac{r (r − 1)}{2}\) logits (odds) that we can form, but only \((r − 1)\) are non-redundant. There are different ways to form a set of \((r − 1)\) non-redundant logits, and these will lead to different polytomous (multinomial) logistic regression models.

Multinomial Logistic Regression models how a multinomial response variable \(Y\) depends on a set of \(k\) explanatory variables, \(x=(x_1, x_2, \dots, x_k)\). This is also a GLM where the random component assumes that the distribution of \(Y\) is multinomial(\(n,\pi\)), where \(\pi\) is a vector with probabilities of "success" for the categories. As with binary logistic regression, the systematic component consists of explanatory variables (can be continuous, discrete, or both) and are linear in the parameters. The link function is the generalized logit, the logit link for each pair of non-redundant logits as discussed above.

When analyzing a polytomous response, it's important to note whether the response is ordinal (consisting of ordered categories) or nominal (consisting of unordered categories). For the binary logistic model, this question does not arise. Some types of models are appropriate only for ordinal responses (e.g., cumulative logits model, adjacent categories model). Other models may be used whether the response is ordinal or nominal (e.g., baseline logit model).

If the response is ordinal, we do not necessarily have to take the ordering into account, but neglecting to do so may lead to sub-optimal models. Using the natural ordering can

lead to a simpler, more parsimonious model and
increase power to detect relationships with other variables.

If the response variable is polytomous and all the potential predictors are discrete as well, we could describe the multi-way contingency table with a log-linear model (see Lesson 10), but this approach views all variables on equal terms without a dedicated response. If one is to be treated as a response and others as explanatory, the (multinomial) logistic regression model is more appropriate.

Grouped versus ungrouped responses

We have already seen in our discussions of logistic regression, data can come in ungrouped (e.g., database form) or grouped format (e.g., tabular form). Consider a study that explores the effect of fat content on taste rating of ice cream. The response variable \(Y\) is a multi-category (Likert scale) response, ranging from 1 (lowest rating) to 9 (highest rating). The data could arrive in ungrouped form, with one record per subject (as below) where the first column indicates the fat content, and the second column the rating:

Or it could arrive in grouped form (e.g., table):

Fat	Rating category
Fat	1	2	3	4	5	6	7	8	9
0	4	17	8	16	5	6	4	2	1
4	1	1	5	6	7	9	21	12	0
...	...	...	...	...	...	...	...	...	...
28	4	6	9	11	5	9	7	8	3

Sampling Model

In ungrouped form, the response occupies a single column of the dataset, but in grouped form, the response occupies \(r\) columns. Most computer programs for polytomous logistic regression can handle grouped or ungrouped data.

Whether the data are grouped or ungrouped, we will imagine the response to be multinomial. That is, the "response" for row \(i\),

\(y_i=(y_{i1},y_{i2},\ldots,y_{ir})^T \),

is assumed to have a multinomial distribution with index \(n_i=\sum_{j=1}^r y_{ij}\) and parameter

\(\pi_i=(\pi_{i1},\pi_{i2},\ldots,\pi_{ir})^T \).

For example, for the first row, with fat=0, \(\pi_1 = (\pi_{11}, \pi_{12}, \dots , \pi_{19})^T\).

If the data are grouped, then \(n_i\) is the total number of "trials" in the \(i^{th}\) row of the dataset, and \(y_{ij}\) is the number of trials in which outcome \(j\) occurred. For example, for the first row, there were \(n_1 = 63\) people who tasted ice cream with fat=0, and \(y_{12}=17\) among them gave the rating of 2.
If the data are ungrouped, \(y_i = j\) implies that individual observation (subject, etc.) \(i\) produced outcome \(j\).

Describing polytomous responses by a sequence of binary models

In some cases, it makes sense to "factor" the response into a sequence of binary choices and model them with a sequence of ordinary logistic models. The number of binary logistic regressions needed is equal to the number of categories of the response minus 1, e.g., \(r-1\).

For example, consider a medical study to investigate the long-term effects of radiation exposure on mortality. The response variable has four levels (\(Y=1\) if alive, \(Y=2\) if death from cause other than cancer, \(Y=3\) if death from cancer other than leukemia, and \(Y=4\) if death from leukemia). The main predictor of interest is level of exposure (low, medium, high). The four-level response can be modeled via a single multinomial model, or as a sequence of binary choices in three stages:

The stage 1 model, which is fit for all subjects, describes the log-odds of death.
The stage 2 model, which is fit only to the subjects that die, describes the log-odds of death due to cancer versus death from other causes.
The stage 3 model, which is fit only to the subjects who die of cancer, describes the log-odds of death due to leukemia versus death due to other cancers.

Because the multinomial distribution can be factored into a sequence of conditional binomials, we can fit these three logistic models separately. The overall likelihood function factors into three independent likelihoods.

This approach is attractive when the response can be naturally arranged as a sequence of binary choices. But in situations where arranging such a sequence is unnatural, we should probably fit a single multinomial model to the entire response. The ice cream example above would not be a good example for the binary sequence approach since the taste ratings do not have such a hierarchy.

8.2 - Baseline-Category Logit Model

Goal:

Give a simultaneous representation (summary) of the odds of being in one category relative to being in a designated category, called the baseline category, for all pairs of categories. This is an extension of binary logistic regression model, where we will consider \(r − 1\) non-redundant logits.

Variables:

Suppose that a response variable \(Y\),

\(y_i=(y_{i1},y_{i2},\ldots,y_{ir})^T \),

has a multinomial distribution with index \(n_i=\sum_{j=1}^r y_{ij}\) and parameter

\(\pi_i=(\pi_{i1},\pi_{i2},\ldots,\pi_{ir})^T \).

A set of explanatory variables \(x_i = (x_{i1}, x_{i2}, \ldots, x_{ip})\) can be discrete, continuous, or a combination of both. When the explanatory/predictor variables are all categorical, the baseline category logit model has an equivalent loglinear model.

Model:

When the response categories \(1, 2,\ldots, r\) are unordered, the most popular way to relate \(\pi_i\) to covariates is through a set of \(r − 1\) baseline-category logits. Taking \(j^*\) as the baseline category, the model is

\(\log\left(\dfrac{\pi_{ij}}{\pi_{ij^\ast}}\right)=x_i^T \beta_j,\qquad j \neq j^\ast\)

Note here that \(x_i\), which has length \(p\), represents the vector of terms needed to include all the predictors and any interactions of interest. We may also take the intercept to represent a predictor \(x_{i1}=1\), so that may be included with this notation as well. The coefficient vector is then

\(\beta_j=(\beta_{1j},\beta_{2j},\ldots,\beta_{pj})^T,\qquad j \neq j^\ast\)

We can choose the baseline or let the software program do it automatically.

Questions: How many logits of the model are there for the study of the effects of the radiation exposure to mortality? Can you write these models down if the baseline level is "being alive"?

Interpretation of Parameter Estimates

The \(k^{th}\) element of \(\beta_j\) can be interpreted as the increase in log-odds of falling into category \(j\) versus category \(j^*\) resulting from a one-unit increase in the \(k^{th}\) predictor term, holding the other terms constant.

Removing the \(k^{th}\) term from the model is equivalent to simultaneously setting \(r − 1\) coefficients to zero.

Any of the categories can be chosen to be the baseline. The model will fit equally well, achieving the same likelihood and producing the same fitted values. Only the values and interpretation of the coefficients will change. However to develop a sensible model a 'natural' baseline is chosen. If the response is ordinal, usually the highest or the lowest category in the ordinal scale is chosen.

To calculate the \(\pi_{ij}\) values, we have for the non-baseline categories (\(j \ne j^*\))

\(\pi_{ij}=\dfrac{\exp(x_i^T \beta_j)}{1+\sum_{k \neq j^\ast}\exp(x_i^T \beta_k)}\)

And for the baseline-category, the probability is

\(\pi_{ij^\ast}=\dfrac{1}{1+\sum_{k \neq j^\ast}\exp(x_i^T \beta_k)}\)

Note that as expected, we have \(\sum_{j=1}^r\pi_{ij}=1\).

Goodness of Fit

If the estimated expected counts \(\hat{\mu}_{ij}=n_i \hat{\pi}_{ij}\) are large enough, we can test the fit of our model versus a saturated model that estimates \(\pi\) independently for \(i = 1,\ldots, N\). The deviance for comparing this model to a saturated one is (if data are grouped)

\(G^2=2\sum\limits_{i=1}^N \sum\limits_{j=1}^r y_{ij}\log\dfrac{y_{ij}}{\hat{\mu}_{ij}}\)

The saturated model has \(N(r − 1)\) free parameters, and the current model (i.e., the model we are fitting) has \(p(r − 1)\), where \(p\) is the length of \(x_i\), so the degrees of freedom are

\(df=(N-p)(r-1)\)

The corresponding Pearson statistic is

\(X^2=\sum\limits_{i=1}^N \sum\limits_{j=1}^r r^2_{ij}\)

where

\(r_{ij}=\dfrac{y_{ij}-\hat{\mu}_{ij}}{\sqrt{\hat{\mu}_{ij}}}\)

is the Pearson residual.

Under the null hypothesis that the current model is correct, both statistics are approximately distributed as \(\chi^2_{df}\) provided that

no more than 20% of the \(\hat{\mu}_{ij}\)'s are below 5.0, and
none are below 1.0.

In practice, this is often not satisfied, so there may be no way to assess the overall fit of the model. However, we may still apply a \(\chi^2\) approximation to \(G^2\) when comparing nested models, provided that \((N − p)(r − 1)\) is large relative to the difference in the number of parameters between them.

Overdispersion

As before, overdispersion means that the actual covariance matrix of \(Y_i\) exceeds that specified by the multinomial model,

\(V(Y_i)=n_i \left[\text{Diag}(\pi_i)-\pi_i \pi_i^T\right]\)

The expression above utilizes the format for a multivariate dispersion matrix. \(\pi_i\) denotes the vector of probabilities and \(\pi_{i}^{T}\) denotes the transpose of the vector. It is simply another way of writing the dispersion matrix instead of writing it element-wise.

It is reasonable to consider overdispersion if

the data are grouped (\(n_i\)'s are greater than 1),
\(x_i\) already contains all covariates worth considering, and
the overall \(X^2\) is substantially larger than its degrees of freedom \((N − p)(r − 1)\).

In this situation, it may be worthwhile to introduce a scale parameter \(\sigma^2\), so that

\(V(Y_i)=n_i \sigma^2 \left[\text{Diag}(\pi_i)-\pi_i \pi_i^T\right]\)

Recall that the usual estimate for \(\sigma^2\) is obtained from the maximal model (not saturated) by dividing the Pearson chi-square statistic with its degrees of freedom:

\(\hat{\sigma}^2=\dfrac{X^2}{(N-p)(r-1)}\)

which is approximately unbiased if \((N − p)(r − 1)\) is large. Introducing a scale parameter does not alter the estimate of \(\beta\) (which then becomes a quasi-likelihood estimate), but it does alter our estimate of the variability of \(\hat{\beta}\). If we estimate a scale parameter, we should

multiply the estimated ML covariance matrix for \(\hat{\beta}\) by \(\hat{\sigma}^2\) (SAS does this automatically; in R, it depends which function we use);
divide the usual Pearson residuals by \(\hat{\sigma}^2\); and
divide the usual \(X^2\) and \(G^2\) statistics by \(\hat{\sigma}^2\) (SAS reports these as scaled statistics; in R, it depends which function we use.).

These adjustments will have little practical effect unless the estimated scale parameter is substantially greater than 1.0 (say, 1.2 or higher).

8.2.1 - Example: Housing Satisfaction in SAS

Example: Housing Data (using SAS)

In this section, we will fit the baseline-category logit model to the data below via PROC LOGISTIC with LINK=GLOGIT in SAS. The model can be also fitted by using PROC GENMOD (see the reference link provided at the top page of this lesson).

The data in "housing.dat" gives information on housing conditions from a survey in Copenhagen. Variables are satisfaction (low, medium, high), perceived influence on management (low, medium, high), type of housing (tower, atrium, apartment, terrace), contact with other residents (low, high), and the frequency count of individuals in that row.

When the data are grouped, as they are in this example, SAS expects the response categories \(1, 2, \ldots , r\) to appear in a single column of the dataset, with another column containing the frequency or count as they are in this case. The lines that have a frequency of zero are not actually used in the modeling, because they contribute nothing to the log-likelihood; as such, we can omit them if we wish.

Let's describe these data by a baseline-category model, with Satisfaction as the outcome (baseline chosen as "medium") and other variables are predictors. Note that with this model, we treat the satisfaction levels as nominal; we'll revisit this data set with satisfaction treated as ordinal in the next section.

We let \(\pi_1\) = probability of low satisfaction, \(\pi_2\) = probability of high satisfaction, and \(\pi_3\) = probability of medium satisfaction so that the equations are

\(\log\left(\dfrac{\pi_j}{\pi_3}\right)=\beta_{0j}+\beta_{1j}x_1+\beta_{2j}x_2+\cdots\)

for \(j = 1,2\). The \(x\)'s represent the predictor terms, including any interactions we wish to fit as well. Note that these predictor terms do not depend on \(j\). Also, treating the predictors as nominal requires

two indicators for perceived influence on management
three indicators for type of housing
one indicator for contact with other residents

With first-order terms and all two-way interactions, this gives us a total of \((3-1)(1+6+11)=36\) parameters (1 intercept, 6 first-order terms, and 11 interaction terms for each of the two non-baseline satisfaction response categories)!

Specifying the Model

Note! In the model statement, we need to tell SAS about the existence of a count or frequency variable; otherwise, SAS will assume that the data are ungrouped, with each line representing a single observation. We also need to specify which of the categories is the baseline, e.g., model sat (ref="Medium") sets medium satisfaction as the baseline. The link function is glogit, for generalized logit.

To get fit statistics, we include the options aggregate and scale=none. To adjust for overdispersion, we would use scale=pearson.

data housing;
input Sat $ Infl $ Type $ Cont $ Freq;
cards;
     Low    Low     Tower  Low   21
  Medium    Low     Tower  Low   21
    High    Low     Tower  Low   28
     Low Medium     Tower  Low   34
  Medium Medium     Tower  Low   22
    High Medium     Tower  Low   36
     Low   High     Tower  Low   10
  Medium   High     Tower  Low   11
    High   High     Tower  Low   36
    Low    Low Apartment  Low   61
 Medium    Low Apartment  Low   23
   High    Low Apartment  Low   17
    Low Medium Apartment  Low   43
 Medium Medium Apartment  Low   35
   High Medium Apartment  Low   40
    Low   High Apartment  Low   26
 Medium   High Apartment  Low   18
   High   High Apartment  Low   54
    Low    Low    Atrium  Low   13
 Medium    Low    Atrium  Low    9
   High    Low    Atrium  Low   10
    Low Medium    Atrium  Low    8
 Medium Medium    Atrium  Low    8
   High Medium    Atrium  Low   12
    Low   High    Atrium  Low    6
 Medium   High    Atrium  Low    7
   High   High    Atrium  Low    9
    Low    Low   Terrace  Low   18
 Medium    Low   Terrace  Low    6
   High    Low   Terrace  Low    7
    Low Medium   Terrace  Low   15
 Medium Medium   Terrace  Low   13
   High Medium   Terrace  Low   13
    Low   High   Terrace  Low    7
 Medium   High   Terrace  Low    5
   High   High   Terrace  Low   11
    Low    Low     Tower High   14
 Medium    Low     Tower High   19
   High    Low     Tower High   37
    Low Medium     Tower High   17
 Medium Medium     Tower High   23
   High Medium     Tower High   40
    Low   High     Tower High    3
 Medium   High     Tower High    5
   High   High     Tower High   23
    Low    Low Apartment High   78
 Medium    Low Apartment High   46
   High    Low Apartment High   43
    Low Medium Apartment High   48
 Medium Medium Apartment High   45
   High Medium Apartment High   86
    Low   High Apartment High   15
 Medium   High Apartment High   25
   High   High Apartment High   62
    Low    Low    Atrium High   20
 Medium    Low    Atrium High   23
   High    Low    Atrium High   20
    Low Medium    Atrium High   10
 Medium Medium    Atrium High   22
   High Medium    Atrium High   24
    Low   High    Atrium High    7
 Medium   High    Atrium High   10
   High   High    Atrium High   21
    Low    Low   Terrace High   57
 Medium    Low   Terrace High   23
   High    Low   Terrace High   13
    Low Medium   Terrace High   31
 Medium Medium   Terrace High   21
   High Medium   Terrace High   13
    Low   High   Terrace High    5
 Medium   High   Terrace High    6
   High   High   Terrace High   13
; run;

* two-way interactions;
proc logistic data=housing;
freq freq;
class infl (ref="Low") type cont / param=ref;
model sat (ref="Medium") =  infl type cont 
 infl*type infl*cont type*cont /link=glogit 
 aggregate=(infl type cont) scale=none;
run;

* additive model;
proc logistic data=housing;
freq freq;
class infl (ref="Low") type cont / param=ref;
model sat (ref="Medium") =  infl type cont /link=glogit 
 aggregate=(infl type cont) scale=none;
run;

* type and cont only;
proc logistic data=housing;
freq freq;
class type cont / param=ref;
model sat (ref="Medium") = type cont /link=glogit 
 aggregate=(infl type cont) scale=none;
run;

* infl and cont only;
proc logistic data=housing;
freq freq;
class infl (ref="Low") cont / param=ref;
model sat (ref="Medium") = infl cont /link=glogit 
 aggregate=(infl type cont) scale=none;
run;

* infl and type only;
proc logistic data=housing;
freq freq;
class infl (ref="Low") type / param=ref;
model sat (ref="Medium") = infl type /link=glogit 
 aggregate=(infl type cont) scale=none;
run;

* intercept only;
proc logistic data=housing;
freq freq;
class / param=ref;
model sat (ref="Medium") = /link=glogit 
 aggregate=(infl type cont) scale=none;
run;

Output

Here is the output pertaining to the "response profile" that gives us three levels of the satisfaction response and clarifying the baseline as "Medium". For example, there are 668 individuals who responded with "High" satisfaction.


Response Profile
Ordered Value	Sat	Total Frequency
1	High	668
2	Low	567
3	Medium	446

Here is the output pertaining to the goodness of fit.


Deviance and Pearson Goodness-of-Fit Statistics
Criterion	Value	DF	Value/DF	Pr > ChiSq
Deviance	5.9443	12	0.4954	0.9189
Pearson	5.9670	12	0.4973	0.9177

Number of unique profiles: 24

There are \(N = 24\) profiles (unique combinations of the predictors) in this dataset. Recall that this part of the output tests the fit of the current model versus the saturated model. The saturated model, which fits a separate multinomial distribution to each profile, has \(24\cdot2 = 48\) parameters. Compared with the current model, which has 36 parameters to include up to two-way interactions only, this gives 12 degrees of freedom for the test.

Stop and Think!

Does this model fit well?

The model fits well because both the residual deviance and Pearson test statistics are small, relative to their chi-square distributions; both p-values are large and indicate very little evidence of lack of fit.

Output pertaining to the significance of the predictors:


Joint Tests
Effect	DF	Wald Chi-Square	Pr > ChiSq
Infl	4	13.2983	0.0099
Type	6	25.3474	0.0003
Cont	2	5.7270	0.0571
Infl*Type	12	20.8975	0.0519
Infl*Cont	4	0.8772	0.9278
Type*Cont	6	9.4081	0.1519

Note:Under full-rank parameterizations, Type 3 effect tests are replaced by joint tests. The joint test for an effect is a test that all the parameters associated with that effect are zero. Such joint tests might not be equivalent to Type 3 effect tests under GLM parameterization.

The first section (global null hypothesis) tests the fit of the current model against the null or intercept-only model. The null model has two parameters (one-intercept for each non-baseline equation). Therefore, the comparison has \(36 - 2 = 34\) degrees of freedom. This test is highly significant, indicating that at least one of the predictors has an effect on satisfaction.

The next section (Type III analysis of effects) shows the change in fit resulting from discarding any one of the predictors—influence, type, or contact—while keeping the others in the model. For example, the test for influence is equivalent to setting its two indicator coefficients equal to zero in each of the two logit equations; so the test for significance of influence has \(2\cdot2=4\) degrees of freedom. Judging from these tests, we see that

Perceived influence on management depends on the type of housing to a degree (interaction), but it's borderline insignificant, given that other terms are already in the model. Other interaction terms are relatively weak.
Both influence and housing type are strongly related to satisfaction, while contact is borderline insignificant.

This suggests that we may get a more efficient model by removing one or more interaction terms. Below are the goodness of fit statistics for the additive model (no interaction terms). This seems to fit reasonably well and has considerably fewer parameters: \((3-1)(1+6)=14\). We'll retain this additive model and interpret the parameter estimates next.

Interpreting Parameter Estimates

Let's look at the estimated coefficients of the current model that contains only the main effects:


Analysis of Maximum Likelihood Estimates
Parameter		Sat	DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept		High	1	0.2805	0.1662	2.8473	0.0915
Intercept		Low	1	0.4192	0.1729	5.8768	0.0153
Infl	High	High	1	0.9477	0.1681	31.8016	<.0001
Infl	High	Low	1	-0.6649	0.1863	12.7338	0.0004
Infl	Medium	High	1	0.2885	0.1448	3.9704	0.0463
Infl	Medium	Low	1	-0.4464	0.1416	9.9443	0.0016
Type	Apartmen	High	1	-0.2999	0.1563	3.6834	0.0550
Type	Apartmen	Low	1	0.4357	0.1725	6.3769	0.0116
Type	Atrium	High	1	-0.5393	0.1996	7.3033	0.0069
Type	Atrium	Low	1	-0.1314	0.2231	0.3467	0.5560
Type	Terrace	High	1	-0.7458	0.2105	12.5494	0.0004
Type	Terrace	Low	1	0.6666	0.2063	10.4445	0.0012
Cont	High	High	1	0.1210	0.1293	0.8752	0.3495
Cont	High	Low	1	-0.3609	0.1324	7.4284	0.0064

The SAS System

The LOGISTIC Procedure


Model Information
Data Set	WORK.HOUSING
Response Variable	Sat
Number of Response Levels	3
Frequency Variable	Freq
Model	generalized logit
Optimization Technique	Newton-Raphson


Odds Ratio Estimates
Effect	Sat	Point Estimate	95% Wald Confidence Limits
Infl High vs Low	High	2.580	1.856	3.586
Infl High vs Low	Low	0.514	0.357	0.741
Infl Medium vs Low	High	1.334	1.005	1.772
Infl Medium vs Low	Low	0.640	0.485	0.845
Type Apartmen vs Tower	High	0.741	0.545	1.006
Type Apartmen vs Tower	Low	1.546	1.102	2.168
Type Atrium vs Tower	High	0.583	0.394	0.862
Type Atrium vs Tower	Low	0.877	0.566	1.358
Type Terrace vs Tower	High	0.474	0.314	0.717
Type Terrace vs Tower	Low	1.948	1.300	2.918
Cont High vs Low	High	1.129	0.876	1.454
Cont High vs Low	Low	0.697	0.538	0.904

How do we interpret them? Recall that there are two logit equations involved: 1) the log-odds of low satisfaction versus medium and 2) the log-odds of high satisfaction versus medium. This baseline reference is in addition to any baselines involved in the predictors. For the two indicators used for perceived influence, the baseline corresponds to "low". So, from the output, the estimated coefficient for high influence is interpreted as follows: for those with high perceived influence on management, the odds of high satisfaction (versus medium) is estimated to be \(e^{.9477}=2.58\) times as high as that for those with low influence. As with other glms we've dealt with to this point, we may change the baselines arbitrarily, which changes the interpretations of the numerical values reported but not the significance of the predictor (all levels together) or the fit of the model as a whole.

The intercepts give the estimated log-odds for the baseline groups. For example, the estimated log-odds of high satisfaction (versus medium) is \(e^{.2805}=1.32\) for those with low perceived influence, housing type tower, and low contact with other residents.

Although the above output shows all three predictor main effects are significantly related to satisfaction, we may consider the same model selection techniques here as we used earlier for binary logistic regression.

Model Selection

In the main effects model, the one we fitted above, the Wald statistics reported indicate that each of the three predictors is highly significant, given the others. We can also address this by comparing deviances, full model versus reduced model; we just have to be careful of the number of parameters since each effect we test applies to two (in general, \(r-1\)) logit models. The current (additive) model has deviance \(G^2=38.6622\) with 34 degrees of freedom versus the saturated model as we noted above. If we remove the influence indicators, the deviance increases to \(G^2=147.7797\) with 38 degrees of freedom, and the LRT statistic for comparing these two models would be

\(147.7797 - 38.6622=109.1175\)

which can be compared against a chi-square distribution with \(38-34=4\) degrees of freedom (p-value approximately 0). So, the effect of influence is considered (highly) significant. Similarly, we can compare any two models in this way, provided one is a special case of the other. Repeating the model-fitting for various sets of predictors, we obtain the following analysis-of-deviance table:

Model	\(G^2\)	df
Saturated	0.00	0
Infl + Type + Cont	38.6622	34
Type + Cont	147.7797	38
Infl + Cont	100.8891	40
Infl + Type	54.7219	36
...	...	...
Null	217.4560	46

We can refine the model by removing terms if doing so does not significantly reduce the fit.

To compare models that are not of the form full versus reduced, we can use an information criterion, such as AIC. For example, to compare the models with only Type and Cont against the model with only Infl and Cont, the AIC values would be 3599.201 and 3548.311, respectively, in favor of the latter (lower AIC).

Finally, an adjustment for overdispersion could also be implemented as it was with binary logistic regression. The assumption is that the multinomial variance assumed in the model is off by a scale parameter, which we can estimate from the Pearson goodness of fit test. For the additive model, this would be

\(\dfrac{X^2}{df}=\dfrac{38.9104}{34}=1.144\)

Since this is so close to 1, we would not opt to include such an adjustment in this case, but if we did, the effect would be to multiply all variance and standard error estimates by \(1.144\) and \(\sqrt{1.144}\), respectively.

8.2.2 - Example: Housing Satisfaction in R

Example: Housing Data (using R)

For R, there are a number of different packages and functions we can use. For example, vglm() from VGAM package, or multinom() from nnet package, or mlogit() from globaltest package from BIOCONDUCTOR; see the links at the first page of these lecture notes and the later examples. We will focus on vglm().

The data in "housing.dat" gives information on housing conditions from a survey in Copenhagen. Variables are satisfaction (low, medium, high), perceived influence on management (low, medium, high), type of housing (tower, atrium, apartment, terrace), contact with other residents (low, high), and the frequency count of individuals in that row.

The vglm function expects the response categories \(1,2,\ldots,r\) to appear in separate columns, with rows corresponding to predictor combinations, much like the glm function. When the data are stacked as they are in this example, we can use the "spread" function to separate them. The lines that have a frequency of zero are not actually used in the modeling, because they contribute nothing to the log-likelihood; as such, we can omit them if we wish.

library(MASS)
library(tidyr)
data(housing)
housing = spread(housing, Sat, Freq)

The result is below. Note also that there are 24 rows corresponding to the unique combinations of the predictors.

> housing
     Infl      Type Cont Low Medium High
1     Low     Tower  Low  21     21   28
2     Low     Tower High  14     19   37
3     Low Apartment  Low  61     23   17
4     Low Apartment High  78     46   43
5     Low    Atrium  Low  13      9   10
6     Low    Atrium High  20     23   20
7     Low   Terrace  Low  18      6    7
8     Low   Terrace High  57     23   13
9  Medium     Tower  Low  34     22   36
10 Medium     Tower High  17     23   40
11 Medium Apartment  Low  43     35   40
12 Medium Apartment High  48     45   86
13 Medium    Atrium  Low   8      8   12
14 Medium    Atrium High  10     22   24
15 Medium   Terrace  Low  15     13   13
16 Medium   Terrace High  31     21   13
17   High     Tower  Low  10     11   36
18   High     Tower High   3      5   23
19   High Apartment  Low  26     18   54
20   High Apartment High  15     25   62
21   High    Atrium  Low   6      7    9
22   High    Atrium High   7     10   21
23   High   Terrace  Low   7      5   11
24   High   Terrace High   5      6   13

Let's describe these data by a baseline-category model, with Satisfaction as the outcome (baseline chosen as "medium") and other variables are predictors. Note that with this model, we treat the satisfaction levels as nominal; we'll revisit this data set with satisfaction treated as ordinal in the next section.

We let \(\pi_1\) = probability of low satisfaction, \(\pi_2\) = probability of high satisfaction, and \(\pi_3\) = probability of medium satisfaction so that the equations are

\(\log\left(\dfrac{\pi_j}{\pi_3}\right)=\beta_{0j}+\beta_{1j}x_1+\beta_{2j}x_2+\cdots\)

for \(j = 1,2\). The \(x\)'s represent the predictor terms, including any interactions we wish to fit as well. Note that these predictor terms do not depend on \(j\). Also, treating the predictors as nominal requires

two indicators for perceived influence on management
three indicators for type of housing
one indicator for contact with other residents

With first-order terms and all two-way interactions, this gives us a total of \((3-1)(1+6+11)=36\) parameters (1 intercept, 6 first-order terms, and 11 i

Specifying the Model

The syntax of the vglm function is very similar to that of the glm function, but note that the last of the response categories is taken as the baseline by default. If we want to use "Medium" as the baseline, we need to place that last. Also, note the family is "multinomial".

library(tidyr)
library(VGAM)
library(MASS)

data(housing)
# see also https://rdrr.io/cran/MASS/man/housing.html
# baseline category logit model for nominal response

# unstacking the satisfaction response
housing = spread(housing, Sat, Freq)

# default baseline in R is last category
# all two-way interactions
fit.full = vglm(cbind(Low,High,Medium) ~ (Infl+Type+Cont)^2, family=multinomial, data=housing)
summary(fit.full)

# deviance test for lack of fit
g2 = deviance(fit.full)
df = df.residual(fit.full)
1 - pchisq(g2, df)

# pearson test for lack of fit
e = residuals(fit.full, type='pearson')
x2 = sum(e^2)
1 - pchisq(x2, df)

# significance tests based on LR statistics
anova(fit.full, type=3, test="LR")

# additive model
fit.add = vglm(cbind(Low,High,Medium) ~ Infl+Type+Cont, family=multinomial, data=housing)
summary(fit.add)
deviance(fit.add)
exp(coef(fit.add))
exp(confint(fit.add))
anova(fit.add, type=3, test="LR")
e.add = residuals(fit.add, type='pearson')
x2.add = sum(e.add^2)
df.add = df.residual(fit.add)
1 - pchisq(x2.add, df.add)

# type and cont only
fit.tc = vglm(cbind(Low,High,Medium) ~ Type+Cont, family=multinomial, data=housing)
deviance(fit.tc)

# infl and cont only
fit.ic = vglm(cbind(Low,High,Medium) ~ Infl+Cont, family=multinomial, data=housing)
deviance(fit.ic)

# infl and type only
fit.it = vglm(cbind(Low,High,Medium) ~ Infl+Type, family=multinomial, data=housing)
deviance(fit.it)

# intercept only
fit0 = vglm(cbind(Low,High,Medium) ~ 1, family=multinomial, data=housing)
deviance(fit0)

# stepwise algorithm using AIC
step4vglm(fit.full, direction='backward')

With \(N = 24\) profiles (unique combinations of the predictors) for each of the two baseline logit equations, the saturated model has \(24\cdot2 = 48\) parameters. Compared with the current model, which has 36 parameters to include up to two-way interactions only, this gives 12 degrees of freedom for the goodness of fit test.

> # deviance test for lack of fit
g2 = deviance(fit.full)
df = df.residual(fit.full)
1 - pchisq(g2, df)

# pearson test for lack of fit
e = residuals(fit.full, type='pearson')
x2 = sum(e^2)
1 - pchisq(x2, df)

Stop and Think!

Does this model fit well?

The model fits well because both the residual deviance and Pearson test statistics are small, relative to their chi-square distributions; both p-values are large and indicate very little evidence of lack of fit.

Output pertaining to the significance of the predictors:

> anova(fit.full, type=3, test="LR")

          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
Infl       4  13.9044        16     19.849 0.0076066 ** 
Type       6  26.2749        18     32.219 0.0001979 ***
Cont       2   5.8487        14     11.793 0.0536990 .  
Infl:Type 12  21.0581        24     27.002 0.0495363 *  
Infl:Cont  4   0.8841        16      6.828 0.9268361    
Type:Cont  6   9.2898        18     15.234 0.1579255

These results are based on the likelihood-ratio tests and are very similar to SAS's Wald Type 3 statistics. They show the change in fit resulting from discarding any one of the predictors—influence, type, or contact—while keeping the others in the model. For example, the test for influence is equivalent to setting its two indicator coefficients equal to zero in each of the two logit equations; so the test for significance of influence has \(2\cdot2=4\) degrees of freedom. Judging from these tests, we see that

Perceived influence on management depends on the type of housing to a degree (interaction), but it's only borderline significant, given that other terms are already in the model. Other interaction terms are relatively weak.
Both influence and housing type are strongly related to satisfaction, while contact is borderline insignificant.

This suggests that we may get a more efficient model by removing one or more interaction terms. Below are the goodness of fit statistics for the additive model (no interaction terms). This seems to fit reasonably well and has considerably fewer parameters: \((3-1)(1+6)=14\). We'll retain this additive model and interpret the parameter estimates next.

Interpreting Parameter Estimates

Let's look at the estimated coefficients of the current model that contains only the main effects:

 >summary(fit.add)

Call:
vglm(formula = cbind(Low, High, Medium) ~ Infl + Type + Cont, 
    family = multinomial, data = housing)

Coefficients: 
                Estimate Std. Error z value Pr(>|z|)    
(Intercept):1     0.4192     0.1729   2.424 0.015342 *  
(Intercept):2     0.2805     0.1662   1.687 0.091525 .  
InflMedium:1     -0.4464     0.1416  -3.153 0.001613 ** 
InflMedium:2      0.2885     0.1448   1.993 0.046306 *  
InflHigh:1       -0.6649     0.1863  -3.568 0.000359 ***
InflHigh:2        0.9477     0.1681   5.639 1.71e-08 ***
TypeApartment:1   0.4357     0.1725   2.525 0.011562 *  
TypeApartment:2  -0.2999     0.1563  -1.919 0.054955 .  
TypeAtrium:1     -0.1314     0.2231  -0.589 0.555980    
TypeAtrium:2     -0.5393     0.1996  -2.702 0.006883 ** 
TypeTerrace:1     0.6666     0.2063   3.232 0.001230 ** 
TypeTerrace:2    -0.7458     0.2105  -3.543 0.000396 ***
ContHigh:1       -0.3609     0.1324  -2.726 0.006420 ** 
ContHigh:2        0.1210     0.1293   0.936 0.349522    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
Residual deviance: 38.6622 on 34 degrees of freedom
Log-likelihood: -118.8993 on 34 degrees of freedom

How do we interpret this? If we jump to toward the lower part of the output, we see that that there are two logit models (i.e., linear predictors) being fitted and that this is a baseline logit model with the last category as the baseline level; this is "Medium" since we specified the response as cbind(Low,High,Medium) in the function call.

Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])

The two logit equations predict the log-odds of

low satisfaction versus medium (1 vs 3)
high satisfaction versus medium (2 vs 3)

This baseline reference for the response is in addition to any baselines involved in the predictors. For the two indicators used for perceived influence, the baseline corresponds to "low". So, from the output, the estimated coefficient for high influence is interpreted as follows: for those with high perceived influence on management, the odds of high satisfaction (versus medium) is estimated to be \(e^{.9477}=2.58\) times as high as that for those with low influence. As with other glms we've dealt with to this point, we may change the baselines arbitrarily, which changes the interpretations of the numerical values reported but not the significance of the predictor (all levels together) or the fit of the model as a whole.

The intercepts give the estimated log-odds for the baseline groups. For example, the estimated log-odds of high satisfaction (versus medium) is \(e^{.2805}=1.32\) for birds versus fish in this group is −2.4633 for those with low perceived influence, housing type tower, and low contact with other residents.

To get odds ratios, we exponentiate the fitted model coefficients. Note the (rounded) value 2.58 for InflHigh : 2 corresponds to the odds ratio we found above.

> exp(coef(fit.add))
  (Intercept):1   (Intercept):2 
      1.5207882       1.3237730 
   InflMedium:1    InflMedium:2 
      0.6399304       1.3343807 
     InflHigh:1      InflHigh:2 
      0.5143068       2.5797584 
TypeApartment:1 TypeApartment:2 
      1.5460274       0.7408604 
   TypeAtrium:1    TypeAtrium:2 
      0.8768930       0.5831281 
  TypeTerrace:1   TypeTerrace:2 
      1.9475467       0.4743750 
     ContHigh:1      ContHigh:2 
      0.6970822       1.1285968

And to add 95% confidence limits, we can run

> exp(confint(fit.add))
                    2.5 %    97.5 %
(Intercept):1   1.0835940 2.1343757
(Intercept):2   0.9557058 1.8335924
InflMedium:1    0.4848852 0.8445523
InflMedium:2    1.0047350 1.7721807
InflHigh:1      0.3569534 0.7410253
InflHigh:2      1.8558064 3.5861247
TypeApartment:1 1.1024451 2.1680907
TypeApartment:2 0.5453913 1.0063859
TypeAtrium:1    0.5662886 1.3578612
TypeAtrium:2    0.3943523 0.8622705
TypeTerrace:1   1.2999442 2.9177698
TypeTerrace:2   0.3140002 0.7166607
ContHigh:1      0.5377582 0.9036100
ContHigh:2      0.8759248 1.4541555

The likelihood-ratio tests for main effects for the additive model can be found with the anova function:

> anova(fit.add, type=3, test="LR")

     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
Infl  4  109.117        38    147.780 < 2.2e-16 ***
Type  6   62.227        40    100.889 1.586e-11 ***
Cont  2   16.060        36     54.722 0.0003256 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Although the above output shows all three predictor main effects are significantly related to satisfaction, we may consider the same model selection techniques here as we used earlier for binary logistic regression.

Model Selection

In the main effects model, the one we fitted above, the LR statistics reported indicate that each of the three predictors is highly significant, given the others. For example, the current (additive) model has deviance \(G^2=38.6622\) with 34 degrees of freedom versus the saturated model as we noted above. If we remove the influence indicators, the deviance increases to \(G^2=147.7797\) with 38 degrees of freedom, and the LRT statistic for comparing these two models would be

\(147.7797 - 38.6622=109.1175\)

which can be compared against a chi-square distribution with \(38-34=4\) degrees of freedom (p-value approximately 0). So, the effect of influence is considered (highly) significant. Similarly, we can compare any two models in this way, provided one is a special case of the other. Repeating the model-fitting for various sets of predictors, we obtain the following analysis-of-deviance table:

Model	\(G^2\)	df
Saturated	0.00	0
Infl + Type + Cont	38.6622	34
Type + Cont	147.7797	38
Infl + Cont	100.8891	40
Infl + Type	54.7219	36
...	...	...
Null	217.4560	46

We can refine the model by removing terms if doing so does not significantly reduce the fit.

To compare models that are not of the form full versus reduced, we can use an information criterion, such as AIC. For example, to compare the models with only Type and Cont against the model with only Infl and Cont, the AIC values would be 366.9161 and 316.0256, respectively, in favor of the latter (lower AIC). Note that these differ from the values reported by SAS because of how the software packages differently handle constants in the likelihoods; their ordering is the same, however.

Finally, an adjustment for overdispersion could also be implemented as it was with binary logistic regression. The assumption is that the multinomial variance assumed in the model is off by a scale parameter, which we can estimate from the Pearson goodness of fit test. For the additive model, this would be

\(\dfrac{X^2}{df}=\dfrac{38.9104}{34}=1.144\)

Since this is so close to 1, we would not opt to include such an adjustment in this case, but if we did, the effect would be to multiply all variance and standard error estimates by \(1.144\) and \(\sqrt{1.144}\), respectively.

> e.add = residuals(fit.add, type='pearson')
> x2.add = sum(e.add^2)
> df.add = df.residual(fit.add)
> x2.add/df.add
[1] 1.144424

8.3 - Adjacent-Category Logits

Let us suppose that the response categories \(1, 2,\ldots , r\) are ordered, (e.g., nine responses in ice cream rating). Rather than considering the probability of each category versus a baseline, it now makes sense to consider the probability of

outcome 1 versus 2,
outcome 2 versus 3,
outcome 3 versus 4,
...
outcome \(r − 1\) versus \(r\).

This comparison of adjacent-categories will make more sense for the mortality data example. For the mortality data, consider the logits of "alive vs. dead", "cancer death vs. non-cancer death", etc.

The adjacent-category logits are defined as:

\begin{array}{rcl}
L_1 &=& \log \left(\dfrac{\pi_1}{\pi_2}\right)\\
L_2 &=& \log \left(\dfrac{\pi_2}{\pi_3}\right)\\
& \vdots & \\
L_{r-1} &=& \log \left(\dfrac{\pi_{r-1}}{\pi_r}\right)
\end{array}

This is similar to a baseline-category logit model, but the baseline changes from one category to the next. Suppose we introduce covariates to the model:

\begin{array}{rcl}
L_1 &=& \beta_{10}+\beta_{11}x_1+\cdots+\beta_{1p}x_p\\
L_2 &=& \beta_{20}+\beta_{21}x_1+\cdots+\beta_{2p}x_p\\
& \vdots & \\
L_{r-1} &=& \beta_{r-1,0}+\beta_{r-1,1}x_1+\cdots+\beta_{r-1,p}x_p\\
\end{array}

It is easy to see that the \(\beta\)-coefficients from this model are linear transformations of the \(\beta\)s from the baseline-category model. To see this, suppose that we create a model in which category 1 is the baseline.

Then

\begin{array}{rcl}
\log \left(\dfrac{\pi_2}{\pi_1}\right)&=& -L_1,\\
\log \left(\dfrac{\pi_3}{\pi_1}\right)&=& -L_2-L_1,\\
& \vdots & \\
\log \left(\dfrac{\pi_r}{\pi_1}\right)&=& -L_{r-1}-\cdots-L_2-L_1
\end{array}

Without further structure, the adjacent-category model is just a reparametrization of the baseline-category model. But now, let's suppose that the effect of a covariate in each of the adjacent-category equations is the same:

\begin{array}{rcl}
L_1 &=& \alpha_1+\beta_1x_1+\cdots+\beta_p x_p\\
L_2 &=& \alpha_2+\beta_1x_1+\cdots+\beta_p x_p\\
& \vdots & \\
L_{r-1} &=& \alpha_{r-1}+\beta_1x_1+\cdots+\beta_p x_p
\end{array}

What does this model mean? Let us consider the interpretation of \(\beta_1\), the coefficient for \(x_1\). Suppose that we hold all the other \(x\)'s constant and change the value of \(x_1\). Think about the \(2\times r\) table that shows the probabilities for the outcomes \(1, 2, \ldots , r\) at a given value of \(x_1=x\) and at a new value \(x_1=x+1\):

The relationship between \(x_1\) and the response, holding all the other x-variables constant, can be described by a set of \(r -1\) odds ratios for each pair of adjacent response categories. The adjacent-category logit model says that each of these adjacent-category odds ratios is equal to \(\exp(\beta_1)\). That is, \(\beta_1\) is the change in the log-odds of falling into category \(j + 1\) versus category \(j\) when \(x_1\) increases by one unit, holding all the other \(x\)-variables constant.

This adjacent-category logit model can be fit using software for Poisson log-linear regression using a specially coded design matrix, or a log-linear model when all data are categorical. In this model, the association between the \(r\)-category response variable and \(x_1\) would be included by

the product of \(r - 1\) indicators for the response variable with
a linear contrast for \(x_1\).

This model can be fit in SAS using PROC CATMOD or PROC GENMOD and in R using the vgam() package, for example. However, we will not discuss this model further, because it is not nearly as popular as the proportional-odds cumulative-logit model, for an ordinal response, which we discuss next.

8.4 - The Proportional-Odds Cumulative Logit Model

Cumulative-logit Models for Ordinal Responses

Proportional-odds cumulative logit model is possibly the most popular model for ordinal data. This model uses cumulative probabilities up to a threshold, thereby making the whole range of ordinal categories binary at that threshold. Let the response be \(Y=1,2,\ldots, J\) where the ordering is natural. The associated probabilities are \((\pi_1,\pi_2,\ldots,\pi_J)\), and a cumulative probability of a response less than equal to \(j\) is

\(P(Y \leq j)=\pi_1+\ldots+\pi_j\)

Then, a cumulative logit is defined as

\(\log\left(\dfrac{P(Y \leq j)}{P(Y > j)}\right)=\log\left(\dfrac{P(Y \leq j)}{1-P(Y \leq j)}\right)=\log\left(\dfrac{\pi_1+\ldots+\pi_j}{\pi_{j+1}+\ldots+\pi_J}\right) \)

This is the log-odds of the event that \(Y\le j\) and measures how likely the response is to be in category \(j\) or below versus in a category higher than \(j\).

The sequence of cumulative logits may be defined as:

\begin{array}{rcl}
L_1 &=& \log \left(\dfrac{\pi_1}{\pi_2+\pi_3+\cdots+\pi_J}\right)\\
L_2 &=& \log \left(\dfrac{\pi_1+\pi_2}{\pi_3+\pi_4+\cdots+\pi_J}\right)\\
& \vdots & \\
L_{J-1} &=& \log \left(\dfrac{\pi_1+\pi_2+\cdots+\pi_{r-1}}{\pi_J}\right)
\end{array}

And with predictors incorporated into the model, we have

\begin{array}{rcl}
L_1 &=& \beta_{01}+\beta_{11}x_1+\cdots+\beta_{p1}x_p\\
L_2 &=& \beta_{02}+\beta_{12}x_1+\cdots+\beta_{p2}x_p\\
& \vdots & \\
L_{J-1} &=& \beta_{0,J-1}+\beta_{1,J-1}x_1+\cdots+\beta_{p,J-1}x_p\\
\end{array}

Notice that (unlike the adjacent-category logit model) this is not a linear reparameterization of the baseline-category model. The cumulative logits are not simple differences between the baseline-category logits. Therefore, the above model will not give a fit equivalent to that of the baseline-category model.

Now suppose that we simplify the model by requiring the coefficient of each \(x\)-variable to be identical across the \(J -1\) logit equations. Then, changing the names of the intercepts to \(\alpha\)'s, the model becomes

\begin{array}{rcl}
L_1 &=& \alpha_1+\beta_1x_1+\cdots+\beta_p X_p\\
L_2 &=& \alpha_2+\beta_1x_1+\cdots+\beta_p X_p\\
& \vdots & \\
L_{J-1} &=& \alpha_{J-1}+\beta_1x_1+\cdots+\beta_p X_p
\end{array}

This model, called the proportional-odds cumulative logit model, has \((J - 1)\) intercepts plus \(p\) slopes, for a total of \(J + p - 1\) parameters to be estimated.

Notice that intercepts can differ, but that slope for each variable stays the same across different equations! One may think of this as a set of parallel lines (or hyperplanes) with different intercepts. The proportional-odds condition forces the lines corresponding to each cumulative logit to be parallel.

Interpretation

In this model, intercept \(\alpha_j\) is the log-odds of falling into or below category \(j\) when \(x_1 = x_2 = \dots = 0\).
A single parameter \(\beta_k\) describes the effect of \(x_k\) on \(Y\) such that \(\beta_k\) is the increase in log-odds of falling into or below any category associated with a one-unit increase in \(x_k\), holding all the other predictors constant; compare this to the baseline logit model where there are \(J-1\) parameters for a single explanatory variable. Therefore, a positive slope indicates a tendency for the response level to decrease as the variable decreases.

Constant sloped \(\beta_k\): The effect of \(x_k\), is the same for all \(J-1\) ways to collapse \(Y\) into dichotomous outcomes.

For simplicity, let's consider only one predictor: \(\text{logit}[P(Y \leq j)]=\alpha_j+\beta x\)

Then the cumulative probabilities are given by: \(P(Y \leq j)=\exp(\alpha_j+\beta x)/(1+\exp(\alpha_j+\beta x))\), and since \(\beta\) is constant, the curves of cumulative probabilities plotted against \(x\) are parallel.

The odds-ratio is proportional to the difference between \(x_1\) and \(x_2\) where \(\beta\) is the constant of proportionality: \(\exp[\beta(x_1-x_2)]\) and thus the name "proportional odds model".

Question: Do you see how we get the above measure of odds-ratio?

Continuous Latent Response

One reason for the proportional-odds cumulative-logit model's popularity is its connection to the idea of a continuous latent response. Suppose that the categorical outcome is actually a categorized version of an unobservable (latent) continuous variable.

For example, it is reasonable to think that a 5-point Likert scale (1 = strongly disagree, 2 = agree, 3 = neutral, 4 = agree, 5 = strongly agree) is a coarsened version of a continuous variable Z indicating degree of approval. The continuous scale is divided into five regions by four cut-points \(c_1, c_2, c_3, c_4\) which are determined by nature (not by the investigator). If \(Z \le c_1\), we observe \(Y = 1\); if \(c1< Z \le c_2\), we observe \(Y = 2\); and so on. Here is the connection. Suppose that the \(Z\) is related to the \(x\)'s through a homoscedastic linear regression. For example, with a single \(x\), the relationship looks like this:

If the regression of \(Z\) on the \(x\)'s has the form

\(Z=\gamma_0+\gamma_1 x_1+\gamma_2 x_2+\cdots+\gamma_p x_p+\epsilon \),

where \(\epsilon\) is a random error from a logistic distribution with mean zero and constant variance, then the coarsened version \(Y\) will be related to the \(x\)'s by a proportional-odds cumulative logit model. (The logistic distribution has a bell-shaped density similar to a normal curve. If we were to have normal errors rather than logistic errors, the cumulative logit equations would change to have a probit link. In most cases, the fit of a logit and probit model are quite similar.)

If the regression of \(Z\) on the \(x\)'s is heteroscedastic—for example, if the variance increases with the mean—then the logit equations will "fan out" and not have constant slope. A model with non-constant slopes is somewhat undesirable because the lines are not parallel; the logit lines will eventually cross each other, implying negative probabilities for some categories.

CASE STUDY: The Penn State Ice Cream Study [Optional]

If time permits, you should also read and listen to the Penn State Ice Cream Case Study where Dr. Bill Harkness unravels the "mystery" of the polytomous logistic regression (through SAS, although we do provide the R code too). While doing this, please try to note:

Why does Dr. Harkness say the ordinary chi-square test is not sufficient for this type of data?
Why is this particular example quadratic in nature?
Can we use the ordinary regression model for this data? And at which point would it be OK to approximate a categorical response variable as continuous; e.g. with how many levels?
Does the proportional odds model for the ice cream data fit well?

8.5 - Fitting the Cumulative Logit Model

Example: Housing Satisfaction Revisited

Suppose we revisit the housing satisfaction example but now treat the response levels "Low", "Medium", and "High" as ordered from least to greatest satisfaction (indexed 1, 2, and 3) so that it \(\pi_1+\pi_2\) is the cumulative probability of medium satisfaction---that is, the probability of medium or less satisfaction, which makes sense now that the categories have a defined order. The cumulative logit model is then

\(\log\left(\dfrac{P(Y \leq j)}{P(Y > j)}\right)= \beta_{0j}+\beta_{1j}x_1+\beta_{2j}x_2+\cdots \quad\mbox{for }j=1,2\)

where the predictor set is identical to that for the baseline model. Although the odds are defined differently, this model is very similar to the baseline model in that each of \(3-1=2\) response categories gets its own logit equation (each of the \(\beta\) coefficients has indices for both its predictor and the \(j^{th}\) category), which yields the same total number of parameters and degrees of freedom for goodness of fit. In fact, if we compute the deviance statistic for the additive model (14 total parameters) against the saturated model (48 parameters), we get \(G^2=39.1570\) on 34 degrees of freedom, which is nearly identical to the test we carried out earlier with the baseline logit model (\(G^2=38.6622\)). However, if we have ordered response categories, we would prefer to work with the cumulative logit model because it allows us to consider the proportional-odds assumption, which simplifies the model as we've seen.

This model can be fit in SAS using PROC LOGISTIC as with the baseline model; we just need to change the response as ordered with the order=data option. By default, SAS will assume proportional odds, but we can temporarily disable that with the unequalslopes option.

* ordinal, not proportional odds;
* additive model;
proc logistic data=housing;
freq freq; 
class infl (ref="Low") type cont / param=ref; 
model sat (order=data) = infl type cont / link=logit 
 aggregate=(infl type cont) scale=none unequalslopes;
run;

The order=data option tells SAS to arrange the response categories from lowest to highest in the order that they arise in the dataset. There is also a descending option that tells SAS to reverse the ordering of the categories, so that the last one becomes the lowest.

This model can be fit in R using the vglm function again, but we specify family=cumulative(parallel=FALSE) to treat the response as ordinal and use the cumulative logit. This is also where we can specify the proportional odds (or parallel slopes) assumption.

fit.add.o = vglm(cbind(Low,Medium,High) ~ Infl+Type+Cont, family=cumulative(parallel=FALSE), data=housing)
g2.add.o = deviance(fit.add.o)
df.add.o = df.residual(fit.add.o)
1 - pchisq(g2.add.o, df.add.o)

Other functions in R can be used as well, but they don't all parameterize the model the same way. The functionpolr(), for example, fits the proportional odds model but with negative coefficients (similar to SAS's "decreasing" option). So, we have to be careful!

Having just observed that the additive cumulative logit model fits the data well, let's see if we can reduce it further with the proportional odds assumption. The logit equations would be

\begin{array}{rcl} L_1 &=& \log \dfrac{P(Y \leq 1)}{P(Y > 1)}=\alpha_1+\beta_1I_1+\beta_2I_2+\beta_3T_1+\beta_4T_2+\beta_5T_3+\beta_6C\\ L_2 &=& \log \dfrac{P(Y \leq 2)}{P(Y > 2)}=\alpha_2+\beta_1I_1+\beta_2I_2+\beta_3T_1+\beta_4T_2+\beta_5T_3+\beta_6C \\ \end{array}

where \(I_1\) and \(I_2\) are the indicators for medium and high influence, \(T_1\), \(T_2\), and \(T_3\) are the indicators for apartment, atrium, and terrace types, and \(C\) is the indicator for high contact with residents.

Each \(\beta\) coefficient represents the increase in log odds of satisfaction \(\le j\) when going from the baseline to the group corresponding to that coefficient's indicator, given other groups are fixed. And if we exponentiate this, we have the odds ratio. So, for example, \(\exp(\beta_1)\) represents the odds ratio for satisfaction \(\le j\) when comparing those with medium perceived influence against those with low perceived influence on management.

The defining property of the proportional odds assumption is that these coefficients don't depend on \(j\). But that's not to say that the probability or odds of all response categories are assumed to be equal. The intercepts \(\alpha_j\) take care of that.

One tricky part of interpreting these cumulative logit model parameters lies in the fact that the cumulative probability applies to response \(j\) or less. If a \(\beta\) coefficient is positive, it means that increasing its predictor (or going 0 to 1 if the predictor is an indicator) is associated with an increase in the probability of \(j\) or less, which means that the response is more likely to be small if the predictor is large (the odds of the response being small is higher). This seems counterintuitive at first but is inherent to the way the cumulative probability is defined.

When we fit the proportional odds model in SAS, we first see this output section:


Score Test for the Proportional Odds Assumption
Chi-Square	DF	Pr > ChiSq
8.2188	6	0.2225

This reports a test of the proportional odds assumption, versus the same model without proportional odds. The proportional odds assumption means each of the 6 predictor indicators has only one \(\beta\) coefficient for both cumulative logit expressions, giving a difference of 6 parameters for this test. The chi-square test statistic of 8.2188 is not significant evidence to reject the proportional odds assumption, and so we retain it for simplicity's sake.

At this point, we can look into the significance of the predictors to see if any can be removed; the "Type 3" output table is particularly convenient for this.


Type 3 Analysis of Effects
Effect	DF	Wald Chi-Square	Pr > ChiSq
Infl	2	104.4617	<.0001
Type	3	54.6096	<.0001
Cont	1	14.2770	0.0002

Since all are significant, we will not be able to reduce the model further. We'll next look at the individual coefficient estimates.


Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept	Low	1	-0.4961	0.1245	15.8683	<.0001
Intercept	Medium	1	0.6907	0.1252	30.4317	<.0001
Infl	High	1	-1.2888	0.1267	103.4616	<.0001
Infl	Medium	1	-0.5664	0.1050	29.1180	<.0001
Type	Apartmen	1	0.5723	0.1187	23.2307	<.0001
Type	Atrium	1	0.3662	0.1568	5.4558	0.0195
Type	Terrace	1	1.0910	0.1515	51.8504	<.0001
Cont	High	1	-0.3603	0.0954	14.2770	0.0002

The first intercept value is the estimated log-odds of low satisfaction (versus medium or high) when all predictors are zero (at baseline levels). So, for individuals with low perceived influence, housing type tower, and low contact with other residents, the odds of low satisfaction is estimated to be \(\exp(-0.4961)=0.6089\). Likewise, the estimated log odds of medium or less satisfaction (versus high) is \(\exp(0.69071)=1.995\) for this same (baseline) group of individuals.

For the \(\beta\) coefficients, we can say that for those with high perceived influence, the odds of low satisfaction (versus medium or high) is \(\exp(-1.2888)=0.2756\), compared with those with low perceived influence. Moreover, by the proportional odds assumption, this is also the estimated odds ratio for low or medium satisfaction (versus high), when comparing those two influence groups. In other words, having a higher perceived influence on management is associated with higher satisfaction because it has a lower odds of being in a small satisfaction category.

As with the baseline-logit model, we can add 95% confidence limits to the odds ratio estimates for further inference. None of these include the value 1, which indicates that these predictors are all related to the satisfaction of the individuals.


Odds Ratio Estimates
Effect	Point Estimate	95% Wald Confidence Limits
Infl High vs Low	0.276	0.215	0.353
Infl Medium vs Low	0.568	0.462	0.697
Type Apartmen vs Tower	1.772	1.404	2.237
Type Atrium vs Tower	1.442	1.061	1.961
Type Terrace vs Tower	2.977	2.212	4.007
Cont High vs Low	0.697	0.579	0.841

To fit the cumulative logit model in R with the proportional odds assumption, we specify parallel=TRUE.

# ordinal, proportional odds
fit.add.op = vglm(cbind(Low,Medium,High) ~ Infl+Type+Cont, 
 family=cumulative(parallel=TRUE), data=housing)

# testing proportional odds assumption
g2.prop = 2*(logLik(fit.add.o) - logLik(fit.add.op))
df.prop = df.residual(fit.add.op) - df.residual(fit.add.o)
1 - pchisq(g2.prop, df.prop)

This code below the model fit calculates a test of the proportional odds assumption, versus the same model without proportional odds, which was fit earlier. The proportional odds assumption means each of the 6 predictor indicators has only one \(\beta\) coefficient for both cumulative logit expressions, giving a difference of 6 parameters for this test. The LRT statistic of 8.57 is not significant evidence to reject the proportional odds assumption, and so we retain it for simplicity's sake.

At this point, we can look into the significance of the predictors to see if any can be removed; the "anova" output table is particularly convenient for this.

> anova(fit.add.op, type=3, test="LR")
Analysis of Deviance Table (Type III tests: each term added last)

Model: 'cumulative', 'VGAMordinal', 'VGAMcategorical'
Links: 'logitlink'
Response: cbind(Low, Medium, High)

     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
Infl  2  108.239        42    155.967 < 2.2e-16 ***
Type  3   55.910        43    103.638 4.391e-12 ***
Cont  1   14.306        41     62.034 0.0001554 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Since all are significant, we will not be able to reduce the model further. We'll next look at the individual coefficient estimates.

> summary(fit.add.op)

Call:
vglm(formula = cbind(Low, Medium, High) ~ Infl + Type + Cont, 
    family = cumulative(parallel = TRUE), data = housing)

Coefficients: 
              Estimate Std. Error z value Pr(>|z|)    
(Intercept):1 -0.49614    0.12454  -3.984 6.78e-05 ***
(Intercept):2  0.69071    0.12521   5.516 3.46e-08 ***
InflMedium    -0.56639    0.10496  -5.396 6.81e-08 ***
InflHigh      -1.28882    0.12670 -10.172  < 2e-16 ***
TypeApartment  0.57235    0.11875   4.820 1.44e-06 ***
TypeAtrium     0.36619    0.15677   2.336 0.019498 *  
TypeTerrace    1.09101    0.15151   7.201 5.99e-13 ***
ContHigh      -0.36028    0.09536  -3.778 0.000158 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2])

Residual deviance: 47.7276 on 40 degrees of freedom

Log-likelihood: -123.432 on 40 degrees of freedom

The first intercept value is the estimated log-odds of low satisfaction (versus medium or high) when all predictors are zero (at baseline levels). So, for individuals with low perceived influence, housing type tower, and low contact with other residents, the odds of low satisfaction is estimated to be \(\exp(-0.4961)=0.6089\). Likewise, the estimated log odds of medium or less satisfaction (versus high) is \(\exp(0.69071)=1.995\) for this same (baseline) group of individuals.

For the \(\beta\) coefficients, we can say that for those with high perceived influence, the odds of low satisfaction (versus medium or high) is \(\exp(-1.2888)=0.2756\), compared with those with low perceived influence. Moreover, by the proportional odds assumption, this is also the estimated odds ratio for low or medium satisfaction (versus high), when comparing those same two influence groups. In other words, having a higher perceived influence on management is associated with higher satisfaction because it has a lower odds of being in a small satisfaction category.

As with the baseline-logit model, we can add 95% confidence limits to the odds ratio estimates for further inference. None of these include the value 1, which indicates that these predictors are all related to the satisfaction of the individuals.

> exp(coef(fit.add.op))
(Intercept):1 (Intercept):2    InflMedium      InflHigh 
    0.6088794     1.9951283     0.5675685     0.2755961 
TypeApartment    TypeAtrium   TypeTerrace      ContHigh 
    1.7724273     1.4422234     2.9772934     0.6974781 
> exp(confint(fit.add.op))
                  2.5 %    97.5 %
(Intercept):1 0.4770040 0.7772139
(Intercept):2 1.5609539 2.5500670
InflMedium    0.4620337 0.6972090
InflHigh      0.2149917 0.3532843
TypeApartment 1.4043990 2.2368989
TypeAtrium    1.0607016 1.9609740
TypeTerrace   2.2123455 4.0067322
ContHigh      0.5785784 0.8408122

8.6 - Lesson 8 Summary

In this lesson, we extended the binary logistic regression model to account for response variables with more than two levels. These may be nominal or ordinal, and the proportional odds model allows us to utilize that ordinal nature to reduce the number of parameters involved and to simplify the model. Each reduces to the binary logistic regression model we had seen earlier in the event of two response categories.

One challenge that comes with the increased generality of multinomial responses is the way that odds can be defined and interpreted. In the case of nominal categories, we choose one category, in particular, to serve as the baseline, and other categories have odds defined relative to that. In the case of ordinal data, we can either define the odds of an event relative to an adjacent event or as cumulative odds, which effectively combines all events equal to or less than one, relative to all events greater. Keeping track of these events is no easy feat!

In the next lesson, we move away from categorical response variables that fall necessarily within a certain range and consider counting the number of events that occur, as a quantitative, whole number. For this, we'll need the Poisson distribution. Fortunately, much of the basic principles that we've dealt with in binary and multinomial logistic regression will carry over.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility

8: Multinomial Logistic Regression Models

Overview

Lesson 8 Code Files

R Files

SAS Files

Useful Links

Reading

8.1 - Polytomous (Multinomial) Logistic Regression

Grouped versus ungrouped responses

Sampling Model

Describing polytomous responses by a sequence of binary models

8.2 - Baseline-Category Logit Model

Goal:

Variables:

Model:

Interpretation of Parameter Estimates

Goodness of Fit

Overdispersion

8.2.1 - Example: Housing Satisfaction in SAS

Example: Housing Data (using SAS)

Specifying the Model

Output

Interpreting Parameter Estimates

Model Selection

8.2.2 - Example: Housing Satisfaction in R

Example: Housing Data (using R)

Specifying the Model

Interpreting Parameter Estimates

Model Selection

8.3 - Adjacent-Category Logits

8.4 - The Proportional-Odds Cumulative Logit Model

Cumulative-logit Models for Ordinal Responses

Interpretation

Continuous Latent Response

CASE STUDY: The Penn State Ice Cream Study [Optional]

8.5 - Fitting the Cumulative Logit Model

Example: Housing Satisfaction Revisited

8.6 - Lesson 8 Summary

Fat	Rating category
Fat	1	2	3	4	5	6	7	8	9
0	4	17	8	16	5	6	4	2	1
4	1	1	5	6	7	9	21	12	0
...	...	...	...	...	...	...	...	...	...
28	4	6	9	11	5	9	7	8	3

Fat	Rating category
Fat	1	2	3	4	5	6	7	8	9
0	4	17	8	16	5	6	4	2	1
4	1	1	5	6	7	9	21	12	0
...	...	...	...	...	...	...	...	...	...
28	4	6	9	11	5	9	7	8	3

Fat	Rating category
Fat	1	2	3	4	5	6	7	8	9
0	4	17	8	16	5	6	4	2	1
4	1	1	5	6	7	9	21	12	0
...	...	...	...	...	...	...	...	...	...
28	4	6	9	11	5	9	7	8	3