8.2 - Baseline-Category Logit Model

Goal:

Give a simultaneous representation (summary) of the odds of being in one category relative to being in a designated category, called the baseline category, for all pairs of categories. This is an extension of binary logistic regression model, where we will consider \(r − 1\) non-redundant logits.

Variables:

Suppose that a response variable \(Y\),

\(y_i=(y_{i1},y_{i2},\ldots,y_{ir})^T \),

has a multinomial distribution with index \(n_i=\sum_{j=1}^r y_{ij}\) and parameter

\(\pi_i=(\pi_{i1},\pi_{i2},\ldots,\pi_{ir})^T \).

A set of explanatory variables \(x_i = (x_{i1}, x_{i2}, \ldots, x_{ip})\) can be discrete, continuous, or a combination of both. When the explanatory/predictor variables are all categorical, the baseline category logit model has an equivalent loglinear model.

Model:

When the response categories \(1, 2,\ldots, r\) are unordered, the most popular way to relate \(\pi_i\) to covariates is through a set of \(r − 1\) baseline-category logits. Taking \(j^*\) as the baseline category, the model is

\(\log\left(\dfrac{\pi_{ij}}{\pi_{ij^\ast}}\right)=x_i^T \beta_j,\qquad j \neq j^\ast\)

Note here that \(x_i\), which has length \(p\), represents the vector of terms needed to include all the predictors and any interactions of interest. We may also take the intercept to represent a predictor \(x_{i1}=1\), so that may be included with this notation as well. The coefficient vector is then

\(\beta_j=(\beta_{1j},\beta_{2j},\ldots,\beta_{pj})^T,\qquad j \neq j^\ast\)

We can choose the baseline or let the software program do it automatically.

Questions: How many logits of the model are there for the study of the effects of the radiation exposure to mortality? Can you write these models down if the baseline level is "being alive"?

Interpretation of Parameter Estimates

The \(k^{th}\) element of \(\beta_j\) can be interpreted as the increase in log-odds of falling into category \(j\) versus category \(j^*\) resulting from a one-unit increase in the \(k^{th}\) predictor term, holding the other terms constant.

Removing the \(k^{th}\) term from the model is equivalent to simultaneously setting \(r − 1\) coefficients to zero.

Any of the categories can be chosen to be the baseline. The model will fit equally well, achieving the same likelihood and producing the same fitted values. Only the values and interpretation of the coefficients will change. However to develop a sensible model a 'natural' baseline is chosen. If the response is ordinal, usually the highest or the lowest category in the ordinal scale is chosen.

To calculate the \(\pi_{ij}\) values, we have for the non-baseline categories (\(j \ne j^*\))

\(\pi_{ij}=\dfrac{\exp(x_i^T \beta_j)}{1+\sum_{k \neq j^\ast}\exp(x_i^T \beta_k)}\)

And for the baseline-category, the probability is

\(\pi_{ij^\ast}=\dfrac{1}{1+\sum_{k \neq j^\ast}\exp(x_i^T \beta_k)}\)

Note that as expected, we have \(\sum_{j=1}^r\pi_{ij}=1\).

Goodness of Fit Section

If the estimated expected counts \(\hat{\mu}_{ij}=n_i \hat{\pi}_{ij}\) are large enough, we can test the fit of our model versus a saturated model that estimates \(\pi\) independently for \(i = 1,\ldots, N\). The deviance for comparing this model to a saturated one is (if data are grouped)

\(G^2=2\sum\limits_{i=1}^N \sum\limits_{j=1}^r y_{ij}\log\dfrac{y_{ij}}{\hat{\mu}_{ij}}\)

The saturated model has \(N(r − 1)\) free parameters, and the current model (i.e., the model we are fitting) has \(p(r − 1)\), where \(p\) is the length of \(x_i\), so the degrees of freedom are

\(df=(N-p)(r-1)\)

The corresponding Pearson statistic is

\(X^2=\sum\limits_{i=1}^N \sum\limits_{j=1}^r r^2_{ij}\)

where

\(r_{ij}=\dfrac{y_{ij}-\hat{\mu}_{ij}}{\sqrt{\hat{\mu}_{ij}}}\)

is the Pearson residual.

Under the null hypothesis that the current model is correct, both statistics are approximately distributed as \(\chi^2_{df}\) provided that

no more than 20% of the \(\hat{\mu}_{ij}\)'s are below 5.0, and
none are below 1.0.

In practice, this is often not satisfied, so there may be no way to assess the overall fit of the model. However, we may still apply a \(\chi^2\) approximation to \(G^2\) when comparing nested models, provided that \((N − p)(r − 1)\) is large relative to the difference in the number of parameters between them.

Overdispersion Section

As before, overdispersion means that the actual covariance matrix of \(Y_i\) exceeds that specified by the multinomial model,

\(V(Y_i)=n_i \left[\text{Diag}(\pi_i)-\pi_i \pi_i^T\right]\)

The expression above utilizes the format for a multivariate dispersion matrix. \(\pi_i\) denotes the vector of probabilities and \(\pi_{i}^{T}\) denotes the transpose of the vector. It is simply another way of writing the dispersion matrix instead of writing it element-wise.

It is reasonable to consider overdispersion if

the data are grouped (\(n_i\)'s are greater than 1),
\(x_i\) already contains all covariates worth considering, and
the overall \(X^2\) is substantially larger than its degrees of freedom \((N − p)(r − 1)\).

In this situation, it may be worthwhile to introduce a scale parameter \(\sigma^2\), so that

\(V(Y_i)=n_i \sigma^2 \left[\text{Diag}(\pi_i)-\pi_i \pi_i^T\right]\)

Recall that the usual estimate for \(\sigma^2\) is obtained from the maximal model (not saturated) by dividing the Pearson chi-square statistic with its degrees of freedom:

\(\hat{\sigma}^2=\dfrac{X^2}{(N-p)(r-1)}\)

which is approximately unbiased if \((N − p)(r − 1)\) is large. Introducing a scale parameter does not alter the estimate of \(\beta\) (which then becomes a quasi-likelihood estimate), but it does alter our estimate of the variability of \(\hat{\beta}\). If we estimate a scale parameter, we should

multiply the estimated ML covariance matrix for \(\hat{\beta}\) by \(\hat{\sigma}^2\) (SAS does this automatically; in R, it depends which function we use);
divide the usual Pearson residuals by \(\hat{\sigma}^2\); and
divide the usual \(X^2\) and \(G^2\) statistics by \(\hat{\sigma}^2\) (SAS reports these as scaled statistics; in R, it depends which function we use.).

These adjustments will have little practical effect unless the estimated scale parameter is substantially greater than 1.0 (say, 1.2 or higher).