Printer-friendly versionPrinter-friendly version

For R, there are a number of different packages and functions we can use. For example, vglm() from VGAM package, or multinom() from nnet package, or mlogit() from globaltest package from BIOCONDUCTOR; see the links at the first page of these lecture notes and the later examples. We will focus on vglm().

image of alligatorThe table below (discussed in Agresti (2007), Sec 6.1.2 or Agresti (2013), Sec 8.1.2), reported by Delany and Moore (1987), comes from a study of the primary food choices of alligators in four Florida lakes. Researchers classified the stomach contents of 219 captured alligators into five categories: Fish (the most common primary food choice), Invertebrate (snails, insects, crayfish, etc.), Reptile (turtles, alligators), Bird, and Other (amphibians, plants, household pets, stones, and other debris).

Let's describe these data by a baseline-category model, with Primary Food Choice as the outcome and Lake (4 lakes (Hancock, Oklawaha,Trafford, George)), Sex (male (m) & female (f)), and Size (small (<2.3) & large ( >2.3)) as covariates.

 

table

Because the usual primary food choice of alligators appears to be fish, we'll use fish as the baseline category; the four logit equations will then describe the log-odds that alligators select other primary food types instead of fish.

We let

 π1 = prob. of fish,

π2 = prob. of invertebrates,
π3 = prob. of reptiles,
π4 = prob. of birds,
π5 = prob. of other,

and make "fish" be the baseline category. The logit equations are

\(\text{log}\left(\dfrac{\pi_j}{\pi_1}\right)=\beta_0+\beta_1X_1+\cdots\)

for j = 2, 3, 4, 5. The X's included

  • three dummy indicators for lake,
  • a dummy for sex, and
  • a dummy for size.

Therefore, each logit equation had six coefficients to be estimated, so the number of free parameters in this model is 4 × 6 = 24.

Entering the Data

When the data are grouped, they could be entered as we did in SAS in the previous section where the responses categories 1, 2, . . . , r  appear in a single column of the dataset, with another column containing the frequency or count. The lines that have a frequency of zero are not actually used in the modeling, because they contribute nothing to the loglikelihood. Or the data can be entered as a flat table as we see above and in gator.txt, which is more convenient for using VGLM() package and function in R:

 profile Gender Size     Lake Fish Invertebrate Reptile Bird Other
1        1      f <2.3   george    3            9       1    0     1
2        2      m <2.3   george   13           10       0    2     2
3        3      f >2.3   george    8            1       0    0     1
4        4      m >2.3   george    9            0       0    1     2
5        5      f <2.3  hancock   16            3       2    2     3
6        6      m <2.3  hancock    7            1       0    0     5
7        7      f >2.3  hancock    3            0       1    2     3
8        8      m >2.3  hancock    4            0       0    1     2
9        9      f <2.3 oklawaha    3            9       1    0     2
10      10      m <2.3 oklawaha    2            2       0    0     1
11      11      f >2.3 oklawaha    0            1       0    1     0
12      12      m >2.3 oklawaha   13            7       6    0     0
13      13      f <2.3 trafford    2            4       1    1     4
14      14      m <2.3 trafford    3            7       1    0     1
15      15      f >2.3 trafford    0            1       0    0     0
16      16      m >2.3 trafford    8            6       6    3     5

There are 16 rows here for 16 unique profiles (that is the row of data, N) given the possible combinations of the predictor variables (4×2×2).

Specifying the Model

IMPORTANT: Install the VGAM package in order to use vglm() function to fit multinomial logit models; there are other packages in R that could be used to fit these types of models.

install.packages("vgam")
library(VGAM)

The data are read in the same way as always. A good practice is to check that the categorical predictors are actually factors and what are the baseline levels. In the following code, we set up the same baseline levels so that the analysis here corresponds to the analysis we did in SAS in the previous section, although SAS provides many more details than R in this case. For example, to set the Hancock as the baseline level for the variable Lake (notice the row with Hancock has all zeros in that row) we can use the contrasts() function:

##set the ref levels so that R output matches the SAS code
##sets Hancock as the baseline level
contrasts(gator$Lake)=contr.treatment(levels(gator$Lake),base=2)
contrasts(gator$Lake)

         george oklawaha trafford
george        1        0        0
hancock       0        0        0
oklawaha      0        1        0
trafford      0        0        1

To set up "fish" as the baseline, this can be done in the vglm() call. By default vglm() will use the last category to create the logits. Let see how this is done if we want to fit a model with Lake, Size and Gender as predictors and 'Fish' as the baseline for logits: 

fit5=vglm(cbind(Bird,Invertebrate,Reptile,Other,Fish)~Lake+Size+Gender, data=gator, family=multinomial)
summary(fit5)

To specify the response function use "family=multinomial" option. The above code will fit 4 logit models. See the R program alligator.R and the necessary data files gator.txt (and possibly alligator.dat for any additional explorations on your own).

Here is the output indicating that there are 4 logit models (i.e., linear predictors) being fitted  and that this is a baseline logit model with the last category, e.g., 5 as the baseline level; this is "Fish" from cbind(Bird,Invertebrate,Reptile,Other,Fish) from the above call.

Number of linear predictors:  4

Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]), log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])

The four logit equations predict the log-odds of

  • birds versus fish (and from the R output: 1 vs 5),
  • invertebrates versus fish (2 vs 5),
  • other versus fish (4 vs 5), and
  • reptiles versus fish (3 vs 5).

Here is the output pertaining to the "goodness of fit".

Dispersion Parameter for multinomial family:   1

Residual deviance: 50.26369 on 40 degrees of freedom

Log-likelihood: -73.32208 on 40 degrees of freedom


 There are N = 16 profiles (unique combinations of lake, sex and size) in this dataset. Recall that Residual deviance tests the fit of the current model versus the saturated model. The saturated model, which fits a separate multinomial distribution to each profile, has 16 × 4 = 64 parameters. The current model has an intercept, three lake coefficients, one sex coefficient and one size coefficient for each of the four logit equations, for a total of 24 parameters. Therefore, the overall fit statistics have 64 − 24 = 40 degrees of freedom.

Think about the following question, then click on the icon to the left display an answer.

Does the model fit well?

Output pertaining to the significance of covariates along with the estimated coefficients and their standard errors is given:

Coefficients:
               Estimate Std. Error  z value
(Intercept):1  -2.46327    0.77390 -3.18294
(Intercept):2  -2.07445    0.61165 -3.39156
(Intercept):3  -2.91414    0.88564 -3.29043
(Intercept):4  -0.91673    0.47817 -1.91716
Lakegeorge:1   -0.57527    0.79522 -0.72341
Lakegeorge:2    1.78051    0.62321  2.85700
Lakegeorge:3   -1.12946    1.19280 -0.94690
Lakegeorge:4   -0.76658    0.56855 -1.34830
Lakeoklawaha:1 -1.12562    1.19230 -0.94407
Lakeoklawaha:2  2.69369    0.66925  4.02493
Lakeoklawaha:3  1.40080    0.81048  1.72835
Lakeoklawaha:4 -0.74052    0.74214 -0.99781
Laketrafford:1  0.66172    0.84607  0.78212
Laketrafford:2  2.93633    0.68740  4.27165
Laketrafford:3  1.93159    0.82532  2.34041
Laketrafford:4  0.79119    0.58794  1.34570
Size>2.3:1      0.73024    0.65228  1.11952
Size>2.3:2     -1.33626    0.41119 -3.24971
Size>2.3:3      0.55704    0.64661  0.86147
Size>2.3:4     -0.29058    0.45993 -0.63180
Genderf:1       0.60643    0.68884  0.88036
Genderf:2       0.46296    0.39552  1.17051
Genderf:3       0.62756    0.68528  0.91578
Genderf:4       0.25257    0.46635  0.54159

From the z-values, that is the Wald statistics which are equal to (estimate-0)/se(estimate), we can see that for example none of the coefficient for Gender are significant. Recall z2=Wald Chi-Square with 1df, e.g. for Genderf:4 which is the parameter for Gender with level "female" and food "other", the z2=0.54152=0.2936 with 1df, p-value is 0.5881. Considering the other covariates in the similar way, judging from the above results it seems that

  • lake has an effect on food choice
  • size has an effect on food choice; and
  • gender does not have a discernible effect.

This suggests that we should probably remove sex from the model. We also may want to look for interactions between lake and size.

But before we do that, let's look at more carefully at the estimated coefficients of the current model and their associated estimated odds ratios:

> exp(coefficients(fit5))
 (Intercept):1  (Intercept):2  (Intercept):3  (Intercept):4   Lakegeorge:1   Lakegeorge:2   Lakegeorge:3
    0.08515564     0.12562535     0.05425079     0.39982590     0.56255501     5.93289534     0.32320675
  Lakegeorge:4 Lakeoklawaha:1 Lakeoklawaha:2 Lakeoklawaha:3 Lakeoklawaha:4 Laketrafford:1 Laketrafford:2
    0.46460153     0.32445217    14.78619873     4.05843171     0.47686707     1.93813088    18.84663158
Laketrafford:3 Laketrafford:4     Size>2.3:1     Size>2.3:2     Size>2.3:3     Size>2.3:4      Genderf:1
    6.90044926     2.20601430     2.07557742     0.26282653     1.74549121     0.74782762     1.83387028
     Genderf:2      Genderf:3      Genderf:4
    1.58877439     1.87303229     1.28732894

How do we interpret them? Recall that there are four logit equations to predict the log-odds of

  • birds versus fish (and from the R output: 1 vs 5),
  • invertebrates versus fish (2 vs 5),
  • other versus fish (4 vs 5), and
  • reptiles versus fish (3 vs 5).

That is the equation for the log odds of food type food={Bird, Inve, Other, Rept} relative to fish (F) for lake (l), size (s) and sex (g): 

\begin{equation} \log\frac{\pi_{food,lsg}}{\pi_{Flsg}}=\alpha_{food} +\beta_{food,l}^L+\beta_{food,s}^S+\beta_{food,g}^G \end{equation}

where $\alpha_F=0$, $\beta_{food,Hancock}=\beta_{food,small}=\beta_{food,male}=0$ for $food=Fish,Bird,Inve,Other,Rept$, indicating the reference levels.

For example, the estimated prediction equation for the log-odds of birds relative to fish: \[log\frac{\hat{\pi}_{Bird,lsg}}{\hat{\pi}_{Fish,lsg}}=-2.4633 + -1.1256Oklawaha+0.6617Trafford-0.5753George+0.7302large+0.6064female\]

The intercepts give the estimated log-odds for the reference group lake = Hancock, size = small, sex = male. For example, the estimated log-odds of birds versus fish in this group is −2.4633; the estimated log-odds of invertebrates versus fish is −2.0744; and so on.

The lake effect is characterized by three dummy coefficients in each of the four logit equations. The estimated coefficient for the Lake Oklawaha dummy in the bird-versus-fish equation is −1.1256 with st. error 1.1924. This means that alligators in Lake Oklawaha are less likely to choose birds over fish than their colleagues in Lake Hancock are. In other words, fish appear to be less common in Lake Oklawaha than in Lake Hancock. The estimated odds ratio of exp(−1.1256) = 0.32 is the same for alligators of all sex and sizes, because this is a model with main effects but no interactions; see the entry "Lakeoklawaha:1" in the table from the R command exp(coefficients(fit5)). However, this coefficient is not statistically significant so these differences are not statistically significant because the Wald X2=0.8912 (p-value=0.3452) which corresponds to the findings from the 95% CI for the odds-ratio estimates (0.031, 3.358) that contains 1; this CI can be computed as:

> exp(-1.1256+1.96*1.19230)
[1] 3.357874
> exp(-1.1256-1.96*1.19230)
[1] 0.03135103

On the other hand, the estimated coefficient for the Lake Oklawaha dummy in the invertebrates-versus-fish equation is 2.6937 (st.error=0.6692, p-value 0.001) and highly significant. The estimated odds-ratio of 14.786 (with 95% CI [3.983, 54,893]) imply that alligators in Lake Oklawaha are more likely to choose invertebrates over fish than their colleagues in Lake Hancock are.

The above output also confirms that there is no significant sex effect and that size only matters for invertebrates versus fish food choice.

Now, let's turn to finding the "best" model.

Model Selection

Let's adopt an analysis-of-deviance approach to compare various models, and test for overall significance of each variable.

First, let's find the deviance G2 for the null (intercept-only) model, a model with just four parameters. Because there are N = 4 × 2 × 2 = 16 unique covariate patterns, the saturated model will have 16 × (5 − 1) = 64 parameters, so the G2 statistic for the null model should have 64 − 4 = 60 degrees of freedom. Let's fit the null model in R, like this (e.g., see alligator.R):

# intercept only
fit0=vglm(cbind(Bird,Invertebrate,Reptile,Other,Fish)~1, data=gator, family=multinomial)

The fit statistics are:

> fit0
Degrees of Freedom: 64 Total; 60 Residual
Residual deviance: 116.7611
Log-likelihood: -106.5708

We can also use

> deviance(fit0)
[1] 116.7611

Repeating the model-fitting for various sets of predictors, we obtain the following analysis-of-deviance table:

Model
G2
df
Saturated
0.00
0
Lake + Size + Lake × Size**
35.40
32
Lake + Size + Sex
50.26
40
Lake + Size
52.48
44
Lake
73.57
48
Size
101.61
56
Sex
114.66
56
Null
116.76
60
**Note: did not converge or issue with the fit

As before, we can use the table above for model selection by comparing the models via calculation of ΔG2, and Δdf. Our "final" model will have main effects for lake and size but no effect for sex (because "Lake+Size" - "Lake + Size +Sex" = 52.48 - 50.26 = 2.22 with 44 - 40 = 4 df which is not significant).

We will see more on "Lake+Size" model but before we do that, notice that we ran into trouble when we included the lake × size interaction.  SAS will give you a warning (see SAS analysis on the previous page) that the validity of the model is questionable, but R with vglm() will NOT.

Coefficients:
                         Estimate Std. Error    z value
(Intercept):1            -2.44235    0.73721 -3.3129607
(Intercept):2            -1.74920    0.54174 -3.2288767
(Intercept):3            -2.44235    0.73721 -3.3129607
(Intercept):4            -1.05605    0.41046 -2.5728451
Lakegeorge:1              0.36291    1.05166  0.3450804
Lakegeorge:2              1.92105    0.63923  3.0052754
Lakegeorge:3             -0.33024    1.26727 -0.2605926
Lakegeorge:4             -0.61792    0.75121 -0.8225748
Lakeoklawaha:1          -15.00060 1663.62182 -0.0090168
Lakeoklawaha:2            2.53766    0.76445  3.3195755
Lakeoklawaha:3            0.83291    1.32041  0.6307966
Lakeoklawaha:4            0.54523    0.83774  0.6508293
Laketrafford:1            0.83291    1.32041  0.6307966
Laketrafford:2            2.53766    0.76445  3.3195755
Laketrafford:3            1.52606    1.11511  1.3685210
Laketrafford:4            1.05605    0.75397  1.4006469
Size>2.3:1                1.59505    1.00979  1.5795905
Size>2.3:2              -16.15361 1769.49896 -0.0091289
Size>2.3:3                0.49644    1.29859  0.3822892
Size>2.3:4                0.71958    0.71508  1.0062975
Lakegeorge:Size>2.3:1    -2.34882    1.62511 -1.4453278
Lakegeorge:Size>2.3:2    13.14855 1769.49929  0.0074307
Lakegeorge:Size>2.3:3   -16.51990 1774.75824 -0.0093083
Lakegeorge:Size>2.3:4    -0.78021    1.13988 -0.6844627
Lakeoklawaha:Size>2.3:1  13.28294 1663.62229  0.0079844
Lakeoklawaha:Size>2.3:2  14.87965 1769.49910  0.0084090
Lakeoklawaha:Size>2.3:3   0.33981    1.76916  0.1920749
Lakeoklawaha:Size>2.3:4 -18.38843 1491.24317 -0.0123309
Laketrafford:Size>2.3:1  -0.96644    1.63646 -0.5905677
Laketrafford:Size>2.3:2  15.23162 1769.49912  0.0086079
Laketrafford:Size>2.3:3   0.13217    1.63646  0.0807669
Laketrafford:Size>2.3:4  -1.18958    1.11191 -1.0698601


Number of linear predictors:  4

Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]), log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])

Dispersion Parameter for multinomial family:   1

Residual deviance: 35.39866 on 32 degrees of freedom

Log-likelihood: -65.88956 on 32 degrees of freedom

What's happening is so called "Quasi-separation" which means that the model effectively includes dummy indicators for groups with observed frequencies of zero, so that the ML estimates for certain coefficients are running off to ± ∞. Notice in the table of ML estimates that certain coefficients are very large, and their standard errors are huge. For example:

Lakeoklawaha:Size>2.3:4 -18.38843 1491.24317 -0.0123309

When zero counts are present in a dataset, the ML estimates for certain main effects and/or interactions may approach infinity. The problem is that we are trying to estimate effects about which we have very little information. (Similar to problems of multicollinearity in linear regression, when the data provide little or no evidence about the effect of X1 when X2 is held constant.) Because sometimes we can't really estimate the interactions—and the interactions make the model more difficult to interpret—it's a good idea to eliminate them.

Diagnostics

Based on the analysis-of-deviance table, our "final" model will have main effects for lake and size but no effect for sex. Let's perform some diagnostics. First, notice that the overall fit:

> fit4
Call:
vglm(formula = cbind(Bird, Invertebrate, Reptile, Other, Fish) ~
    Lake + Size, family = multinomial, data = gator)

Coefficients:
 (Intercept):1  (Intercept):2  (Intercept):3  (Intercept):4   Lakegeorge:1   Lakegeorge:2   Lakegeorge:3
    -2.0286189     -1.7491726     -2.4230189     -0.7465252     -0.6951176      1.6583586     -1.2427766
  Lakegeorge:4 Lakeoklawaha:1 Lakeoklawaha:2 Lakeoklawaha:3 Lakeoklawaha:4 Laketrafford:1 Laketrafford:2
    -0.8261962     -1.3483253      2.5955779      1.2160953     -0.8205431      0.3926492      2.7803434
Laketrafford:3 Laketrafford:4     Size>2.3:1     Size>2.3:2     Size>2.3:3     Size>2.3:4
     1.6924767      0.6901725      0.6306597     -1.4582046      0.3512628     -0.3315503

Degrees of Freedom: 64 Total; 44 Residual
Residual deviance: 52.47849
Log-likelihood: -74.42948

The fit is OK but not great. We might want to include a scale parameter for overdispersion. The estimated scale parameter is

\(\hat{\sigma}=\sqrt{X^2/df}=\sqrt{1.3185}=1.148\)

The R program becomes:

> summary(fit4, dispersion=1.48)

Coefficients:
               Estimate Std. Error  z value
(Intercept):1  -2.02862    0.67890 -2.98810
(Intercept):2  -1.74917    0.65595 -2.66664
(Intercept):3  -2.42302    0.78299 -3.09459
(Intercept):4  -0.74653    0.42820 -1.74339
Lakegeorge:1   -0.69512    0.95045 -0.73136
Lakegeorge:2    1.65836    0.74560  2.22420
Lakegeorge:3   -1.24278    1.44212 -0.86177
Lakegeorge:4   -0.82620    0.67828 -1.21808
Lakeoklawaha:1 -1.34833    1.41527 -0.95270
Lakeoklawaha:2  2.59558    0.80257  3.23408
Lakeoklawaha:3  1.21610    0.95623  1.27176
Lakeoklawaha:4 -0.82054    0.88762 -0.92443
Laketrafford:1  0.39265    0.95106  0.41285
Laketrafford:2  2.78034    0.81658  3.40487
Laketrafford:3  1.69248    0.94945  1.78258
Laketrafford:4  0.69017    0.68087  1.01366
Size>2.3:1      0.63066    0.78160  0.80688
Size>2.3:2     -1.45820    0.48169 -3.02728
Size>2.3:3      0.35126    0.70564  0.49779
Size>2.3:4     -0.33155    0.54532 -0.60799

This has no effect on the estimates; but the standard errors become slightly larger.

We can also look at residuals. In R, we can use the usual residual() function to get the pearson and deviance residuals as we have seen before.

Discuss     What about other diagnostics, such as checking the link function? In principle, one could check the validity of the link function by saving the linear predictors (as we did in binary logistic regression) and trying to include the square of the linear predictor. But in practice, this is often not done, because PROC LOGISTIC in SAS and most other programs for polytomous regression provide no alternatives to the baseline-category logistic model. VGAM package in R offers some more flexibility, but this is left to be explored on your own. 

Another thing that we could do since Gender is not significant is to collapse over Gender and re-fit our models on the 3-way table.

What are some other multinomial logit models? We will see some in the next lesson.