8.2.2  Example: Alligator Food Choices in R
For R, there are a number of different packages and functions we can use. For example, vglm() from VGAM package, or multinom() from nnet package, or mlogit() from globaltest package from BIOCONDUCTOR; see the links at the first page of these lecture notes and the later examples. We will focus on vglm().
The table below (discussed in Agresti (2007), Sec 6.1.2 or Agresti (2013), Sec 8.1.2), reported by Delany and Moore (1987), comes from a study of the primary food choices of alligators in four Florida lakes. Researchers classified the stomach contents of 219 captured alligators into five categories: Fish (the most common primary food choice), Invertebrate (snails, insects, crayfish, etc.), Reptile (turtles, alligators), Bird, and Other (amphibians, plants, household pets, stones, and other debris).
Let's describe these data by a baselinecategory model, with Primary Food Choice as the outcome and Lake (4 lakes (Hancock, Oklawaha,Trafford, George)), Sex (male (m) & female (f)), and Size (small (<2.3) & large ( >2.3)) as covariates.
Because the usual primary food choice of alligators appears to be fish, we'll use fish as the baseline category; the four logit equations will then describe the logodds that alligators select other primary food types instead of fish.
We let
π_{1} = prob. of fish,
π_{2} = prob. of invertebrates,
π_{3} = prob. of reptiles,
π_{4} = prob. of birds,
π_{5} = prob. of other,
and make "fish" be the baseline category. The logit equations are
\(\text{log}\left(\dfrac{\pi_j}{\pi_1}\right)=\beta_0+\beta_1X_1+\cdots\)
for j = 2, 3, 4, 5. The X's included
 three dummy indicators for lake,
 a dummy for sex, and
 a dummy for size.
Therefore, each logit equation had six coefficients to be estimated, so the number of free parameters in this model is 4 × 6 = 24.
Entering the Data
When the data are grouped, they could be entered as we did in SAS in the previous section where the responses categories 1, 2, . . . , r appear in a single column of the dataset, with another column containing the frequency or count. The lines that have a frequency of zero are not actually used in the modeling, because they contribute nothing to the loglikelihood. Or the data can be entered as a flat table as we see above and in gator.txt, which is more convenient for using VGLM() package and function in R:
profile Gender Size Lake Fish Invertebrate Reptile Bird Other
1 1 f <2.3 george 3 9 1 0 1
2 2 m <2.3 george 13 10 0 2 2
3 3 f >2.3 george 8 1 0 0 1
4 4 m >2.3 george 9 0 0 1 2
5 5 f <2.3 hancock 16 3 2 2 3
6 6 m <2.3 hancock 7 1 0 0 5
7 7 f >2.3 hancock 3 0 1 2 3
8 8 m >2.3 hancock 4 0 0 1 2
9 9 f <2.3 oklawaha 3 9 1 0 2
10 10 m <2.3 oklawaha 2 2 0 0 1
11 11 f >2.3 oklawaha 0 1 0 1 0
12 12 m >2.3 oklawaha 13 7 6 0 0
13 13 f <2.3 trafford 2 4 1 1 4
14 14 m <2.3 trafford 3 7 1 0 1
15 15 f >2.3 trafford 0 1 0 0 0
16 16 m >2.3 trafford 8 6 6 3 5
There are 16 rows here for 16 unique profiles (that is the row of data, N) given the possible combinations of the predictor variables (4×2×2).
Specifying the Model
IMPORTANT: Install the VGAM package in order to use vglm() function to fit multinomial logit models; there are other packages in R that could be used to fit these types of models.
install.packages("vgam")
library(VGAM)
The data are read in the same way as always. A good practice is to check that the categorical predictors are actually factors and what are the baseline levels. In the following code, we set up the same baseline levels so that the analysis here corresponds to the analysis we did in SAS in the previous section, although SAS provides many more details than R in this case. For example, to set the Hancock as the baseline level for the variable Lake (notice the row with Hancock has all zeros in that row) we can use the contrasts() function:
##set the ref levels so that R output matches the SAS code
##sets Hancock as the baseline level
contrasts(gator$Lake)=contr.treatment(levels(gator$Lake),base=2)
contrasts(gator$Lake)
george oklawaha trafford
george 1 0 0
hancock 0 0 0
oklawaha 0 1 0
trafford 0 0 1
To set up "fish" as the baseline, this can be done in the vglm() call. By default vglm() will use the last category to create the logits. Let see how this is done if we want to fit a model with Lake, Size and Gender as predictors and 'Fish' as the baseline for logits:
fit5=vglm(cbind(Bird,Invertebrate,Reptile,Other,Fish)~Lake+Size+Gender, data=gator, family=multinomial)
summary(fit5)
To specify the response function use "family=multinomial" option. The above code will fit 4 logit models. See the R program alligator.R and the necessary data files gator.txt (and possibly alligator.dat for any additional explorations on your own).
Here is the output indicating that there are 4 logit models (i.e., linear predictors) being fitted and that this is a baseline logit model with the last category, e.g., 5 as the baseline level; this is "Fish" from cbind(Bird,Invertebrate,Reptile,Other,Fish) from the above call.
Number of linear predictors: 4
Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]), log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
The four logit equations predict the logodds of
 birds versus fish (and from the R output: 1 vs 5),
 invertebrates versus fish (2 vs 5),
 other versus fish (4 vs 5), and
 reptiles versus fish (3 vs 5).
Here is the output pertaining to the "goodness of fit".
Dispersion Parameter for multinomial family: 1
Residual deviance: 50.26369 on 40 degrees of freedom
Loglikelihood: 73.32208 on 40 degrees of freedom
There are N = 16 profiles (unique combinations of lake, sex and size) in this dataset. Recall that Residual deviance tests the fit of the current model versus the saturated model. The saturated model, which fits a separate multinomial distribution to each profile, has 16 × 4 = 64 parameters. The current model has an intercept, three lake coefficients, one sex coefficient and one size coefficient for each of the four logit equations, for a total of 24 parameters. Therefore, the overall fit statistics have 64 − 24 = 40 degrees of freedom.
Think about the following question, then click on the icon to the left display an answer. Does the model fit well? 
Output pertaining to the significance of covariates along with the estimated coefficients and their standard errors is given:
Coefficients:
Estimate Std. Error z value
(Intercept):1 2.46327 0.77390 3.18294
(Intercept):2 2.07445 0.61165 3.39156
(Intercept):3 2.91414 0.88564 3.29043
(Intercept):4 0.91673 0.47817 1.91716
Lakegeorge:1 0.57527 0.79522 0.72341
Lakegeorge:2 1.78051 0.62321 2.85700
Lakegeorge:3 1.12946 1.19280 0.94690
Lakegeorge:4 0.76658 0.56855 1.34830
Lakeoklawaha:1 1.12562 1.19230 0.94407
Lakeoklawaha:2 2.69369 0.66925 4.02493
Lakeoklawaha:3 1.40080 0.81048 1.72835
Lakeoklawaha:4 0.74052 0.74214 0.99781
Laketrafford:1 0.66172 0.84607 0.78212
Laketrafford:2 2.93633 0.68740 4.27165
Laketrafford:3 1.93159 0.82532 2.34041
Laketrafford:4 0.79119 0.58794 1.34570
Size>2.3:1 0.73024 0.65228 1.11952
Size>2.3:2 1.33626 0.41119 3.24971
Size>2.3:3 0.55704 0.64661 0.86147
Size>2.3:4 0.29058 0.45993 0.63180
Genderf:1 0.60643 0.68884 0.88036
Genderf:2 0.46296 0.39552 1.17051
Genderf:3 0.62756 0.68528 0.91578
Genderf:4 0.25257 0.46635 0.54159
From the zvalues, that is the Wald statistics which are equal to (estimate0)/se(estimate), we can see that for example none of the coefficient for Gender are significant. Recall z^{2}=Wald ChiSquare with 1df, e.g. for Genderf:4 which is the parameter for Gender with level "female" and food "other", the z^{2}=0.5415^{2}=0.2936 with 1df, pvalue is 0.5881. Considering the other covariates in the similar way, judging from the above results it seems that
 lake has an effect on food choice
 size has an effect on food choice; and
 gender does not have a discernible effect.
This suggests that we should probably remove sex from the model. We also may want to look for interactions between lake and size.
But before we do that, let's look at more carefully at the estimated coefficients of the current model and their associated estimated odds ratios:
> exp(coefficients(fit5))
(Intercept):1 (Intercept):2 (Intercept):3 (Intercept):4 Lakegeorge:1 Lakegeorge:2 Lakegeorge:3
0.08515564 0.12562535 0.05425079 0.39982590 0.56255501 5.93289534 0.32320675
Lakegeorge:4 Lakeoklawaha:1 Lakeoklawaha:2 Lakeoklawaha:3 Lakeoklawaha:4 Laketrafford:1 Laketrafford:2
0.46460153 0.32445217 14.78619873 4.05843171 0.47686707 1.93813088 18.84663158
Laketrafford:3 Laketrafford:4 Size>2.3:1 Size>2.3:2 Size>2.3:3 Size>2.3:4 Genderf:1
6.90044926 2.20601430 2.07557742 0.26282653 1.74549121 0.74782762 1.83387028
Genderf:2 Genderf:3 Genderf:4
1.58877439 1.87303229 1.28732894
How do we interpret them? Recall that there are four logit equations to predict the logodds of
 birds versus fish (and from the R output: 1 vs 5),
 invertebrates versus fish (2 vs 5),
 other versus fish (4 vs 5), and
 reptiles versus fish (3 vs 5).
That is the equation for the log odds of food type food={Bird, Inve, Other, Rept} relative to fish (F) for lake (l), size (s) and sex (g):
\begin{equation} \log\frac{\pi_{food,lsg}}{\pi_{Flsg}}=\alpha_{food} +\beta_{food,l}^L+\beta_{food,s}^S+\beta_{food,g}^G \end{equation}
where $\alpha_F=0$, $\beta_{food,Hancock}=\beta_{food,small}=\beta_{food,male}=0$ for $food=Fish,Bird,Inve,Other,Rept$, indicating the reference levels.
For example, the estimated prediction equation for the logodds of birds relative to fish: \[log\frac{\hat{\pi}_{Bird,lsg}}{\hat{\pi}_{Fish,lsg}}=2.4633 + 1.1256Oklawaha+0.6617Trafford0.5753George+0.7302large+0.6064female\]
The intercepts give the estimated logodds for the reference group lake = Hancock, size = small, sex = male. For example, the estimated logodds of birds versus fish in this group is −2.4633; the estimated logodds of invertebrates versus fish is −2.0744; and so on.
The lake effect is characterized by three dummy coefficients in each of the four logit equations. The estimated coefficient for the Lake Oklawaha dummy in the birdversusfish equation is −1.1256 with st. error 1.1924. This means that alligators in Lake Oklawaha are less likely to choose birds over fish than their colleagues in Lake Hancock are. In other words, fish appear to be less common in Lake Oklawaha than in Lake Hancock. The estimated odds ratio of exp(−1.1256) = 0.32 is the same for alligators of all sex and sizes, because this is a model with main effects but no interactions; see the entry "Lakeoklawaha:1" in the table from the R command exp(coefficients(fit5)). However, this coefficient is not statistically significant so these differences are not statistically significant because the Wald X^{2}=0.8912 (pvalue=0.3452) which corresponds to the findings from the 95% CI for the oddsratio estimates (0.031, 3.358) that contains 1; this CI can be computed as:
> exp(1.1256+1.96*1.19230)
[1] 3.357874
> exp(1.12561.96*1.19230)
[1] 0.03135103
On the other hand, the estimated coefficient for the Lake Oklawaha dummy in the invertebratesversusfish equation is 2.6937 (st.error=0.6692, pvalue 0.001) and highly significant. The estimated oddsratio of 14.786 (with 95% CI [3.983, 54,893]) imply that alligators in Lake Oklawaha are more likely to choose invertebrates over fish than their colleagues in Lake Hancock are.
The above output also confirms that there is no significant sex effect and that size only matters for invertebrates versus fish food choice.
Now, let's turn to finding the "best" model.
Model Selection
Let's adopt an analysisofdeviance approach to compare various models, and test for overall significance of each variable.
First, let's find the deviance G^{2} for the null (interceptonly) model, a model with just four parameters. Because there are N = 4 × 2 × 2 = 16 unique covariate patterns, the saturated model will have 16 × (5 − 1) = 64 parameters, so the G^{2} statistic for the null model should have 64 − 4 = 60 degrees of freedom. Let's fit the null model in R, like this (e.g., see alligator.R):
# intercept only
fit0=vglm(cbind(Bird,Invertebrate,Reptile,Other,Fish)~1, data=gator, family=multinomial)
The fit statistics are:
> fit0
Degrees of Freedom: 64 Total; 60 Residual
Residual deviance: 116.7611
Loglikelihood: 106.5708
We can also use
> deviance(fit0)
[1] 116.7611
Repeating the modelfitting for various sets of predictors, we obtain the following analysisofdeviance table:
Model

G^{2}

df

Saturated 
0.00

0

Lake + Size + Lake × Size** 
35.40

32

Lake + Size + Sex 
50.26

40

Lake + Size 
52.48

44

Lake 
73.57

48

Size 
101.61

56

Sex 
114.66

56

Null 
116.76

60

**Note: did not converge or issue with the fit 
As before, we can use the table above for model selection by comparing the models via calculation of ΔG^{2}, and Δdf. Our "final" model will have main effects for lake and size but no effect for sex (because "Lake+Size"  "Lake + Size +Sex" = 52.48  50.26 = 2.22 with 44  40 = 4 df which is not significant).
We will see more on "Lake+Size" model but before we do that, notice that we ran into trouble when we included the lake × size interaction. SAS will give you a warning (see SAS analysis on the previous page) that the validity of the model is questionable, but R with vglm() will NOT.
Coefficients:
Estimate Std. Error z value
(Intercept):1 2.44235 0.73721 3.3129607
(Intercept):2 1.74920 0.54174 3.2288767
(Intercept):3 2.44235 0.73721 3.3129607
(Intercept):4 1.05605 0.41046 2.5728451
Lakegeorge:1 0.36291 1.05166 0.3450804
Lakegeorge:2 1.92105 0.63923 3.0052754
Lakegeorge:3 0.33024 1.26727 0.2605926
Lakegeorge:4 0.61792 0.75121 0.8225748
Lakeoklawaha:1 15.00060 1663.62182 0.0090168
Lakeoklawaha:2 2.53766 0.76445 3.3195755
Lakeoklawaha:3 0.83291 1.32041 0.6307966
Lakeoklawaha:4 0.54523 0.83774 0.6508293
Laketrafford:1 0.83291 1.32041 0.6307966
Laketrafford:2 2.53766 0.76445 3.3195755
Laketrafford:3 1.52606 1.11511 1.3685210
Laketrafford:4 1.05605 0.75397 1.4006469
Size>2.3:1 1.59505 1.00979 1.5795905
Size>2.3:2 16.15361 1769.49896 0.0091289
Size>2.3:3 0.49644 1.29859 0.3822892
Size>2.3:4 0.71958 0.71508 1.0062975
Lakegeorge:Size>2.3:1 2.34882 1.62511 1.4453278
Lakegeorge:Size>2.3:2 13.14855 1769.49929 0.0074307
Lakegeorge:Size>2.3:3 16.51990 1774.75824 0.0093083
Lakegeorge:Size>2.3:4 0.78021 1.13988 0.6844627
Lakeoklawaha:Size>2.3:1 13.28294 1663.62229 0.0079844
Lakeoklawaha:Size>2.3:2 14.87965 1769.49910 0.0084090
Lakeoklawaha:Size>2.3:3 0.33981 1.76916 0.1920749
Lakeoklawaha:Size>2.3:4 18.38843 1491.24317 0.0123309
Laketrafford:Size>2.3:1 0.96644 1.63646 0.5905677
Laketrafford:Size>2.3:2 15.23162 1769.49912 0.0086079
Laketrafford:Size>2.3:3 0.13217 1.63646 0.0807669
Laketrafford:Size>2.3:4 1.18958 1.11191 1.0698601
Number of linear predictors: 4
Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]), log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
Dispersion Parameter for multinomial family: 1
Residual deviance: 35.39866 on 32 degrees of freedom
Loglikelihood: 65.88956 on 32 degrees of freedom
What's happening is so called "Quasiseparation" which means that the model effectively includes dummy indicators for groups with observed frequencies of zero, so that the ML estimates for certain coefficients are running off to ± ∞. Notice in the table of ML estimates that certain coefficients are very large, and their standard errors are huge. For example:
Lakeoklawaha:Size>2.3:4 18.38843 1491.24317 0.0123309
When zero counts are present in a dataset, the ML estimates for certain main effects and/or interactions may approach infinity. The problem is that we are trying to estimate effects about which we have very little information. (Similar to problems of multicollinearity in linear regression, when the data provide little or no evidence about the effect of X_{1} when X_{2} is held constant.) Because sometimes we can't really estimate the interactions—and the interactions make the model more difficult to interpret—it's a good idea to eliminate them.
Diagnostics
Based on the analysisofdeviance table, our "final" model will have main effects for lake and size but no effect for sex. Let's perform some diagnostics. First, notice that the overall fit:
> fit4
Call:
vglm(formula = cbind(Bird, Invertebrate, Reptile, Other, Fish) ~
Lake + Size, family = multinomial, data = gator)
Coefficients:
(Intercept):1 (Intercept):2 (Intercept):3 (Intercept):4 Lakegeorge:1 Lakegeorge:2 Lakegeorge:3
2.0286189 1.7491726 2.4230189 0.7465252 0.6951176 1.6583586 1.2427766
Lakegeorge:4 Lakeoklawaha:1 Lakeoklawaha:2 Lakeoklawaha:3 Lakeoklawaha:4 Laketrafford:1 Laketrafford:2
0.8261962 1.3483253 2.5955779 1.2160953 0.8205431 0.3926492 2.7803434
Laketrafford:3 Laketrafford:4 Size>2.3:1 Size>2.3:2 Size>2.3:3 Size>2.3:4
1.6924767 0.6901725 0.6306597 1.4582046 0.3512628 0.3315503
Degrees of Freedom: 64 Total; 44 Residual
Residual deviance: 52.47849
Loglikelihood: 74.42948
The fit is OK but not great. We might want to include a scale parameter for overdispersion. The estimated scale parameter is
\(\hat{\sigma}=\sqrt{X^2/df}=\sqrt{1.3185}=1.148\)
The R program becomes:
> summary(fit4, dispersion=1.48)
Coefficients:
Estimate Std. Error z value
(Intercept):1 2.02862 0.67890 2.98810
(Intercept):2 1.74917 0.65595 2.66664
(Intercept):3 2.42302 0.78299 3.09459
(Intercept):4 0.74653 0.42820 1.74339
Lakegeorge:1 0.69512 0.95045 0.73136
Lakegeorge:2 1.65836 0.74560 2.22420
Lakegeorge:3 1.24278 1.44212 0.86177
Lakegeorge:4 0.82620 0.67828 1.21808
Lakeoklawaha:1 1.34833 1.41527 0.95270
Lakeoklawaha:2 2.59558 0.80257 3.23408
Lakeoklawaha:3 1.21610 0.95623 1.27176
Lakeoklawaha:4 0.82054 0.88762 0.92443
Laketrafford:1 0.39265 0.95106 0.41285
Laketrafford:2 2.78034 0.81658 3.40487
Laketrafford:3 1.69248 0.94945 1.78258
Laketrafford:4 0.69017 0.68087 1.01366
Size>2.3:1 0.63066 0.78160 0.80688
Size>2.3:2 1.45820 0.48169 3.02728
Size>2.3:3 0.35126 0.70564 0.49779
Size>2.3:4 0.33155 0.54532 0.60799
This has no effect on the estimates; but the standard errors become slightly larger.
We can also look at residuals. In R, we can use the usual residual() function to get the pearson and deviance residuals as we have seen before.
What about other diagnostics, such as checking the link function? In principle, one could check the validity of the link function by saving the linear predictors (as we did in binary logistic regression) and trying to include the square of the linear predictor. But in practice, this is often not done, because PROC LOGISTIC in SAS and most other programs for polytomous regression provide no alternatives to the baselinecategory logistic model. VGAM package in R offers some more flexibility, but this is left to be explored on your own. 
Another thing that we could do since Gender is not significant is to collapse over Gender and refit our models on the 3way table.
What are some other multinomial logit models? We will see some in the next lesson.