7.1.1  Example  The Donner Party
In 1846, the Donner party (Donner and Reed families) left Springfield, Illinois for California in covered wagons. After reaching Fort Bridger, Wyoming, the leaders decided to find a new route to Sacramento. They became stranded in the eastern Sierra Nevada mountains at a place now called Donner Pass (right) when the region was hit by heavy snows in late October. By the time the survivors were rescued on April 21, 1847, 40 out of 87 had died.
Three variables:
Y_{i} = 1 if person i survived
Y_{i} = 0 if person i diedX_{1i} age of person i
X_{2i} = 1 if person i is male
X_{2i} = 0 if person i is female.
Objectives
 What is the relationship between survival and gender?
 Predict the probability of survival as a function of age
 After taking into account age, are women more likely to survive harsh conditions than men? Does age affect the survival rate of men and women differently? As we saw in Lesson 5, there are more hypotheses to consider and more potential unexpected behavior with multiple predictors.
The analyses may be carried with donner.sas or donner.R with the dataset donner.dat (or donner.txt) which contains N = 45 observations, and the following columns: 1) Age, 2) Sex (1 = male, 0 = female), 3) Survivorship (1 = survived, 0 = dead).
Q1: What is the relationship between survival and gender?
Analysis: How about Oneway ANOVA?
Problems: The fit of the oneway ANOVA model F_{1,43} = 5.35, pvalue = 0.03432, R^{2} = 10%, and:
 Violation of model assumptions, e.g., errors are not normally distributed, nonconstant variance
 Problems with prediction: can give values between infinity and +infinity, that is outside [0,1] range, and thus not give probabilities of survival.
Correct Analysis: Analyze the twoway contingency table. You can do this via chisquare test of independence or a binary logistic regression model.
Survivorship


Sex

Survived

Died

Total

Female 
10

5

15

Male 
10

20

30

Total 
20

25

45

Test:
H_{0 }: Survivorship is independent of gender
H_{a }: Survivorship depends on gender
Conclusion: The null hypothesis is rejected (X^{2} = 4.50, df = 1, pvalue = 0.0339). We may conclude that survivorship does depend on gender. Among females survivorship is 67% while among males it is 33%. Apparently, females’ chance of survival is twice that of males.
Problem: By now you should be suspicious of simply ignoring a variable by marginalization without good justification, and wary of Simpson's Paradox. The above analysis does not consider any dependency between survivorship and age. The table above is created by collapsing over age, which may lead to a wrong conclusion.
Q2: Predict the probability of survival as a function of age
Naive model: Model the probability of survival given age:
\(P(Y=1X_1)=\beta_0+\beta_1 X_1\)
Linear regression model: Model expected survival given age:
\(E(Y=1X_1)=\beta_0+\beta_1 X_1\)
Problems:
 Nonlinearity – a linear model may give predicted values outside (0, 1) range
 Heteroscedasticity – the variance nπ(1 π) is not constant. This is due to the fact that Y follows a binomial distribution with n = 1, that is the ‘Bernoulli’ distribution (after the famous statistician Bernoulli, of a family of mathematicians by the same name).
Above is a plot of the fitted linear regression model with the observed data. The estimated model is: $E(survival  age)=0.86920.01336age$ Consider predicting survival for a 70 years old person: P(surv=1  age=70)=0.86920.01336 × 70 =  0.0658 !!! This model predicts a negative probability of survival.
Correct Analysis (or one proper solution amongst others: Logistic regression model )
\(\pi_i=\dfrac{\text{exp}(\beta_0+\beta_1 X_{1i})}{1+\text{exp}(\beta_0+\beta_1 X_{1i})}\)
\(\text{log}\dfrac{\pi_i}{1\pi_i}=\beta_0+\beta_1 X_{1i}\)
where $\pi_i$=P(survival=1) probability of survival for person i, and X_{1}_{i} is age of a person i.
Below is the donner.sas SAS program. It reads a datafile stored on your computer. For SAS, you need to adjust the infile line of code depending on the platform, version of SAS, etc.
We will get additional diagnostics needed for evaluating overall fit of the model when having a continuous predictor(s) with influence iplots options in SAS.
Fitted model and Interpretation of parameter estimates
Let’s look at a part of the output from donner.lst. Just as in R we get the same model estimates and the only difference is that R report zvalue instead of Wald ChiSquare, but recall that z^{2} ≈ Wald X^{2} (i.e., for the age: (2.063)^{2} ≈ 4.2553).
Below is the part of R code from donner.R that corresponds to donner.sas. For R, you need to adjust read.table() line depending where you stored the file.
We will get additional diagnostics needed for evaluating overall fit of the model when having a continuous predictor(s) with lm.influence() function and many others in R. See the part labeled "Model Diagnostics".
Fitted model and Interpretation of parameter estimates
We can get the same model estimates as from the SAS output and the only difference is that R reports a zvalue instead of Wald ChiSquare, but recall that z^{2} ≈ Wald X^{2} (i.e., for the age: (2.063)^{2} ≈ 4.2553).
Here is this part of the R output:
The fitted model is
\(\text{log}\dfrac{\pi_i}{1\pi_i}=1.81830.0665X_{1i}\)
The expected probability of survival of a 70 years old individual (regardless of gender since it's not included in the model):
\(\hat{\pi}=Pr(survival)=\dfrac{\text{exp}(1.81830.0665\times 70)}{1+\text{exp}(1.81830.0665\times 70)}=0.055\)
\(\text{exp}(\beta_0)=\text{exp}(1.8183)=3.26\)
is the estimate of the odds of survival when age = 0, that is the probability of survival is 3.26 / (1 + 3.26) = 0.77.
The exponentiated slope is
\(\text{exp}(0.0665)=0.94\)
which means that every year increase in age multiplies the odds of survival by 0.94. So the odds of survival decreases with age. This is an estimated odds ratio. In SAS this value and its 95% Wald Confidence Interval is given in the output under "Odds Ratio Estimates":
Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits age 0.936 0.878 0.997
In R, the oddsratio estimate and its 95% Wald Confidence Interval (i.e., the asymptotic CI) if not directly provided but can be obtain as follows:
> exp(coefficients(result)[2])
age
0.9356907
> exp(confint(result)) ## exponentiate to get on the oddsscale
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.9940306 54.0635442
age 0.8695861 0.9898905
In general this the usual 95% Wald CI, i.e., $\hat{\beta}\pm 1.96\times SE(\beta)$, and then we need to exponentiate the ends to get the CI on the oddsscale; for this example $exp(0.0665\pm 1.96 \times 0.0322)=[0.878, 0.997]$.
Assuming the Type I error of 0.05, we reject the null hypothesis that $\beta_1=0$, so there is a weak evidence of a statistically significant relationship between the person's age and probability of survival. But how well does this model fit? Can we trust this inference?
Overall fit of the model
Let's look at the values listed for AGE. While some observations share the same AGE value, most of these are unique. Crosstabulation of AGE vs. SURVIVAL gives 21 × 2 table:
Issue: The table is sparse (e.g. small observed counts and we will get small fitted values), and the sample size requirement for the use of Pearson chisquare goodnessoffit test statistic and likelihood ratio goodnessoffit test statistic is not met. In addition, when more data are collected there could be more ages included so the number of cells is not fixed. This is almost always the case when you are dealing with continuous predictor(s) such as age. Thus, the X^{2} and G^{2} statistics for logistic regression models with continuous predictors do not have approximate chisquared distributions, and are not the best statistics to use to evaluate the overall fit of the model.
There are several alternatives. The most common are
 Fit the desired model, fit an appropriate larger model (e.g. include other predictors, or a quadratic term), and look at the differences in the loglikelihood ratio statistics (e.g., difference in deviances).
 Use HosmerLemeshow (HL) statistic (Agresti (2007), pg. 147, or Agresti (2013), pg. 173 and see the earlier notes in Lesson 6.)
When explanatory variables are continuous, it’s hard to analyze the lack of fit without some grouping. Recall that’s what the HosmerLemeshow test does; it groups by the predicted probabilities such that there is roughly an equal number of observations per group. Below are SAS and R outputs.
In SAS:
In the model statement, the option lackfit tells SAS to compute HL statistics and print the partitioning. In this example, there are 8 groups with roughly equal number of observations per group. For example, group 1 has 5 observations and group 2 has 7. The HL ChiSquare statistic of 8.5165, with df=number of groups  2=6, has a pvalue of 0.2027 indicating that there is no issue with the lackoffit. There are some differences between SAS and R output because of difference in the cut points used to group the data, but the inferences are the same.
Conclusion: The model fits moderately well.
For R, there is no builtin function for the HosmerLemeshow goodnessoffit test. Instead use the hosmerlem() function from ROCandHL.R code provided for the class. Run the following for the Donner example, where "result" is the saved model and g is the number of groups for the HL statistic:
hosmerlem(survive,fitted(result), g=8)
X^2 Df P(>Chi)
6.3528592 6.0000000 0.3848447
The HL ChiSquare statistic of 6.353, with df=number of groups  2=6, has a pvalue of 0.385 indicating that there is no issue with the lackoffit. There are some differences between SAS and R output because of difference in the cut points used to group the data, but the inferences are the same.
Conclusion: The model fits reasonably well.
Q3: After taking into account age, are women more likely to survive harsh conditions than men?
This will involve fitting different logistic regression models, some including both continuous and categorical predictors, and comparing them via likelihoodratio test. First, we review the general setup, where
H_{0 }: The probability of the characteristic does not depend on the pth variable; i.e., reduced model fits the data
H_{a }: The probability of the characteristic depends on the pth variable; i.e., full model fits the data
This is accomplished by comparing the two models as given below:
Reduced Model:
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\ldots+\beta_{p1} X_{(p1)i}\)
which contains all variables except the variable of interest p.
Full Model:
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\ldots+\beta_{p1} X_{(p1)i}+\beta_p X_{pi}\)
which contains all variables including the variable of interest p.
Recall, the fit of each model to the data is assessed by the likelihood or, equivalently, the loglikelihood. The larger the log likelihood, the better the model fits the data. Adding parameters to model can only result in an increase in the log likelihood. So, the full model will always have a larger log likelihood than the reduced model and thus fit better. But the question is, is that extra cost of adding an additional variable really needed or can we capture the variability in the response well enough with the simpler model.
Question: Is the loglikelihood for the full model significantly better than the loglikelihood for the reduced model? If so, we conclude that the probability that the trait is present depends on variable p.
For example, to test for the effect of gender in Donner Party data, compare:
Reduced Model:
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 X_{1i}\)
where X_{1}_{i} is age of a person i.
Full Model:
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}\)
where X_{1}_{i} is age of a person i, and X_{2}_{i} the person’s gender, with female=0.
If we are testing only for a single parameter then this test should lead to the same inference as if we are testing: $H_0: \beta_2=0$ vs. $H_a: \beta_2\neq 0$.
In SAS: From the Model Fit Statistics in SAS get the likelihoods:
Model

2 log likelihood

Reduced (age) 
56.291

Full (age and sex) 
51.256

The likelihood ratio test statistic is
\(\Delta G^2 =2 \text{log} \Lambda=56.29151.256=5.035\)
Since \(\Delta G^2 =2 \text{log} \Lambda=5.035 > 5.02= \chi^2_1(0.975)\) we can reject the null hypothesis that sex has no effect on the survivorship probability at the α = 0.025 level.
If we study the table with parameter estimates and consider $H_0: \beta_{sex}=0$ vs. $H_a: \beta_{sex}\neq=0$, based on the Wald X^{2}, and pvalue of 0.0345 we can reject the null hypothesis, and thus gender does have an effect on survival.
Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error ChiSquare Pr > ChiSq Intercept 1 3.2304 1.3870 5.4248 0.0199 age 1 0.0782 0.0373 4.3988 0.0360 sex 1 1.5973 0.7555 4.4699 0.0345
Fitted Model:
\(\hat{\pi}=Pr(survival)=\dfrac{\text{exp}(3.23040.0782 X_11.5973 X_2)}{1+\text{exp}(3.23040.0782 X_11.5973 X_2)}\)
Also, from SAS, the HL = 9.323, df = 7, pvalue = 0.2305 indicates a moderate good fit.
For R, users notice that this is the usual difference in the deviance statistics for the two models, e.g., residual deviance for age only model G^{2}= 56.291 and for age and sex model, G^{2}=51.256. Thus, ΔG^{2}=2 log Λ = 56.291  51.256 = 5.035 and we can reject the null hypothesis that sex has no effect on the survivorship probability.
If we study the table with parameter estimates and consider $H_0: \beta_{sex}=0$ vs. $H_a: \beta_{sex}\neq=0$, based on the Wald X^{2}, and pvalue of 0.0345 we can reject the null hypothesis, and thus gender does have an effect on survival.
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 3.23041 1.38686 2.329 0.0198 *
age 0.07820 0.03728 2.097 0.0359 *
sex 1.59729 0.75547 2.114 0.0345 *
Fitted Model:
\(\hat{\pi}=Pr(survival)=\dfrac{\text{exp}(3.23040.0782 X_11.5973 X_2)}{1+\text{exp}(3.23040.0782 X_11.5973 X_2)}\)
From R, hosmerlem(survive, fitted(result), g=10)
X^2 Df P(>Chi)
11.1771583 8.0000000 0.1918618
the HL = 11.177, df = 8, pvalue = 0.192 indicates a moderate good fit.
Conclusions:
 After taking into account the effects of age, women had higher survival probabilities than men (2 log Λ = 5.035, df = 1; p = 0.025).
 The odds that a male survives are estimated to be exp(1.5973) = 0.202 times the odds that a female survives.
 Moreover, by Walds test, the survivorship probability decreases with increasing age (X_{2} = 4.47; df. = 1; p = 0.0345).
 The odds of surviving decline by a multiplicative factor of exp(0.0782) = 0.925 per year of age.
Computing Survivorship Probabilities
Q: What is the probability of survival for a 24 year old male? What about 24 year old female?
For females:
\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(3.23040.0782 \times 241.5973 \times 0)}{1+\text{exp}(3.23040.0782 \times 241.5973 \times 0)}\\
&= 0.806\\
\end{align}
For males:
\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(3.23040.0782 \times 241.5973 \times 1)}{1+\text{exp}(3.23040.0782 \times 241.5973 \times 1)}\\
&= 0.439\\
\end{align}
As before, you can also calculate confidence intervals by using computed estimates of the standard errors and the fact that the estimates are Wald statistics.
Graphical Summary of Model Results
Interaction between Age and Sex
The above analyses assume that the effect of gender on survivorship does not depend on age; that is, there is no interaction between age and sex. To test for interaction compare the following models:
Reduced Model:
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}\)
where X_{1}_{i} is age of traveller i, and X_{2}_{i} the traveller’s gender.
Full Model:
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\beta_3 X_{1i} X_{2i}\)
where X_{1}_{i} is age of traveller i, and X_{2}_{i} the traveller’s gender.
In SAS, the interaction model is fitted via
proc logistic descending; model survive = age sex age*sex /lackfit influence iplots; output out=a p=pi; run;
Model

2 log likelihood

Reduced (age, sex) 
51.256

Full (age, sex, interaction) 
47.346

From the output, we see that the likelihood ratio test statistic is \(\Delta G^2 =2 \text{log} \Lambda=3.91 > 3.84= \chi^2_1(0.95)\)
Conclusion: At the $\alpha=0.05$ level, we can reject the hypothesis that there is no significant interaction between age and gender. But this is weak evidence in support of the significant interaction term, however it does support the what we observed in the above figure. From the table of of the analysis of MLE below, however, notice that we would fail to reject that $\beta_{age*sex}$ is 0. This happens because Wald chisquare statistic is sensitive to sparseness in the data.
Q: What is the fitted model and what are the probabilities of survival for 24 years old males and females?
For females:
\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(7.24500.1940 \times 246.9267 \times 0+0.1616\times 0)}{1+\text{exp}(7.24500.1940 \times 246.9267 \times 0+0.1616\times 0)}\\
&= 0.930\\
\end{align}
The odds of surviving are multiplied by a factor of exp(0.194) = 0.824 per additional year of age
For males:
\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(7.24500.1940 \times 246.9267 \times 1+0.1616\times (24\times 1))}{1+\text{exp}(7.24500.1940 \times 246.9267 \times 1+0.1616\times (24\times 1))}\\
&= \dfrac{\text{exp}(0.31830.0324 \times 24)}{1+\text{exp}(0.31830.0324 \times 24)}\\
&= 0.387\\
\end{align}
The odds of surviving are multiplied by a factor of exp(0.0324) = .968 per additional year of age.
Note that the odds of survival decline more rapidly for females than for males.
In R, the interaction model is fitted via
result=glm(survive~age+sex+age*sex,family=binomial("logit"))
Model

2 log likelihood

Reduced (age, sex) 
51.256

Full (age, sex, interaction) 
47.346

From the output, we see that the likelihood ratio test statistic is \(\Delta G^2 =2 \text{log} \Lambda=3.91 > 3.84= \chi^2_1(0.95)\)
Conclusion: At the $\alpha=0.05$ level, we can reject the hypothesis that there is no significant interaction between age and gender. But this is weak evidence in support of the significant interaction term, however it does support the what we observed in the above figure. From the table of of the analysis of MLE below, however, notice that we would fail to reject that $\beta_{age*sex}$ is significant. This happens because Wald chisquare statistic is sensitive to sparseness in the data.
Q: What is the fitted model and what are the probabilities of survival for 24 years old males and females?
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 7.24638 3.20517 2.261 0.0238 *
age 0.19407 0.08742 2.220 0.0264 *
sex 6.92805 3.39887 2.038 0.0415 *
age:sex 0.16160 0.09426 1.714 0.0865 .
For females:
\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(7.24500.1940 \times 246.9267 \times 0+0.1616\times 0)}{1+\text{exp}(7.24500.1940 \times 246.9267 \times 0+0.1616\times 0)}\\
&= 0.930\\
\end{align}
The odds of surviving are multiplied by a factor of exp(0.194) = 0.824 per additional year of age
For males:
\begin{align}
\hat{\pi} &= Pr(survival)\\
&= \dfrac{\text{exp}(7.24500.1940 \times 246.9267 \times 1+0.1616\times (24\times 1))}{1+\text{exp}(7.24500.1940 \times 246.9267 \times 1+0.1616\times (24\times 1))}\\
&= \dfrac{\text{exp}(0.31830.0324 \times 24)}{1+\text{exp}(0.31830.0324 \times 24)}\\
&= 0.387\\
\end{align}
The odds of surviving are multiplied by a factor of exp(0.0324) = .968 per additional year of age.
Note that the odds of survival decline more rapidly for females than for males.
More on Model Diagnostics
While the goodnessoffit statistics can tell you how well a particular model fits the data, they don’t tell you much about the lackoffit; i.e. where the model fails. To assess the lack of fit we need to look at regression diagnostics. Regression diagnostics tell you how influential each observation is to the fit of the logistic regression model.
We can evaluate the numerical values of these statistics and/or consider their graphical representation (e.g. like residual plots in linear regression).
Some measures of influence:
 Pearson and Deviance Residuals identify observations not well explained by the model.
 Hat Matrix Diagonal detects extreme large points in the design space. These are often labeled as "leverage" or "h_{i}" and are related to standardized residuals. A general rule says that if h_{i} > 2*p/n or > 3*p/n the points is influential. Here "p" is the number of parameters in the model and "n" the number of observations. For donner data example with the main effects model, 2*3/45=0.13. There are two points possibly influential.
 The regression diagnostics work is very similar to linear regression.
[Enlarge]
 DFBETA assesses the effect of an individual observation on the estimated parameter of the fitted model. It’s calculated for each observation for each parameter estimate. It’s a difference in the parameter estimated due to deleting the corresponding observation. For example, for the main effects model there are 3 such plots (one for intercept, one for age and one for sex). Here is the one for AGE, showing that observation #35 is potentially influential.
[Enlarge]
 C and CBAR provide scalar measures of the influence of the individual observations on the parameter estimates; similar to the Cook distance in linear regression. The same rules in linear regression hold here too. For example, observation #35 again seems influential (see the rest of donner.lst).
 DIFDEV and DIFCHISQ detect observations that heavily add to the lack of fit, i.e. there is a large disagreement between the fitted and observed values. DIFDEV measures the change in the deviance (G^{2}) as an individual observation is deleted. DIFCHISQ measures the change in X^{2}.
In SAS, under PROC LOGISTIC, INFLUENCE option and IPLOTS option will output these diagnostics. In R see, influence() function.
As pointed out before, the standardized residual values are considered to be indicative of lack of fit if their absolute value exceeds 3 (some sources are more conservative and take 2 as the threshold value).
Index plot, in which the residuals are plotted against the indices of the observations, can help us determine outliers, and/or systematic patterns in variation, and thus assess the lackoffit. In our example, all Pearson and Deviance residuals fall within 2 and +2; thus there does not seem to be any outliers or particular heavy influential points; see donner.lst for the IPLOTS.
For our example, IPLOT of Deviance residuals vs. the Case Number.
For analysis of other diagnostics see Agresti and/or SAS/R documentation, or other software you are using. More will be discussed in the next section.
For an example of a larger dataset, and more on Model Selection, see handouts relevant for the Water Level Study data below:
Go to Case Studies: The Water Level Study. The relevant SAS and R files are on the SAS supplemental pages (e.g., water_level1.sas, water_level2.sas). For corresponding R files see R supplemental pages (e.g., water.R)