Example: Donner Party Section
In 1846, the Donner party (Donner and Reed families) left Springfield, Illinois for California in covered wagons. After reaching Fort Bridger, Wyoming, the leaders decided to find a new route to Sacramento. They became stranded in the eastern Sierra Nevada mountains at a place now called Donner Pass (right) when the region was hit by heavy snows in late October. By the time the survivors were rescued on April 21, 1847, 40 out of 87 had died.
Three variables:
\(Y_i = 1\) if the \(i\)th person survived
\(Y_i = 0\) if \(i\)th person died
\(x_{1i} = \) age of \(i\)th person.
\(x_{2i} = 1\) if \(i\)th person is male
\(x_{2i} = 0\) if \(i\)th person is female.
Questions of primary interest are:
- What is the relationship between survival and sex? Are males more or less likely to survive than females?
- What is the relationship between survival and age? Can the probability of survival be predicted as a function of age?
- Does age affect the survival rate of males and females differently?
The data are available in the file donner.dat and contain \( N = 45\) observations in the following columns: 1) Age, 2) Sex (1 = male, 0 = female), 3) Survived (1 = yes, 0 = no).
Analysis
Marginal Analysis: The two-way contingency table below summarizes the counts between sex and survival, ignoring age. We can use a chi-square test of independence or a binary logistic regression model.
Survivorship | |||
---|---|---|---|
Sex | Survived | Died | Total |
Female | 10 | 5 | 15 |
Male | 10 | 20 | 30 |
Total | 20 | 25 | 45 |
Test:
\(H_a\): Survivorship is associated with sex
Conclusion: The null hypothesis is rejected (\(X^2 = 4.50\), df = 1, p-value = 0.0339). We may conclude that survivorship is (marginally) associated with sex. About 67% of females survived, whereas about 33% of males survived. Apparently, females' chance of survival is twice that of males.
Naive model: The linear regression model assumes the probability of survival, given age, has the following form:
\(\pi=P(Y=1|x_1)=\beta_0+\beta_1 x_1\)
- Non-linearity – Modeling the mean of \(Y\) as a linear function of predictors may result in estimates that fall outside the (0, 1) range.
- Heteroscedasticity – The variance of \(Y\) depends on \(\pi\), which depends on \(x_1\) and is therefore not constant.
Above is a plot of the fitted linear regression model with the observed data. The estimated model is \(P(\mbox{survival} | \mbox{age})=0.8692-0.01336\mbox{age}\). Consider predicting survival for a 70 year-old person: \(0.8692-0.01336( 70) = - 0.0658\)! This model predicts a negative probability of survival.
Logistic regression model:
\(\pi_i=\dfrac{\exp(\beta_0+\beta_1 x_{1i})}{1+\exp(\beta_0+\beta_1 x_{1i})}\)
\(\log\dfrac{\pi_i}{1-\pi_i}=\beta_0+\beta_1 x_{1i}\)
where \( \pi_i\) is the probability of survival for the \(i\)th person, and \(x_{1i}\) is the age of the \(i\)th person.
Below is the donner.sas SAS program. It reads the datafile from a given path (make sure to update the "infile" statement to point where you have the data saved).
options ls=78;
title "Donner Party Data - Logistic Regression";
/********read into data***************************/
data donner;
infile "C:\Temp\donner.txt";
input age sex survive;
run;
proc freq;
tables sex*survive / chisq expected;
run;
/************one way ANOVA*************/
ods graphics on;
proc mixed data=donner plot=all;
class sex;
model survive=sex;
run;
ods graphics off;
/*************OLS Regression************/
proc sgplot data=donner;
xaxis label= "age";
yaxis label="survived";
scatter x=age y=survive;
reg x=age y=survive;
title "Fitted Line Plot";
run;
quit;
title;
proc reg data=donner;
model survive = age;
run;
/***************Logistics Regression****************/
proc logistic descending;
model survive = age / lackfit influence iplots;
run;
proc freq;
tables age*survive /chisq expected;
run;
proc logistic descending;
model survive = age sex /lackfit influence iplots;
run;
proc logistic descending;
model survive = age sex age*sex /lackfit influence iplots;
output out=a p=pi;
run;
proc format;
value sexfmt 1= 'male' 0='female';
run;
proc sgscatter data=a ;
plot pi*age / group= sex pbspline;
format sex sexfmt.;
run;
We will get additional diagnostics needed for evaluating overall fit of the model when having a continuous predictor(s) with influence iplots options in SAS.
Fitted model and Interpretation of parameter estimates
Let's look at a part of the output from donner.lst. Just as in R, we get the same model estimates, and the only difference is that R report a z-value instead of the Wald Chi-Square, but recall that \(z^2\approx \) Wald \(X^2\) (e.g., for age, \((-2.063)^2 \approx 4.2553)\).
Analysis of Maximum Likelihood Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard | Wald | Pr > ChiSq |
Intercept | 1 | 1.8183 | 0.9993 | 3.3106 | 0.0688 |
age | 1 | -0.0665 | 0.0322 | 4.2553 | 0.0391 |
Odds Ratio Estimates | |||
---|---|---|---|
Effect | Point Estimate | 95% Wald | |
age | 0.936 | 0.878 | 0.997 |
Below is the part of R code from donner.R that corresponds to donner.sas. For R, we need to adjust the read.table() line depending on where the file is saved.
donner=read.table("donner.txt")
survive=donner[,3]
age=donner[,1]
sex=donner[,2]
### Table sex*survive
table=as.matrix(table(sex,survive))
contingency_table=list(Frequency=table,Expected=chisq.test(table)$expected)
contingency_table
chisq.test(table, correct=FALSE) ## chisq. test of independence
LM=lm(survive~age) ## linear regression
summary(LM)
LMANOVA=anova(LM) ## anova
LMANOVA
### Plot Survive*Age
plot(age,survive,xlim=c(15,70),ylim=c(-1.5,2.0),main="survive v.s. age")
abline(LM,col="red")
abline(confint(LM)[1],confint(LM)[2],col="green")
abline(confint(LM)[3],confint(LM)[4],col="purple")
### Plot Predicted*Age
plot(age,fitted(LM),main="Predicted v.s. Age")
### Q-Q plot
qqnorm(residuals(LM),main="Q-Q Plot")
### Studentized residuals v.s Observation
plot(rstudent(LM),main="Studentized residual v.s. observation")
abline(h=0)
###----------fitting logistic regression survive~age
result=glm(survive~age,family=binomial("logit"))
summary(result)
## to get the specific coefficient, this command will produce a vector
coefficients(result)
## to access the estimated slope and turn it into the odds-ratio
exp(coefficients(result)[2])
confint(result) ## confidence interval for parameters
exp(confint(result)) ## exponentiate to get on the odds-scale
### Diagnostics Measures
lm.influence(result)
### Pearson Residuals v.s. observation
plot(residuals(result,type="pearson"),main="pearson residual plot")
### Deviance Residuals v.s. observation
plot(residuals(result,type="deviance"),main="deviance residual plot")
### Hat Diagonal Plot
plot(hatvalues(result),ylab="H",xlab="Case Number Index")
### Intercept DfBeta Plot
plot(dfbetas(result)[,1],ylab="DFBETA0",xlab="Case Number Index")
### Intercept DfBeta Plot
plot(dfbetas(result)[,2],ylab="DFBETA1",xlab="Case Number Index")
### Table age*survive
table=as.matrix(table(age,survive))
contigency_table=list(Frequency=table,Expected=chisq.test(table)$expected,Percent=prop.table(table),RowPct=prop.table(table,1),ColPct=prop.table(table,2))
contigency_table
chisq.test(table)
###----------fitting logistic regression survive~age+sex
result=glm(survive~age+sex,family=binomial("logit"))
summary(result)
confint(result) ## confidence interval for the parameters
### Diagnostics Measures
lm.influence(result)
### Pearson Residuals v.s. observation
plot(residuals(result,type="pearson"),main="pearson residual plot")
### Deviance Residuals v.s. observation
plot(residuals(result,type="deviance"),main="deviance residual plot")
### Hat Diagonal Plot
plot(hatvalues(result),ylab="H",xlab="Case Number Index")
### Intercept DfBeta Plot
plot(dfbetas(result)[,1],ylab="DFBETA0",xlab="Case Number Index")
### Intercept DfBeta Plot
plot(dfbetas(result)[,2],ylab="DFBETA1",xlab="Case Number Index")
###----------fitting logistic regression survive~age+sex+age*sex
donner=as.data.frame(donner)
sort(donner,c(donner$V1,donner$V2))
result=glm(survive~age+sex+age*sex,family=binomial("logit"))
out=data.frame(survive,age,sex,pi=result$fitted)
out
### Sort by sex age and Plot by sex group
group1=(out[which(sex==0),])[sort.int(age[which(sex==0)],index=TRUE)$ix,]
group2=(out[which(sex==1),])[sort.int(age[which(sex==1)],index=TRUE)$ix,]
plot(group1$age,group1$pi,col="red",type="l",xlim=c(10,70),ylim=c(0,1),ylab="Estimated Probability",xlab="age")
lines(group2$age,group2$pi,col="blue",type="c")
##what if we have two categorical predictors and want to add an interaction term?
We will get additional diagnostics needed for evaluating the overall fit of the model when having a continuous predictor(s) with lm.influence() function and many others in R. See the part labeled "Model Diagnostics".
Fitted model and Interpretation of parameter estimates
We can get the same model estimates as from the SAS output and the only difference is that R reports a z-value instead of Wald Chi-Square, but recall that \(z^2\approx \) Wald \(X^2\) (e.g., for age, \((-2.063)^2 \approx 4.2553)\).
Here is part of the R output:
> summary(result)
Call:
glm(formula = survive ~ age, family = binomial("logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5401 -1.1594 -0.4651 1.0842 1.7283
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.81852 0.99937 1.820 0.0688 .
age -0.06647 0.03222 -2.063 0.0391 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The fitted model is
\(\log\dfrac{\hat{\pi}_i}{1-\hat{\pi}_i}=1.8183-0.0665x_{1i}\)
The expected probability of survival of a 70 year-old individual (regardless of sex since it's not included in the model):
\(\hat{\pi}=Pr(\mbox{survival})=\dfrac{\exp(1.8183-0.0665(70))}{1+\exp(1.8183-0.0665(70))}=0.055\),
and
\(\exp(\hat{\beta}_0)=\exp(1.8183)=3.26\)
is the estimate of the odds of survival when age = 0; that is, the probability of survival is \(3.26 / (1 + 3.26) = 0.77\). The exponentiated slope is
\(\exp(-0.0665)=0.94\)
which means that every year increase in age multiplies the estimated odds of survival by 0.94. So the odds of survival decreases with age.
In SAS, this value and its 95% Wald Confidence Interval are given in the output under "Odds Ratio Estimates":
Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits age 0.936 0.878 0.997
In R, the odds-ratio estimate and its 95% Wald Confidence Interval (i.e., the asymptotic CI) if not directly provided but can be obtained as follows:
> exp(coefficients(result)[2])
age
0.9356907
> exp(confint.default(result)) ## exponentiate to get off the log scale
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.8691709 43.6958039
age 0.8784288 0.9966853
In general, this the usual 95% Wald CI, i.e., \(\hat{\beta}\pm 1.96\cdot SE(\beta)\), and then we need to exponentiate the ends to convert the CI to the odds-scale; for this example \(\exp(-0.0665\pm 1.96 \cdot 0.0322)=[0.878, 0.997]\).
Assuming the Type I error of 0.05, we reject the null hypothesis that \(\beta_1=0\), so there is weak evidence of a statistically significant relationship between the person's age and probability of survival. But how well does this model fit? Can we trust this inference?
Overall fit of the model
Let's look at the values listed for AGE. While some observations share the same AGE value, most of these are unique. Cross-tabulation of AGE vs. SURVIVAL gives a \(21\times 2\) table:
> table(donner[,1], donner[,2])
0 1
15 1 1
18 0 1
20 1 1
21 1 0
22 1 0
23 1 3
24 1 1
25 2 6
28 0 4
30 0 4
32 2 1
35 0 1
40 1 2
45 2 0
46 0 1
47 1 0
50 1 0
57 0 1
60 0 1
62 0 1
65 0 1
There are several alternatives. The most common are to:
- fit the desired model, fit an appropriate larger model (include other predictors or a quadratic term), and look at the differences in the log-likelihood ratio statistics (i.e., the difference in deviances).
- use the Hosmer-Lemeshow (HL) statistic.
When explanatory variables are continuous, it's difficult to analyze the lack of fit without some grouping. Recall that's what the Hosmer-Lemeshow test does; it groups by the predicted probabilities such that there is roughly an equal number of observations per group. Below are SAS and R outputs.
In SAS:
Partition for the Hosmer and Lemeshow Test | |||||
---|---|---|---|---|---|
Group | Total | survive = 1 | survive = 0 | ||
Observed | Expected | Observed | Expected | ||
1 | 5 | 0 | 0.57 | 5 | 4.43 |
2 | 7 | 3 | 1.82 | 4 | 5.18 |
3 | 4 | 3 | 1.65 | 1 | 2.35 |
4 | 4 | 1 | 1.82 | 3 | 2.18 |
5 | 4 | 2 | 1.96 | 2 | 2.04 |
6 | 8 | 2 | 4.31 | 6 | 3.69 |
7 | 6 | 3 | 3.40 | 3 | 2.60 |
8 | 7 | 6 | 4.47 | 1 | 2.53 |
Hosmer and Lemeshow Goodness-of-Fit Test | ||
---|---|---|
Chi-Square | DF | Pr > ChiSq |
8.5165 | 6 | 0.202 |
In the model statement, the option lackfit tells SAS to compute HL statistics and print the partitioning. In this example, there are 8 groups with roughly equal number of observations per group. For example, group 1 has 5 observations, and group 2 has 7. The HL Chi-Square statistic of 8.5165, with df=number of groups - 2=6, has a p-value of 0.2027 indicating that there is no issue with the lack-of-fit. There are some differences between SAS and R output because of differences in the cut points used to group the data, but the inferences are the same.
Conclusion: The model fits moderately well.
For R, there is no built-in function for the Hosmer-Lemeshow goodness-of-fit test. Instead, we can use the hosmerlem() function from the HosmerLemeshow.R file. If we run the following for the Donner example, where "result" is the saved model and g is the number of groups for the HL statistic, we obtain
> hosmerlem(survive,fitted(result), g=8)
X^2 Df P(>Chi)
6.3528592 6.0000000 0.3848447
The HL Chi-Square statistic of 6.353, with df=number of groups - 2=6, has a p-value of 0.385 indicating that there is no issue with the lack-of-fit. There are some differences between SAS and R output because of differences in the cut points used to group the data, but the inferences are the same.
Conclusion: The model fits reasonably well.
This will involve fitting different logistic regression models, some including both continuous and categorical predictors, and comparing them via likelihood-ratio test. First, we review the general set-up, where
\(H_0\): The probability of the characteristic does not depend on the \(p^{th}\) variable; that is, the reduced model fits the data.
\(H_A\): The probability of the characteristic depends on the \(p^{th}\) variable; that is, the full model fits the data.
This is accomplished by comparing the two models as given below:
Reduced Model
\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\ldots+\beta_{p-1} x_{(p-1)i}\)
which contains all variables except the variable of interest p.
Full Model
\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\ldots+\beta_{p-1} x_{(p-1)i}+\beta_p x_{pi}\)
which contains all variables including the variable of interest p.
Recall that the fit of each model is assessed by the likelihood or, equivalently, the log-likelihood. The larger the log-likelihood, the better the model fits the data. Adding parameters to the model can only result in an increase in the log-likelihood. So, the full model will always have a larger log-likelihood than the reduced model and thus fit better. But the question is whether that additional variable is really needed, or can we capture the variability in the response well enough with the simpler model?
Question: Is the log-likelihood for the full model significantly better than the log-likelihood for the reduced model? If so, we conclude that the probability that the trait is present depends on variable p. For example, to test for the effect of sex in the Donner Party data, we compare
Reduced Model
\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}\)
where \(x_{1i}\) is the age of a person \(i\).
Full Model
\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}\)
where \(x_{1i}\) is the age of person \(i\), and \(x_{2i}\) is the person's sex (with female corresponding to 0). If we are testing only for a single parameter, then this test should lead to the same inference as that for the test of /(H_0: \beta_2=0/) versus /(H_A: \beta_2\neq 0/).
In SAS: From the Model Fit Statistics in SAS get the likelihoods:
Model | -2 log likelihood |
---|---|
Reduced (age) | 56.291 |
Full (age and sex) | 51.256 |
The likelihood ratio test statistic is
\(G^2 =-2 \log \Lambda=56.291-51.256=5.035\)
Since \(G^2 =-2 \log \Lambda=5.035 > 5.02= \chi^2_1(0.975)\), we can reject the null hypothesis that sex has no effect on the survivorship probability at the \(\alpha = 0.025\) level.
If we study the table with parameter estimates and consider \(H_0\colon \beta_{sex}=0\) vs. \(H_a\colon \beta_{sex}\neq=0\), based on the Wald \(X^2\) and p-value of 0.0345, we can reject the null hypothesis and conclude that sex does have an effect on survivorship.
Analysis of Maximum Likelihood Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard | Wald | Pr > ChiSq |
Intercept | 1 | 3.2304 | 1.3870 | 5.4248 | 0.0199 |
age | 1 | -0.0782 | 0.0373 | 4.3988 | 0.0360 |
sex | 1 | -1.5973 | 0.7555 | 4.4699 | 0.0345 |
Fitted Model:
\(\hat{\pi}=Pr(survival)=\dfrac{\exp(3.2304-0.0782 x_1-1.5973 x_2)}{1+\exp(3.2304-0.0782 x_1-1.5973 x_2)}\)
Also, from SAS, the HL = 9.323, df = 7, p-value = 0.2305 indicates a moderate good fit.
For R, notice that this is the usual difference in the deviance statistics for the two models: the residual deviance for the age-only model (\(G^2= 56.291\)) and the age and sex model (\(G^2=51.256\)). Thus, \(G^2=-2 \log\Lambda = 56.291 - 51.256 = 5.035\), and we can reject the null hypothesis that sex has no effect on the survivorship probability.
If we study the table with parameter estimates and consider /(H_0: \beta_{sex}=0/) versus /(H_A: \beta_{sex}\neq=0/), based on the Wald \(X^2\) and p-value of 0.0345, we can reject the null hypothesis and conclude that sex does have an effect on survivorship.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.23041 1.38686 2.329 0.0198 *
age -0.07820 0.03728 -2.097 0.0359 *
sex -1.59729 0.75547 -2.114 0.0345 *
Fitted Model
\(\hat{\pi}=Pr(survival)=\dfrac{\exp(3.2304-0.0782 x_1-1.5973 x_2)}{1+\exp(3.2304-0.0782 x_1-1.5973 x_2)}\)
From R, hosmerlem(survive, fitted(result), g=10)
X^2 Df P(>Chi)
11.1771583 8.0000000 0.1918618
The HL = 11.177, df = 8, p-value = 0.192 indicates a moderate good fit.
Conclusions
- After taking into account the effects of age, females had higher survival probabilities than males (\(2 \log \Lambda = 5.035\), df = 1; p = 0.025).
- The odds that a male survives are estimated to be \(\exp(-1.5973) = 0.202\) times the odds that a female survives.
- By the Wald test, the survivorship probability decreases with increasing age (\(X_2 = 4.47\); df. = 1; p = 0.0345).
- The odds of surviving decline by a multiplicative factor of \(\exp(-0.0782) = 0.925\) per year of age.
Odds Ratio Estimates | |||
---|---|---|---|
Effect | Point Estimate | 95% Wald | |
age | 0.925 | 0.860 | 0.995 |
sex | 0.202 | 0.046 | 0.890 |
Computing Survivorship Probabilities
Question: What is the probability of survival for a 24 year old male? What about a 24 year old female?
For females:
\begin{align}\hat{\pi} &= \dfrac{\exp(3.2304-0.0782 \times 24-1.5973 \times 0)}{1+\exp(3.2304-0.0782 \times 24-1.5973 \times 0)}\\&= 0.806\end{align}
For males:
\begin{align}\hat{\pi} &= \dfrac{\exp(3.2304-0.0782 \times 24-1.5973 \times 1)}{1+\exp(3.2304-0.0782 \times 24-1.5973 \times 1)}\\&= 0.439\end{align}
As before, we can also calculate confidence intervals by using computed estimates of the standard errors and the fact that the estimates are approximately normal.
Graphical Summary of Model Results
Interaction between Age and Sex
The above analyses assume that the effect of sex on survivorship does not depend on age; that is, there is no interaction between age and sex. To test for interaction compare the following models:
Reduced Model
\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}\)
where \(x_{1i}\) is the age of person \(i\), and \(x_{2i}\) is the sex.
Full Model
\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\beta_3 x_{1i} x_{2i}\)
where \(x_{1i}\) is the age of person \(i\), and \(x_{2i}\) is the sex.
In SAS, the interaction model is fitted via
proc logistic descending; model survive = age sex age*sex /lackfit influence iplots; output out=a p=pi; run;
Model | -2 log likelihood |
---|---|
Reduced (age, sex) | 51.256 |
Full (age, sex, interaction) | 47.346 |
From the output, we see that the likelihood ratio test statistic is \( G^2 =-2 \log \Lambda=3.91 > 3.84= \chi^2_1(0.95)\)
Conclusion: At the \(\alpha=0.05\) level, we can reject the hypothesis that there is no interaction between age and sex. This is weak evidence in support of interaction, but it does support what we observed in the above figure. From the table of the analysis of the MLEs below, however, notice that we would fail to reject that \(\beta_{age*sex}\) is 0. This happens because the Wald chi-square statistic is sensitive to sparseness in the data.
Question: What is the fitted model, and what are the probabilities of survival for 24-year-old males and females?
Analysis of Maximum Likelihood Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard | Wald | Pr > ChiSq |
Intercept | 1 | 7.2450 | 3.2046 | 5.1114 | 0.0238 |
age | 1 | -0.1940 | 0.0874 | 4.9289 | 0.0264 |
sex | 1 | -6.9267 | 3.3983 | 4.1546 | 0.0415 |
age*sex | 1 | 0.1616 | 0.0942 | 2.9385 | 0.0865 |
For females
\begin{align}\hat{\pi} &= \dfrac{\exp(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)}{1+\exp(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)}\\&= 0.930\end{align}
The odds of surviving are multiplied by a factor of \(\exp(-0.194) = 0.824\) per additional year of age.
For males
\begin{align} \hat{\pi} &= \dfrac{\exp(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}{1+\exp(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}\\&= \dfrac{\exp(0.3183-0.0324 \times 24)}{1+\exp(0.3183-0.0324 \times 24)}\\&= 0.387\end{align}
The odds of surviving are multiplied by a factor of \(\exp(-0.0324) = .968\) per additional year of age. Note that the odds of survival decline more rapidly for females than for males.
In R, the interaction model is fitted via
result=glm(survive~age+sex+age*sex,family=binomial("logit"))
Model | -2 log likelihood |
---|---|
Reduced (age, sex) | 51.256 |
Full (age, sex, interaction) | 47.346 |
From the output, we see that the likelihood ratio test statistic is \( G^2 =-2 \log \Lambda=3.91 > 3.84= \chi^2_1(0.95)\)
Conclusion: At the \(\alpha=0.05\) level, we can reject the hypothesis that there is no interaction between age and sex. This is weak evidence in support of interaction, but it does support what we observed in the above figure. From the table of the analysis of the MLEs below, however, notice that we would fail to reject that \(\beta_{age*sex}\) is 0. This happens because the Wald chi-square statistic is sensitive to sparseness in the data.
Question: What is the fitted model, and what are the probabilities of survival for 24 year old males and females?
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.24638 3.20517 2.261 0.0238 *
age -0.19407 0.08742 -2.220 0.0264 *
sex -6.92805 3.39887 -2.038 0.0415 *
age:sex 0.16160 0.09426 1.714 0.0865 .
For females:
\begin{align} \hat{\pi} &= \dfrac{\exp(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)}{1+\exp(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)} \\ &= 0.930 \end{align}
The odds of surviving are multiplied by a factor of \(\exp(-0.194) = 0.824\) per additional year of age.
For males:
\begin{align}\hat{\pi} &= \dfrac{\exp(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}{1+\exp(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}\\&= \dfrac{\exp(0.3183-0.0324 \times 24)}{1+\exp(0.3183-0.0324 \times 24)}\\&= 0.387\end{align}
The odds of surviving are multiplied by a factor of \(\exp(-0.0324) = .968\) per additional year of age. Note that the odds of survival decline more rapidly for females than for males.