7.1 - Logistic Regression with Continuous Covariates

Example: Donner Party Section

Image of the Donner PassIn 1846, the Donner party (Donner and Reed families) left Springfield, Illinois for California in covered wagons. After reaching Fort Bridger, Wyoming, the leaders decided to find a new route to Sacramento. They became stranded in the eastern Sierra Nevada mountains at a place now called Donner Pass (right) when the region was hit by heavy snows in late October. By the time the survivors were rescued on April 21, 1847, 40 out of 87 had died.

Three variables:

\(Y_i = 1\) if the \(i\)th person survived
\(Y_i = 0\) if \(i\)th person died

\(x_{1i} = \) age of \(i\)th person.

\(x_{2i} = 1\) if \(i\)th person is male
\(x_{2i} = 0\) if \(i\)th person is female.

Questions of primary interest are:

  1. What is the relationship between survival and sex? Are males more or less likely to survive than females?
  2. What is the relationship between survival and age? Can the probability of survival be predicted as a function of age?
  3. Does age affect the survival rate of males and females differently?

The data are available in the file donner.dat and contain \( N = 45\) observations in the following columns: 1) Age, 2) Sex (1 = male, 0 = female), 3) Survived (1 = yes, 0 = no).


Analysis

Marginal Analysis: The two-way contingency table below summarizes the counts between sex and survival, ignoring age.  We can use a chi-square test of independence or a binary logistic regression model.

Survivorship

Sex

Survived

Died

Total

Female

10

5

15

Male

10

20

30

Total

20

25

45

Test:

\(H_0\): Survivorship is independent of sex
\(H_a\): Survivorship is associated with sex
 

Conclusion: The null hypothesis is rejected (\(X^2 = 4.50\), df = 1, p-value = 0.0339). We may conclude that survivorship is (marginally) associated with sex. About 67% of females survived, whereas about 33% of males survived. Apparently, females' chance of survival is twice that of males.

Problem! However, from what we've seen with Simpson's Paradox, we know the potential effect a third variable, such as age, may have. The above analysis does not consider any connection between survivorship and age. We would prefer to take this into account.

Naive model: The linear regression model assumes the probability of survival, given age, has the following form:

\(\pi=P(Y=1|x_1)=\beta_0+\beta_1 x_1\)

Problem! Both of these issues stem from the fact that \(Y\) is binomial with mean \(\pi\) and variance \(\pi(1-\pi)\).
  • Non-linearity – Modeling the mean of \(Y\) as a linear function of predictors may result in estimates that fall outside the (0, 1) range.
  • Heteroscedasticity – The variance of \(Y\) depends on \(\pi\), which depends on \(x_1\) and is therefore not constant.
The SGPlot Procedure 20 30 40 50 60 age 0.0 0.2 0.4 0.6 0.8 1.0 survived Regression survive Fitted Line Plot

 

Above is a plot of the fitted linear regression model with the observed data. The estimated model is \(P(\mbox{survival} | \mbox{age})=0.8692-0.01336\mbox{age}\). Consider predicting survival for a 70 year-old person: \(0.8692-0.01336( 70) = - 0.0658\)! This model predicts a negative probability of survival.

Logistic regression model:

\(\pi_i=\dfrac{\exp(\beta_0+\beta_1 x_{1i})}{1+\exp(\beta_0+\beta_1 x_{1i})}\)

\(\log\dfrac{\pi_i}{1-\pi_i}=\beta_0+\beta_1 x_{1i}\)

where \( \pi_i\) is the probability of survival for the \(i\)th person, and \(x_{1i}\) is the age of the \(i\)th person.

Below is the donner.sas SAS program. It reads the datafile from a given path (make sure to update the "infile" statement to point where you have the data saved).

options ls=78;
                                title "Donner Party Data - Logistic Regression";
                                /********read into data***************************/
                                data donner;
                                  infile "C:\Temp\donner.txt";
                                  input age sex survive;
                                  run;
                                proc freq;
                                  tables sex*survive / chisq expected;
                                  run;
                                /************one way ANOVA*************/
                                ods graphics on;
                                proc mixed data=donner plot=all;
                                class sex;
                                model survive=sex;
                                run;
                                ods graphics off;
                                /*************OLS Regression************/
                                proc sgplot data=donner;
                                xaxis label= "age";
                                yaxis label="survived";
                                scatter x=age y=survive;
                                reg x=age y=survive;
                                title "Fitted Line Plot";
                                run;
                                quit;
                                title;
                                proc reg data=donner;
                                model survive = age;
                                run;
                                /***************Logistics Regression****************/
                                proc logistic descending;
                                  model survive = age / lackfit influence iplots;
                                  run;
                                proc freq;
                                  tables age*survive /chisq expected;
                                  run;
                                proc logistic descending;
                                  model survive = age sex /lackfit influence iplots;
                                  run;
                                proc logistic descending;
                                  model survive = age sex age*sex /lackfit influence iplots;
                                  output out=a p=pi;
                                  run;
                                proc format;
                                  value sexfmt 1= 'male' 0='female';
                                  run;
                                proc sgscatter data=a ;
                                  plot pi*age / group= sex pbspline;
                                  format sex sexfmt.;
                                run;

We will get additional diagnostics needed for evaluating overall fit of the model when having a continuous predictor(s) with influence iplots options in SAS.

Fitted model and Interpretation of parameter estimates

Let's look at a part of the output from donner.lst. Just as in R, we get the same model estimates, and the only difference is that R report a z-value instead of the Wald Chi-Square, but recall that \(z^2\approx \) Wald \(X^2\) (e.g., for age, \((-2.063)^2 \approx 4.2553)\).

 

Analysis of Maximum Likelihood Estimates

Parameter

DF

Estimate

Standard
Error

Wald
Chi-Square

Pr > ChiSq

Intercept

1

1.8183

0.9993

3.3106

0.0688

age

1

-0.0665

0.0322

4.2553

0.0391

 

Odds Ratio Estimates

Effect

Point Estimate

95% Wald
Confidence Limits

age

0.936

0.878

0.997

 

Below is the part of R code from donner.R that corresponds to donner.sas. For R, we need to adjust the read.table() line depending on where the file is saved.

donner=read.table("donner.txt")
                            survive=donner[,3]
                            age=donner[,1]
                            sex=donner[,2]
                            ### Table sex*survive
                            table=as.matrix(table(sex,survive))
                            contingency_table=list(Frequency=table,Expected=chisq.test(table)$expected)
                            contingency_table
                            chisq.test(table, correct=FALSE) ## chisq. test of independence 
                            LM=lm(survive~age) ## linear regression
                            summary(LM)
                            LMANOVA=anova(LM) ## anova
                            LMANOVA
                            ### Plot Survive*Age
                            plot(age,survive,xlim=c(15,70),ylim=c(-1.5,2.0),main="survive v.s. age")
                            abline(LM,col="red")
                            abline(confint(LM)[1],confint(LM)[2],col="green")
                            abline(confint(LM)[3],confint(LM)[4],col="purple")
                            ### Plot Predicted*Age
                            plot(age,fitted(LM),main="Predicted v.s. Age")
                            ### Q-Q plot
                            qqnorm(residuals(LM),main="Q-Q Plot")
                            ### Studentized residuals v.s Observation
                            plot(rstudent(LM),main="Studentized residual v.s. observation")
                            abline(h=0)
                            ###----------fitting logistic regression survive~age
                            result=glm(survive~age,family=binomial("logit"))
                            summary(result)
                            ## to get the specific coefficient, this command will produce a vector
                            coefficients(result)
                            ## to access the estimated slope and turn it into the odds-ratio
                            exp(coefficients(result)[2])
                            confint(result) ## confidence interval for parameters
                            exp(confint(result)) ## exponentiate to get on the odds-scale
                            ### Diagnostics Measures
                            lm.influence(result)
                            ### Pearson Residuals v.s. observation 
                            plot(residuals(result,type="pearson"),main="pearson residual plot")
                            ### Deviance Residuals v.s. observation
                            plot(residuals(result,type="deviance"),main="deviance residual plot")
                            ### Hat Diagonal Plot
                            plot(hatvalues(result),ylab="H",xlab="Case Number Index")
                            ### Intercept DfBeta Plot
                            plot(dfbetas(result)[,1],ylab="DFBETA0",xlab="Case Number Index")
                            ### Intercept DfBeta Plot
                            plot(dfbetas(result)[,2],ylab="DFBETA1",xlab="Case Number Index")
                            ### Table age*survive
                            table=as.matrix(table(age,survive))
                            contigency_table=list(Frequency=table,Expected=chisq.test(table)$expected,Percent=prop.table(table),RowPct=prop.table(table,1),ColPct=prop.table(table,2))
                            contigency_table
                            chisq.test(table)
                            ###----------fitting logistic regression survive~age+sex
                            result=glm(survive~age+sex,family=binomial("logit"))
                            summary(result)
                            confint(result) ## confidence interval for the parameters 
                            ### Diagnostics Measures
                            lm.influence(result)
                            ### Pearson Residuals v.s. observation 
                            plot(residuals(result,type="pearson"),main="pearson residual plot")
                            ### Deviance Residuals v.s. observation
                            plot(residuals(result,type="deviance"),main="deviance residual plot")
                            ### Hat Diagonal Plot
                            plot(hatvalues(result),ylab="H",xlab="Case Number Index")
                            ### Intercept DfBeta Plot
                            plot(dfbetas(result)[,1],ylab="DFBETA0",xlab="Case Number Index")
                            ### Intercept DfBeta Plot
                            plot(dfbetas(result)[,2],ylab="DFBETA1",xlab="Case Number Index")
                            ###----------fitting logistic regression survive~age+sex+age*sex
                            donner=as.data.frame(donner)
                            sort(donner,c(donner$V1,donner$V2))
                            result=glm(survive~age+sex+age*sex,family=binomial("logit"))
                            out=data.frame(survive,age,sex,pi=result$fitted)
                            out
                            ### Sort by sex age and Plot by sex group
                            group1=(out[which(sex==0),])[sort.int(age[which(sex==0)],index=TRUE)$ix,]
                            group2=(out[which(sex==1),])[sort.int(age[which(sex==1)],index=TRUE)$ix,]
                            plot(group1$age,group1$pi,col="red",type="l",xlim=c(10,70),ylim=c(0,1),ylab="Estimated Probability",xlab="age")
                            lines(group2$age,group2$pi,col="blue",type="c")
                            ##what if we have two categorical predictors and want to add an interaction term? 

We will get additional diagnostics needed for evaluating the overall fit of the model when having a continuous predictor(s) with lm.influence() function and many others in R. See the part labeled "Model Diagnostics".

Fitted model and Interpretation of parameter estimates

We can get the same model estimates as from the SAS output and the only difference is that R reports a z-value instead of Wald Chi-Square, but recall that \(z^2\approx \) Wald \(X^2\) (e.g., for age, \((-2.063)^2 \approx 4.2553)\).

Here is part of the R output:

> summary(result)
Call:
                                glm(formula = survive ~ age, family = binomial("logit"))
                                Deviance Residuals: 
                                    Min       1Q   Median       3Q      Max  
                                -1.5401  -1.1594  -0.4651   1.0842   1.7283  
                                Coefficients:
                                            Estimate Std. Error z value Pr(>|z|)  
                                (Intercept)  1.81852    0.99937   1.820   0.0688 .
                                age         -0.06647    0.03222  -2.063   0.0391 *
                                ---
                                Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
                                

The fitted model is

\(\log\dfrac{\hat{\pi}_i}{1-\hat{\pi}_i}=1.8183-0.0665x_{1i}\)

The expected probability of survival of a 70 year-old individual (regardless of sex since it's not included in the model):

\(\hat{\pi}=Pr(\mbox{survival})=\dfrac{\exp(1.8183-0.0665(70))}{1+\exp(1.8183-0.0665(70))}=0.055\),

and

\(\exp(\hat{\beta}_0)=\exp(1.8183)=3.26\)

is the estimate of the odds of survival when age = 0; that is, the probability of survival is \(3.26 / (1 + 3.26) = 0.77\). The exponentiated slope is

\(\exp(-0.0665)=0.94\)

which means that every year increase in age multiplies the estimated odds of survival by 0.94. So the odds of survival decreases with age.

In SAS, this value and its 95% Wald Confidence Interval are given in the output under "Odds Ratio Estimates":

                             Odds Ratio Estimates
                                           Point          95% Wald
                              Effect    Estimate      Confidence Limits
                              age          0.936       0.878       0.997

In R, the odds-ratio estimate and its 95% Wald Confidence Interval (i.e., the asymptotic CI) if not directly provided but can be obtained as follows:

> exp(coefficients(result)[2])
            
age
            0.9356907
            
> exp(confint.default(result)) ## exponentiate to get off the log scale
            
Waiting for profiling to be done...
            2.5 % 97.5 %
            (Intercept) 0.8691709 43.6958039
            age 0.8784288 0.9966853
            

In general, this the usual 95% Wald CI, i.e., \(\hat{\beta}\pm 1.96\cdot SE(\beta)\), and then we need to exponentiate the ends to convert the CI to the odds-scale; for this example \(\exp(-0.0665\pm 1.96 \cdot 0.0322)=[0.878, 0.997]\).

Assuming the Type I error of 0.05, we reject the null hypothesis that \(\beta_1=0\), so there is weak evidence of a statistically significant relationship between the person's age and probability of survival. But how well does this model fit? Can we trust this inference?

Overall fit of the model

Let's look at the values listed for AGE. While some observations share the same AGE value, most of these are unique. Cross-tabulation of AGE vs. SURVIVAL gives a \(21\times 2\) table:

> table(donner[,1], donner[,2])
            

                 0 1
              15 1 1
              18 0 1
              20 1 1
              21 1 0
              22 1 0
              23 1 3
              24 1 1
              25 2 6
              28 0 4
              30 0 4
              32 2 1
              35 0 1
              40 1 2
              45 2 0
              46 0 1
              47 1 0
              50 1 0
              57 0 1
              60 0 1
              62 0 1
              65 0 1
            
Issue! The table is sparse (e.g. small observed counts and we will get small fitted values), and the sample size requirement for the use of Pearson chi-square goodness-of-fit test statistic and likelihood ratio goodness-of-fit test statistic is not met. In addition, when more data are collected there could be more ages included so the number of cells is not fixed. This is almost always the case when you are dealing with continuous predictor(s) such as age. Thus, the \(X^2\) and \(G^2\) statistics for logistic regression models with continuous predictors do not have approximate chi-squared distributions and are not the best statistics to use to evaluate the overall fit of the model.

There are several alternatives. The most common are to:

  • fit the desired model, fit an appropriate larger model (include other predictors or a quadratic term), and look at the differences in the log-likelihood ratio statistics (i.e., the difference in deviances).
  • use the Hosmer-Lemeshow (HL) statistic.

When explanatory variables are continuous, it's difficult to analyze the lack of fit without some grouping. Recall that's what the Hosmer-Lemeshow test does; it groups by the predicted probabilities such that there is roughly an equal number of observations per group. Below are SAS and R outputs.

In SAS:

 

Partition for the Hosmer and Lemeshow Test

Group

Total

survive = 1

survive = 0

Observed

Expected

Observed

Expected

1

5

0

0.57

5

4.43

2

7

3

1.82

4

5.18

3

4

3

1.65

1

2.35

4

4

1

1.82

3

2.18

5

4

2

1.96

2

2.04

6

8

2

4.31

6

3.69

7

6

3

3.40

3

2.60

8

7

6

4.47

1

2.53

 

Hosmer and Lemeshow Goodness-of-Fit Test

Chi-Square

DF

Pr > ChiSq

8.5165

6

0.202

In the model statement, the option lackfit tells SAS to compute HL statistics and print the partitioning. In this example, there are 8 groups with roughly equal number of observations per group. For example, group 1 has 5 observations, and group 2 has 7. The HL Chi-Square statistic of 8.5165, with df=number of groups - 2=6, has a p-value of 0.2027 indicating that there is no issue with the lack-of-fit. There are some differences between SAS and R output because of differences in the cut points used to group the data, but the inferences are the same.

Conclusion: The model fits moderately well.

For R, there is no built-in function for the Hosmer-Lemeshow goodness-of-fit test. Instead, we can use the hosmerlem() function from the HosmerLemeshow.R file. If we run the following for the Donner example, where "result" is the saved model and g is the number of groups for the HL statistic, we obtain

> hosmerlem(survive,fitted(result), g=8) 
                            X^2        Df   P(>Chi) 
                            6.3528592 6.0000000 0.3848447 

The HL Chi-Square statistic of 6.353, with df=number of groups - 2=6, has a p-value of 0.385 indicating that there is no issue with the lack-of-fit. There are some differences between SAS and R output because of differences in the cut points used to group the data, but the inferences are the same.

Conclusion: The model fits reasonably well.

This will involve fitting different logistic regression models, some including both continuous and categorical predictors, and comparing them via likelihood-ratio test. First, we review the general set-up, where

\(H_0\): The probability of the characteristic does not depend on the \(p^{th}\) variable; that is, the reduced model fits the data.

\(H_A\): The probability of the characteristic depends on the \(p^{th}\) variable; that is, the full model fits the data.

This is accomplished by comparing the two models as given below:

Reduced Model

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\ldots+\beta_{p-1} x_{(p-1)i}\)

which contains all variables except the variable of interest p.

Full Model

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\ldots+\beta_{p-1} x_{(p-1)i}+\beta_p x_{pi}\)

which contains all variables including the variable of interest p.

Recall that the fit of each model is assessed by the likelihood or, equivalently, the log-likelihood. The larger the log-likelihood, the better the model fits the data. Adding parameters to the model can only result in an increase in the log-likelihood. So, the full model will always have a larger log-likelihood than the reduced model and thus fit better. But the question is whether that additional variable is really needed, or can we capture the variability in the response well enough with the simpler model?

 Question: Is the log-likelihood for the full model significantly better than the log-likelihood for the reduced model? If so, we conclude that the probability that the trait is present depends on variable p. For example, to test for the effect of sex in the Donner Party data, we compare

Reduced Model

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}\)

where \(x_{1i}\) is the age of a person \(i\).

Full Model

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}\)

where \(x_{1i}\) is the age of person \(i\), and \(x_{2i}\) is the person's sex (with female corresponding to 0). If we are testing only for a single parameter, then this test should lead to the same inference as that for the test of /(H_0: \beta_2=0/) versus /(H_A: \beta_2\neq 0/).

In SAS: From the Model Fit Statistics in SAS get the likelihoods:

Model

-2 log likelihood

Reduced (age)

56.291

Full (age and sex)

51.256

The likelihood ratio test statistic is

\(G^2 =-2 \log \Lambda=56.291-51.256=5.035\)

Since \(G^2 =-2 \log \Lambda=5.035 > 5.02= \chi^2_1(0.975)\), we can reject the null hypothesis that sex has no effect on the survivorship probability at the \(\alpha = 0.025\) level.

If we study the table with parameter estimates and consider \(H_0\colon \beta_{sex}=0\) vs. \(H_a\colon \beta_{sex}\neq=0\), based on the Wald \(X^2\) and p-value of 0.0345, we can reject the null hypothesis and conclude that sex does have an effect on survivorship.

 

Analysis of Maximum Likelihood Estimates

Parameter

DF

Estimate

Standard
Error

Wald
Chi-Square

Pr > ChiSq

Intercept

1

3.2304

1.3870

5.4248

0.0199

age

1

-0.0782

0.0373

4.3988

0.0360

sex

1

-1.5973

0.7555

4.4699

0.0345

                             

Fitted Model:

\(\hat{\pi}=Pr(survival)=\dfrac{\exp(3.2304-0.0782 x_1-1.5973 x_2)}{1+\exp(3.2304-0.0782 x_1-1.5973 x_2)}\)

Also, from SAS, the HL = 9.323, df = 7, p-value = 0.2305 indicates a moderate good fit.

For R, notice that this is the usual difference in the deviance statistics for the two models: the residual deviance for the age-only model (\(G^2= 56.291\)) and the age and sex model (\(G^2=51.256\)). Thus, \(G^2=-2 \log\Lambda = 56.291 - 51.256 = 5.035\), and we can reject the null hypothesis that sex has no effect on the survivorship probability.

If we study the table with parameter estimates and consider /(H_0: \beta_{sex}=0/) versus /(H_A: \beta_{sex}\neq=0/), based on the Wald \(X^2\) and p-value of 0.0345, we can reject the null hypothesis and conclude that sex does have an effect on survivorship.

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)
                            (Intercept) 3.23041 1.38686 2.329 0.0198 *
                            age -0.07820 0.03728 -2.097 0.0359 *
                            sex -1.59729 0.75547 -2.114 0.0345 *

Fitted Model

\(\hat{\pi}=Pr(survival)=\dfrac{\exp(3.2304-0.0782 x_1-1.5973 x_2)}{1+\exp(3.2304-0.0782 x_1-1.5973 x_2)}\)

From R, hosmerlem(survive, fitted(result), g=10)
                            X^2 Df P(>Chi)
                            11.1771583 8.0000000 0.1918618

The HL = 11.177, df = 8, p-value = 0.192 indicates a moderate good fit.

Conclusions

  • After taking into account the effects of age, females had higher survival probabilities than males (\(2 \log \Lambda = 5.035\), df = 1; p = 0.025).
  • The odds that a male survives are estimated to be \(\exp(-1.5973) = 0.202\) times the odds that a female survives.
  • By the Wald test, the survivorship probability decreases with increasing age (\(X_2 = 4.47\); df. = 1; p = 0.0345).
  • The odds of surviving decline by a multiplicative factor of \(\exp(-0.0782) = 0.925\) per year of age.
 

Odds Ratio Estimates

Effect

Point Estimate

95% Wald
Confidence Limits

age

0.925

0.860

0.995

sex

0.202

0.046

0.890

Computing Survivorship Probabilities

 Question: What is the probability of survival for a 24 year old male? What about a 24 year old female?

For females:

\begin{align}\hat{\pi} &= \dfrac{\exp(3.2304-0.0782 \times 24-1.5973 \times 0)}{1+\exp(3.2304-0.0782 \times 24-1.5973 \times 0)}\\&= 0.806\end{align}

For males:

\begin{align}\hat{\pi} &=  \dfrac{\exp(3.2304-0.0782 \times 24-1.5973 \times 1)}{1+\exp(3.2304-0.0782 \times 24-1.5973 \times 1)}\\&= 0.439\end{align}

As before, we can also calculate confidence intervals by using computed estimates of the standard errors and the fact that the estimates are approximately normal.

Graphical Summary of Model Results

The SGScatter Procedure male female sex 20 30 40 50 60 age 0.2 0.4 0.6 0.8 1.0 Estimated Probability

Interaction between Age and Sex

The above analyses assume that the effect of sex on survivorship does not depend on age; that is, there is no interaction between age and sex. To test for interaction compare the following models:

Reduced Model

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}\)

where \(x_{1i}\) is the age of person \(i\), and \(x_{2i}\) is the sex.

Full Model

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\beta_3 x_{1i} x_{2i}\)

where \(x_{1i}\) is the age of person \(i\), and \(x_{2i}\) is the sex.

In SAS, the interaction model is fitted via

proc logistic descending;
                              model survive = age sex age*sex /lackfit influence iplots;
                              output out=a p=pi;
                              run;

Model

-2 log likelihood

Reduced (age, sex)

51.256

Full (age, sex, interaction)

47.346

From the output, we see that the likelihood ratio test statistic is \( G^2 =-2 \log \Lambda=3.91 > 3.84= \chi^2_1(0.95)\)

Conclusion: At the \(\alpha=0.05\) level, we can reject the hypothesis that there is no interaction between age and sex. This is weak evidence in support of interaction, but it does support what we observed in the above figure. From the table of the analysis of the MLEs below, however, notice that we would fail to reject that \(\beta_{age*sex}\) is 0. This happens because the Wald chi-square statistic is sensitive to sparseness in the data.

 Question: What is the fitted model, and what are the probabilities of survival for 24-year-old males and females?

 

Analysis of Maximum Likelihood Estimates

Parameter

DF

Estimate

Standard
Error

Wald
Chi-Square

Pr > ChiSq

Intercept

1

7.2450

3.2046

5.1114

0.0238

age

1

-0.1940

0.0874

4.9289

0.0264

sex

1

-6.9267

3.3983

4.1546

0.0415

age*sex

1

0.1616

0.0942

2.9385

0.0865

 

For females

\begin{align}\hat{\pi} &= \dfrac{\exp(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)}{1+\exp(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)}\\&= 0.930\end{align}

The odds of surviving are multiplied by a factor of \(\exp(-0.194) = 0.824\) per additional year of age.

For males

\begin{align} \hat{\pi} &= \dfrac{\exp(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}{1+\exp(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}\\&= \dfrac{\exp(0.3183-0.0324 \times 24)}{1+\exp(0.3183-0.0324 \times 24)}\\&= 0.387\end{align}

The odds of surviving are multiplied by a factor of \(\exp(-0.0324) = .968\) per additional year of age. Note that the odds of survival decline more rapidly for females than for males.

In R, the interaction model is fitted via


                            result=glm(survive~age+sex+age*sex,family=binomial("logit"))

Model

-2 log likelihood

Reduced (age, sex)

51.256

Full (age, sex, interaction)

47.346

From the output, we see that the likelihood ratio test statistic is \( G^2 =-2 \log \Lambda=3.91 > 3.84= \chi^2_1(0.95)\)

Conclusion: At the \(\alpha=0.05\) level, we can reject the hypothesis that there is no interaction between age and sex. This is weak evidence in support of interaction, but it does support what we observed in the above figure. From the table of the analysis of the MLEs below, however, notice that we would fail to reject that \(\beta_{age*sex}\) is 0. This happens because the Wald chi-square statistic is sensitive to sparseness in the data.

 Question: What is the fitted model, and what are the probabilities of survival for 24 year old males and females?

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)
                            (Intercept) 7.24638 3.20517 2.261 0.0238 *
                            age -0.19407 0.08742 -2.220 0.0264 *
                            sex -6.92805 3.39887 -2.038 0.0415 *
                            age:sex 0.16160 0.09426 1.714 0.0865 .

For females:

\begin{align} \hat{\pi} &= \dfrac{\exp(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)}{1+\exp(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)} \\ &= 0.930 \end{align}

The odds of surviving are multiplied by a factor of \(\exp(-0.194) = 0.824\) per additional year of age.

For males:

\begin{align}\hat{\pi} &= \dfrac{\exp(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}{1+\exp(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}\\&= \dfrac{\exp(0.3183-0.0324 \times 24)}{1+\exp(0.3183-0.0324 \times 24)}\\&= 0.387\end{align}

The odds of surviving are multiplied by a factor of \(\exp(-0.0324) = .968\) per additional year of age. Note that the odds of survival decline more rapidly for females than for males.