7: Further Topics on Logistic Regression

7: Further Topics on Logistic Regression

Overview

In the previous lesson, we dealt with basic topics of logistic regression. Recall that logistic regression is a special type of regression where the probability of "success" is modeled through a set of predictors. The predictors may be categorical, nominal or ordinal, or continuous. Logit or log odds of success is assumed to be a linear function of the predictors and such a model is fitted. This lesson deals with continuous covariates and model diagnostics for logistic regression.

Objectives
Upon completion of this lesson, you should be able to:

No objectives have been defined for this lesson yet.

 Lesson 7 Code Files

Data File: donner.dat


7.1 - Logistic Regression with Continuous Covariates

7.1 - Logistic Regression with Continuous Covariates

Example: Donner Party

Image of the Donner PassIn 1846, the Donner party (Donner and Reed families) left Springfield, Illinois for California in covered wagons. After reaching Fort Bridger, Wyoming, the leaders decided to find a new route to Sacramento. They became stranded in the eastern Sierra Nevada mountains at a place now called Donner Pass (right) when the region was hit by heavy snows in late October. By the time the survivors were rescued on April 21, 1847, 40 out of 87 had died.

Three variables:

\(Y_i = 1\) if the \(i\)th person survived
\(Y_i = 0\) if \(i\)th person died

\(x_{1i} = \) age of \(i\)th person.

\(x_{2i} = 1\) if \(i\)th person is male
\(x_{2i} = 0\) if \(i\)th person is female.

Questions of primary interest are:

  1. What is the relationship between survival and sex? Are males more or less likely to survive than females?
  2. What is the relationship between survival and age? Can the probability of survival be predicted as a function of age?
  3. Does age affect the survival rate of males and females differently?

The data are available in the file donner.dat and contain \( N = 45\) observations in the following columns: 1) Age, 2) Sex (1 = male, 0 = female), 3) Survived (1 = yes, 0 = no).


Analysis

Marginal Analysis: The two-way contingency table below summarizes the counts between sex and survival, ignoring age.  We can use a chi-square test of independence or a binary logistic regression model.

Survivorship
Sex Survived Died Total
Female 10 5 15
Male 10 20 30
Total 20 25 45

Test:

\(H_0\): Survivorship is independent of sex
\(H_a\): Survivorship is associated with sex
 

Conclusion: The null hypothesis is rejected (\(X^2 = 4.50\), df = 1, p-value = 0.0339). We may conclude that survivorship is (marginally) associated with sex. About 67% of females survived, whereas about 33% of males survived. Apparently, females' chance of survival is twice that of males.

Problem! However, from what we've seen with Simpson's Paradox, we know the potential effect a third variable, such as age, may have. The above analysis does not consider any connection between survivorship and age. We would prefer to take this into account.

Naive model: The linear regression model assumes the probability of survival, given age, has the following form:

\(\pi=P(Y=1|x_1)=\beta_0+\beta_1 x_1\)

Problem! Both of these issues stem from the fact that \(Y\) is binomial with mean \(\pi\) and variance \(\pi(1-\pi)\).
  • Non-linearity – Modeling the mean of \(Y\) as a linear function of predictors may result in estimates that fall outside the (0, 1) range.
  • Heteroscedasticity – The variance of \(Y\) depends on \(\pi\), which depends on \(x_1\) and is therefore not constant.
The SGPlot Procedure 20 30 40 50 60 age 0.0 0.2 0.4 0.6 0.8 1.0 survived Regression survive Fitted Line Plot

 

Above is a plot of the fitted linear regression model with the observed data. The estimated model is \(P(\mbox{survival} | \mbox{age})=0.8692-0.01336\mbox{age}\). Consider predicting survival for a 70 year-old person: \(0.8692-0.01336( 70) = - 0.0658\)! This model predicts a negative probability of survival.

Logistic regression model:

\(\pi_i=\dfrac{\exp(\beta_0+\beta_1 x_{1i})}{1+\exp(\beta_0+\beta_1 x_{1i})}\)

\(\log\dfrac{\pi_i}{1-\pi_i}=\beta_0+\beta_1 x_{1i}\)

where \( \pi_i\) is the probability of survival for the \(i\)th person, and \(x_{1i}\) is the age of the \(i\)th person.

Below is the donner.sas SAS program. It reads the datafile from a given path (make sure to update the "infile" statement to point where you have the data saved).

options ls=78;
title "Donner Party Data - Logistic Regression";

/********read into data***************************/

data donner;
  infile "C:\Temp\donner.txt";
  input age sex survive;
  run;
proc freq;
  tables sex*survive / chisq expected;
  run;

/************one way ANOVA*************/

ods graphics on;
proc mixed data=donner plot=all;
class sex;
model survive=sex;
run;
ods graphics off;

/*************OLS Regression************/

proc sgplot data=donner;
xaxis label= "age";
yaxis label="survived";
scatter x=age y=survive;
reg x=age y=survive;
title "Fitted Line Plot";
run;
quit;

title;

proc reg data=donner;
model survive = age;
run;


/***************Logistics Regression****************/


proc logistic descending;
  model survive = age / lackfit influence iplots;
  run;

proc freq;
  tables age*survive /chisq expected;
  run;

proc logistic descending;
  model survive = age sex /lackfit influence iplots;
  run;

proc logistic descending;
  model survive = age sex age*sex /lackfit influence iplots;
  output out=a p=pi;
  run;


proc format;
  value sexfmt 1= 'male' 0='female';
  run;

proc sgscatter data=a ;
  plot pi*age / group= sex pbspline;
  format sex sexfmt.;
run;

We will get additional diagnostics needed for evaluating overall fit of the model when having a continuous predictor(s) with influence iplots options in SAS.

Fitted model and Interpretation of parameter estimates

Let's look at a part of the output from donner.lst. Just as in R, we get the same model estimates, and the only difference is that R report a z-value instead of the Wald Chi-Square, but recall that \(z^2\approx \) Wald \(X^2\) (e.g., for age, \((-2.063)^2 \approx 4.2553)\).

 
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 1.8183 0.9993 3.3106 0.0688
age 1 -0.0665 0.0322 4.2553 0.0391
 
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
age 0.936 0.878 0.997
 

Below is the part of R code from donner.R that corresponds to donner.sas. For R, we need to adjust the read.table() line depending on where the file is saved.

donner=read.table("donner.txt")
survive=donner[,3]
age=donner[,1]
sex=donner[,2]

### Table sex*survive
table=as.matrix(table(sex,survive))
contingency_table=list(Frequency=table,Expected=chisq.test(table)$expected)
contingency_table
chisq.test(table, correct=FALSE) ## chisq. test of independence 
LM=lm(survive~age) ## linear regression
summary(LM)
LMANOVA=anova(LM) ## anova
LMANOVA

### Plot Survive*Age

plot(age,survive,xlim=c(15,70),ylim=c(-1.5,2.0),main="survive v.s. age")
abline(LM,col="red")
abline(confint(LM)[1],confint(LM)[2],col="green")
abline(confint(LM)[3],confint(LM)[4],col="purple")
### Plot Predicted*Age
plot(age,fitted(LM),main="Predicted v.s. Age")

### Q-Q plot
qqnorm(residuals(LM),main="Q-Q Plot")

### Studentized residuals v.s Observation
plot(rstudent(LM),main="Studentized residual v.s. observation")
abline(h=0)

###----------fitting logistic regression survive~age
result=glm(survive~age,family=binomial("logit"))
summary(result)

## to get the specific coefficient, this command will produce a vector
coefficients(result)
## to access the estimated slope and turn it into the odds-ratio
exp(coefficients(result)[2])

confint(result) ## confidence interval for parameters
exp(confint(result)) ## exponentiate to get on the odds-scale

### Diagnostics Measures

lm.influence(result)

### Pearson Residuals v.s. observation 
plot(residuals(result,type="pearson"),main="pearson residual plot")

### Deviance Residuals v.s. observation
plot(residuals(result,type="deviance"),main="deviance residual plot")

### Hat Diagonal Plot
plot(hatvalues(result),ylab="H",xlab="Case Number Index")

### Intercept DfBeta Plot
plot(dfbetas(result)[,1],ylab="DFBETA0",xlab="Case Number Index")

### Intercept DfBeta Plot
plot(dfbetas(result)[,2],ylab="DFBETA1",xlab="Case Number Index")

### Table age*survive
table=as.matrix(table(age,survive))
contigency_table=list(Frequency=table,Expected=chisq.test(table)$expected,Percent=prop.table(table),RowPct=prop.table(table,1),ColPct=prop.table(table,2))
contigency_table
chisq.test(table)

###----------fitting logistic regression survive~age+sex
result=glm(survive~age+sex,family=binomial("logit"))
summary(result)
confint(result) ## confidence interval for the parameters 

### Diagnostics Measures

lm.influence(result)

### Pearson Residuals v.s. observation 
plot(residuals(result,type="pearson"),main="pearson residual plot")

### Deviance Residuals v.s. observation
plot(residuals(result,type="deviance"),main="deviance residual plot")

### Hat Diagonal Plot
plot(hatvalues(result),ylab="H",xlab="Case Number Index")

### Intercept DfBeta Plot
plot(dfbetas(result)[,1],ylab="DFBETA0",xlab="Case Number Index")

### Intercept DfBeta Plot
plot(dfbetas(result)[,2],ylab="DFBETA1",xlab="Case Number Index")

###----------fitting logistic regression survive~age+sex+age*sex
donner=as.data.frame(donner)
sort(donner,c(donner$V1,donner$V2))
result=glm(survive~age+sex+age*sex,family=binomial("logit"))
out=data.frame(survive,age,sex,pi=result$fitted)
out

### Sort by sex age and Plot by sex group
group1=(out[which(sex==0),])[sort.int(age[which(sex==0)],index=TRUE)$ix,]
group2=(out[which(sex==1),])[sort.int(age[which(sex==1)],index=TRUE)$ix,]
plot(group1$age,group1$pi,col="red",type="l",xlim=c(10,70),ylim=c(0,1),ylab="Estimated Probability",xlab="age")
lines(group2$age,group2$pi,col="blue",type="c")

##what if we have two categorical predictors and want to add an interaction term? 

We will get additional diagnostics needed for evaluating the overall fit of the model when having a continuous predictor(s) with lm.influence() function and many others in R. See the part labeled "Model Diagnostics".

Fitted model and Interpretation of parameter estimates

We can get the same model estimates as from the SAS output and the only difference is that R reports a z-value instead of Wald Chi-Square, but recall that \(z^2\approx \) Wald \(X^2\) (e.g., for age, \((-2.063)^2 \approx 4.2553)\).

Here is part of the R output:

> summary(result)
Call:
glm(formula = survive ~ age, family = binomial("logit"))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5401  -1.1594  -0.4651   1.0842   1.7283  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.81852    0.99937   1.820   0.0688 .
age         -0.06647    0.03222  -2.063   0.0391 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The fitted model is

\(\log\dfrac{\hat{\pi}_i}{1-\hat{\pi}_i}=1.8183-0.0665x_{1i}\)

The expected probability of survival of a 70 year-old individual (regardless of sex since it's not included in the model):

\(\hat{\pi}=Pr(\mbox{survival})=\dfrac{\exp(1.8183-0.0665(70))}{1+\exp(1.8183-0.0665(70))}=0.055\),

and

\(\exp(\hat{\beta}_0)=\exp(1.8183)=3.26\)

is the estimate of the odds of survival when age = 0; that is, the probability of survival is \(3.26 / (1 + 3.26) = 0.77\). The exponentiated slope is

\(\exp(-0.0665)=0.94\)

which means that every year increase in age multiplies the estimated odds of survival by 0.94. So the odds of survival decreases with age.

In SAS, this value and its 95% Wald Confidence Interval are given in the output under "Odds Ratio Estimates":

                             Odds Ratio Estimates

                               Point          95% Wald
                  Effect    Estimate      Confidence Limits

                  age          0.936       0.878       0.997

In R, the odds-ratio estimate and its 95% Wald Confidence Interval (i.e., the asymptotic CI) if not directly provided but can be obtained as follows:

> exp(coefficients(result)[2])
age
0.9356907
> exp(confint(result)) ## exponentiate to get on the odds-scale
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.9940306 54.0635442
age 0.8695861 0.9898905

In general, this the usual 95% Wald CI, i.e., \(\hat{\beta}\pm 1.96\cdot SE(\beta)\), and then we need to exponentiate the ends to convert the CI to the odds-scale; for this example \(\exp(-0.0665\pm 1.96 \cdot 0.0322)=[0.878, 0.997]\).

Assuming the Type I error of 0.05, we reject the null hypothesis that \(\beta_1=0\), so there is weak evidence of a statistically significant relationship between the person's age and probability of survival. But how well does this model fit? Can we trust this inference?

Overall fit of the model

Let's look at the values listed for AGE. While some observations share the same AGE value, most of these are unique. Cross-tabulation of AGE vs. SURVIVAL gives a \(21\times 2\) table:

> table(donner[,1], donner[,2])

     0 1
  15 1 1
  18 0 1
  20 1 1
  21 1 0
  22 1 0
  23 1 3
  24 1 1
  25 2 6
  28 0 4
  30 0 4
  32 2 1
  35 0 1
  40 1 2
  45 2 0
  46 0 1
  47 1 0
  50 1 0
  57 0 1
  60 0 1
  62 0 1
  65 0 1
Issue! The table is sparse (e.g. small observed counts and we will get small fitted values), and the sample size requirement for the use of Pearson chi-square goodness-of-fit test statistic and likelihood ratio goodness-of-fit test statistic is not met. In addition, when more data are collected there could be more ages included so the number of cells is not fixed. This is almost always the case when you are dealing with continuous predictor(s) such as age. Thus, the \(X^2\) and \(G^2\) statistics for logistic regression models with continuous predictors do not have approximate chi-squared distributions and are not the best statistics to use to evaluate the overall fit of the model.

There are several alternatives. The most common are to:

  • fit the desired model, fit an appropriate larger model (include other predictors or a quadratic term), and look at the differences in the log-likelihood ratio statistics (i.e., the difference in deviances).
  • use the Hosmer-Lemeshow (HL) statistic.

When explanatory variables are continuous, it's difficult to analyze the lack of fit without some grouping. Recall that's what the Hosmer-Lemeshow test does; it groups by the predicted probabilities such that there is roughly an equal number of observations per group. Below are SAS and R outputs.

In SAS:

 
Partition for the Hosmer and Lemeshow Test
Group Total survive = 1 survive = 0
Observed Expected Observed Expected
1 5 0 0.57 5 4.43
2 7 3 1.82 4 5.18
3 4 3 1.65 1 2.35
4 4 1 1.82 3 2.18
5 4 2 1.96 2 2.04
6 8 2 4.31 6 3.69
7 6 3 3.40 3 2.60
8 7 6 4.47 1 2.53
 
Hosmer and Lemeshow Goodness-of-Fit Test
Chi-Square DF Pr > ChiSq
8.5165 6 0.202

In the model statement, the option lackfit tells SAS to compute HL statistics and print the partitioning. In this example, there are 8 groups with roughly equal number of observations per group. For example, group 1 has 5 observations, and group 2 has 7. The HL Chi-Square statistic of 8.5165, with df=number of groups - 2=6, has a p-value of 0.2027 indicating that there is no issue with the lack-of-fit. There are some differences between SAS and R output because of differences in the cut points used to group the data, but the inferences are the same.

Conclusion: The model fits moderately well.

For R, there is no built-in function for the Hosmer-Lemeshow goodness-of-fit test. Instead, we can use the hosmerlem() function from the HosmerLemeshow.R file. If we run the following for the Donner example, where "result" is the saved model and g is the number of groups for the HL statistic, we obtain

> hosmerlem(survive,fitted(result), g=8) 
X^2        Df   P(>Chi) 
6.3528592 6.0000000 0.3848447 

The HL Chi-Square statistic of 6.353, with df=number of groups - 2=6, has a p-value of 0.385 indicating that there is no issue with the lack-of-fit. There are some differences between SAS and R output because of differences in the cut points used to group the data, but the inferences are the same.

Conclusion: The model fits reasonably well.

This will involve fitting different logistic regression models, some including both continuous and categorical predictors, and comparing them via likelihood-ratio test. First, we review the general set-up, where

\(H_0\): The probability of the characteristic does not depend on the \(p^{th}\) variable; that is, the reduced model fits the data.

\(H_A\): The probability of the characteristic depends on the \(p^{th}\) variable; that is, the full model fits the data.

This is accomplished by comparing the two models as given below:

Reduced Model

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\ldots+\beta_{p-1} x_{(p-1)i}\)

which contains all variables except the variable of interest p.

Full Model

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\ldots+\beta_{p-1} x_{(p-1)i}+\beta_p x_{pi}\)

which contains all variables including the variable of interest p.

Recall that the fit of each model is assessed by the likelihood or, equivalently, the log-likelihood. The larger the log-likelihood, the better the model fits the data. Adding parameters to the model can only result in an increase in the log-likelihood. So, the full model will always have a larger log-likelihood than the reduced model and thus fit better. But the question is whether that additional variable is really needed, or can we capture the variability in the response well enough with the simpler model?

 Question: Is the log-likelihood for the full model significantly better than the log-likelihood for the reduced model? If so, we conclude that the probability that the trait is present depends on variable p. For example, to test for the effect of sex in the Donner Party data, we compare

Reduced Model

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}\)

where \(x_{1i}\) is the age of a person \(i\).

Full Model

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}\)

where \(x_{1i}\) is the age of person \(i\), and \(x_{2i}\) is the person's sex (with female corresponding to 0). If we are testing only for a single parameter, then this test should lead to the same inference as that for the test of /(H_0: \beta_2=0/) versus /(H_A: \beta_2\neq 0/).

In SAS: From the Model Fit Statistics in SAS get the likelihoods:

Model -2 log likelihood
Reduced (age) 56.291
Full (age and sex) 51.256

The likelihood ratio test statistic is

\(G^2 =-2 \log \Lambda=56.291-51.256=5.035\)

Since \(G^2 =-2 \log \Lambda=5.035 > 5.02= \chi^2_1(0.975)\), we can reject the null hypothesis that sex has no effect on the survivorship probability at the \(\alpha = 0.025\) level.

If we study the table with parameter estimates and consider \(H_0\colon \beta_{sex}=0\) vs. \(H_a\colon \beta_{sex}\neq=0\), based on the Wald \(X^2\) and p-value of 0.0345, we can reject the null hypothesis and conclude that sex does have an effect on survivorship.

 
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 3.2304 1.3870 5.4248 0.0199
age 1 -0.0782 0.0373 4.3988 0.0360
sex 1 -1.5973 0.7555 4.4699 0.0345

 

Fitted Model:

\(\hat{\pi}=Pr(survival)=\dfrac{\exp(3.2304-0.0782 x_1-1.5973 x_2)}{1+\exp(3.2304-0.0782 x_1-1.5973 x_2)}\)

Also, from SAS, the HL = 9.323, df = 7, p-value = 0.2305 indicates a moderate good fit.

For R, notice that this is the usual difference in the deviance statistics for the two models: the residual deviance for the age-only model (\(G^2= 56.291\)) and the age and sex model (\(G^2=51.256\)). Thus, \(G^2=-2 \log\Lambda = 56.291 - 51.256 = 5.035\), and we can reject the null hypothesis that sex has no effect on the survivorship probability.

If we study the table with parameter estimates and consider /(H_0: \beta_{sex}=0/) versus /(H_A: \beta_{sex}\neq=0/), based on the Wald \(X^2\) and p-value of 0.0345, we can reject the null hypothesis and conclude that sex does have an effect on survivorship.

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.23041 1.38686 2.329 0.0198 *
age -0.07820 0.03728 -2.097 0.0359 *
sex -1.59729 0.75547 -2.114 0.0345 *

Fitted Model

\(\hat{\pi}=Pr(survival)=\dfrac{\exp(3.2304-0.0782 x_1-1.5973 x_2)}{1+\exp(3.2304-0.0782 x_1-1.5973 x_2)}\)

From R, hosmerlem(survive, fitted(result), g=10)
X^2 Df P(>Chi)
11.1771583 8.0000000 0.1918618

The HL = 11.177, df = 8, p-value = 0.192 indicates a moderate good fit.

Conclusions

  • After taking into account the effects of age, females had higher survival probabilities than males (\(2 \log \Lambda = 5.035\), df = 1; p = 0.025).
  • The odds that a male survives are estimated to be \(\exp(-1.5973) = 0.202\) times the odds that a female survives.
  • By the Wald test, the survivorship probability decreases with increasing age (\(X_2 = 4.47\); df. = 1; p = 0.0345).
  • The odds of surviving decline by a multiplicative factor of \(\exp(-0.0782) = 0.925\) per year of age.
 
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
age 0.925 0.860 0.995
sex 0.202 0.046 0.890

Computing Survivorship Probabilities

 Question: What is the probability of survival for a 24 year old male? What about a 24 year old female?

For females:

\begin{align}\hat{\pi} &= \dfrac{\exp(3.2304-0.0782 \times 24-1.5973 \times 0)}{1+\exp(3.2304-0.0782 \times 24-1.5973 \times 0)}\\&= 0.806\end{align}

For males:

\begin{align}\hat{\pi} &=  \dfrac{\exp(3.2304-0.0782 \times 24-1.5973 \times 1)}{1+\exp(3.2304-0.0782 \times 24-1.5973 \times 1)}\\&= 0.439\end{align}

As before, we can also calculate confidence intervals by using computed estimates of the standard errors and the fact that the estimates are approximately normal.

Graphical Summary of Model Results

The SGScatter Procedure male female sex 20 30 40 50 60 age 0.2 0.4 0.6 0.8 1.0 Estimated Probability

Interaction between Age and Sex

The above analyses assume that the effect of sex on survivorship does not depend on age; that is, there is no interaction between age and sex. To test for interaction compare the following models:

Reduced Model

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}\)

where \(x_{1i}\) is the age of person \(i\), and \(x_{2i}\) is the sex.

Full Model

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\beta_3 x_{1i} x_{2i}\)

where \(x_{1i}\) is the age of person \(i\), and \(x_{2i}\) is the sex.

In SAS, the interaction model is fitted via

proc logistic descending;
  model survive = age sex age*sex /lackfit influence iplots;
  output out=a p=pi;
  run;
Model -2 log likelihood
Reduced (age, sex) 51.256
Full (age, sex, interaction) 47.346

From the output, we see that the likelihood ratio test statistic is \( G^2 =-2 \log \Lambda=3.91 > 3.84= \chi^2_1(0.95)\)

Conclusion: At the \(\alpha=0.05\) level, we can reject the hypothesis that there is no interaction between age and sex. This is weak evidence in support of interaction, but it does support what we observed in the above figure. From the table of the analysis of the MLEs below, however, notice that we would fail to reject that \(\beta_{age*sex}\) is 0. This happens because the Wald chi-square statistic is sensitive to sparseness in the data.

 Question: What is the fitted model, and what are the probabilities of survival for 24-year-old males and females?

 
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 7.2450 3.2046 5.1114 0.0238
age 1 -0.1940 0.0874 4.9289 0.0264
sex 1 -6.9267 3.3983 4.1546 0.0415
age*sex 1 0.1616 0.0942 2.9385 0.0865

 

For females

\begin{align}\hat{\pi} &= \dfrac{\exp(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)}{1+\exp(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)}\\&= 0.930\end{align}

The odds of surviving are multiplied by a factor of \(\exp(-0.194) = 0.824\) per additional year of age.

For males

\begin{align} \hat{\pi} &= \dfrac{\exp(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}{1+\exp(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}\\&= \dfrac{\exp(0.3183-0.0324 \times 24)}{1+\exp(0.3183-0.0324 \times 24)}\\&= 0.387\end{align}

The odds of surviving are multiplied by a factor of \(\exp(-0.0324) = .968\) per additional year of age. Note that the odds of survival decline more rapidly for females than for males.

In R, the interaction model is fitted via


result=glm(survive~age+sex+age*sex,family=binomial("logit"))
Model -2 log likelihood
Reduced (age, sex) 51.256
Full (age, sex, interaction) 47.346

From the output, we see that the likelihood ratio test statistic is \( G^2 =-2 \log \Lambda=3.91 > 3.84= \chi^2_1(0.95)\)

Conclusion: At the \(\alpha=0.05\) level, we can reject the hypothesis that there is no interaction between age and sex. This is weak evidence in support of interaction, but it does support what we observed in the above figure. From the table of the analysis of the MLEs below, however, notice that we would fail to reject that \(\beta_{age*sex}\) is 0. This happens because the Wald chi-square statistic is sensitive to sparseness in the data.

 Question: What is the fitted model, and what are the probabilities of survival for 24 year old males and females?

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.24638 3.20517 2.261 0.0238 *
age -0.19407 0.08742 -2.220 0.0264 *
sex -6.92805 3.39887 -2.038 0.0415 *
age:sex 0.16160 0.09426 1.714 0.0865 .

For females:

\begin{align} \hat{\pi} &= \dfrac{\exp(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)}{1+\exp(7.2450-0.1940 \times 24-6.9267 \times 0+0.1616\times 0)} \\ &= 0.930 \end{align}

The odds of surviving are multiplied by a factor of \(\exp(-0.194) = 0.824\) per additional year of age.

For males:

\begin{align}\hat{\pi} &= \dfrac{\exp(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}{1+\exp(7.2450-0.1940 \times 24-6.9267 \times 1+0.1616\times (24\times 1))}\\&= \dfrac{\exp(0.3183-0.0324 \times 24)}{1+\exp(0.3183-0.0324 \times 24)}\\&= 0.387\end{align}

The odds of surviving are multiplied by a factor of \(\exp(-0.0324) = .968\) per additional year of age. Note that the odds of survival decline more rapidly for females than for males.


7.2 - Model Diagnostics

7.2 - Model Diagnostics

While the goodness-of-fit statistics can tell us how well a particular model fits the data, they don't tell us much about why a model may fit poorly. Do the model assumptions hold? To assess the lack of fit we need to look at regression diagnostics.

Let us begin with a bit of a review of Linear Regression diagnostics. The standard linear regression model is given by:

\(y_i \sim N(\mu_i,\sigma^2)\)

\(\mu_i =x_i^T \beta\)

The two crucial features of this model are

  1. the assumed mean structure, \(\mu_i =x_i^T \beta\) , and
  2. the assumed constant variance \(\sigma^2\) (homoscedasticity).

The most common diagnostic tool is the residuals, the difference between the estimated and observed values of the dependent variable. There are other useful regression diagnostics, e.g. measures of leverage and influence, but for now, our focus will be on the estimated residuals.

The most common way to check these assumptions is to fit the model and then plot the residuals versus the fitted values \(\hat{y}_i=x_i^T \hat{\beta}\) .

  • If the model assumptions are correct, the residuals should fall within an area representing a horizontal band, like this,
  • If the residuals have been standardized in some fashion (i.e., scaled by an estimate of \(\sigma\)), then we would expect most of them to have values within \(\pm2\) or \(\pm3\); residuals outside of this range are potential outliers.
  • If the plot reveals some kind of curvature—for example, like this,

    it suggests a failure of the mean model; the true relationship between \(\mu_i\), and the covariates might not be linear.

  • If the variability in the residuals is not constant as we move from left to right—for example, if the plot looks like this,

    or like this,

    then the variance \(V(Y_i)\) is not constant but changes as the mean \(\mu_i\) changes.

Analogous plots for logistic regression.

The logistic regression model says that the mean of \(Y_i\) is

\(\mu_i=n_i \pi_i\)

where

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=x_i^T \beta\)

and that the variance of \(Y_i\) is

\(V(Y_i)=n_i \pi_i(1-\pi_i)\).

After fitting the model, we can calculate the Pearson residuals:

\(r_i=\dfrac{y_i-\hat{\mu}_i}{\sqrt{\hat{V}(Y_i)}}=\dfrac{y_i-n_i\hat{\pi}_i}{\sqrt{n_i\hat{\pi}_i(1-\hat{\pi}_i)}}\)

If the \(n_i\)s are relatively large, these act like standardized residuals in linear regression. To see what's happening, we can plot them against the linear predictors:

\(\hat{\eta}_i=\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=x_i^T \hat{\beta}_i\)

which are the estimated log-odds of success, for cases \(i = 1, \ldots , N\).

  • If the fitted logistic regression model was true, we would expect to see a horizontal band with most of the residuals falling within \(\pm 3\):
  • If the \(n_i\)s are small, then the plot might not have such a nice pattern, even if the model is true. In the extreme case of ungrouped data (with all \(n_i\)s equal to 1), this plot becomes uninformative. From now on, we will suppose that the \(n_i\)s are not too small so that the plot is at least somewhat meaningful.
  • If outliers are present—that is, if a few residuals or even one residual is substantially larger than \(\pm 3\), then \(X^2\) and \(G^2\) may be much larger than the degrees of freedom. In that situation, the lack of fit can be attributed to outliers, and the large residuals will be easy to find in the plot.

Curvature

  • If the plot of Pearson residuals versus the linear predictors reveals curvature—for example, like this,

then it suggests that the mean model

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=x_i^T \beta\) (2)

has been misspecified in some fashion. That is, it could be that one or more important covariates do not influence the log-odds of success in a linear fashion. For example, it could be that a covariate \(x\) ought to be replaced by a transformation such as \(\sqrt{x}\), \(\log x\), \(x^2\), etc.

To see whether individual covariates ought to enter in the logit model in a non-linear fashion, we could plot the empirical logits

\(\log\left(\dfrac{y_i+1/2}{n_i-y_i+1/2}\right)\) (3)

versus each covariate in the model.

  • If one of the plots looks substantially non-linear, we might want to transform that particular covariate.
  • If many of them are nonlinear, however, it may suggest that the link function has been mis-specified—i.e., that the left-hand side of (2) should not involve a logit transformation but some other function.

Changing the link function will change the interpretation of the coefficients entirely; the \(\beta\)s will no longer be log-odds ratios. But, depending on what the link function is, they might still have a nice interpretation. For example, in a model with a log link, \(\log\pi_i=x_i^T \beta\), an exponentiated coefficient \(\exp(\beta_j)\) becomes a relative risk.

Test for correctness of the link function

Hinkley (1985) suggests a nice, easy test to see whether the link function is plausible:

\(\hat{\eta}_i=x_i^T \hat{\beta}\)

  • Fit the model and save the estimated linear predictors
  • Add \(\eta^2_i\) to the model as a new predictor and see if it's significant.

A significant result indicates that the link function is misspecified. A nice feature of this test is that it applies even to ungrouped data (\(n_i\)s equal to one), for which residual plots are uninformative.

Non-constant Variance

Suppose that the residual plot shows non-constant variance as we move from left to right:

Another way to detect non-constancy of variance is to plot the absolute values of the residuals versus the linear predictors and look for a non-zero slope:

Non-constant variance in the Pearson residuals means that the assumed form of the variance function,

\(V(y_i)=n_i \pi_i (1-\pi_i)\)

is wrong and cannot be corrected by simply introducing a scale factor for overdispersion. Overdispersion and changes to the variance function will be discussed later.

Example

The SAS online help documentation provides the following quantal assay dataset. In this table, \(x_i\) refers to the log-dose, \(n_i\) is the number of subjects exposed, and \(y_i\) is the number who responded.

\(x_i\) \(y_i\) \(n_i\)
2.68 10 31
2.76 17 30
2.82 12 31
2.90 7 27
3.02 23 26
3.04 22 30
3.13 29 31
3.20 29 30
3.21 23 30

If we fit a simple logistic regression model, we will find that the coefficient for \(x_i\) is highly significant, but the model doesn't fit. The plot of Pearson residuals versus the fitted values resembles a horizontal band, with no obvious curvature or trends in the variance. This seems to be a classic example of overdispersion. Since there's only a single covariate, a good place to start is to plot the empirical logits as defined in equation (3) above versus \(x\).

This is basically the same thing as a scatterplot of \(Y\) versus \(x\) in the context of ordinary linear regression. This plot becomes more meaningful as the \(n_i\)s grow. With ungrouped data (all \(n_i\) = 1), the empirical logits will only take two possible values—\(\log(1/3)\) and \(\log(3/1\)\)—and the plot will not be very useful.

Here is part of the SAS program file assay.sas:

options nocenter nodate nonumber linesize=72;
data assay; 
   input logconc y n; 
 	emplogit=log((y+.5)/(n-y+.5));
   cards; 
   2.68  10  31 
   2.76  17  30 
   2.82  12  31 
   2.90   7  27 
   3.02  23  26 
   3.04  22  30 
   3.13  29  31 
   3.20  29  30 
   3.21  23  30 
   ; 
    run; 
   axis1 label=('Log-concentration');
   axis2 label=('Empirical Logit');
   proc gplot data=assay; 
      title 'Empirical logits vs. x'; 
      plot emplogit * logconc / haxis=axis1 vaxis=axis2; 
   run; 
Plot of emplogit by logconc Empirical Logit -2 -1 0 1 2 3 Log-concentration 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 Empirical logits vs. x

The relationship between the logits and \(x\) seems linear. Let's fit the logistic regression and see what happens.

*no adjustment for overdispersion (scale=none);
   proc logistic data=assay; 
      model y/n= logconc / scale=none; 
      output out=out1 xbeta=xb reschi=reschi; 
   run; 
 
   axis1 label=('Linear predictor');
   axis2 label=('Pearson Residual');
   proc gplot data=out1; 
      title 'Residual plot'; 
      plot reschi * xb / haxis=axis1 vaxis=axis2; 
   run;  

In plot reschi * xb, reschi are the Pearson residuals, and xb the linear predictor.

The output reveals that the coefficient of \(x\) is highly significant, but the model does not fit.

 
Deviance and Pearson Goodness-of-Fit Statistics
Criterion Value DF Value/DF Pr > ChiSq
Deviance 29.3462 7 4.1923 0.0001
Pearson 28.5630 7 4.0804 0.0002

Number of events/trials observations: 9

 
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -15.8331 2.4371 42.2072 <.0001
logconc 1 5.5776 0.8319 44.9521 <.0001
Plot of reschi by xb Pearson Residual -4 -3 -2 -1 0 1 2 Linear predictor -1 0 1 2 3 Residual plot

Here is the R code assay.R that corresponds to the SAS program assay.sas:


#### Logistic regression 

r=c(10,17,12,7,23,22,29,29,23)
n=c(31,30,31,27,26,30,31,30,30)
logconc=c(2.68,2.76,2.82,2.90,3.02,3.04,3.13,3.20,3.21)
counts=cbind(r,n-r)
result=glm(counts~logconc,family=binomial("logit"))
summary(result,correlation=TRUE,symbolic.cor = TRUE)
result$coefficients

#### plot residuals vs. linear predictor 

plot(residuals(result, type="pearson"),result$linear.predictors)

#### plot logconc vs. empirical logits 

emplogit=log((r+0.5)/(n-r+0.5))
plot(logconc,emplogit)
plot(result)

 

With plot(result), R will produce four diagnostic plots, including a residual plot, a QQ plot, a scale-location plot, and a residual vs leverage plot as well.

The output reveals that the coefficient of \(x\) is highly significant, but the model does not fit.


Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -15.8339     2.4371  -6.497 8.20e-11 ***
logconc       5.5778     0.8319   6.705 2.02e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 83.631  on 8  degrees of freedom
Residual deviance: 29.346  on 7  degrees of freedom
AIC: 62.886

Here is the residual plot from R output:

The residuals plots in both SAS and R above do not show any obvious curvature or trends in the variance. And there are no other predictors that are good candidates for inclusion in the model. (It can be shown that a quadratic term for log-concentration will not improve the fit.) So it is quite reasonable to attribute the problem to overdispersion.

Influence

Regression diagnostics can also tell us how influential each observation is to the fit of the logistic regression model. We can evaluate the numerical values of these statistics and/or consider their graphical representation (like residual plots in linear regression).

Some measures of influence:

  • Pearson and Deviance Residuals identify observations not well explained by the model.
  • Hat Matrix Diagonal detects extremely large points in the design space. These are often labeled as "leverage" or "\(h_i\)" and are related to standardized residuals. A general rule says that if \(h_i > 2p/n\) or \(> 3p/n\), the points is influential. Here "p" is the number of parameters in the model and "n" is the number of observations. For the donner data example with the main effects model, \(2(3)/45=0.13\). There are two points possibly influential.
  • The regression diagnostics work is very similar to linear regression.

The LOGISTIC Procedure

      „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
H 0.2 ˆ                                                                                                                            ˆ
a     ‚                                                                                                                            ‚
t     ‚                                                                                *                                           ‚
    H ‚                                                      *                                                                     ‚
D     ‚                              *     *                                                                                       ‚
i     ‚                                                                                                                            ‚
a 0.1 ˆ                    *                                                         *                                             ˆ
g     ‚                                      *       * *   *     *             * *       *     *                 *                 ‚
o     ‚                                * *                     *   *   *   *                 *                                     ‚
n     ‚                  *   *     *               *                 *   *                 *       * * *   * * *                   ‚
a     ‚                        * *             * *       *                   *     *             *       *                         ‚
l     ‚                                                                                                                            ‚
  0.0 ˆ                                                                                                                            ˆ
      ŠƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒŒ
                       0         5        10        15        20        25        30        35        40        45                  
                                                                                                                                    
                                                           Case Number   INDEX                                                      
  • DFBETA assesses the effect of an individual observation on the estimated parameter of the fitted model. It’s calculated for each observation for each parameter estimate. It's a difference in the parameter estimated due to deleting the corresponding observation. For example, for the main effects model there are 3 such plots (one for intercept, one for age, and one for sex). Here is the one for AGE, showing that observation #35 is potentially influential.

The LOGISTIC Procedure

          „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
      0.5 ˆ                                                                                                                        ˆ
a         ‚                                                                                    *                                   ‚
g         ‚                                                                            *                                           ‚
e DFBETA1 ‚                  * *                                                                                                   ‚
          ‚                                                                                                                        ‚
D         ‚                *                   *     *                 *       * *           *   *   * * *       *                 ‚
f     0.0 ˆ                      * *             * *       *                         *             *       *                       ˆ
B         ‚                          *   * *           * *       * *     * * *     *       *                 * *   *               ‚
e         ‚                            *     *               * *     *                   *                                         ‚
t         ‚                                                                                                                        ‚
a         ‚                                                                                                                        ‚
          ‚                                                                                                                        ‚
     -0.5 ˆ                                                                                                                        ˆ
          ŠƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒŒ
                         0         5        10        15        20        25        30        35        40        45                
                                                                                                                                    
                                                             Case Number   INDEX                                                    
  • C and CBAR provide scalar measures of the influence of the individual observations on the parameter estimates, similar to the Cook distance in linear regression. The same rules in linear regression hold here too. For example, observation #35 again seems influential (see the rest of donner.lst).
  • DIFDEV and DIFCHISQ detect observations that heavily add to the lack of fit, i.e. there is a large disagreement between the fitted and observed values. DIFDEV measures the change in the deviance (\(G^2\)) as an individual observation is deleted. DIFCHISQ measures the change in \(X^2\).

In SAS, under PROC LOGISTIC, INFLUENCE option and IPLOTS option will output these diagnostics. In R see, influence() function.

As pointed out before, the standardized residual values are considered to be indicative of lack of fit if their absolute value exceeds 3 (some sources are more conservative and take 2 as the threshold value).

The Index plot, in which the residuals are plotted against the indices of the observations, can help us determine outliers, and/or systematic patterns in variation and thus assess the lack of fit. In our example, all Pearson and Deviance residuals fall within \(-2\) and \(+2\); thus, there does not seem to be any outliers or particular heavy influential points; see donner.html for the IPLOTS.

For our example, here is the IPLOT of the deviance residuals vs. the case number.

The LOGISTIC Procedure

         „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
D      2 ˆ                     *                                                               *                                   ˆ
e        ‚                                                                     *                                                   ‚
v        ‚                                       *         *               *         *                         *                   ‚
i RESDEV ‚                   *                                       *       *   *               *                                 ‚
a        ‚                                             * *         *               *       *                       *               ‚
n        ‚                                                   *                                                                     ‚
c      0 ˆ                                                                                                                         ˆ
e        ‚                               * *                     *       *                                                         ‚
         ‚                           *                                                                       *                     ‚
R        ‚                 *     * *   *     *     * *         *       *                 *   *     * * * * *     *                 ‚
e        ‚                                                                             *                                           ‚
s        ‚                                     *                                                                                   ‚
i     -2 ˆ                                                                                                                         ˆ
d        ŠƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒŒ
u                        0         5        10        15        20        25        30        35        40        45                
a                                                                                                                                   
                                                             Case Number   INDEX                                                    

 


7.3 - Overdispersion

7.3 - Overdispersion

Overdispersion is an important concept in the analysis of discrete data. Many times data admit more variability than expected under the assumed distribution. The extra variability not predicted by the generalized linear model random component reflects overdispersion. Overdispersion occurs because the mean and variance components of a GLM are related and depend on the same parameter that is being predicted through the predictor set.

Overdispersion is not an issue in ordinary linear regression. In a linear regression model

\(y_i \sim N(x_i^T \beta, \sigma^2)\)

the variance \(σ^2\) is estimated independently of the mean function \(x_i^T \beta\). With discrete response variables, however, the possibility for overdispersion exists because the commonly used distributions specify particular relationships between the variance and the mean; we will see the same holds for Poisson.

For the binomial response, if \(Y_i\sim Bin(n_i, \pi_i)\), the mean is \(\mu_i=n_i\pi_i\), and the variance is \(\mu_i(n_i - \mu_i) / n_i\).

  • Overdispersion means that the variance of the response \(Y_i\) is greater than what's assumed by the model.
  • Underdispersion is also theoretically possible but rare in practice. More often than not, if the model's variance doesn't match what's observed in the response, it's because the latter is greater.

In the context of logistic regression, overdispersion occurs when the discrepancies between the observed responses \(y_i\) and their predicted values \(\hat{\mu}_i=n_i \hat{\pi}_i\) are larger than what the binomial model would predict. Overdispersion arises when the \(n_i\) Bernoulli trials that are summarized in a line of the dataset are

  • not identically distributed (i.e., the success probabilities vary from one trial to the next), or
  • not independent (i.e., the outcome of one trial influences the outcomes of other trials).

In practice, it is impossible to distinguish non-identically distributed trials from non-independence; the two phenomena are intertwined.

Issue!

If overdispersion is present in a dataset, the estimated standard errors and test statistics the overall goodness-of-fit will be distorted and adjustments must be made. When a logistic model fitted to n binomial proportions is satisfactory, the residual deviance has an approximate \(\chi^2\) distribution with \((n – p)\) degrees of freedom, where \(p\) is the number of unknown parameters in the fitted model. Since the expected value of a \(chi^2\) distribution is equal to its degree of freedom, it follows that the residual deviance for a well-fitting model should be approximately equal to its degrees of freedom. Equivalently, we may say that the mean deviance (deviance/df) should be close to one. Similarly, if the variance of the data is greater than that under binomial sampling, the residual mean deviance is likely to be greater than 1.

The problem of overdispersion may also be confounded with the problem of omitted covariates. If we have included all the available covariates related to \(Y_i\) in our model and it still does not fit, it could be because our regression function \(x_i^T \beta\) is incomplete. Or it could be due to overdispersion. Unless we collect more data, we cannot do anything about omitted covariates. But we can adjust for overdispersion.

Adjusting for Overdispersion

The most popular method for adjusting for overdispersion comes from the theory of quasi-likelihood. Quasilikelihood has come to play a very important role in modern statistics. It is the foundation of many methods that are thought to be "robust" (e.g. Generalized Estimating Equations (GEE) for longitudinal data) because they do not require the specification of a full parametric model.

In the quasilikelihood approach, we must first specify the "mean function" which determines how \(\mu_i=E(Y_i)\) is related to the covariates. In the context of logistic regression, the mean function is

\(\mu_i=n_i \exp(x_i^T \beta)\),

which implies

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=x_i^T \beta\)

Then we must specify the "variance function," which determines the relationship between the variance of the response variable and its mean. For a binomial model, the variance function is \(\mu_i(n_i-\mu_i)/n_i\). But to account for overdispersion, we will include another factor \(\sigma^2\) called the "scale parameter," so that

\(V(Y_i)=\sigma^2 \mu_i (n_i-\mu_i)/n_i\).

  • If \(\sigma^2 \ne 1\) then the model is not binomial; \(\sigma^2 > 1\) corresponds to "overdispersion", and \(\sigma^2 < 1\) corresponds to "underdispersion."
  • If \(\sigma^2\) were known, we could obtain a consistent, asymptotically normal and efficient estimate for \(\beta\) by a quasi-scoring procedure, sometimes called "estimating equations." For the variance function shown above, the quasi-scoring procedure reduces to the Fisher scoring algorithm that we mentioned as a way to iteratively find ML estimates.

Note that no matter what \(\sigma^2\) is assumed to be, we get the same estimate for \(\beta\). Therefore, this method for overdispersion does not change the estimate for \(\beta\) at all. However, the estimated covariance for \(\hat{\beta}\) changes from

\(\hat{V}(\hat{\beta})=(x^T W x)^{-1}\)

to

\(\hat{V}(\hat{\beta})=\sigma^2 (x^T W x)^{-1}\)

That is, the estimated standard errors must be multiplied by the factor \(\sigma=\sqrt{\sigma^2}\).

 How do we estimate \(\sigma^2\)?

One recommendation is to use

\(\hat{\sigma}^2=X^2/(N-p)\)

where \(X^2\) is the usual Pearson goodness-of-fit statistic, \(N\) is the number of sample cases (number of rows in the dataset we are modeling), and \(p\) is the number of parameters in the current model (suffering from overdispersion).

  • If the model holds, then \(X^2/(N - p)\) is a consistent estimate for \(\sigma^2\) in the asymptotic sequence \(N\rightarrow\infty\) for fixed \(n_i\)s.
  • The deviance-based estimate \(G^2/(N - p)\) does not have this consistency property and should not be used.

This is a reasonable way to estimate \(\sigma^2\) if the mean model \(\mu_i=g(x_i^T \beta)\) holds. But if important covariates are omitted, then \(X^2\) tends to grow and the estimate for \(\sigma^2\) can be too large. For this reason, we will estimate \(\sigma^2\) under a maximal model, a model that includes all of the covariates we wish to consider.

The best way to estimate \(\sigma^2\) is to identify a rich model for \(\mu_i\) and designate it to be the most complicated one that we are willing to consider. For example, if we have a large pool of potential covariates, we may take the maximal model to be the model that has every covariate included as a main effect. Or, if we have a smaller number of potential covariates, we decide to include all main effects along with two-way and perhaps even three-way interactions. But we must omit at least a few higher-order interactions, otherwise, we will end up with a model that is saturated.

In an overdispersed model, we must also adjust our test statistics. The statistics \(X^2\) and \(G^2\) are adjusted by dividing them by \(\sigma^2\). That is, tests of nested models are carried out by comparing differences in the scaled Pearson statistic, \(\Delta X^2/\sigma^2\), or the scaled deviance, \(G^2/\sigma^2\) to a chi-square distribution with degrees of freedom equal to the difference in the numbers of parameters for the two models.

If the data are overdispersed — that is, if

\(V(Y_i) \approx \sigma^2 n_i \pi_i (1-\pi_i)\)

for a scale factor \(\sigma^2 > 1\), then the residual plot may still resemble a horizontal band, but many of the residuals will tend to fall outside the \(\pm 3\) limits. In this case, the denominator of the Pearson residual will tend to understate the true variance of the \(Y_i\), making the residuals larger. If the plot looks like a horizontal band but \(X^2\) and \(G^2\) indicate lack of fit, an adjustment for overdispersion might be warranted. A warning about this, however: If the residuals tend to be too large, it doesn't necessarily mean that overdispersion is the cause. Large residuals may also be caused by omitted covariates. If some important covariates are omitted from \(x_i\), then the true \(\mu_i\)s will depart from what your model predicts, causing the numerator of the Pearson residual to be larger than usual. That is, apparent overdispersion could also be an indication that your mean model needs additional covariates. If these additional covariates are not available in the dataset, however, then there's not much we can do about it; we may need to attribute it to overdispersion.

Note, there is no overdispersion for ungrouped data. McCullagh and Nelder (1989) point out that overdispersion is not possible if \(n_i=1\). If \(y_i\) only takes values 0 and 1, then it must be distributed as Bernoulli(\(\pi\)), and its variance must be \(\pi_i(1-\pi_i)\). There is no other distribution with support {0,1}. Therefore, with ungrouped data, we should always assume scale=1 and not try to estimate a scale parameter and adjust for overdispersion.

Summary of Adjusting for Overdispersion in Binary Logistic Regression

The usual way to correct for overdispersion in a logit model is to assume that:

\(E(y_i)=n_i \pi_i\)
\(V(y_i)=\sigma^2 n_i \pi_i (1-\pi_i)\)

where \(\sigma^2\) is a scale parameter. Under this modification, the Fisher-scoring procedure for estimating \(\beta\) does not change, but its estimated covariance matrix becomes \(\sigma^2(x^TWx)^{-1}\)—that is, the usual standard errors are multiplied by the square root of \(\sigma^2\). This will make the confidence intervals wider.

Let's get back to our example and refit the model, making an adjustment for overdispersion.

We can refit the model, making an adjustment for overdispersion in SAS by changing the model statement to

model y/n= logconc / scale=pearson; 

In SAS, including the option scale=Pearson in the model statement will perform the adjustment. It will not change the estimated coefficients \(\beta_j\), but it will adjust the standard errors.

Here's the output:

 
Deviance and Pearson Goodness-of-Fit Statistics
Criterion Value DF Value/DF Pr > ChiSq
Deviance 29.3462 7 4.1923 0.0001
Pearson 28.5630 7 4.0804 0.0002

Number of events/trials observations: 9

Note! The covariance matrix has been multiplied by the heterogeneity factor (Pearson Chi-Square / DF) 4.08043.
 
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 13.3037 1 0.0003
Score 12.6609 1 0.0004
Wald 11.0165 1 0.0009
 
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -15.8331 4.9230 10.3438 0.0013
logconc 1 5.5776 1.6804 11.0165 0.0009

The estimated scale parameter is \(\hat{\sigma}^2=X^2/df=4.08\). SAS automatically scales the covariance matrix by this factor, which means that

  • the standard errors in the table of coefficients are multiplied by \(\sqrt{4.08} \approx 2\), and
  • the Wald statistics are divided by 4.08.

If you are using glm() in R, and want to refit the model adjusting for overdispersion one way of doing it is to use summary.glm() function. For example, fit the model using glm() and save the object as RESULT. By default, dispersion is equal to 1. This will perform the adjustment. It will not change the estimated coefficients \(\beta_j\), but it will adjust the standard errors.

Estimate from the MAXIMAL model dispersion value as \(X^2/df\). Then we can call

summary(RESULT, dispersion=4.08,correlation=TRUE,symbolic.cor = TRUE).

This should give the same model but with an adjusted covariance matrix---that is, adjusted standard errors (SEs) for the \(\beta\)s (estimated logistic regression coefficients) and also changed z-values. Notice it will not adjust overall fit statistics. For that try the package "dispmod" (see assay.R).

R output after adjusting for overdispersion:


Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -15.834      4.923  -3.216 0.001298 ** 
logconc        5.578      1.680   3.319 0.000902 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 4.08)

    Null deviance: 83.631  on 8  degrees of freedom
Residual deviance: 29.346  on 7  degrees of freedom
AIC: 62.886

There are other corrections that we could make. If we were constructing an analysis-of-deviance table, we would want to divide \(G^2\) and \(X^2\) by \(\hat{\sigma}^2\) and use these scaled versions for comparing nested models. Moreover, in reporting residuals, it would be appropriate to modify the Pearson residuals to

\( r_i^\ast=\dfrac{y_i-n_i\hat{\pi}_i}{\sqrt{\hat{\sigma}^2n_i\hat{\pi}_i(1-\hat{\pi}_i)}}\);

that is, we should divide the Pearson residuals (or the deviance residuals, for that matter) by \(\sqrt{\hat{\sigma}^2}\).


7.4 - Receiver Operating Characteristic Curve (ROC)

7.4 - Receiver Operating Characteristic Curve (ROC)

A Receiver Operating Characteristic Curve (ROC) is a standard technique for summarizing classifier performance over a range of trade-offs between true positive (TP) and false positive (FP) error rates (Sweets, 1988). ROC curve is a plot of sensitivity (the ability of the model to predict an event correctly) versus 1-specificity for the possible cut-off classification probability values \(\pi_0\).

For logistic regression we can create a \(2\times 2\) classification table of predicted values from your model for the response if \(\hat{y}=0\) or 1 versus the true value of \(y = 0\) or 1. The prediction if \(\hat{y}=1\) depends on some cut-off probability, \(\pi_0\). For example, \(\hat{y}=1\) if \(\hat{\pi}_i>\pi_0\) and \(\hat{y}=0\) if \(\hat{\pi}_i \leq \pi_0\). The most common value for \(\pi_0 = 0.5\). Then \(sensitivity=P(\hat{y}=1|y=1)\) and \(specificity=P(\hat{y}=0|y=0)\).

The ROC curve is more informative than the classification table since it summarizes the predictive power for all possible \(\pi_0\).

The position of the ROC on the graph reflects the accuracy of the diagnostic test. It covers all possible thresholds (cut-off points). The ROC of random guessing lies on the diagonal line. The ROC of a perfect diagnostic technique is a point at the upper left corner of the graph, where the TP proportion is 1.0 and the FP proportion is 0.

The Area Under the Curve (AUC), also referred to as index of accuracy (A), or concordance index, \(c\), in SAS, and it is an accepted traditional performance metric for a ROC curve. The higher the area under the curve the better prediction power the model has. \(c = 0.8 \) can be interpreted to mean that a randomly selected individual from the positive group has a test value larger than that for a randomly chosen individual from the negative group 80 percent of the time.

The following is taken from the SAS program assay.sas.

options nocenter nodate nonumber linesize=72;

data assay; 
   input logconc y n; 
   cards; 
   2.68  10  31 
   2.76  17  30 
   2.82  12  31 
   2.90   7  27 
   3.02  23  26 
   3.04  22  30 
   3.13  29  31 
   3.20  29  30 
   3.21  23  30 
   ; 
    run; 
 
   proc logistic data=assay; 
      model y/n= logconc / scale=pearson outroc=roc1; 
      output out=out1 xbeta=xb reschi=reschi; 
   run; 
 
   axis1 label=('Linear predictor');
   axis2 label=('Pearson Residual');
   proc gplot data=out1; 
      title 'Residual plot'; 
      plot reschi * xb / haxis=axis1 vaxis=axis2; 
   run; 

 symbol1 i=join v=none c=blue;
   proc gplot data=roc1;
   title 'ROC plot';
   plot  _sensit_*_1mspec_=1 / vaxis=0 to 1 by .1 cframe=ligr ;
   run;

Here is the resulting ROC graph.

ROC Curve for Model 0.00 0.25 0.50 0.75 1.00 Sensitivity 0.00 0.25 0.50 0.75 1.00 1 - Specificity ROC Curve for Model Area Under the Curve = 0.7462

Area under the curve is \(c = 0.746\) indicates good predictive power of the model.

 
Association of Predicted Probabilities and Observed Responses
Percent Concordant 70.6 Somers' D 0.492
Percent Discordant 21.4 Gamma 0.535
Percent Tied 8.0 Tau-a 0.226
Pairs 16168 c 0.746

Option ctable prints the classification tables for various cut-off points. Each row of this output is a classification table for the specified Prob Level, \(\pi_0\).

 
Classification Table
Prob
Level
Correct Incorrect Percentages
Event Non-
Event
Event Non-
Event
Correct Sensi-
tivity
Speci-
ficity
Pos
Pred
Neg
Pred
0.280 172 0 94 0 64.7 100.0 0.0 64.7 .
0.300 162 21 73 10 68.8 94.2 22.3 68.9 67.7
0.320 162 21 73 10 68.8 94.2 22.3 68.9 67.7
0.340 162 21 73 10 68.8 94.2 22.3 68.9 67.7
0.360 162 21 73 10 68.8 94.2 22.3 68.9 67.7
0.380 162 21 73 10 68.8 94.2 22.3 68.9 67.7
0.400 145 34 60 27 67.3 84.3 36.2 70.7 55.7
0.420 145 34 60 27 67.3 84.3 36.2 70.7 55.7
0.440 145 34 60 27 67.3 84.3 36.2 70.7 55.7
0.460 145 34 60 27 67.3 84.3 36.2 70.7 55.7
0.480 133 53 41 39 69.9 77.3 56.4 76.4 57.6
0.500 133 53 41 39 69.9 77.3 56.4 76.4 57.6
0.520 133 53 41 39 69.9 77.3 56.4 76.4 57.6
0.540 133 53 41 39 69.9 77.3 56.4 76.4 57.6
0.560 133 53 41 39 69.9 77.3 56.4 76.4 57.6
0.580 133 53 41 39 69.9 77.3 56.4 76.4 57.6
0.600 126 73 21 46 74.8 73.3 77.7 85.7 61.3
0.620 126 73 21 46 74.8 73.3 77.7 85.7 61.3
0.640 126 73 21 46 74.8 73.3 77.7 85.7 61.3
0.660 126 73 21 46 74.8 73.3 77.7 85.7 61.3
0.680 126 73 21 46 74.8 73.3 77.7 85.7 61.3
0.700 126 73 21 46 74.8 73.3 77.7 85.7 61.3
0.720 126 73 21 46 74.8 73.3 77.7 85.7 61.3
0.740 103 76 18 69 67.3 59.9 80.9 85.1 52.4
0.760 81 84 10 91 62.0 47.1 89.4 89.0 48.0
0.780 81 84 10 91 62.0 47.1 89.4 89.0 48.0
0.800 81 84 10 91 62.0 47.1 89.4 89.0 48.0
0.820 81 84 10 91 62.0 47.1 89.4 89.0 48.0
0.840 52 84 10 120 51.1 30.2 89.4 83.9 41.2
0.860 52 86 8 120 51.9 30.2 91.5 86.7 41.7
0.880 52 86 8 120 51.9 30.2 91.5 86.7 41.7
0.900 0 94 0 172 35.3 0.0 100.0 . 35.3

Here is part of the R program assay.R that plots the ROC curve.


#### ROC curve
#### sensitivity vs 1-specificity

lp = result$linear.predictors
p = exp(lp)/(1+exp(lp))
cbind(yes,no,p)
p0 = 0
sens = 1
spec = 0
total = 100
for (i in (1:total)/total)
{ 
  yy = sum(r*(p>=i))
  yn = sum(r*(p=i))
  nn = sum(n*(p))>))>

Here is the ROC graph from R output:


7.5 - Lesson 7 Summary

7.5 - Lesson 7 Summary

In the past two lessons, we've explored ways to fit and evaluate logistic regression models for a binary response. These results generalize what we saw earlier for two and three-way tables to explain associations between two variables while controlling for additional variables and allowing for both categorical and continuous types. One shortcoming, however, would be in accommodating response variables with more than two levels (something more general than "success" or "failure"). For this, we can utilize a multinomial random component; the link and systematic components need not change. The challenge, as we'll see, is in defining and interpreting odds when more than one complementary category is involved.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility