7.3 - Overdispersion

Overdispersion is an important concept in the analysis of discrete data. Many times data admit more variability than expected under the assumed distribution. The extra variability not predicted by the generalized linear model random component reflects overdispersion. Overdispersion occurs because the mean and variance components of a GLM are related and depend on the same parameter that is being predicted through the predictor set.

Overdispersion is not an issue in ordinary linear regression. In a linear regression model

\(y_i \sim N(x_i^T \beta, \sigma^2)\)

the variance \(σ^2\) is estimated independently of the mean function \(x_i^T \beta\). With discrete response variables, however, the possibility for overdispersion exists because the commonly used distributions specify particular relationships between the variance and the mean; we will see the same holds for Poisson.

For the binomial response, if \(Y_i\sim Bin(n_i, \pi_i)\), the mean is \(\mu_i=n_i\pi_i\), and the variance is \(\mu_i(n_i - \mu_i) / n_i\).

Overdispersion means that the variance of the response \(Y_i\) is greater than what's assumed by the model.
Underdispersion is also theoretically possible but rare in practice. More often than not, if the model's variance doesn't match what's observed in the response, it's because the latter is greater.

In the context of logistic regression, overdispersion occurs when the discrepancies between the observed responses \(y_i\) and their predicted values \(\hat{\mu}_i=n_i \hat{\pi}_i\) are larger than what the binomial model would predict. Overdispersion arises when the \(n_i\) Bernoulli trials that are summarized in a line of the dataset are

not identically distributed (i.e., the success probabilities vary from one trial to the next), or
not independent (i.e., the outcome of one trial influences the outcomes of other trials).

In practice, it is impossible to distinguish non-identically distributed trials from non-independence; the two phenomena are intertwined.

Issue!

If overdispersion is present in a dataset, the estimated standard errors and test statistics the overall goodness-of-fit will be distorted and adjustments must be made. When a logistic model fitted to n binomial proportions is satisfactory, the residual deviance has an approximate \(\chi^2\) distribution with \((n – p)\) degrees of freedom, where \(p\) is the number of unknown parameters in the fitted model. Since the expected value of a \(chi^2\) distribution is equal to its degree of freedom, it follows that the residual deviance for a well-fitting model should be approximately equal to its degrees of freedom. Equivalently, we may say that the mean deviance (deviance/df) should be close to one. Similarly, if the variance of the data is greater than that under binomial sampling, the residual mean deviance is likely to be greater than 1.

The problem of overdispersion may also be confounded with the problem of omitted covariates. If we have included all the available covariates related to \(Y_i\) in our model and it still does not fit, it could be because our regression function \(x_i^T \beta\) is incomplete. Or it could be due to overdispersion. Unless we collect more data, we cannot do anything about omitted covariates. But we can adjust for overdispersion.

Adjusting for Overdispersion

The most popular method for adjusting for overdispersion comes from the theory of quasi-likelihood. Quasilikelihood has come to play a very important role in modern statistics. It is the foundation of many methods that are thought to be "robust" (e.g. Generalized Estimating Equations (GEE) for longitudinal data) because they do not require the specification of a full parametric model.

In the quasilikelihood approach, we must first specify the "mean function" which determines how \(\mu_i=E(Y_i)\) is related to the covariates. In the context of logistic regression, the mean function is

\(\mu_i=n_i \exp(x_i^T \beta)\),

which implies

\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=x_i^T \beta\)

Then we must specify the "variance function," which determines the relationship between the variance of the response variable and its mean. For a binomial model, the variance function is \(\mu_i(n_i-\mu_i)/n_i\). But to account for overdispersion, we will include another factor \(\sigma^2\) called the "scale parameter," so that

\(V(Y_i)=\sigma^2 \mu_i (n_i-\mu_i)/n_i\).

If \(\sigma^2 \ne 1\) then the model is not binomial; \(\sigma^2 > 1\) corresponds to "overdispersion", and \(\sigma^2 < 1\) corresponds to "underdispersion."

If \(\sigma^2\) were known, we could obtain a consistent, asymptotically normal and efficient estimate for \(\beta\) by a quasi-scoring procedure, sometimes called "estimating equations." For the variance function shown above, the quasi-scoring procedure reduces to the Fisher scoring algorithm that we mentioned as a way to iteratively find ML estimates.

Note that no matter what \(\sigma^2\) is assumed to be, we get the same estimate for \(\beta\). Therefore, this method for overdispersion does not change the estimate for \(\beta\) at all. However, the estimated covariance for \(\hat{\beta}\) changes from

\(\hat{V}(\hat{\beta})=(x^T W x)^{-1}\)

\(\hat{V}(\hat{\beta})=\sigma^2 (x^T W x)^{-1}\)

That is, the estimated standard errors must be multiplied by the factor \(\sigma=\sqrt{\sigma^2}\).

How do we estimate \(\sigma^2\)?

One recommendation is to use

\(\hat{\sigma}^2=X^2/(N-p)\)

where \(X^2\) is the usual Pearson goodness-of-fit statistic, \(N\) is the number of sample cases (number of rows in the dataset we are modeling), and \(p\) is the number of parameters in the current model (suffering from overdispersion).

If the model holds, then \(X^2/(N - p)\) is a consistent estimate for \(\sigma^2\) in the asymptotic sequence \(N\rightarrow\infty\) for fixed \(n_i\)s.

The deviance-based estimate \(G^2/(N - p)\) does not have this consistency property and should not be used.

This is a reasonable way to estimate \(\sigma^2\) if the mean model \(\mu_i=g(x_i^T \beta)\) holds. But if important covariates are omitted, then \(X^2\) tends to grow and the estimate for \(\sigma^2\) can be too large. For this reason, we will estimate \(\sigma^2\) under a maximal model, a model that includes all of the covariates we wish to consider.

The best way to estimate \(\sigma^2\) is to identify a rich model for \(\mu_i\) and designate it to be the most complicated one that we are willing to consider. For example, if we have a large pool of potential covariates, we may take the maximal model to be the model that has every covariate included as a main effect. Or, if we have a smaller number of potential covariates, we decide to include all main effects along with two-way and perhaps even three-way interactions. But we must omit at least a few higher-order interactions, otherwise, we will end up with a model that is saturated.

In an overdispersed model, we must also adjust our test statistics. The statistics \(X^2\) and \(G^2\) are adjusted by dividing them by \(\sigma^2\). That is, tests of nested models are carried out by comparing differences in the scaled Pearson statistic, \(\Delta X^2/\sigma^2\), or the scaled deviance, \(G^2/\sigma^2\) to a chi-square distribution with degrees of freedom equal to the difference in the numbers of parameters for the two models.

If the data are overdispersed — that is, if

\(V(Y_i) \approx \sigma^2 n_i \pi_i (1-\pi_i)\)

for a scale factor \(\sigma^2 > 1\), then the residual plot may still resemble a horizontal band, but many of the residuals will tend to fall outside the \(\pm 3\) limits. In this case, the denominator of the Pearson residual will tend to understate the true variance of the \(Y_i\), making the residuals larger. If the plot looks like a horizontal band but \(X^2\) and \(G^2\) indicate lack of fit, an adjustment for overdispersion might be warranted. A warning about this, however: If the residuals tend to be too large, it doesn't necessarily mean that overdispersion is the cause. Large residuals may also be caused by omitted covariates. If some important covariates are omitted from \(x_i\), then the true \(\mu_i\)s will depart from what your model predicts, causing the numerator of the Pearson residual to be larger than usual. That is, apparent overdispersion could also be an indication that your mean model needs additional covariates. If these additional covariates are not available in the dataset, however, then there's not much we can do about it; we may need to attribute it to overdispersion.

Note, there is no overdispersion for ungrouped data. McCullagh and Nelder (1989) point out that overdispersion is not possible if \(n_i=1\). If \(y_i\) only takes values 0 and 1, then it must be distributed as Bernoulli(\(\pi\)), and its variance must be \(\pi_i(1-\pi_i)\). There is no other distribution with support {0,1}. Therefore, with ungrouped data, we should always assume scale=1 and not try to estimate a scale parameter and adjust for overdispersion.

Summary of Adjusting for Overdispersion in Binary Logistic Regression

The usual way to correct for overdispersion in a logit model is to assume that:

\(E(y_i)=n_i \pi_i\)
\(V(y_i)=\sigma^2 n_i \pi_i (1-\pi_i)\)

where \(\sigma^2\) is a scale parameter. Under this modification, the Fisher-scoring procedure for estimating \(\beta\) does not change, but its estimated covariance matrix becomes \(\sigma^2(x^TWx)^{-1}\)—that is, the usual standard errors are multiplied by the square root of \(\sigma^2\). This will make the confidence intervals wider.

Let's get back to our example and refit the model, making an adjustment for overdispersion.

We can refit the model, making an adjustment for overdispersion in SAS by changing the model statement to

model y/n= logconc / scale=pearson;

In SAS, including the option scale=Pearson in the model statement will perform the adjustment. It will not change the estimated coefficients \(\beta_j\), but it will adjust the standard errors.

Here's the output:


Deviance and Pearson Goodness-of-Fit Statistics
Criterion	Value	DF	Value/DF	Pr > ChiSq
Deviance	29.3462	7	4.1923	0.0001
Pearson	28.5630	7	4.0804	0.0002

Number of events/trials observations: 9

Note! The covariance matrix has been multiplied by the heterogeneity factor (Pearson Chi-Square / DF) 4.08043.


Testing Global Null Hypothesis: BETA=0
Test	Chi-Square	DF	Pr > ChiSq
Likelihood Ratio	13.3037	1	0.0003
Score	12.6609	1	0.0004
Wald	11.0165	1	0.0009


Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept	1	-15.8331	4.9230	10.3438	0.0013
logconc	1	5.5776	1.6804	11.0165	0.0009

The estimated scale parameter is \(\hat{\sigma}^2=X^2/df=4.08\). SAS automatically scales the covariance matrix by this factor, which means that

the standard errors in the table of coefficients are multiplied by \(\sqrt{4.08} \approx 2\), and
the Wald statistics are divided by 4.08.

If you are using glm() in R, and want to refit the model adjusting for overdispersion one way of doing it is to use summary.glm() function. For example, fit the model using glm() and save the object as RESULT. By default, dispersion is equal to 1. This will perform the adjustment. It will not change the estimated coefficients \(\beta_j\), but it will adjust the standard errors.

Estimate from the MAXIMAL model dispersion value as \(X^2/df\). Then we can call

summary(RESULT, dispersion=4.08,correlation=TRUE,symbolic.cor = TRUE).

This should give the same model but with an adjusted covariance matrix---that is, adjusted standard errors (SEs) for the \(\beta\)s (estimated logistic regression coefficients) and also changed z-values. Notice it will not adjust overall fit statistics. For that try the package "dispmod" (see assay.R).

R output after adjusting for overdispersion:


Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -15.834      4.923  -3.216 0.001298 ** 
logconc        5.578      1.680   3.319 0.000902 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 4.08)

    Null deviance: 83.631  on 8  degrees of freedom
Residual deviance: 29.346  on 7  degrees of freedom
AIC: 62.886

There are other corrections that we could make. If we were constructing an analysis-of-deviance table, we would want to divide \(G^2\) and \(X^2\) by \(\hat{\sigma}^2\) and use these scaled versions for comparing nested models. Moreover, in reporting residuals, it would be appropriate to modify the Pearson residuals to

\( r_i^\ast=\dfrac{y_i-n_i\hat{\pi}_i}{\sqrt{\hat{\sigma}^2n_i\hat{\pi}_i(1-\hat{\pi}_i)}}\);

that is, we should divide the Pearson residuals (or the deviance residuals, for that matter) by \(\sqrt{\hat{\sigma}^2}\).

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility