12: Advanced Topics II
12: Advanced Topics IIOverview
Having introduced the idea of dependent samples, where two responses are matched by subject or "cluster" (e.g., two ratings provided for each movie in a sample), and measuring ways these responses can be in agreement (e.g., marginal homogeneity or symmetry) for twoway tables, we now extend this concept to three or more responses clustered together. Thus, we have the categorical version of repeated measures and the additional challenge of deciding an appropriate structure for modeling the association among them.
The GEE approach is an extension of GLMs that provides a semiparametric approach to repeated measures of categorical response; it can be also used for continuous measurements.
Lesson 12 Code Files
Useful Links
12.1  Introduction to Generalized Estimating Equations
12.1  Introduction to Generalized Estimating EquationsThe idea behind GEEs is to produce reasonable estimates of model parameters, along with standard errors, without specifying a likelihood function in its entirety, which can be quite difficult with a multivariate categorical response. We have to account for the correlation among the multiple responses that arise from a single subject, but we can largely estimate these correlations from the data without having to accurately specify the form of the correlation structure.
In the General Social Survey (GSS) of 2018, respondents answered questions regarding their confidence in various institutions in the U.S., including education, medicine, and the scientific community. Among these three institutions, are there differences in the proportions of people having "a great deal" of confidence? Responses are clustered by subject because each subject provided answers to all three questions.
A GLM for Clustered Responses
The response vector for subject \(i\) is \(Y_i=(Y_{i1},\ldots,Y_{in_i})\), where \(Y_{ij}\) is the categorical response for subject \(i\) (\(1,\dots,n\)) and measurement \(j\) (\(1,\ldots,n_i\)). With expected value \(E(Y_{i})=\mu_i'=(\mu_{i1},\ldots,\mu_{in_i})\) and covariate/predictor vector \(x_{ij}\), we have the general model expression:
\(g(\mu_{ij})=x_{ij}'\beta\)
In our example above, \(Y_{ij}\) is binomial with mean \(\mu_{ij}=\pi_{ij}\), and the logit link would be used for \(g\). If the institution indicators, say \(Med_{ij}=1\) for medicine and \(Sci_{ij}=1\) for scientific community (each would be 0 otherwise), are added, along with a covariate for age, we have \(x_{ij}'=(1,Med_{ij},Sci_{ij},Age_i)\) and
\(\log\left(\dfrac{\pi_{ij}}{1\pi_{ij}}\right)=\beta_0+Med_{ij}\beta_1+Sci_{ij}\beta_2+Age_i\beta_3\)
Note that the age of the subject does not vary within subject and so does not require a \(j\) subscript. Although subjects are assumed to be independently collected, the multiple responses within a subject are allowed to have a correlation structure. Here are some potential structures we may consider. In each matrix below (in general, these would be \(n_i\times n_i\)), the value in the \(j^{th}\) row and \(k^{th}\) column represents the correlation between the \(j^{th}\) and \(k^{th}\) measurement for the same subject.
Independence  (no correlation within subjects)
\( \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\)
Exchangeable or Compound Symmetry
\( \begin{bmatrix} 1 & \rho & \rho \\ \rho & 1 & \rho \\ \rho & \rho & 1 \end{bmatrix}\)
AutoRegressive Order 1  (correlation depends on spatial "distance" between measurements)
\( \begin{bmatrix} 1 & \rho & \rho^2 \\ \rho & 1 & \rho \\ \rho^2 & \rho & 1 \end{bmatrix}\)
Unstructured
\( \begin{bmatrix} 1 & \rho_{12} & \rho_{13} \\ \rho_{12} & 1 & \rho_{23} \\ \rho_{13} & \rho_{23} & 1 \end{bmatrix}\)
Maximumlikelihood estimation of these parameters from the would be identical to that for the familiar logistic model, except that the correlation structure would be incorporated into the likelihood for responses from the same subject. The GEE alternative avoids this.
Interpretation of Parameter Estimates:
The interpretation will depend on the chosen link function. For example, if it is logit, \(\exp\left(\beta_0\right) =\) the odds that the characteristic is present in an observation \(i\) when \(x_i = 0\), but if it is the identity, \(\exp\left(\beta_0\right) =\) is the value of the response when \(x_i = 0\).
Model Fit:
 We don't test for the model fit of the GEE, because this is really an estimating procedure; there is no likelihood function.
 We look at the empirical estimates of the standard errors and the covariance.
 Compare the empirical estimates with the modelbased estimates
Advantages:
 Computationally more simple than MLE for categorical data
 Does not require multivariate distribution
 Consistent estimation even with misspecified correlation structure
Limitations:
 There is no likelihood function since the GEE does not specify completely the joint distribution; thus some do not consider it a model but just a method of estimation.
 Likelihoodbased methods are NOT available for testing fit, comparing models, and conducting inferences about parameters.
 Empirical based standard errors underestimate the true ones, unless very large sample size.
 Empirical estimators more variable than the parametric ones.
First we examine the method of "independence estimating equations," which incorrectly assumes that the observations within a subject are independent.
GEE Approach to Estimation
Starting with \(E(y_{i})=\mu_{i}\), the vector of means for subject \(i\) connected with the predictors via \(g(\mu_i)=x_i'\beta)\), we let \(\Delta_i\) be the diagonal matrix of variances
\(\Delta_i=\text{Diag}[\text{Var}(y_{ij})]= \begin{bmatrix}
Var_{i1} &\cdots & \cdots &\vdots \\
\vdots & Var_{i2} & \cdots & \vdots\\
\vdots& \cdots & \ddots & \vdots\\
\vdots& \cdots & \cdots & Var_{ij}
\end{bmatrix}\).
In terms of the correlation structure, this would be:
\( \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\)
Unless the observations within a subject are independent,
\(\Delta_i \neq \text{Cov}(y_i)\)
But if \(\Delta_i\) were correct, we could stack up the \(y_i\)'s and estimate \(\beta\) using techniques for generalized linear models.
How bad is it to pretend that \(\Delta_i\) is correct? Let \(\hat{\beta}\) be the estimate that assumes observations within a subject are independent (e.g., as found in ordinary linear regression, logistic regression, etc.)
 If \(\Delta_i\) is correct, then \(\hat{\beta}\) is asymptotically unbiased and efficient. Recall that unbiased \(E(\hat{\beta})=\beta\), efficient means it has the smallest variance of all other possible estimations.
 If Δ_{i} is not correct, then \(\hat{\beta}\) is still asymptotically unbiased but no longer efficient
\(\hat{\sigma}^2 \left[X^T \hat{W} X \right]^{1}\)
may be very misleading (here, X is the matrix of stacked X_{i}'s and \(\hat{W}\) is the diagonal matrix of final weights, if any)
 The 'naive' standard error for \(\hat{\beta}\), obtained from the naive estimate of \(\text{Cov}(\hat{\beta})\)
 consistent standard errors for \(\hat{\beta}\) are still possible using the sandwich estimator (sometimes called the 'robust' or 'empirical' estimator)
Sandwich Estimator
The sandwich estimator was first proposed by Huber (1967) and White(1980); Liang and Zeger (1986) applied it to longitudinal data
\(\left[X^T \hat{W} X \right]^{1} \left[\sum\limits_i X_i^T (y_i\hat{\mu}_i)(y_i\hat{\mu}_i)^T X_i \right]\left[X^T \hat{W} X \right]^{1}\)
 provides a good estimate of \(\text{Cov}(\hat{\beta})\) in large samples (several hundred subjects or more) regardless of the true form of \(\text{Cov}\left(y_i\right)\)
 in smaller samples it could be rather noisy, so 95% intervals obtained by ± 2 SE's could suffer from undercoverage
When withinsubject correlations are not strong, Zeger (1988) suggests that the use of IEE with the sandwich estimator is highly efficient.
12.2  Modeling Binary Clustered Responses
12.2  Modeling Binary Clustered ResponsesExample  Confidence in U.S. Institutions
For the three GSS questions about confidence in education, medicine, and the scientific community, the original responses were divided into three categories: "A great deal", "Hardly any", and "Only some".
A great deal  Hardly any  Only some  

Education  370  276  821 
Medicine  525  194  748 
Scientific Community  665  95  707 
For this particular analysis, we focus on the proportions of "A great deal" and combine the other two categories. Thus, we have binary responses repeatedly measured on each of 1467 subjects or clusters. With \(Y_{ij}\) representing the binomial response for the \(j^{th}\) question and the \(i^{th}\) subject, the following image illustrates the clusters, within which we expect some correlation.
For reference, if we temporarily ignore the correlation among responses within subjects and instead treat all \(Y_{ij}\) as independent, we can calculate estimates of the odds of "A great deal" of confidence in education, medicine, and scientific community, respectively, as
\(\dfrac{370}{276+821}=0.337,\mbox{ }\dfrac{525}{194+748}=0.557,\mbox{ and }\dfrac{665}{95+707}=0.829\)
And from Lesson 3, we can calculate the standard error for differences in these (log) odds. For example, the standard error for comparing the log odds of "A great deal" for education versus medicine would be
\(\sqrt{\dfrac{1}{370}+\dfrac{1}{(276+821)}+\dfrac{1}{525}+\dfrac{1}{(194+748)}}=0.0811\)
But we would expect this to be smaller if the correlation between these responses is positive and taken into account. This is what we now explore with the software analysis.
Fitting the GEE Model
The model for the clustered responses as a function of only the question type would look like this.
\(\log\left(\dfrac{\pi_{ij}}{1\pi_{ij}}\right)=\beta_0+Med_{ij}\beta_1+Sci_{ij}\beta_2\)
where \(\pi_{ij}\) is the probability that the \(i^{th}\) subject answers "A great deal" to the \(j^{th}\) question. The slope \(\beta_1\) is interpreted as the log odds ratio for "A great deal" of confidence in medicine, relative to education.
The naive approach would assume all the responses from all subjects are independent and estimate the parameters and standard errors with the usual logistic model, which can be done with SAS's LOGISTIC procedure.
data gss;
infile 'x:\gss.confidence.csv' delimiter=',' firstobs=2;
input id age sex $ question $ conf $ greatly;
run;
proc logistic data=gss;
class question(ref='coneduc') / param=ref;
model greatly(event='1') = question / scale=none;
run;
This coding will treat the question about education as the baseline. Compare the output below to the estimates and standard errors obtained from the calculations above.
Analysis of Maximum Likelihood Estimates  

Parameter  DF  Estimate  Standard Error 
Wald ChiSquare 
Pr > ChiSq  
Intercept  1  1.0868  0.0601  326.8155  <.0001  
question  conmedic  1  0.5022  0.0811  38.3292  <.0001 
question  consci  1  0.8995  0.0798  127.1214  <.0001 
If instead, we estimate the parameters and standard errors according to the GEE approach, we use the GENMOD procedure and have several choices for the working correlation structure (withinsubjects). Note that even with the working correlation structure specified as independent, as it is below with the TYPE=IND option, the sample correlations among the three responses within subjects are still being incorporated into the standard error estimates.
* gee with independent corr structure;
proc genmod data=gss descending;
class id question(ref='coneduc');
model greatly = question / dist = bin link=logit type3 wald;
repeated subject=id / type=ind corrw;
title 'gee with independent corr structure';
run;
The SUBJECT=ID is where we specify the clusters within which we have repeated observations that may be correlated. Compared with the LOGISTIC model output above, the estimates are identical, but the standard errors are slightly smaller. The TYPE3 and WALD options can also be used to output an ANOVA table for an overall effect due to question type. Versus the reference chisquare distribution with two degrees of freedom, the test statistic of 149 is strong evidence that subjects' confidence differs among the three institutions, with the highest proportion of "A great deal" of confidence applying to the scientific community.
Analysis Of GEE Parameter Estimates  

Empirical Standard Error Estimates  
Parameter  Estimate  Standard Error 
95% Confidence Limits  Z  Pr > Z  
Intercept  1.0868  0.0601  1.2047  0.9690  18.08  <.0001  
question  conmedic  0.5022  0.0697  0.3657  0.6388  7.21  <.0001 
question  consci  0.8995  0.0737  0.7552  1.0439  12.21  <.0001 
question  coneduc  0.0000  0.0000  0.0000  0.0000  .  . 
Wald Statistics For Type 3 GEE Analysis  

Source  DF  ChiSquare  Pr > ChiSq 
question  2  149.21  <.0001 
Other options for the working correlation structure can produce slightly different results, but in general, the results are not that sensitive to the specified working correlation structure because empirical correlations among the responses from the data are largely dominant in the GEE calculations. With this in mind, a good compromise between model fit and parameter sparsity is the exchangeable structure, which allows a single correlation parameter for all pairwise responses within a subject. The results below are for the exchangeable structure; these are particularly similar to those assuming the independent correlation structure because the estimated correlation in the data is so small.
* gee with exchangeable corr structure;
proc genmod data=gss descending;
class id question(ref='coneduc');
model greatly = question / dist = bin link=logit type3 wald;
repeated subject=id / type=exch corrw;
title 'gee with exchangeable corr structure';
run;
Working Correlation Matrix  

Col1  Col2  Col3  
Row1  1.0000  0.2348  0.2348 
Row2  0.2348  1.0000  0.2348 
Row3  0.2348  0.2348  1.0000 
Analysis Of GEE Parameter Estimates  

Empirical Standard Error Estimates  
Parameter  Estimate  Standard Error 
95% Confidence Limits  Z  Pr > Z  
Intercept  1.0868  0.0601  1.2047  0.9690  18.08  <.0001  
question  conmedic  0.5022  0.0697  0.3657  0.6388  7.21  <.0001 
question  consci  0.8995  0.0737  0.7552  1.0439  12.21  <.0001 
question  coneduc  0.0000  0.0000  0.0000  0.0000  .  . 
The naive approach would assume all the responses from all subjects are independent and estimate the parameters and standard errors with the usual logistic model, which can be done with R's glm function.
library(geepack)
gss = read.csv(file.choose(), header=T) # "gss.confidence.csv"
# all observations independent
fit.glm = glm(greatly~question, data=gss, family=binomial(link='logit'))
By default, R will treat the question about education (first alphabetically among the question types) as the baseline. Compare the output below to the estimates and standard errors obtained from the calculations above.
> summary(fit.glm)
Call:
glm(formula = greatly ~ question, family = binomial(link = "logit"),
data = gss)
Deviance Residuals:
Min 1Q Median 3Q Max
1.099 0.941 0.762 1.258 1.660
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 1.0868 0.0601 18.08 <2e16 ***
questionconmedic 0.5022 0.0811 6.19 6e10 ***
questionconsci 0.8995 0.0798 11.27 <2e16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5722.8 on 4400 degrees of freedom
Residual deviance: 5591.4 on 4398 degrees of freedom
AIC: 5597
If instead, we estimate the parameters and standard errors according to the GEE approach, we use the geeglm function and have several choices for the working correlation structure (withinsubjects). Note that even with the working correlation structure specified as independent, as it is below with the corstr option, the sample correlations among the three responses within subjects are still being incorporated into the standard error estimates.
# gee with independent corr structure
fit.ind = geeglm(greatly~question, data=gss, id=id, family=binomial, corstr="independence", scale.fix=T)
The id= option is where we specify the clusters within which we have repeated observations that may be correlated, and tehe scale.fix=T option is to avoid R's default approach of introducing an overdispersion scale parameter. Compared with the glm function output above, the estimates are identical, but the standard errors are slightly smaller. The ANOVA function can also be used to output an ANOVA table for an overall effect due to question type. Versus the reference chisquare distribution with two degrees of freedom, the test statistic of 149 is strong evidence that subjects' confidence differs among the three institutions, with the highest proportion of "A great deal" of confidence applying to the scientific community.
> summary(fit.ind)
Call:
geeglm(formula = greatly ~ question, family = binomial, data = gss,
id = id, corstr = "independence", scale.fix = TRUE)
Coefficients:
Estimate Std.err Wald Pr(>W)
(Intercept) 1.0868 0.0601 327 < 2e16 ***
questionconmedic 0.5022 0.0697 52 5.6e13 ***
questionconsci 0.8995 0.0737 149 < 2e16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation structure = independence
Scale is fixed.
Number of clusters: 1467 Maximum cluster size: 3
> anova(fit.ind)
Analysis of 'Wald statistic' Table
Model: binomial, link: logit
Response: greatly
Terms added sequentially (first to last)
Df X2 P(>Chi)
question 2 149 <2e16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Other options for the working correlation structure can produce slightly different results, but in general, the results are not that sensitive to the specified working correlation structure because empirical correlations among the responses from the data are largely dominant in the GEE calculations. With this in mind, a good compromise between model fit and parameter sparsity is the exchangeable structure, which allows a single correlation parameter for all pairwise responses within a subject. The results below are for the exchangeable structure; these are particularly similar to those assuming the independent correlation structure in this example because the estimated correlation in the data is so small.
# gee with exchangeable corr structure
fit.exch = geeglm(greatly~question, data=gss, id=id, family=binomial, corstr="exchangeable", scale.fix=T)
# gee with unstructured corr structure
fit.un = geeglm(greatly~question, data=gss, id=id, family=binomial, corstr="unstructured", scale.fix=T)
> summary(fit.exch)
Call:
geeglm(formula = greatly ~ question, family = binomial, data = gss,
id = id, corstr = "exchangeable", scale.fix = TRUE)
Coefficients:
Estimate Std.err Wald Pr(>W)
(Intercept) 1.0868 0.0601 327 < 2e16 ***
questionconmedic 0.5022 0.0697 52 5.6e13 ***
questionconsci 0.8995 0.0737 149 < 2e16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation structure = exchangeable
Scale is fixed.
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.235 0.0173
Number of clusters: 1467 Maximum cluster size: 3
Adding Additional Predictors
With age (centered around its mean) also potentially involved, the model can be extended to
\(\log\left(\dfrac{\pi_{ij}}{1\pi_{ij}}\right)=\beta_0+Med_{ij}\beta_1+Sci_{ij}\beta_2+C.age_i\beta_3\)
In the case of age, or any predictor that is measured at the subject level, an index for question (institution type) is not needed.
Adding additional predictors, including interactions, does not change the GEE approach to estimation, but the parameter interpretations are adjusted for other effects. For example, \(\beta_0\) now represents the logodds of having "A great deal" of confidence in the (baseline) education system, for a subject at the mean age of this survey data, which happens to be 48.1. The coefficient \(\beta_3\) is the change in logodds of having "A great deal" of confidence in the education system for each 1year increase in the subject's age. Moreover, this effect is common for the other institutions as well unless interaction is taken into account.
* gee with age as a covariate;
data gss;
set gss;
c_age = age48.1;
run;
proc genmod data=gss descending;
class id question(ref='coneduc');
model greatly = question c_age / dist = bin link=logit type3 wald;
repeated subject=id / type=exch corrw;
title 'gee with age as a covariate';
run;
We see that, in addition to the institution in question, a person's confidence is also related with age. For each additional year in age, the estimated odds of having "A great deal" of confidence in the education system (or either of the others) is multiplied by
\( e^{0.0050}=0.995\)
So, the proportions of people having "A great deal" of confidence in these institutions tend to be smaller for older ages, but this effect is very small in practical terms. The statistical significance is likely due to the large sample size.
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Standard 95% Confidence
Parameter Estimate Error Limits Z Pr > Z
Intercept 1.0889 0.0603 1.2070 0.9708 18.07 <.0001
question conmedic 0.5031 0.0698 0.3663 0.6399 7.21 <.0001
question consci 0.9011 0.0738 0.7564 1.0457 12.21 <.0001
question coneduc 0.0000 0.0000 0.0000 0.0000 . .
c_age 0.0050 0.0023 0.0095 0.0006 2.22 0.0266
# gee with age covariate
gss$age = gss$age  mean(gss$age)
fit.age = geeglm(greatly~question+age, data=gss, id=id, family=binomial, corstr="exchangeable", scale.fix=T)
We see that, in addition to the institution in question, a person's confidence is also related with age. For each additional year in age, the estimated odds of having "A great deal" of confidence in the education system (or either of the others) is multiplied by
\( e^{0.00504}=0.995\)
So, the proportions of people having "A great deal" of confidence in these institutions tend to be smaller for older ages, but this effect is very small in practical terms. The statistical significance is likely due to the large sample size.
> summary(fit.age)
Call:
geeglm(formula = greatly ~ question + age, family = binomial,
data = gss, id = id, corstr = "exchangeable", scale.fix = T)
Coefficients:
Estimate Std.err Wald Pr(>W)
(Intercept) 1.08903 0.06026 326.65 < 2e16 ***
questionconmedic 0.50306 0.06979 51.95 5.7e13 ***
questionconsci 0.90109 0.07381 149.06 < 2e16 ***
age 0.00504 0.00228 4.91 0.027 *

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation structure = exchangeable
Scale is fixed.
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.233 0.0174
Number of clusters: 1467 Maximum cluster size: 3
Changing Covariance Structure
With PROC GENMOD
, we can also try alternative assumptions about the withinsubject correlation structure.
The type=exch
or type=cs
option specifies an "exchangeable" or "compound symmetry assumption," in which the observations within a subject are assumed to be equally correlated:
\(Corr(y_i)=\begin{bmatrix}
1 & \rho & \rho & \cdots & \rho\\
\rho & 1 & \rho & \cdots & \rho\\
\rho & \rho & 1 & \cdots & \rho\\
\vdots & \vdots & \vdots & \ddots & \vdots\\
\rho & \rho & \rho & \cdots & 1
\end{bmatrix}\)
This is the same kind of structure that arises in a normal linear model with random intercepts. It acknowledges that correlations exist within a subject, but it assumes that the correlations are constant over time.
An autoregressive(AR1) structure, specified by type=ar, allows the correlations to diminish over time as
\(Corr(y_{ij},y_{ij'})=\rho ^{t_{ij}t_{ij'}}\),
where \(t_{ij}\) and \(t_{ij′}\) are the observation times for \(y_{ij}\) and \(y_{ij'}\) .
The exchangeable and the autoregressive structures both express the withinsubject correlations in terms of a single parameter \(\rho\). If the subjects are measured at a relatively small common set of occasions, we may be able to estimate an arbitrary correlation matrix. With \(n_i = 4\) measurement times per subject, the unstructured matrix would have six correlations to estimate. An unstructured matrix is obtained by the option type=un.
Which correlation structure is best for any given problem?
In principle, it makes sense to think that the one that is most nearly "correct" would be best. For example, if the autoregressive or exchangeable assumptions are wrong, one might think that the unstructured form would provide the most efficient answer. But that is not necessarily true, because moving to an unstructured matrix introduces many more unknown parameters which could destabilize the model fit. In certain cases, a wrong structure with a small number of parameters could perform better than a correct structure with many parameters. Unfortunately, the literature does not provide much guidance on how to choose a good working correlation structure for GEE.
12.3  Addendum: Estimating Equations and the Sandwich
12.3  Addendum: Estimating Equations and the SandwichQuasilikelihood
Suppose that we observe responses \(Y_1,Y_1 , \dots , Y_N\) which we want to relate to covariates. The ordinary linear regression model is
\(Y_i \sim N(x_i^T\beta,\sigma^2)\)
where \(x_i\) is a vector of covariates, \(\beta\) is a vector of coefficients to be estimated, and \(\sigma^2\) is the error variance. Now let's generalize this model in two ways:
Introduce a link function for the mean \(E ( Y_i ) = \mu_i\),
\(g(\mu_i)=x_i^T\beta\).
We could also write this as
\(E(Y_i)=\mu_i(\beta)\quad\) (2)
where the covariates and the link become part of the function \(\mu_i\). For example, in a loglinear model we would have \(\mu_i=\exp(x_i^T\beta)\).
Allow for heteroscedasticity, so that
\(\text{Var}(Y_i)=V_i\quad\) (3)
In many cases \(V_i\) will depend on the mean \(\mu_i\) and therefore on \(\beta\), and possibly on additional unknown parameters(e.g. a scale factor). For example, in a traditional overdispersed loglinear model, we would have \(V_i=\sigma^2 \mu_i=\sigma^2 \exp(x_i^T\beta)\).
A maximumlikelihood estimate for \(\beta\) under this model could be computed by a Fisher scoring procedure. ML estimates have two nice theoretical properties: they are approximately unbiased and highly efficient. Interestingly, the asymptotic theory underlying these properties does not really depend on the normality of \(Y_i\), but only on the first two moments. That is if the mean function (2) and the variance function (3) are correct, but the distribution of \(Y_i\) is not normal, the estimate of \(\beta\) obtained by maximizing the normal loglikelihood from
\(Y_i \sim N(\mu_i(\beta), V_i(\beta))\)
is still asymptotically unbiased and efficient.
Quasiscoring
If we maximize the normalitybased loglikelihood without assuming that the response is normally distributed, the resulting estimate of \(\beta\) is called a quasilikelihood estimate. The iterative procedure for computing the quasilikelihood estimate, called quasiscoring, proceeds as follows.
First, let's collect the responses and their means into vectors of length \(N\),
\(Y=
\begin{bmatrix}
Y_1\\
Y_2\\
\vdots\\
Y_N
\end{bmatrix},
\mu=
\begin{bmatrix}
\mu_1\\
\mu_2\\
\vdots\\
\mu_N
\end{bmatrix} \)
Also, let \(V\) be the \(N \times N\) matrix with \(V_1 , \dots , V_N\) on the diagonal and zeros elsewhere. Finally, let
\(D=\dfrac{\partial \mu}{\partial \beta}\)
be the \(N \times p\) matrix containing the partial derivatives of \(\mu\) with respect to \(\beta\). That is, the \(\left(i, j\right)^{th}\) element of \(D\) is \(\dfrac{∂\mu_i }{∂\beta_i}\) . In a simple linear model with an identity link, \(\mu_i=x_i^T\beta\), \(D\) is simply the \(N \times p\) matrix of covariates
\(X=
\begin{bmatrix}
x_1^T\\
x_2^T\\
\vdots\\
x_N^T
\end{bmatrix}\)
The iterative quasiscoring procedure is
\(\beta_{\mbox{new}}\beta_{\mbox{old}}= \left( D^T V^{1} D\right)^{1} D^T V^{1}(Y\mu)\quad\) (4)
where on the righthand side of (4) the quantities \(D\), \(\tilde{V}\), and \(\mu\) are evaluated at \(\beta = \beta_{\mbox{old}}\).
It is easy to see that in the case of simple linear regression with homoscedastic response (\(\mu_i=x_i^T\beta\) and \(V_i=\sigma^2\) ), the quasiscoring procedure converges to the OLS estimate
\(\hat{\beta}=\left(X^T X\right)^{1}X^T Y\)
in a single step regardless of \(\beta_{\mbox{old}}\). In other cases (e.g. loglinear regression where \(\mu_i=\exp(x_i^T\beta)\), \(V_i=\sigma^2 \mu_i\) ) it produces the same estimate for \(\beta\) that we obtained earlier by fitting the generalized linear model.
Quasilikelihood estimation is really the same thing as generalized linear modeling, except that we no longer have a full parametric model for \(Y_i\).
If we let \(U\) be the \(p \times 1\) vector
\(U=D^TV^{1}(Y\mu)\)
the final estimate from the quasiscoring procedure satisfies the condition \(U = 0\). \(U\) can be regarded as the first derivative of the quasiloglikelihood function. The first derivative of a loglikelihood is called a score, so \(U\) is called a quasiscore. \(U = 0\) is often called the set of estimating equations, and the final estimate for \(\beta\) is the solution to the estimating equations.
Estimating equations with a working variance function. We'll suppose that the mean regression function \(\mu_i \left(\beta\right)\) has been correctly specified but the variance function has not. That is, the data analyst incorrectly supposes that the variance function for \(Y_i\) is \(\tilde{V}_i\) rather than \(V_i\) , where \(\tilde{V}_i\) is another function of \(\beta\). The analyst then estimates \(\beta\) by solving the quasiscore equations
\(U=D^T \tilde{V}^{1}(Y\mu)=0\)
where \(\tilde{V}_i\) is the diagonal matrix of \(\tilde{V}_i\)s. Obviously it would be better to perform the scoring procedure using the true variance function \(V = \text{Diag}\left(V_1 , \dots , V_N\right)\) rather than \(\tilde{V}\). What are the properties of this procedure where \(\tilde{V}\) has been used instead of \(V\)?
 The estimate \(\hat{\beta}\) is a consistent and asymptotically unbiased estimate of \(\beta\), even if \(\tilde{V} \neq V\). It's also asymptotically normal.
 If \(\tilde{V} \neq V\) then \(\hat{\beta}\) is not efficient; that is, the asymptotic variance of \(\hat{\beta}\) is lowest when \(\tilde{V} = V\).
 If \(\tilde{V}= V\) then the final value of the matrix \(\left(D^T V^{1} D\right)^{1}\) from the scoring procedure (4) (i.e. the value of this matrix with \(\hat{\beta}\) substituted for \(\beta\)) is a consistent estimate of \(\text{Var}(\hat{\beta})\).
 If \(\tilde{V} \neq V\) then the final value of the matrix \(\left(D^{T} V^{1} D\right)^{1}\) from (4) is not a consistent estimate of \(\text{Var}(\hat{\beta})\).
Because of these properties, \(\hat{\beta}\) may still be a reasonable estimate of \(\beta\) if \(\tilde{V} \neq V\), but the final value of \((D^T \tilde{V}^{1}D)^{1}\)  often called the "modelbased" or "naive" estimator  will not give accurate standard errors for the elements of \(\hat{\beta}\). However, this problem can be corrected by using the "robust" or "sandwich estimator," defined as
\(\left(D^T \tilde{V}^{1}D\right)^{1} \left(D^T \tilde{V}^{1} E \tilde{V}^{1} D\right) \left(D^T \tilde{V}^{1}D\right)^{1}\quad\), (5)
where,
\(E=\text{Diag}\left((Y_1\mu_1)^2,\ldots, (Y_N\mu_N)^2\right)\quad\) , (6)
and all quantities in (5) and (6) are evaluated with \(\hat{\beta}\) substituted for \(\beta\). The sandwich estimator is a consistent estimate of \(\text{Var}(\hat{\beta})\) even when \(\tilde{V} \neq V\).
The sandwich estimator was first proposed by Huber (1967) and White (1980), but was popularized in the late 1980s when Liang and Zeger extended it to multivariate or longitudinal responses.
12.4  Inference for Loglinear Models: Sparse Data
12.4  Inference for Loglinear Models: Sparse DataAs we consider higherdimensional contingency tables, we are more likely to encounter sparseness. A sparse table is one where there are many cells with small counts and/or zeros. How many is 'many' and how small is 'small' are relative to
 the sample size \(n\)
 the table size \(N\)
We can have a small sample size, or a large sample size with a large number of cells. The large number of cells can be either due to the large number of classification variables, or small number of variables but with lots of levels.
Types of zeros (or empty cells)
Usually, when we talk about sparseness we are talking about zeros. This is a problem that is frequently encountered. There are different types of zeros, and we need to differentiate between the two.
Sampling Zeros occur when there is no observation in a cell, but there is a positive probability of observing an observation. That is, \(n_{ij} = 0\), but \(\pi_{ij} > 0\), for some \(i\) and \(j\). And increasing the sample size will eventually lead to an observation there.
Structural Zeros are are theoretically impossible to observe. That is, both \(n_{ij} = 0\), and \(\pi_{ij} = 0\). Tables with structural zeros are structurally incomplete. They are also known as incomplete tables. This is different from a partial classification where an incomplete table results from not being able to completely crossclassify all individuals or units.
Example of a structurally incomplete table
Sample of adults on whether they got an infection once or twice:
Infection

Reinfection


Yes  No  
Yes  87 
16

No    116 
Total  87  132 
The explanation here is that one cannot get an infection the second time until one gets it the first time. Thus, \(n_{21}\) is a structural zero.
Hypothetical Example
If there is a single sampling zero in a table, the MLE estimates are always positive. If there are nonzero margins, but a bad pattern of zeros, some MLE estimates of the cells counts can be negative.
If there are zero margins, we cannot estimate the corresponding term in the model, and the partial odds ratios involving this cell will equal 0 or 1. For example, if \(n_{+11} = 0\), there is no MLE estimate for \(\lambda_{11}^{YZ} = 0\), and any model including this term will not fit properly. We can force \(\lambda_{11}^{YZ} = 0\) but then must adjust for the degrees of freedom in the model, and no software will do this automatically.
A general formula for adjusting the degrees of freedom:
\(df = \left(T_e  Z_e\right)  \left(T_p  Z_p\right)\)
where \(T_e \) is the number of cells fitted, \(Z_e\) is the number of zero cells, \(T_p\) is the number of parameters in the model, and \(Z_p \) is the number of parameters that cannot be estimated because of zero margins.
Let's see how this works in the example below. The data, also available in Zeros.sas and Zeros.R, are from Fienberg (1980) and contain a marginal total if we sum over X:
Z=1:
Y=1  Y=2  
X=1  0  5 
X=2  0  16 
Z=2
Y=1  Y=2  
X=1  6  9 
X=2  5  7 
The three marginal tables are
XY:
Y=1  Y=2  
X=1  6  14 
X=2  5  23 
XZ:
Z=1  Z=2  
X=1  5  15 
X=2  16  12 
YZ:
Z=1  Z=2  
Y=1  0  11 
Y=2  21  16 
Here, \(n_{+11} = 0\), so there is no MLE estimate for \(\lambda_{11}^{YZ} = 0\), and any model including this term will not fit properly. Let's take a look at these examples in R and SAS.
The following code fits the model of homogeneous association (XY, XZ, YZ), which allows all conditional pairwise associations to be unique but assumes they do not depend on the level of the third variable conditioned on.
data zero_margin;
input X $ Y $ Z $ count @@;
datalines;
x1 y1 z1 0 x1 y1 z2 6 x1 y2 z1 5 x1 y2 z2 9
x2 y1 z1 0 x2 y1 z2 5 x2 y2 z1 16 x2 y2 z2 7
;
/* homogeneous associations */
proc genmod data=zero_margin order=data ;
class X Y Z;
model count = X Y Z X*Y X*Z Z*Y /link=log dist=poi;
title 'Homogeneous Association with zero margin';
run;
Notice the large standard errors for some of the terms such as Y*Z, y1z1 below.
Homogeneous Association with zero margin 8
The GENMOD Procedure
Analysis Of Maximum Likelihood Parameter Estimates
Standard Wald 95% Wald
Parameter DF Estimate Error Confidence Limits ChiSquare Pr > ChiSq
Intercept 1 1.9459 0.3780 1.2051 2.6867 26.51 <.0001
X x1 1 0.2513 0.5040 0.7364 1.2390 0.25 0.6180
X x2 0 0.0000 0.0000 0.0000 0.0000 . .
Y y1 1 0.3365 0.5855 1.4841 0.8112 0.33 0.5655
Y y2 0 0.0000 0.0000 0.0000 0.0000 . .
Z z1 1 0.8267 0.4532 0.0615 1.7149 3.33 0.0681
Z z2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x1 y1 1 0.0690 0.7878 1.6131 1.4751 0.01 0.9302
X*Y x1 y2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x2 y1 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x2 y2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x1 z1 1 1.4145 0.7187 2.8230 0.0059 3.87 0.0490
X*Z x1 z2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x2 z1 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x2 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y1 z1 1 25.5664 92704.17 181722 181671.3 0.00 0.9998
Y*Z y1 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y2 z1 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y2 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Scale 0 1.0000 0.0000 1.0000 1.0000
As expected, the overall the deviance df is reported to be 1 since exactly one term is omitted from this model: the threeway interaction. But with the zero counts, there is no room for improved fit, and we should have 0 degrees of freedom. If we apply the above formula, we get
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 1 0.0000 0.0000
Scaled Deviance 1 0.0000 0.0000
Pearson ChiSquare 1 0.0000 0.0000
Scaled Pearson X2 1 0.0000 0.0000
\(df = \left(T_e  Z_e\right)  \left(T_p  Z_p\right) = \left(8  2\right)  \left(7  1\right) = 6  6 = 0\)
As it is, the estimates and the model overall are not very useful. If however we remove the cells with zero counts and refit the model, we should get the correct degrees of freedom. Thus, we have
/*delete zero values*/
data zero_del;
set zero_margin;
if count=0 then delete;
run;
/* homogeneous associations */
proc genmod data=zero_del order=data ;
class X Y Z;
model count = X Y Z X*Y X*Z Z*Y /link=log dist=poi;
title 'Homogeneous Association with zero margin';
run;
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 0 0.0000 .
Scaled Deviance 0 0.0000 .
Pearson ChiSquare . 0.0000 .
Scaled Pearson X2 . 0.0000 .
Log Likelihood 56.6027
Full Log Likelihood 11.5503
AIC (smaller is better) 35.1007
AICC (smaller is better) .
BIC (smaller is better) 33.8512
Also, notice that if we fit a model that does not require the YZ margin, such as the model of complete independence (X, Y, Z), then \(Z_p=0\) since all parameters in the model can be estimated. For this example, the df would be
\((8  2)  (4  0) = 2\)
/* complete independence */
proc genmod data=zero_del order=data;
class X Y Z;
model count = X Y Z /link=log dist=poi;
title 'Complete independence with zeros deleted';
run;
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 2 5.0616 2.5308
Scaled Deviance 2 5.0616 2.5308
Pearson ChiSquare 2 4.9058 2.4529
Scaled Pearson X2 2 4.9058 2.4529
Log Likelihood 54.0720
Full Log Likelihood 14.0811
AIC (smaller is better) 36.1622
AICC (smaller is better) 76.1622
BIC (smaller is better) 35.3293
The following code fits the model of homogeneous association (XY, XZ, YZ), which allows all conditional pairwise associations to be unique but assumes they do not depend on the level of the third variable conditioned on.
# Zero Margin
X=factor(c(rep("x1",4),rep("x2",4)))
Y=factor(c(rep(c("y1","y1","y2","y2"),2)))
Z=factor(c(rep(c("z1","z2"),4)))
count=c(0,6,5,9,0,5,16,7)
table=xtabs(count~X+Y+Z)
# Homogeneous Association (XY,XZ,YZ) with zero margin
model=glm(count~X+Y+Z+X*Y+X*Z+Z*Y,family=poisson(link=log))
Notice the large standard errors for several of the estimates below.
> summary(model)
Call:
glm(formula = count ~ X + Y + Z + X * Y + X * Z + Z * Y, family = poisson(link = log))
Deviance Residuals:
1 2 3 4 5 6
8.988e06 0.000e+00 0.000e+00 0.000e+00 1.664e05 0.000e+00
7 8
5.960e08 1.490e08
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 2.393e+01 4.535e+04 0.001 1.000
Xx2 1.232e+00 9.397e01 1.311 0.190
Yy2 2.554e+01 4.535e+04 0.001 1.000
Zz2 2.572e+01 4.535e+04 0.001 1.000
Xx2:Yy2 6.899e02 7.878e01 0.088 0.930
Xx2:Zz2 1.414e+00 7.187e01 1.968 0.049 *
Yy2:Zz2 2.514e+01 4.535e+04 0.001 1.000

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 3.7197e+01 on 7 degrees of freedom
Residual deviance: 3.5779e10 on 1 degrees of freedom
AIC: 37.101
As expected, the overall the deviance df is reported to be 1 since exactly one term is omitted from this model: the threeway interaction. But with the zero counts, there is no room for improved fit, and we should have 0 degrees of freedom. If we apply the above formula, we get
\(df=(T_e−Z_e)−(T_p−Z_p)=(8−2)−(7−1)=6−6=0\)
As it is, the estimates and the model overall are not very useful. If however we remove the cells with zero counts and refit the model, we should get the correct degrees of freedom. Thus, we have
# Delete Zero Values, "subset" used to remove 0s
# Homogeneous Association with zero margin: (XY,XZ,YZ)
model=glm(count~X+Y+Z+X*Y+X*Z+Z*Y,family=poisson(link=log), subset=count>0)
And the output looks like:
> summary(model)
Call:
glm(formula = count ~ X + Y + Z + X * Y + X * Z + Z * Y, family = poisson(link = log),
subset = count > 0)
Deviance Residuals:
[1] 0 0 0 0 0 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>z)
(Intercept) 1.20397 0.69121 1.742 0.0815 .
Xx2 1.23214 0.93975 1.311 0.1898
Yy2 0.40547 0.52705 0.769 0.4417
Zz2 0.58779 0.55777 1.054 0.2920
Xx2:Yy2 0.06899 0.78780 0.088 0.9302
Xx2:Zz2 1.41447 0.71866 1.968 0.0490 *
Yy2:Zz2 NA NA NA NA

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 9.5791e+00 on 5 degrees of freedom
Residual deviance: 5.3291e15 on 0 degrees of freedom
AIC: 35.101
Also, notice that if we fit a model that does not require the YZ margin, such as the model of complete independence (X, Y, Z), then \(Z_p=0\) since all parameters in the model can be estimated. For this example, the df would be
\((8  2)  (4  0) = 2\)
# Delete Zero Values, "subset" used to remove 0s
# Complete Independence with zero margin: (X,Y,Z)
model=glm(count~X+Y+Z,family=poisson(link=log), subset=count>0)
> summary(model)
Call:
glm(formula = count ~ X + Y + Z, family = poisson(link = log),
subset = count > 0)
Deviance Residuals:
2 3 4 6 7 8
0.6314 1.3798 0.8575 0.5820 1.0228 0.7994
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 1.7944 0.4798 3.740 0.000184 ***
Xx2 0.3365 0.2928 1.149 0.250443
Yy2 0.3747 0.3917 0.957 0.338746
Zz2 0.2719 0.3318 0.819 0.412519

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 9.5791 on 5 degrees of freedom
Residual deviance: 5.0616 on 2 degrees of freedom
AIC: 36.162
Effects of sampling zeros
There are two major effects on modeling contingency tables with sampling zeros:
 Maximum likelihood estimates (MLEs) may not exist for loglinear/logit models
 If MLE estimates exist, they may be biased
Recall that terms in a loglinear model correspond to the margins of the tables. Existence of MLE estimates depends on the positions of zeros in the table and what effects are included in a model.
If all \(n_{ij} > 0\) than MLE estimates of parameters are finite. If a table has a 0 marginal frequency, and there is a term in the model corresponding to that margin, the MLE estimates are infinite.
The bias problem due to sparseness
As mentioned above, the ML estimates of odds ratio can be severely biased. Also, the sampling distributions of fit statistics may be poorly approximated by the chisquared distribution, which is why the standard errors are so large.
One suggested ad hoc solution is to add .5 to each cell in the table. Doing so shrinks the estimated odds ratios that are 1 to finite values and increases estimates that would otherwise be 0.
Other suggested solutions:
 When a model does not converge, try adding a tiny number to all zero cells in the table, such as 0.000000001. Some argue you should add this value to all of the cells. Essentially, adding very small values is what the software programs like SAS and R are doing when they fit these types of models.
 Extend this by doing a sensitivity analysis by adding different numbers of varying sizes to the cells. Examine fit statistics and parameter estimates to see if they change very much.
Effect of Sparseness on \(X^2\) and \(G^2\)
Unfortunately there is no single rule that covers all situations. We have to be careful about application of these situationspecific rules.
So while the \(G^2\) statistic for model fit may not be well approximated by the chisquared distribution, the difference between \(G^2\)s for two nested models will be closer to chisquared than the \(G^2\) for either model versus the saturated one. Chisquared comparison tests depend more on the size of marginals than on cell sizes in the joint table. So if margins have cells exceeding 5, the chisquared approximation of \(G^{2}\left(\mbox{reduced model}\right)  G^{2}\left(\mbox{full model}\right)\) should be reasonable.
 When \(df > 1\), it is OK to have the \(\hat{\mu}\) as small as 1 as long as fewer than 20% of the cells have \(\hat{\mu} < 5\)
 The permissible size of \(\hat{\mu}\) decreases as the size of the table \(N\) increases.
 The chisquared distribution of \(X^{2}\) and \(G^{2}\) can be poor for sparse tables with both very small and very large \({\hat{\mu}}'s\) (relative to \(n/N\)).
 \(G^2\) is usually poorly approximated by the chisquared distribution when \(\dfrac{n}{N} < 5\).The pvalues for \(G^2\) may be too large or too small (it depends on n/N).
 For fixed \(n\) and \(N\), chisquared approximations are better for smaller df than for larger df.
12.5  Summary
In this lesson, we saw an alternative way to estimate parameters and their standard errors for a GLM that doesn't rely on full specification of the likelihood function. This "generalized estimating equation" approach needs only the marginal distribution of the responses, along with a correlation structure to describe pairwise associations between observations within the same subject.
Additionally, we saw different ways in which 0s can appear in categorical data, depending on whether an observation could have been observed but wasn't or whether an observation could not have been observed at all due to its nature.