12: Advanced Topics II

12: Advanced Topics II

Overview

Having introduced the idea of dependent samples, where two responses are matched by subject or "cluster" (e.g., two ratings provided for each movie in a sample), and measuring ways these responses can be in agreement (e.g., marginal homogeneity or symmetry) for two-way tables, we now extend this concept to three or more responses clustered together. Thus, we have the categorical version of repeated measures and the additional challenge of deciding an appropriate structure for modeling the association among them.

The GEE approach is an extension of GLMs that provides a semi-parametric approach to repeated measures of categorical response; it can be also used for continuous measurements.

Objectives
Upon completion of this lesson, you should be able to:

No objectives have been defined for this lesson yet.

 Lesson 12 Code Files

Useful Links


12.1 - Introduction to Generalized Estimating Equations

12.1 - Introduction to Generalized Estimating Equations

The idea behind GEEs is to produce reasonable estimates of model parameters, along with standard errors, without specifying a likelihood function in its entirety, which can be quite difficult with a multivariate categorical response. We have to account for the correlation among the multiple responses that arise from a single subject, but we can largely estimate these correlations from the data without having to accurately specify the form of the correlation structure.

In the General Social Survey (GSS) of 2018, respondents answered questions regarding their confidence in various institutions in the U.S., including education, medicine, and the scientific community. Among these three institutions, are there differences in the proportions of people having "a great deal" of confidence? Responses are clustered by subject because each subject provided answers to all three questions.

A GLM for Clustered Responses

The response vector for subject \(i\) is \(Y_i=(Y_{i1},\ldots,Y_{in_i})\), where \(Y_{ij}\) is the categorical response for subject \(i\) (\(1,\dots,n\)) and measurement \(j\) (\(1,\ldots,n_i\)). With expected value \(E(Y_{i})=\mu_i'=(\mu_{i1},\ldots,\mu_{in_i})\) and covariate/predictor vector \(x_{ij}\), we have the general model expression:

\(g(\mu_{ij})=x_{ij}'\beta\)

In our example above, \(Y_{ij}\) is binomial with mean \(\mu_{ij}=\pi_{ij}\), and the logit link would be used for \(g\). If the institution indicators, say \(Med_{ij}=1\) for medicine and \(Sci_{ij}=1\) for scientific community (each would be 0 otherwise), are added, along with a covariate for age, we have \(x_{ij}'=(1,Med_{ij},Sci_{ij},Age_i)\) and

\(\log\left(\dfrac{\pi_{ij}}{1-\pi_{ij}}\right)=\beta_0+Med_{ij}\beta_1+Sci_{ij}\beta_2+Age_i\beta_3\)

Note that the age of the subject does not vary within subject and so does not require a \(j\) subscript. Although subjects are assumed to be independently collected, the multiple responses within a subject are allowed to have a correlation structure. Here are some potential structures we may consider. In each matrix below (in general, these would be \(n_i\times n_i\)), the value in the \(j^{th}\) row and \(k^{th}\) column represents the correlation between the \(j^{th}\) and \(k^{th}\) measurement for the same subject.

Independence - (no correlation within subjects)

\( \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\)

Exchangeable or Compound Symmetry

\( \begin{bmatrix} 1 & \rho & \rho \\ \rho & 1 & \rho \\ \rho & \rho & 1 \end{bmatrix}\)

Auto-Regressive Order 1 - (correlation depends on spatial "distance" between measurements)

\( \begin{bmatrix} 1 & \rho & \rho^2 \\ \rho & 1 & \rho \\ \rho^2 & \rho & 1 \end{bmatrix}\)

Unstructured

\( \begin{bmatrix} 1 & \rho_{12} & \rho_{13} \\ \rho_{12} & 1 & \rho_{23} \\ \rho_{13} & \rho_{23} & 1 \end{bmatrix}\)

Maximum-likelihood estimation of these parameters from the would be identical to that for the familiar logistic model, except that the correlation structure would be incorporated into the likelihood for responses from the same subject. The GEE alternative avoids this.

Interpretation of Parameter Estimates:

The interpretation will depend on the chosen link function. For example, if it is logit, \(\exp\left(\beta_0\right) =\) the odds that the characteristic is present in an observation \(i\) when \(x_i = 0\), but if it is the identity, \(\exp\left(\beta_0\right) =\) is the value of the response when \(x_i = 0\).

Model Fit:

  • We don't test for the model fit of the GEE, because this is really an estimating procedure; there is no likelihood function.
  • We look at the empirical estimates of the standard errors and the covariance.
  • Compare the empirical estimates with the model-based estimates

Advantages:

  • Computationally more simple than MLE for categorical data
  • Does not require multivariate distribution
  • Consistent estimation even with mis-specified correlation structure

Limitations:

  • There is no likelihood function since the GEE does not specify completely the joint distribution; thus some do not consider it a model but just a method of estimation.
  • Likelihood-based methods are NOT available for testing fit, comparing models, and conducting inferences about parameters.
  • Empirical based standard errors underestimate the true ones, unless very large sample size.
  • Empirical estimators more variable than the parametric ones.

First we examine the method of "independence estimating equations," which incorrectly assumes that the observations within a subject are independent.

GEE Approach to Estimation

Starting with \(E(y_{i})=\mu_{i}\), the vector of means for subject \(i\) connected with the predictors via \(g(\mu_i)=x_i'\beta)\), we let \(\Delta_i\) be the diagonal matrix of variances

\(\Delta_i=\text{Diag}[\text{Var}(y_{ij})]= \begin{bmatrix}
Var_{i1} &\cdots & \cdots &\vdots \\
\vdots & Var_{i2} & \cdots & \vdots\\
\vdots& \cdots & \ddots & \vdots\\
\vdots& \cdots & \cdots & Var_{ij}
\end{bmatrix}\).

In terms of the correlation structure, this would be:

\( \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\)

Unless the observations within a subject are independent,

\(\Delta_i \neq \text{Cov}(y_i)\)

But if \(\Delta_i\) were correct, we could stack up the \(y_i\)'s and estimate \(\beta\) using techniques for generalized linear models.

How bad is it to pretend that \(\Delta_i\) is correct? Let \(\hat{\beta}\) be the estimate that assumes observations within a subject are independent (e.g., as found in ordinary linear regression, logistic regression, etc.)

  • If \(\Delta_i\) is correct, then \(\hat{\beta}\) is asymptotically unbiased and efficient. Recall that unbiased \(E(\hat{\beta})=\beta\), efficient means it has the smallest variance of all other possible estimations.
  • If Δi is not correct, then \(\hat{\beta}\) is still asymptotically unbiased but no longer efficient

    \(\hat{\sigma}^2 \left[X^T \hat{W} X \right]^{-1}\)

    may be very misleading (here, X is the matrix of stacked Xi's and \(\hat{W}\) is the diagonal matrix of final weights, if any)

    • The 'naive' standard error for \(\hat{\beta}\), obtained from the naive estimate of \(\text{Cov}(\hat{\beta})\)
    • consistent standard errors for \(\hat{\beta}\) are still possible using the sandwich estimator (sometimes called the 'robust' or 'empirical' estimator)

Sandwich Estimator

The sandwich estimator was first proposed by Huber (1967) and White(1980); Liang and Zeger (1986) applied it to longitudinal data

\(\left[X^T \hat{W} X \right]^{-1} \left[\sum\limits_i X_i^T (y_i-\hat{\mu}_i)(y_i-\hat{\mu}_i)^T X_i \right]\left[X^T \hat{W} X \right]^{-1}\)

  • provides a good estimate of \(\text{Cov}(\hat{\beta})\) in large samples (several hundred subjects or more) regardless of the true form of \(\text{Cov}\left(y_i\right)\)
  • in smaller samples it could be rather noisy, so 95% intervals obtained by ± 2 SE's could suffer from under-coverage

When within-subject correlations are not strong, Zeger (1988) suggests that the use of IEE with the sandwich estimator is highly efficient.


12.2 - Modeling Binary Clustered Responses

12.2 - Modeling Binary Clustered Responses

Example - Confidence in U.S. Institutions

For the three GSS questions about confidence in education, medicine, and the scientific community, the original responses were divided into three categories: "A great deal", "Hardly any", and "Only some".

  A great deal Hardly any Only some
Education 370 276 821
Medicine 525 194 748
Scientific Community 665 95 707

For this particular analysis, we focus on the proportions of "A great deal" and combine the other two categories. Thus, we have binary responses repeatedly measured on each of 1467 subjects or clusters. With \(Y_{ij}\) representing the binomial response for the \(j^{th}\) question and the \(i^{th}\) subject, the following image illustrates the clusters, within which we expect some correlation.

 

... 3 responses per subject cluster n=1467 subject clusters Y_11 Y_12 Y_13 Y_21 Y_22 Y_23 Y_n1 Y_n2 Y_n3

 

For reference, if we temporarily ignore the correlation among responses within subjects and instead treat all \(Y_{ij}\) as independent, we can calculate estimates of the odds of "A great deal" of confidence in education, medicine, and scientific community, respectively, as

\(\dfrac{370}{276+821}=0.337,\mbox{ }\dfrac{525}{194+748}=0.557,\mbox{ and }\dfrac{665}{95+707}=0.829\)

And from Lesson 3, we can calculate the standard error for differences in these (log) odds. For example, the standard error for comparing the log odds of "A great deal" for education versus medicine would be

\(\sqrt{\dfrac{1}{370}+\dfrac{1}{(276+821)}+\dfrac{1}{525}+\dfrac{1}{(194+748)}}=0.0811\)

But we would expect this to be smaller if the correlation between these responses is positive and taken into account. This is what we now explore with the software analysis.

Fitting the GEE Model

The model for the clustered responses as a function of only the question type would look like this.

\(\log\left(\dfrac{\pi_{ij}}{1-\pi_{ij}}\right)=\beta_0+Med_{ij}\beta_1+Sci_{ij}\beta_2\)

where \(\pi_{ij}\) is the probability that the \(i^{th}\) subject answers "A great deal" to the \(j^{th}\) question. The slope \(\beta_1\) is interpreted as the log odds ratio for "A great deal" of confidence in medicine, relative to education.

The naive approach would assume all the responses from all subjects are independent and estimate the parameters and standard errors with the usual logistic model, which can be done with SAS's LOGISTIC procedure.

data gss;
infile 'x:\gss.confidence.csv' delimiter=',' firstobs=2;
input id age sex $ question $ conf $ greatly;
run;
proc logistic data=gss; 
class question(ref='coneduc') / param=ref; 
model greatly(event='1') = question / scale=none; 
run; 

This coding will treat the question about education as the baseline. Compare the output below to the estimates and standard errors obtained from the calculations above.

 
Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept   1 -1.0868 0.0601 326.8155 <.0001
question conmedic 1 0.5022 0.0811 38.3292 <.0001
question consci 1 0.8995 0.0798 127.1214 <.0001

If instead, we estimate the parameters and standard errors according to the GEE approach, we use the GENMOD procedure and have several choices for the working correlation structure (within-subjects). Note that even with the working correlation structure specified as independent, as it is below with the TYPE=IND option, the sample correlations among the three responses within subjects are still being incorporated into the standard error estimates. 

* gee with independent corr structure;
proc genmod data=gss descending; 
class id question(ref='coneduc'); 
model greatly = question / dist = bin link=logit type3 wald; 
repeated subject=id / type=ind corrw; 
title 'gee with independent corr structure';
run;   

The SUBJECT=ID is where we specify the clusters within which we have repeated observations that may be correlated. Compared with the LOGISTIC model output above, the estimates are identical, but the standard errors are slightly smaller. The TYPE3 and WALD options can also be used to output an ANOVA table for an overall effect due to question type. Versus the reference chi-square distribution with two degrees of freedom, the test statistic of 149 is strong evidence that subjects' confidence differs among the three institutions, with the highest proportion of "A great deal" of confidence applying to the scientific community.

 
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Parameter   Estimate Standard
Error
95% Confidence Limits Z Pr > |Z|
Intercept   -1.0868 0.0601 -1.2047 -0.9690 -18.08 <.0001
question conmedic 0.5022 0.0697 0.3657 0.6388 7.21 <.0001
question consci 0.8995 0.0737 0.7552 1.0439 12.21 <.0001
question coneduc 0.0000 0.0000 0.0000 0.0000 . .
 
Wald Statistics For Type 3 GEE Analysis
Source DF Chi-Square Pr > ChiSq
question 2 149.21 <.0001

Other options for the working correlation structure can produce slightly different results, but in general, the results are not that sensitive to the specified working correlation structure because empirical correlations among the responses from the data are largely dominant in the GEE calculations. With this in mind, a good compromise between model fit and parameter sparsity is the exchangeable structure, which allows a single correlation parameter for all pairwise responses within a subject. The results below are for the exchangeable structure; these are particularly similar to those assuming the independent correlation structure because the estimated correlation in the data is so small.

* gee with exchangeable corr structure;
proc genmod data=gss descending; 
class id question(ref='coneduc'); 
model greatly = question / dist = bin link=logit type3 wald; 
repeated subject=id / type=exch corrw; 
title 'gee with exchangeable corr structure';
run;

 

 
Working Correlation Matrix
  Col1 Col2 Col3
Row1 1.0000 0.2348 0.2348
Row2 0.2348 1.0000 0.2348
Row3 0.2348 0.2348 1.0000
 
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Parameter   Estimate Standard
Error
95% Confidence Limits Z Pr > |Z|
Intercept   -1.0868 0.0601 -1.2047 -0.9690 -18.08 <.0001
question conmedic 0.5022 0.0697 0.3657 0.6388 7.21 <.0001
question consci 0.8995 0.0737 0.7552 1.0439 12.21 <.0001
question coneduc 0.0000 0.0000 0.0000 0.0000 . .

The naive approach would assume all the responses from all subjects are independent and estimate the parameters and standard errors with the usual logistic model, which can be done with R's glm function.

library(geepack)
gss = read.csv(file.choose(), header=T) # "gss.confidence.csv"
# all observations independent
fit.glm = glm(greatly~question, data=gss, family=binomial(link='logit'))

By default, R will treat the question about education (first alphabetically among the question types) as the baseline. Compare the output below to the estimates and standard errors obtained from the calculations above.

> summary(fit.glm)

Call:
glm(formula = greatly ~ question, family = binomial(link = "logit"), 
    data = gss)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.099  -0.941  -0.762   1.258   1.660  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -1.0868     0.0601  -18.08   <2e-16 ***
questionconmedic   0.5022     0.0811    6.19    6e-10 ***
questionconsci     0.8995     0.0798   11.27   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 5722.8  on 4400  degrees of freedom
Residual deviance: 5591.4  on 4398  degrees of freedom
AIC: 5597

If instead, we estimate the parameters and standard errors according to the GEE approach, we use the geeglm function and have several choices for the working correlation structure (within-subjects). Note that even with the working correlation structure specified as independent, as it is below with the corstr option, the sample correlations among the three responses within subjects are still being incorporated into the standard error estimates. 

# gee with independent corr structure 
fit.ind = geeglm(greatly~question, data=gss, id=id, family=binomial, corstr="independence", scale.fix=T)

The id= option is where we specify the clusters within which we have repeated observations that may be correlated, and tehe scale.fix=T option is to avoid R's default approach of introducing an overdispersion scale parameter. Compared with the glm function output above, the estimates are identical, but the standard errors are slightly smaller. The ANOVA function can also be used to output an ANOVA table for an overall effect due to question type. Versus the reference chi-square distribution with two degrees of freedom, the test statistic of 149 is strong evidence that subjects' confidence differs among the three institutions, with the highest proportion of "A great deal" of confidence applying to the scientific community.

> summary(fit.ind)

Call:
geeglm(formula = greatly ~ question, family = binomial, data = gss, 
    id = id, corstr = "independence", scale.fix = TRUE)

 Coefficients:
                 Estimate Std.err Wald Pr(>|W|)    
(Intercept)       -1.0868  0.0601  327  < 2e-16 ***
questionconmedic   0.5022  0.0697   52  5.6e-13 ***
questionconsci     0.8995  0.0737  149  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation structure = independence 
Scale is fixed.

Number of clusters:   1467  Maximum cluster size: 3 

> anova(fit.ind)
Analysis of 'Wald statistic' Table
Model: binomial, link: logit
Response: greatly
Terms added sequentially (first to last)

         Df  X2 P(>|Chi|)    
question  2 149    <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Other options for the working correlation structure can produce slightly different results, but in general, the results are not that sensitive to the specified working correlation structure because empirical correlations among the responses from the data are largely dominant in the GEE calculations. With this in mind, a good compromise between model fit and parameter sparsity is the exchangeable structure, which allows a single correlation parameter for all pairwise responses within a subject. The results below are for the exchangeable structure; these are particularly similar to those assuming the independent correlation structure in this example because the estimated correlation in the data is so small.

# gee with exchangeable corr structure
fit.exch = geeglm(greatly~question, data=gss, id=id, family=binomial, corstr="exchangeable", scale.fix=T)
# gee with unstructured corr structure
fit.un = geeglm(greatly~question, data=gss, id=id, family=binomial, corstr="unstructured", scale.fix=T)
> summary(fit.exch)

Call:
geeglm(formula = greatly ~ question, family = binomial, data = gss, 
    id = id, corstr = "exchangeable", scale.fix = TRUE)

 Coefficients:
                 Estimate Std.err Wald Pr(>|W|)    
(Intercept)       -1.0868  0.0601  327  < 2e-16 ***
questionconmedic   0.5022  0.0697   52  5.6e-13 ***
questionconsci     0.8995  0.0737  149  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation structure = exchangeable 
Scale is fixed.

  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha    0.235  0.0173
Number of clusters:   1467  Maximum cluster size: 3  

 

Adding Additional Predictors

With age (centered around its mean) also potentially involved, the model can be extended to

\(\log\left(\dfrac{\pi_{ij}}{1-\pi_{ij}}\right)=\beta_0+Med_{ij}\beta_1+Sci_{ij}\beta_2+C.age_i\beta_3\)

In the case of age, or any predictor that is measured at the subject level, an index for question (institution type) is not needed.

Adding additional predictors, including interactions, does not change the GEE approach to estimation, but the parameter interpretations are adjusted for other effects. For example, \(\beta_0\) now represents the log-odds of having "A great deal" of confidence in the (baseline) education system, for a subject at the mean age of this survey data, which happens to be 48.1. The coefficient \(\beta_3\) is the change in log-odds of having "A great deal" of confidence in the education system for each 1-year increase in the subject's age. Moreover, this effect is common for the other institutions as well unless interaction is taken into account.

* gee with age as a covariate;
data gss;
set gss;
c_age = age-48.1;
run;
proc genmod data=gss descending; 
class id question(ref='coneduc'); 
model greatly = question c_age / dist = bin link=logit type3 wald; 
repeated subject=id / type=exch corrw; 
title 'gee with age as a covariate';
run;  

We see that, in addition to the institution in question, a person's confidence is also related with age. For each additional year in age, the estimated odds of having "A great deal" of confidence in the education system (or either of the others) is multiplied by

\( e^{-0.0050}=0.995\)

So, the proportions of people having "A great deal" of confidence in these institutions tend to be smaller for older ages, but this effect is very small in practical terms. The statistical significance is likely due to the large sample size.


                    Analysis Of GEE Parameter Estimates
                  Empirical Standard Error Estimates

                            Standard   95% Confidence
Parameter          Estimate    Error       Limits            Z Pr > |Z|

Intercept           -1.0889   0.0603  -1.2070  -0.9708  -18.07   <.0001
question  conmedic   0.5031   0.0698   0.3663   0.6399    7.21   <.0001
question  consci     0.9011   0.0738   0.7564   1.0457   12.21   <.0001
question  coneduc    0.0000   0.0000   0.0000   0.0000     .      .
c_age               -0.0050   0.0023  -0.0095  -0.0006   -2.22   0.0266
  
# gee with age covariate
gss$age = gss$age - mean(gss$age)
fit.age = geeglm(greatly~question+age, data=gss, id=id, family=binomial, corstr="exchangeable", scale.fix=T)

We see that, in addition to the institution in question, a person's confidence is also related with age. For each additional year in age, the estimated odds of having "A great deal" of confidence in the education system (or either of the others) is multiplied by

\( e^{-0.00504}=0.995\)

So, the proportions of people having "A great deal" of confidence in these institutions tend to be smaller for older ages, but this effect is very small in practical terms. The statistical significance is likely due to the large sample size.

> summary(fit.age)

Call:
geeglm(formula = greatly ~ question + age, family = binomial, 
    data = gss, id = id, corstr = "exchangeable", scale.fix = T)

 Coefficients:
                 Estimate  Std.err   Wald Pr(>|W|)    
(Intercept)      -1.08903  0.06026 326.65  < 2e-16 ***
questionconmedic  0.50306  0.06979  51.95  5.7e-13 ***
questionconsci    0.90109  0.07381 149.06  < 2e-16 ***
age              -0.00504  0.00228   4.91    0.027 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation structure = exchangeable 
Scale is fixed.

  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha    0.233  0.0174
Number of clusters:   1467  Maximum cluster size: 3 

Changing Covariance Structure

With PROC GENMOD, we can also try alternative assumptions about the within-subject correlation structure.

The type=exch or type=cs option specifies an "exchangeable" or "compound symmetry assumption," in which the observations within a subject are assumed to be equally correlated:

\(Corr(y_i)=\begin{bmatrix}
1 & \rho & \rho & \cdots & \rho\\
\rho & 1 & \rho & \cdots & \rho\\
\rho & \rho & 1 & \cdots & \rho\\
\vdots & \vdots & \vdots & \ddots & \vdots\\
\rho & \rho & \rho & \cdots & 1
\end{bmatrix}\)

This is the same kind of structure that arises in a normal linear model with random intercepts. It acknowledges that correlations exist within a subject, but it assumes that the correlations are constant over time.

An autoregressive(AR-1) structure, specified by type=ar, allows the correlations to diminish over time as

\(Corr(y_{ij},y_{ij'})=\rho ^{|t_{ij}-t_{ij'}|}\),

where \(t_{ij}\) and \(t_{ij′}\) are the observation times for \(y_{ij}\) and \(y_{ij'}\) .

The exchangeable and the autoregressive structures both express the within-subject correlations in terms of a single parameter \(\rho\). If the subjects are measured at a relatively small common set of occasions, we may be able to estimate an arbitrary correlation matrix. With \(n_i = 4\) measurement times per subject, the unstructured matrix would have six correlations to estimate. An unstructured matrix is obtained by the option type=un.

 Which correlation structure is best for any given problem?

In principle, it makes sense to think that the one that is most nearly "correct" would be best. For example, if the autoregressive or exchangeable assumptions are wrong, one might think that the unstructured form would provide the most efficient answer. But that is not necessarily true, because moving to an unstructured matrix introduces many more unknown parameters which could destabilize the model fit. In certain cases, a wrong structure with a small number of parameters could perform better than a correct structure with many parameters. Unfortunately, the literature does not provide much guidance on how to choose a good working correlation structure for GEE.


12.3 - Addendum: Estimating Equations and the Sandwich

12.3 - Addendum: Estimating Equations and the Sandwich

Quasi-likelihood

Suppose that we observe responses \(Y_1,Y_1 , \dots , Y_N\) which we want to relate to covariates. The ordinary linear regression model is

\(Y_i \sim N(x_i^T\beta,\sigma^2)\)

where \(x_i\) is a vector of covariates, \(\beta\) is a vector of coefficients to be estimated, and \(\sigma^2\) is the error variance. Now let's generalize this model in two ways:

Introduce a link function for the mean \(E ( Y_i ) = \mu_i\),

\(g(\mu_i)=x_i^T\beta\).

We could also write this as

\(E(Y_i)=\mu_i(\beta)\quad\) (2)

where the covariates and the link become part of the function \(\mu_i\). For example, in a loglinear model we would have \(\mu_i=\exp(x_i^T\beta)\).

Allow for heteroscedasticity, so that

\(\text{Var}(Y_i)=V_i\quad\) (3) 

In many cases \(V_i\) will depend on the mean \(\mu_i\) and therefore on \(\beta\), and possibly on additional unknown parameters(e.g. a scale factor). For example, in a traditional overdispersed loglinear model, we would have \(V_i=\sigma^2 \mu_i=\sigma^2 \exp(x_i^T\beta)\).

A maximum-likelihood estimate for \(\beta\) under this model could be computed by a Fisher scoring procedure. ML estimates have two nice theoretical properties: they are approximately unbiased and highly efficient. Interestingly, the asymptotic theory underlying these properties does not really depend on the normality of \(Y_i\), but only on the first two moments. That is if the mean function (2) and the variance function (3) are correct, but the distribution of \(Y_i\) is not normal, the estimate of \(\beta\) obtained by maximizing the normal loglikelihood from

\(Y_i \sim N(\mu_i(\beta), V_i(\beta))\)

is still asymptotically unbiased and efficient.

Quasi-scoring

If we maximize the normality-based loglikelihood without assuming that the response is normally distributed, the resulting estimate of \(\beta\) is called a quasi-likelihood estimate. The iterative procedure for computing the quasi-likelihood estimate, called quasi-scoring, proceeds as follows.

First, let's collect the responses and their means into vectors of length \(N\),

\(Y=
\begin{bmatrix}
Y_1\\
Y_2\\
\vdots\\
Y_N
\end{bmatrix},
\mu=
\begin{bmatrix}
\mu_1\\
\mu_2\\
\vdots\\
\mu_N
\end{bmatrix} \)

Also, let \(V\) be the \(N \times N\) matrix with \(V_1 , \dots , V_N\) on the diagonal and zeros elsewhere. Finally, let

\(D=\dfrac{\partial \mu}{\partial \beta}\)

be the \(N \times p\) matrix containing the partial derivatives of \(\mu\) with respect to \(\beta\). That is, the \(\left(i, j\right)^{th}\) element of \(D\) is \(\dfrac{∂\mu_i }{∂\beta_i}\) . In a simple linear model with an identity link, \(\mu_i=x_i^T\beta\), \(D\) is simply the \(N \times p\) matrix of covariates

\(X=
\begin{bmatrix}
x_1^T\\
x_2^T\\
\vdots\\
x_N^T
\end{bmatrix}\)

The iterative quasi-scoring procedure is

\(\beta_{\mbox{new}}-\beta_{\mbox{old}}= \left( D^T V^{-1} D\right)^{-1} D^T V^{-1}(Y-\mu)\quad\) (4)

where on the right-hand side of (4) the quantities \(D\), \(\tilde{V}\), and \(\mu\) are evaluated at \(\beta = \beta_{\mbox{old}}\).

It is easy to see that in the case of simple linear regression with homoscedastic response (\(\mu_i=x_i^T\beta\) and \(V_i=\sigma^2\) ), the quasi-scoring procedure converges to the OLS estimate

\(\hat{\beta}=\left(X^T X\right)^{-1}X^T Y\)

in a single step regardless of \(\beta_{\mbox{old}}\). In other cases (e.g. loglinear regression where \(\mu_i=\exp(x_i^T\beta)\), \(V_i=\sigma^2 \mu_i\) ) it produces the same estimate for \(\beta\) that we obtained earlier by fitting the generalized linear model.

Quasi-likelihood estimation is really the same thing as generalized linear modeling, except that we no longer have a full parametric model for \(Y_i\).

If we let \(U\) be the \(p \times 1\) vector

\(U=D^TV^{-1}(Y-\mu)\)

the final estimate from the quasi-scoring procedure satisfies the condition \(U = 0\). \(U\) can be regarded as the first derivative of the quasi-loglikelihood function. The first derivative of a loglikelihood is called a score, so \(U\) is called a quasi-score. \(U = 0\) is often called the set of estimating equations, and the final estimate for \(\beta\) is the solution to the estimating equations.

Estimating equations with a working variance function. We'll suppose that the mean regression function \(\mu_i \left(\beta\right)\) has been correctly specified but the variance function has not. That is, the data analyst incorrectly supposes that the variance function for \(Y_i\) is \(\tilde{V}_i\) rather than \(V_i\) , where \(\tilde{V}_i\) is another function of \(\beta\). The analyst then estimates \(\beta\) by solving the quasi-score equations

\(U=D^T \tilde{V}^{-1}(Y-\mu)=0\)

where \(\tilde{V}_i\) is the diagonal matrix of \(\tilde{V}_i\)s. Obviously it would be better to perform the scoring procedure using the true variance function \(V = \text{Diag}\left(V_1 , \dots , V_N\right)\) rather than \(\tilde{V}\). What are the properties of this procedure where \(\tilde{V}\) has been used instead of \(V\)?

  1. The estimate \(\hat{\beta}\) is a consistent and asymptotically unbiased estimate of \(\beta\), even if \(\tilde{V} \neq V\). It's also asymptotically normal.
  2. If \(\tilde{V} \neq V\) then \(\hat{\beta}\) is not efficient; that is, the asymptotic variance of \(\hat{\beta}\) is lowest when \(\tilde{V} = V\).
  3. If \(\tilde{V}= V\) then the final value of the matrix \(\left(D^T V^{-1} D\right)^{-1}\) from the scoring procedure (4) (i.e. the value of this matrix with \(\hat{\beta}\) substituted for \(\beta\)) is a consistent estimate of \(\text{Var}(\hat{\beta})\).
  4. If \(\tilde{V} \neq V\) then the final value of the matrix \(\left(D^{T} V^{-1} D\right)^{-1}\) from (4) is not a consistent estimate of \(\text{Var}(\hat{\beta})\).

Because of these properties, \(\hat{\beta}\) may still be a reasonable estimate of \(\beta\) if \(\tilde{V} \neq V\), but the final value of \((D^T \tilde{V}^{-1}D)^{-1}\) --- often called the "model-based" or "naive" estimator --- will not give accurate standard errors for the elements of \(\hat{\beta}\). However, this problem can be corrected by using the "robust" or "sandwich estimator," defined as

\(\left(D^T \tilde{V}^{-1}D\right)^{-1} \left(D^T \tilde{V}^{-1} E \tilde{V}^{-1} D\right) \left(D^T \tilde{V}^{-1}D\right)^{-1}\quad\), (5)

where,

\(E=\text{Diag}\left((Y_1-\mu_1)^2,\ldots, (Y_N-\mu_N)^2\right)\quad\) , (6)

and all quantities in (5) and (6) are evaluated with \(\hat{\beta}\) substituted for \(\beta\). The sandwich estimator is a consistent estimate of \(\text{Var}(\hat{\beta})\) even when \(\tilde{V} \neq V\).

The sandwich estimator was first proposed by Huber (1967) and White (1980), but was popularized in the late 1980s when Liang and Zeger extended it to multivariate or longitudinal responses.


12.4 - Inference for Log-linear Models: Sparse Data

12.4 - Inference for Log-linear Models: Sparse Data

As we consider higher-dimensional contingency tables, we are more likely to encounter sparseness. A sparse table is one where there are many cells with small counts and/or zeros. How many is 'many' and how small is 'small' are relative to

  • the sample size \(n\)
  • the table size \(N\)

We can have a small sample size, or a large sample size with a large number of cells. The large number of cells can be either due to the large number of classification variables, or small number of variables but with lots of levels.

Types of zeros (or empty cells)

Usually, when we talk about sparseness we are talking about zeros. This is a problem that is frequently encountered. There are different types of zeros, and we need to differentiate between the two.

Sampling Zeros occur when there is no observation in a cell, but there is a positive probability of observing an observation. That is, \(n_{ij} = 0\), but \(\pi_{ij} > 0\), for some \(i\) and \(j\). And increasing the sample size will eventually lead to an observation there.

Structural Zeros are are theoretically impossible to observe. That is, both \(n_{ij} = 0\), and \(\pi_{ij} = 0\). Tables with structural zeros are structurally incomplete. They are also known as incomplete tables. This is different from a partial classification where an incomplete table results from not being able to completely cross-classify all individuals or units.

Example of a structurally incomplete table

Sample of adults on whether they got an infection once or twice:

 
Infection
Reinfection
Yes No
Yes 87
16
No - 116
Total 87 132

The explanation here is that one cannot get an infection the second time until one gets it the first time. Thus, \(n_{21}\) is a structural zero.

Hypothetical Example

If there is a single sampling zero in a table, the MLE estimates are always positive. If there are non-zero margins, but a bad pattern of zeros, some MLE estimates of the cells counts can be negative.

If there are zero margins, we cannot estimate the corresponding term in the model, and the partial odds ratios involving this cell will equal 0 or 1. For example, if \(n_{+11} = 0\), there is no MLE estimate for \(\lambda_{11}^{YZ} = 0\), and any model including this term will not fit properly. We can force \(\lambda_{11}^{YZ} = 0\) but then must adjust for the degrees of freedom in the model, and no software will do this automatically.

A general formula for adjusting the degrees of freedom:

\(df = \left(T_e - Z_e\right) - \left(T_p - Z_p\right)\)

where \(T_e \) is the number of cells fitted, \(Z_e\) is the number of zero cells, \(T_p\) is the number of parameters in the model, and \(Z_p \) is the number of parameters that cannot be estimated because of zero margins.

Let's see how this works in the example below. The data, also available in Zeros.sas and Zeros.R, are from Fienberg (1980) and contain a marginal total if we sum over X:

Z=1:

  Y=1 Y=2
X=1 0 5
X=2 0 16

Z=2

  Y=1 Y=2
X=1 6 9
X=2 5 7

The three marginal tables are

XY:

  Y=1 Y=2
X=1 6 14
X=2 5 23

XZ:

  Z=1 Z=2
X=1 5 15
X=2 16 12

YZ:

  Z=1 Z=2
Y=1 0 11
Y=2 21 16

Here, \(n_{+11} = 0\), so there is no MLE estimate for \(\lambda_{11}^{YZ} = 0\), and any model including this term will not fit properly. Let's take a look at these examples in R and SAS.

The following code fits the model of homogeneous association (XY, XZ, YZ), which allows all conditional pairwise associations to be unique but assumes they do not depend on the level of the third variable conditioned on.

data zero_margin;
input X $ Y $ Z $ count @@;
datalines;
x1 y1 z1 0 x1 y1 z2 6 x1 y2 z1 5 x1 y2 z2 9
x2 y1 z1 0 x2 y1 z2 5 x2 y2 z1 16 x2 y2 z2 7
;
/* homogeneous associations */ 
proc genmod data=zero_margin order=data ;
class X Y Z;
model count = X Y Z  X*Y X*Z Z*Y /link=log dist=poi;
title 'Homogeneous Association with zero margin';
run;

Notice the large standard errors for some of the terms such as Y*Z, y1z1 below.

Homogeneous Association with zero margin                                                 8

The GENMOD Procedure

                 Analysis Of Maximum Likelihood Parameter Estimates

                                 Standard       Wald 95%             Wald
Parameter          DF  Estimate     Error   Confidence Limits  Chi-Square  Pr > ChiSq
Intercept           1    1.9459    0.3780    1.2051    2.6867       26.51      <.0001
X          x1       1    0.2513    0.5040   -0.7364    1.2390        0.25      0.6180
X          x2       0    0.0000    0.0000    0.0000    0.0000         .         .
Y          y1       1   -0.3365    0.5855   -1.4841    0.8112        0.33      0.5655
Y          y2       0    0.0000    0.0000    0.0000    0.0000         .         .
Z          z1       1    0.8267    0.4532   -0.0615    1.7149        3.33      0.0681
Z          z2       0    0.0000    0.0000    0.0000    0.0000         .         .
X*Y        x1  y1   1   -0.0690    0.7878   -1.6131    1.4751        0.01      0.9302
X*Y        x1  y2   0    0.0000    0.0000    0.0000    0.0000         .         .
X*Y        x2  y1   0    0.0000    0.0000    0.0000    0.0000         .         .
X*Y        x2  y2   0    0.0000    0.0000    0.0000    0.0000         .         .
X*Z        x1  z1   1   -1.4145    0.7187   -2.8230   -0.0059        3.87      0.0490
X*Z        x1  z2   0    0.0000    0.0000    0.0000    0.0000         .         .
X*Z        x2  z1   0    0.0000    0.0000    0.0000    0.0000         .         .
X*Z        x2  z2   0    0.0000    0.0000    0.0000    0.0000         .         .
Y*Z        y1  z1   1  -25.5664  92704.17   -181722  181671.3        0.00      0.9998
Y*Z        y1  z2   0    0.0000    0.0000    0.0000    0.0000         .         .
Y*Z        y2  z1   0    0.0000    0.0000    0.0000    0.0000         .         .
Y*Z        y2  z2   0    0.0000    0.0000    0.0000    0.0000         .         .
Scale               0    1.0000    0.0000    1.0000    1.0000

As expected, the overall the deviance df is reported to be 1 since exactly one term is omitted from this model: the three-way interaction. But with the zero counts, there is no room for improved fit, and we should have 0 degrees of freedom. If we apply the above formula, we get

             Criteria For Assessing Goodness Of Fit

Criterion                     DF           Value        Value/DF
Deviance                       1          0.0000          0.0000
Scaled Deviance                1          0.0000          0.0000
Pearson Chi-Square             1          0.0000          0.0000
Scaled Pearson X2              1          0.0000          0.0000

\(df = \left(T_e - Z_e\right) - \left(T_p - Z_p\right) = \left(8 - 2\right) - \left(7 - 1\right) = 6 - 6 = 0\)

As it is, the estimates and the model overall are not very useful. If however we remove the cells with zero counts and refit the model, we should get the correct degrees of freedom. Thus, we have

/*delete zero values*/
data zero_del;
set zero_margin;
if count=0 then delete;
run;
/* homogeneous associations */ 
proc genmod data=zero_del order=data ;
class X Y Z;
model count = X Y Z  X*Y X*Z Z*Y /link=log dist=poi;
title 'Homogeneous Association with zero margin';
run;
             Criteria For Assessing Goodness Of Fit

Criterion                     DF           Value        Value/DF
Deviance                       0          0.0000           .
Scaled Deviance                0          0.0000           .
Pearson Chi-Square             .          0.0000           .
Scaled Pearson X2              .          0.0000           .
Log Likelihood                           56.6027
Full Log Likelihood                     -11.5503
AIC (smaller is better)                  35.1007
AICC (smaller is better)                   .
BIC (smaller is better)                  33.8512

Also, notice that if we fit a model that does not require the YZ margin, such as the model of complete independence (X, Y, Z), then \(Z_p=0\) since all parameters in the model can be estimated. For this example, the df would be

\((8 - 2) - (4 - 0) = 2\)

/* complete independence */
proc genmod data=zero_del order=data;
class X Y Z;
model count = X Y Z /link=log dist=poi;
title 'Complete independence with zeros deleted';
run;
             Criteria For Assessing Goodness Of Fit
Criterion                     DF           Value        Value/DF
Deviance                       2          5.0616          2.5308
Scaled Deviance                2          5.0616          2.5308
Pearson Chi-Square             2          4.9058          2.4529
Scaled Pearson X2              2          4.9058          2.4529
Log Likelihood                           54.0720
Full Log Likelihood                     -14.0811
AIC (smaller is better)                  36.1622
AICC (smaller is better)                 76.1622
BIC (smaller is better)                  35.3293

The following code fits the model of homogeneous association (XY, XZ, YZ), which allows all conditional pairwise associations to be unique but assumes they do not depend on the level of the third variable conditioned on.

# Zero Margin
X=factor(c(rep("x1",4),rep("x2",4)))
Y=factor(c(rep(c("y1","y1","y2","y2"),2)))
Z=factor(c(rep(c("z1","z2"),4)))
count=c(0,6,5,9,0,5,16,7)
table=xtabs(count~X+Y+Z)
# Homogeneous Association (XY,XZ,YZ) with zero margin
model=glm(count~X+Y+Z+X*Y+X*Z+Z*Y,family=poisson(link=log))

Notice the large standard errors for several of the estimates below.

> summary(model)

Call:
glm(formula = count ~ X + Y + Z + X * Y + X * Z + Z * Y, family = poisson(link = log))

Deviance Residuals: 
         1           2           3           4           5           6  
-8.988e-06   0.000e+00   0.000e+00   0.000e+00  -1.664e-05   0.000e+00  
         7           8  
 5.960e-08  -1.490e-08  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)  
(Intercept) -2.393e+01  4.535e+04  -0.001    1.000  
Xx2          1.232e+00  9.397e-01   1.311    0.190  
Yy2          2.554e+01  4.535e+04   0.001    1.000  
Zz2          2.572e+01  4.535e+04   0.001    1.000  
Xx2:Yy2     -6.899e-02  7.878e-01  -0.088    0.930  
Xx2:Zz2     -1.414e+00  7.187e-01  -1.968    0.049 *
Yy2:Zz2     -2.514e+01  4.535e+04  -0.001    1.000  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 3.7197e+01  on 7  degrees of freedom
Residual deviance: 3.5779e-10  on 1  degrees of freedom
AIC: 37.101

As expected, the overall the deviance df is reported to be 1 since exactly one term is omitted from this model: the three-way interaction. But with the zero counts, there is no room for improved fit, and we should have 0 degrees of freedom. If we apply the above formula, we get

\(df=(T_e−Z_e)−(T_p−Z_p)=(8−2)−(7−1)=6−6=0\)

As it is, the estimates and the model overall are not very useful. If however we remove the cells with zero counts and refit the model, we should get the correct degrees of freedom. Thus, we have

# Delete Zero Values, "subset" used to remove 0s
# Homogeneous Association with zero margin: (XY,XZ,YZ)
model=glm(count~X+Y+Z+X*Y+X*Z+Z*Y,family=poisson(link=log), subset=count>0)

And the output looks like:

> summary(model)

Call:
glm(formula = count ~ X + Y + Z + X * Y + X * Z + Z * Y, family = poisson(link = log), 
    subset = count > 0)

Deviance Residuals: 
[1]  0  0  0  0  0  0

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.20397    0.69121   1.742   0.0815 .
Xx2          1.23214    0.93975   1.311   0.1898  
Yy2          0.40547    0.52705   0.769   0.4417  
Zz2          0.58779    0.55777   1.054   0.2920  
Xx2:Yy2     -0.06899    0.78780  -0.088   0.9302  
Xx2:Zz2     -1.41447    0.71866  -1.968   0.0490 *
Yy2:Zz2           NA         NA      NA       NA  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 9.5791e+00  on 5  degrees of freedom
Residual deviance: 5.3291e-15  on 0  degrees of freedom
AIC: 35.101

Also, notice that if we fit a model that does not require the YZ margin, such as the model of complete independence (X, Y, Z), then \(Z_p=0\) since all parameters in the model can be estimated. For this example, the df would be

\((8 - 2) - (4 - 0) = 2\)

# Delete Zero Values, "subset" used to remove 0s
# Complete Independence with zero margin: (X,Y,Z)
model=glm(count~X+Y+Z,family=poisson(link=log), subset=count>0)

 

> summary(model)

Call:
glm(formula = count ~ X + Y + Z, family = poisson(link = log), 
    subset = count > 0)

Deviance Residuals: 
      2        3        4        6        7        8  
 0.6314  -1.3798   0.8575  -0.5820   1.0228  -0.7994  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.7944     0.4798   3.740 0.000184 ***
Xx2           0.3365     0.2928   1.149 0.250443    
Yy2           0.3747     0.3917   0.957 0.338746    
Zz2          -0.2719     0.3318  -0.819 0.412519    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 9.5791  on 5  degrees of freedom
Residual deviance: 5.0616  on 2  degrees of freedom
AIC: 36.162

Effects of sampling zeros

There are two major effects on modeling contingency tables with sampling zeros:

  • Maximum likelihood estimates (MLEs) may not exist for loglinear/logit models
  • If MLE estimates exist, they may be biased

Recall that terms in a log-linear model correspond to the margins of the tables. Existence of MLE estimates depends on the positions of zeros in the table and what effects are included in a model.

If all \(n_{ij} > 0\) than MLE estimates of parameters are finite. If a table has a 0 marginal frequency, and there is a term in the model corresponding to that margin, the MLE estimates are infinite.

The bias problem due to sparseness

As mentioned above, the ML estimates of odds ratio can be severely biased. Also, the sampling distributions of fit statistics may be poorly approximated by the chi-squared distribution, which is why the standard errors are so large.

One suggested ad hoc solution is to add .5 to each cell in the table. Doing so shrinks the estimated odds ratios that are 1 to finite values and increases estimates that would otherwise be 0.

Other suggested solutions:

  • When a model does not converge, try adding a tiny number to all zero cells in the table, such as 0.000000001. Some argue you should add this value to all of the cells. Essentially, adding very small values is what the software programs like SAS and R are doing when they fit these types of models.
  • Extend this by doing a sensitivity analysis by adding different numbers of varying sizes to the cells. Examine fit statistics and parameter estimates to see if they change very much.

Effect of Sparseness on \(X^2\) and \(G^2\)

Unfortunately there is no single rule that covers all situations. We have to be careful about application of these situation-specific rules.

So while the \(G^2\) statistic for model fit may not be well approximated by the chi-squared distribution, the difference between \(G^2\)s for two nested models will be closer to chi-squared than the \(G^2\) for either model versus the saturated one. Chi-squared comparison tests depend more on the size of marginals than on cell sizes in the joint table. So if margins have cells exceeding 5, the chi-squared approximation of \(G^{2}\left(\mbox{reduced model}\right) - G^{2}\left(\mbox{full model}\right)\) should be reasonable.

  • When \(df > 1\), it is OK to have the \(\hat{\mu}\) as small as 1 as long as fewer than 20% of the cells have \(\hat{\mu} < 5\)
  • The permissible size of \(\hat{\mu}\) decreases as the size of the table \(N\) increases.
  • The chi-squared distribution of \(X^{2}\) and \(G^{2}\) can be poor for sparse tables with both very small and very large \({\hat{\mu}}'s\) (relative to \(n/N\)).
  • \(G^2\) is usually poorly approximated by the chi-squared distribution when \(\dfrac{n}{N} < 5\).The p-values for \(G^2\) may be too large or too small (it depends on n/N).
  • For fixed \(n\) and \(N\), chi-squared approximations are better for smaller df than for larger df.

12.5 - Summary

In this lesson, we saw an alternative way to estimate parameters and their standard errors for a GLM that doesn't rely on full specification of the likelihood function. This "generalized estimating equation" approach needs only the marginal distribution of the responses, along with a correlation structure to describe pairwise associations between observations within the same subject.

Additionally, we saw different ways in which 0s can appear in categorical data, depending on whether an observation could have been observed but wasn't or whether an observation could not have been observed at all due to its nature.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility