Key Concepts
Objectives

To complete this lesson you should :
In Lesson 4 [5] we introduced an idea of dependent samples, i.e., repeated measures on two variables or two points in time, matched data and square tables. We described the ways to perform significance tests for models of marginal homogeneity, symmetry, and agreement. In Lessons 10 and 11, we learned how to answer the same questions (and more) via loglinear models.
In this lesson we will introduce models for repeated categorical response data, and thus generalize models for matched pairs.
We have learned so far to model the count data as various generalized linear models with a key assumption of independence among the response. GEE approach is an extension of GLMs. It provides a semiparametric approach to longitudinal analysis of categorical response; it can be also used for continuous measurements.
We will focus only on basic ideas of GEE; for more details see cited references at the beginning of the lecture. GEE's were first introduced by Liang and Zeger (1986); see also Diggle, Liang and Zeger, (1994). The very crux of GEE is instead of attempting to model the withinsubject covariance structure, to treat it as a nuisance and simply model the mean response.
In this framework, the covariance structure doesn't need to be specified correctly for us to get reasonable estimates of regression coefficients and standard errors.
Objective:
Fit a model to repeated categorical responses, that is correlated and clustered responses, by GEE methodology.
Variables:
Model:
Its form is like GLM, but full specification of the joint distribution not required, and thus no likelihood function:
\(g(\mu_i)=x_i^T \beta\)
Random component:
Any distribution of the response that we can use for GLM, e.g., binomial, multinomial, normal, etc.
Systematic component:
A linear predictor of any combination of continuous and discrete variables.
Link function:
Can be any g(μ_{i}), e.g., identity, log, logit, etc.
Covariance structure:
Correlated data are modeled using the same link function and linear predictor setup (systematic component) as in the case of independent responses. The random component is described by the same variance functions as in the independence case, but the covariance structure of the correlated responses must also be specified and modeled now!
Assumptions:
Independence  (correlation between time points is independent)
\[ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\]
Exchangable (or Compund Symmetry)
\[ \begin{bmatrix} 1 & \rho & \rho \\ \rho & 1 & \rho \\ \rho & \rho & 1 \end{bmatrix}\]
AutoRegressive Order 1 (AR 1)
\[ \begin{bmatrix} 1 & \rho & \rho^2 \\ \rho & 1 & \rho \\ \rho^2 & \rho & 1 \end{bmatrix}\]
Unstructured
\[ \begin{bmatrix} 1 & \rho_{12} & \rho_{13} \\ \rho_{12} & 1 & \rho_{23} \\ \rho_{13} & \rho_{23} & 1 \end{bmatrix}\]
\(\rho_{ij} =corr(Y_{ij}, Y_{ik})\) for the i^{th} subject at times j and k.
Parameter Estimation:
The quasilikelihood estimators are estimates of quasilikelihood equations which are called generalized estimating equations. A quasilikelihood estimate of β arise from maximization of normalitybased loglikelihood [6] without assuming that the response is normally distributed. Recall, that we briefly discussed quasilikelihood when we introduced overdispersion [7] in Lesson 6.
In general, there are no closedform solutions, so the GEE estimates are obtained by using an iterative algorithm, that is iterative quasiscoring procedure.
GEE estimates of model parameters are valid even if the covariance is misspecified (because they depend on the first moment, e.g., mean). However, if the correlation structure is misspecified, the standard errors are not good, and some adjustments based on the data(empirical adjustment) are needed to get more appropriate standard errors. Agresti (2013) points out that a chosen model in practice is never exactly correct, but choosing carefully a working correlation(covariance structure) can help with efficiency of the estimates.
For more technical details see Agresti (2013) 4.7, and 11.3, and Agresti (2007), Ch. 9. Also, the end of this lecture has a bit more technical addendum based on Dr. Schafer's notes.
Interpretation of Parameter Estimates:
The interpretation will depend on the chosen link function. For example if it is logit, exp(β_{0} ) = the odds that the characteristic is present in an observation i when X_{i} = 0, but if it is identity, exp(β_{0} ) = is the value of the response when X_{i} = 0.
Inferences:
Wald statistics based confidence intervals [8] and hypothesis testing for parameters; recall they rely on asymptotic normality of estimator and their estimated covariance matrix.
Model Fit:
Advantages:
Limitations:
First we examine the method of "independence estimating equations," which incorrectly assumes that the observations within a subject are independent.
The data for a single subject i measured at occasions j = 1, 2, ... , n_{i}:
y_{i} = (y_{i1}, y_{i2} , ... , y_{i} , n_{i} )^{T}
y_{ij} = discrete or continuous response
X_{i} = n_{i} × p matrix of covariates
Let us suppose that the mean response
\(E(y_{ij})=\mu_{ij}\)_{}
is related to the covariates by a link function,
\(g(\mu_i)=X_i \beta\)
and let Δ_{i} be the diagonal matrix of variances
\(\Delta_i=\text{Diag}[\text{Var}(y_{ij})]= \begin{bmatrix}
Var_{i1} &\cdots & \cdots &\vdots \\
\vdots & Var_{i2} & \cdots & \vdots\\
\vdots& \cdots & \ddots & \vdots\\
\vdots& \cdots & \cdots & Var_{ij}
\end{bmatrix}\).
In terms of the correlation structure, this would be:
\[ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\]
Unless the observations within a subject are independent,
\(\Delta_i \neq \text{Cov}(y_i)\)
But if Δ_{i} were correct, we could stack up the y_{i}'s and estimate β using techniques for generalized linear models.
How bad is it to pretend that Δ_{i} is correct? Let \(\hat{\beta}\) be the estimate that assumes observations within a subject are independent (e.g., as found in ordinary linear regression, logistic regression, etc.)
\(\hat{\sigma}^2 \left[X^T \hat{W} X \right]^{1}\)
may be very misleading (here, X is the matrix of stacked X_{i}'s and \(\hat{W}\) is the diagonal matrix of final weights, if any)
The sandwich estimator was first proposed by Huber (1967) and White(1980); Liang and Zeger (1986) applied it to longitudinal data
\(\left[X^T \hat{W} X \right]^{1} \left[\sum\limits_i X_i^T (y_i\hat{\mu}_i)(y_i\hat{\mu}_i)^T X_i \right]\left[X^T \hat{W} X \right]^{1}\)
When withinsubject correlations are not strong, Zeger (1988) suggests that the use of IEE with the sandwich estimator is highly efficient.
Data analyzed by Hedeker and Gibbons (1997). A randomized trial for schizophrenia where:
"Spaghetti plot" of response curves for all subjects
Responses for drug patients:
Responses for placebo patients:
Average for each group at each time point:
Same plot versus \( \sqrt{\text{week}}\) :
As shown by the second plot, the average trajectories for the placebo and drug groups appear to be approximately linear when plotted against the square root of week.
At baseline (week0), the two groups have very similar averages. This makes sense. In a randomized trial, the groups are initially just a random division of the subjects; there should be no "treatment" effect because the treatment hasn't yet started. If there were a difference at baseline, it would lead us to believe that the randomization was not carried out properly.
Let's fit a model for mean response with
This allows the two groups to have different intercepts and slopes. Because the intercepts are defined as the average responses at week 0, we expect that the main effect for group (i.e., the difference in intercepts will be small).
How can we fit this model, taking into account the fact that the multiple observations for a subject are correlated?
Let's try GEE, more specifically first IEE. We will look at the normal rather than a multinomial model just to demonstrate the IEE. However, you should investigate the given SAS code and change the parameters and specify the multinomial distribution and compare your results.
Data: The data from the schizophrenia trial (schiz.dat [9]).
SAS program (schiz.sas [10]) for fitting model by independence estimating equations: NO R CODE!
What's going on?
Together, these two statements specify an estimation procedure equivalent to ML under an ordinary linear regression model; in other words, the resulting estimates are simply OLS.
What is the advantage of using PROC GENMOD here?
The advantage is that the standard errors are computed using the sandwich; if the sample is sufficiently large, then the SE's are going to be reasonable even if the assumptions of independence and constant variance are wrong.
Let's look at some results.
From the "Model information" section, we see that GENMOD is fitting a normal model with an identity link. As we learned, however, the normality doesn't matter; the only part of the normal model being relied upon is the assumption of constant variance, Var(y_{ij} )= σ^{2} .
Notice also that out of 413 × 4 = 1652 patientoccasions, only 1600 of them contributed to the model fit; the other 152 had missing values. The exact same results would have been obtained if we had omitted the rows with missing responses from the data file. Now examine the model fit (not the GEE):
The unscaled Pearson and Deviance statistics assume that the scale parameter σ^{2} is equal to 1.00. In practice, we would never assume σ^{2} =1.00 for a normal model; we would always estimate it. The estimated scale parameter is equal to Pearson's X^{2} divided by the degrees of freedom,
\(X^2/df=2221.4/1496=1.481\).
Now the table of coefficients:
The heading "Empirical standard error estimates" means that the SE's in this table are based on the sandwich.
As we expected, the coefficient for drug, the estimated difference in intercepts, is very small.
The estimated difference in slopes, −0.5243, is highly significant, indicating that the responses are declining over time more quickly for the drug group than for the placebo group.
Because we specified the modelse option, SAS also prints out modelbased standard errors. These assume that the working covariance assumption(constant variance and uncorrelated errors within subjects) is correct:
In this case, the modelbased standard errors are somewhat larger than their sandwich counterparts.
With PROC GENMOD, we can also try alternative assumptions about the withinsubject correlation structure.
The type=exch or type=cs option specifies an "exchangeable" or "compound symmetry assumption," in which the observations within a subject are assumed to be equally correlated:
\(Corr(y_i)=\begin{bmatrix}
1 & \rho & \rho & \cdots & \rho\\
\rho & 1 & \rho & \cdots & \rho\\
\rho & \rho & 1 & \cdots & \rho\\
\vdots & \vdots & \vdots & \ddots & \vdots\\
\rho & \rho & \rho & \cdots & 1
\end{bmatrix}\)
This is the same kind of structure that arises in a normal linear model with random intercepts. It acknowledges that correlations exist within a subject, but it assumes that the correlations are constant over time.
An autoregressive(AR1) structure, specified by type=ar, allows the correlations to diminish over time as
\(Corr(y_{ij},y_{ij'})=\rho ^{t_{ij}t_{ij'}}\),
where t_{ij} and t_{ij′} are the observation times for y_{ij} and y_{ij′} .
The exchangeable and the autoregressive structures both express the intrasubject correlations in terms of a single parameter ρ. If the subjects are measured at a relatively small common set of occasions, we may be able to estimate an arbitrary correlation matrix.
With n_{i} = 4 measurement times per subject, the unstructured matrix would have six correlations to estimate. An unstructured matrix is obtained by the option type=un.
Which correlation structure is best for any given problem?
In principle, it makes sense to think that the one that is most nearly "correct" would be best.
For example, if the autoregressive or exchangeable assumptions are wrong, one might think that the unstructured form would provide the most efficient answer. But that is not necessarily true, because moving to an unstructured matrix introduces many more unknown parameters which could destabilize the model fit.
In certain cases, a wrong structure with a small number of parameters could perform better than a correct structure with many parameters. Unfortunately, the literature does not provide much guidance on how to choose a good working correlation structure for GEE.
Here is a summary of the results from different working correlation structures applied to the data from the schizophrenia trial:
type=ind

type=exch

type=ar

type=un


Parameter

Est

SE

Est

SE

Est

SE

Est

SE

Intercept 
5.3961

0.0882

5.3768

0.0862

5.3829

0.0808

5.3762

0.0912

drug 
0.0098

0.1024

0.0184

0.1006

0.0158

0.0947

0.0485

0.1061

sqrtweek 
0.4048

0.0657

0.3561

0.0614

0.3672

0.0604

0.3788

0.0838

drug*sqrtweek 
0.5243

0.0763

0.5932

0.0725

0.6000

0.0718

0.6562

0.0959

Moving from an independence structure to exchangeable and AR1, the estimated coefficient for the effect of interest (drug*sqrtweek) becomes larger. The movement from −.52 to −.60 is a change of about one standard error, which is not trivial but not huge. The SE's under independence, exchangeable and AR1 are similar, but under an unstructured correlation it gets larger, presumably because of the extra parameters being estimated.
If I had to pick a single answer here, I would probably report the results from exchangeable or AR1. The independence structure is not plausible, and the unstructured may be unnecessarily general.
Next, you should play with this problem and change options in GENMOD.
For example, specify the DIST=multinomial and LINK=clogit for polytomous logistic regression.
For more examples, on GEE and binomial and polytomous response see references in Agresti (2013, 2007) and SAS online example.
To do the same analysis in R, we need to use either the gee [3]package or geepack [4] package. The main difference between the two is that the latter contains an ANOVA method that allows for fit comparsions.
Suppose that we observe responses y_{1}, y_{1} , ... , y_{N} which we want to relate to covariates. The ordinary linear regression model is:
\(y_i \sim N(x_i^T\beta,\sigma^2)\)
where x_{i} is a vector of covariates, β is a vector of coefficients to be estimated, and σ^{2} is the error variance. Now let's generalize this model in two ways:
Introduce a link function for the mean E ( y_{i} ) = μ_{i},
\(g(\mu_i)=x_i^T\beta\).
We could also write this as:
\(E(y_i)=\mu_i(\beta)\) (2)
where the covariates and the link become part of the function μ_{i}. For example, in a loglinear model we would have \(\mu_i=\text{exp}(x_i^T\beta)\).
Allow for heteroscedasticity, so that
\(\text{Var}(y_i)=V_i\). (3)
In many cases V_{i} will depend on the mean μ_{i} and therefore on β, and possibly on additional unknown parameters(e.g. a scale factor). For example, in a traditional overdispersed loglinear model, we would have \(V_i=\sigma^2 \mu_i=\sigma^2 \text{exp}(x_i^T\beta)\).
A maximumlikelihood estimate for β under this model could be computed by a Fisher scoring procedure. ML estimates have two nice theoretical properties: they are approximately unbiased and highly efficient. Interestingly, the asymptotic theory underlying these properties does not really depend on the normality of y_{i}, but only on the first two moments. That is, if the mean function (2) and the variance function (3) are correct,but the distribution of y_{i} is not normal, the estimate of β obtained by maximizing the normal loglikelihood from
\(y_i \sim N(\mu_i(\beta), V_i(\beta))\)
is still asymptotically unbiased and efficient.
If we maximize the normalitybased loglikelihood without assuming that the response is normally distributed, the resulting estimate of β is called a quasilikelihood estimate. The iterative procedure for computing the quasilikelihood estimate, called quasiscoring, proceeds as follows.
First, let's collect the responses and their means into vectors of length N,
\(y=
\begin{bmatrix}
y_1\\
y_2\\
\vdots\\
y_N
\end{bmatrix},
\mu=
\begin{bmatrix}
\mu_1\\
\mu_2\\
\vdots\\
\mu_N
\end{bmatrix} \)
Also, let V be the N × N matrix with V_{1} , ... , V_{N} on the diagonal and zeros elsewhere. Finally, let
\(D=\dfrac{\partial \mu}{\partial \beta}\)
be the N × p matrix containing the partial derivatives of μ with respect to β. That is, the (i, j)th element of D is ∂μ_{i} / ∂β_{i} . In a simple linear model with an identity link, \(\mu_i=x_i^T\beta\), D is simply the N × p matrix of covariates
\(X=
\begin{bmatrix}
x_1^T\\
x_2^T\\
\vdots\\
x_N^T
\end{bmatrix}\)
The iterative quasiscoring procedure is
\(\beta_{new}\beta_{old}=\left( D^T V^{1} D\right)^{1} D^T V^{1}(y\mu)\) (4)
where on the righthand side of (4) the quantities D, \(\tilde{V}\), and μ are evaluated at β = β_{old}.
It is easy to see that in the case of simple linear regression with homoscedastic response(\(\mu_i=x_i^T\beta\) and \(V_i=\sigma^2\) ), the quasiscoring procedure converges to the OLS estimate
\(\hat{\beta}=\left(X^T X\right)^{1}X^T y\)
in a single step regardless of βold. In other cases(e.g. loglinear regression where \(\mu_i=\text{exp}(x_i^T\beta)\), \(V_i=\sigma^2 \mu_i\) ) it produces the same estimate for β that we obtained earlier by fitting the generalized linear model.
Quasilikelihood estimation is really the same thing as generalized linear modeling, except that we no longer have a full parametric model for y_{i} .
If we let U be the p × 1 vector
\(U=D^TV^{1}(y\mu)\)
the final estimate from the quasiscoring procedure satisfies the condition U = 0. U can be regarded as the first derivative of the quasiloglikelihood function. The first derivative of a loglikelihood is called a score, so U is called a quasiscore. U = 0 is often called the set of estimating equations, and the final estimate for β is the solution to the estimating equations.
Estimating equations with a working variance function. We'll suppose that the mean regression function μ_{i} (β) has been correctly specified but the variance function has not. That is, the data analyst incorrectly supposes that the variance function for y_{i} is \(\tilde{V}_i\) rather than V_{i} , where \(\tilde{V}_i\) is another function of β. The analyst then estimates β by solving the quasiscore equations
\(U=D^T \tilde{V}^{1}(y\mu)=0\)
where \(\tilde{V}_i\) is the diagonal matrix of \(\tilde{V}_i 's\). Obviously it would be better to perform the scoring procedure using the true variance function V = Diag(V_{1} , ... , V_{n}) rather than \(\tilde{V}\). What are the properties of this procedure where \(\tilde{V}\) has been used instead of V?
Because of these properties, \(\hat{\beta}\) may still be a reasonable estimate of β if \(\tilde{V} \neq V\), but the final value of \((D^T \tilde{V}^{1}D)^{1}\) —often called the "modelbased" or "naive" estimator— will not give accurate standard errors for the elements of \(\hat{\beta}\). However, this problem can be corrected by using the "robust" or "sandwich estimator," defined as
\(\left(D^T \tilde{V}^{1}D\right)^{1} \left(D^T \tilde{V}^{1} E \tilde{V}^{1} D\right) \left(D^T \tilde{V}^{1}D\right)^{1}\), (5)
where,
\(E=\text{Diag}\left((y_1\mu_1)^2,\ldots, (y_n\mu_n)^2\right)\) , (6)
and all quantities in (5) and (6) are evaluated with \(\hat{\beta}\) substituted for β. The sandwich estimator is a consistent estimate of \(\text{Var}(\hat{\beta})\) even when \(\tilde{V} \neq V\).
The sandwich estimator was first proposed by Huber (1967) and White (1980), but was popularized in the late 1980's when Liang and Zeger extended it to multivariate or longitudinal responses.
Links:
[1] https://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/genmod_sect39.htm
[2] https://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/genmod_sect7.htm
[3] https://cran.rproject.org/web/packages/gee/
[4] https://cran.rproject.org/web/packages/geepack/
[5] https://onlinecourses.science.psu.edu/stat504/node/215
[6] https://onlinecourses.science.psu.edu/stat504/node/216
[7] https://onlinecourses.science.psu.edu/stat504/node/151
[8] https://onlinecourses.science.psu.edu/stat504/node/58
[9] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson09/schiz.dat
[10] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson09/schiz.sas