12.2 - Modeling Binary Clustered Responses
12.2 - Modeling Binary Clustered ResponsesExample - Confidence in U.S. Institutions
For the three GSS questions about confidence in education, medicine, and the scientific community, the original responses were divided into three categories: "A great deal", "Hardly any", and "Only some".
A great deal | Hardly any | Only some | |
---|---|---|---|
Education | 370 | 276 | 821 |
Medicine | 525 | 194 | 748 |
Scientific Community | 665 | 95 | 707 |
For this particular analysis, we focus on the proportions of "A great deal" and combine the other two categories. Thus, we have binary responses repeatedly measured on each of 1467 subjects or clusters. With \(Y_{ij}\) representing the binomial response for the \(j^{th}\) question and the \(i^{th}\) subject, the following image illustrates the clusters, within which we expect some correlation.
For reference, if we temporarily ignore the correlation among responses within subjects and instead treat all \(Y_{ij}\) as independent, we can calculate estimates of the odds of "A great deal" of confidence in education, medicine, and scientific community, respectively, as
\(\dfrac{370}{276+821}=0.337,\mbox{ }\dfrac{525}{194+748}=0.557,\mbox{ and }\dfrac{665}{95+707}=0.829\)
And from Lesson 3, we can calculate the standard error for differences in these (log) odds. For example, the standard error for comparing the log odds of "A great deal" for education versus medicine would be
\(\sqrt{\dfrac{1}{370}+\dfrac{1}{(276+821)}+\dfrac{1}{525}+\dfrac{1}{(194+748)}}=0.0811\)
But we would expect this to be smaller if the correlation between these responses is positive and taken into account. This is what we now explore with the software analysis.
Fitting the GEE Model
The model for the clustered responses as a function of only the question type would look like this.
\(\log\left(\dfrac{\pi_{ij}}{1-\pi_{ij}}\right)=\beta_0+Med_{ij}\beta_1+Sci_{ij}\beta_2\)
where \(\pi_{ij}\) is the probability that the \(i^{th}\) subject answers "A great deal" to the \(j^{th}\) question. The slope \(\beta_1\) is interpreted as the log odds ratio for "A great deal" of confidence in medicine, relative to education.
The naive approach would assume all the responses from all subjects are independent and estimate the parameters and standard errors with the usual logistic model, which can be done with SAS's LOGISTIC procedure.
data gss;
infile 'x:\gss.confidence.csv' delimiter=',' firstobs=2;
input id age sex $ question $ conf $ greatly;
run;
proc logistic data=gss;
class question(ref='coneduc') / param=ref;
model greatly(event='1') = question / scale=none;
run;
This coding will treat the question about education as the baseline. Compare the output below to the estimates and standard errors obtained from the calculations above.
Analysis of Maximum Likelihood Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 1 | -1.0868 | 0.0601 | 326.8155 | <.0001 | |
question | conmedic | 1 | 0.5022 | 0.0811 | 38.3292 | <.0001 |
question | consci | 1 | 0.8995 | 0.0798 | 127.1214 | <.0001 |
If instead, we estimate the parameters and standard errors according to the GEE approach, we use the GENMOD procedure and have several choices for the working correlation structure (within-subjects). Note that even with the working correlation structure specified as independent, as it is below with the TYPE=IND option, the sample correlations among the three responses within subjects are still being incorporated into the standard error estimates.
* gee with independent corr structure;
proc genmod data=gss descending;
class id question(ref='coneduc');
model greatly = question / dist = bin link=logit type3 wald;
repeated subject=id / type=ind corrw;
title 'gee with independent corr structure';
run;
The SUBJECT=ID is where we specify the clusters within which we have repeated observations that may be correlated. Compared with the LOGISTIC model output above, the estimates are identical, but the standard errors are slightly smaller. The TYPE3 and WALD options can also be used to output an ANOVA table for an overall effect due to question type. Versus the reference chi-square distribution with two degrees of freedom, the test statistic of 149 is strong evidence that subjects' confidence differs among the three institutions, with the highest proportion of "A great deal" of confidence applying to the scientific community.
Analysis Of GEE Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Empirical Standard Error Estimates | |||||||
Parameter | Estimate | Standard Error |
95% Confidence Limits | Z | Pr > |Z| | ||
Intercept | -1.0868 | 0.0601 | -1.2047 | -0.9690 | -18.08 | <.0001 | |
question | conmedic | 0.5022 | 0.0697 | 0.3657 | 0.6388 | 7.21 | <.0001 |
question | consci | 0.8995 | 0.0737 | 0.7552 | 1.0439 | 12.21 | <.0001 |
question | coneduc | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
Wald Statistics For Type 3 GEE Analysis | |||
---|---|---|---|
Source | DF | Chi-Square | Pr > ChiSq |
question | 2 | 149.21 | <.0001 |
Other options for the working correlation structure can produce slightly different results, but in general, the results are not that sensitive to the specified working correlation structure because empirical correlations among the responses from the data are largely dominant in the GEE calculations. With this in mind, a good compromise between model fit and parameter sparsity is the exchangeable structure, which allows a single correlation parameter for all pairwise responses within a subject. The results below are for the exchangeable structure; these are particularly similar to those assuming the independent correlation structure because the estimated correlation in the data is so small.
* gee with exchangeable corr structure;
proc genmod data=gss descending;
class id question(ref='coneduc');
model greatly = question / dist = bin link=logit type3 wald;
repeated subject=id / type=exch corrw;
title 'gee with exchangeable corr structure';
run;
Working Correlation Matrix | |||
---|---|---|---|
Col1 | Col2 | Col3 | |
Row1 | 1.0000 | 0.2348 | 0.2348 |
Row2 | 0.2348 | 1.0000 | 0.2348 |
Row3 | 0.2348 | 0.2348 | 1.0000 |
Analysis Of GEE Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Empirical Standard Error Estimates | |||||||
Parameter | Estimate | Standard Error |
95% Confidence Limits | Z | Pr > |Z| | ||
Intercept | -1.0868 | 0.0601 | -1.2047 | -0.9690 | -18.08 | <.0001 | |
question | conmedic | 0.5022 | 0.0697 | 0.3657 | 0.6388 | 7.21 | <.0001 |
question | consci | 0.8995 | 0.0737 | 0.7552 | 1.0439 | 12.21 | <.0001 |
question | coneduc | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
The naive approach would assume all the responses from all subjects are independent and estimate the parameters and standard errors with the usual logistic model, which can be done with R's glm function.
library(geepack)
gss = read.csv(file.choose(), header=T) # "gss.confidence.csv"
# all observations independent
fit.glm = glm(greatly~question, data=gss, family=binomial(link='logit'))
By default, R will treat the question about education (first alphabetically among the question types) as the baseline. Compare the output below to the estimates and standard errors obtained from the calculations above.
> summary(fit.glm)
Call:
glm(formula = greatly ~ question, family = binomial(link = "logit"),
data = gss)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.099 -0.941 -0.762 1.258 1.660
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0868 0.0601 -18.08 <2e-16 ***
questionconmedic 0.5022 0.0811 6.19 6e-10 ***
questionconsci 0.8995 0.0798 11.27 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5722.8 on 4400 degrees of freedom
Residual deviance: 5591.4 on 4398 degrees of freedom
AIC: 5597
If instead, we estimate the parameters and standard errors according to the GEE approach, we use the geeglm function and have several choices for the working correlation structure (within-subjects). Note that even with the working correlation structure specified as independent, as it is below with the corstr option, the sample correlations among the three responses within subjects are still being incorporated into the standard error estimates.
# gee with independent corr structure
fit.ind = geeglm(greatly~question, data=gss, id=id, family=binomial, corstr="independence", scale.fix=T)
The id= option is where we specify the clusters within which we have repeated observations that may be correlated, and tehe scale.fix=T option is to avoid R's default approach of introducing an overdispersion scale parameter. Compared with the glm function output above, the estimates are identical, but the standard errors are slightly smaller. The ANOVA function can also be used to output an ANOVA table for an overall effect due to question type. Versus the reference chi-square distribution with two degrees of freedom, the test statistic of 149 is strong evidence that subjects' confidence differs among the three institutions, with the highest proportion of "A great deal" of confidence applying to the scientific community.
> summary(fit.ind)
Call:
geeglm(formula = greatly ~ question, family = binomial, data = gss,
id = id, corstr = "independence", scale.fix = TRUE)
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -1.0868 0.0601 327 < 2e-16 ***
questionconmedic 0.5022 0.0697 52 5.6e-13 ***
questionconsci 0.8995 0.0737 149 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation structure = independence
Scale is fixed.
Number of clusters: 1467 Maximum cluster size: 3
> anova(fit.ind)
Analysis of 'Wald statistic' Table
Model: binomial, link: logit
Response: greatly
Terms added sequentially (first to last)
Df X2 P(>|Chi|)
question 2 149 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Other options for the working correlation structure can produce slightly different results, but in general, the results are not that sensitive to the specified working correlation structure because empirical correlations among the responses from the data are largely dominant in the GEE calculations. With this in mind, a good compromise between model fit and parameter sparsity is the exchangeable structure, which allows a single correlation parameter for all pairwise responses within a subject. The results below are for the exchangeable structure; these are particularly similar to those assuming the independent correlation structure in this example because the estimated correlation in the data is so small.
# gee with exchangeable corr structure
fit.exch = geeglm(greatly~question, data=gss, id=id, family=binomial, corstr="exchangeable", scale.fix=T)
# gee with unstructured corr structure
fit.un = geeglm(greatly~question, data=gss, id=id, family=binomial, corstr="unstructured", scale.fix=T)
> summary(fit.exch)
Call:
geeglm(formula = greatly ~ question, family = binomial, data = gss,
id = id, corstr = "exchangeable", scale.fix = TRUE)
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -1.0868 0.0601 327 < 2e-16 ***
questionconmedic 0.5022 0.0697 52 5.6e-13 ***
questionconsci 0.8995 0.0737 149 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation structure = exchangeable
Scale is fixed.
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.235 0.0173
Number of clusters: 1467 Maximum cluster size: 3
Adding Additional Predictors
With age (centered around its mean) also potentially involved, the model can be extended to
\(\log\left(\dfrac{\pi_{ij}}{1-\pi_{ij}}\right)=\beta_0+Med_{ij}\beta_1+Sci_{ij}\beta_2+C.age_i\beta_3\)
In the case of age, or any predictor that is measured at the subject level, an index for question (institution type) is not needed.
Adding additional predictors, including interactions, does not change the GEE approach to estimation, but the parameter interpretations are adjusted for other effects. For example, \(\beta_0\) now represents the log-odds of having "A great deal" of confidence in the (baseline) education system, for a subject at the mean age of this survey data, which happens to be 48.1. The coefficient \(\beta_3\) is the change in log-odds of having "A great deal" of confidence in the education system for each 1-year increase in the subject's age. Moreover, this effect is common for the other institutions as well unless interaction is taken into account.
* gee with age as a covariate;
data gss;
set gss;
c_age = age-48.1;
run;
proc genmod data=gss descending;
class id question(ref='coneduc');
model greatly = question c_age / dist = bin link=logit type3 wald;
repeated subject=id / type=exch corrw;
title 'gee with age as a covariate';
run;
We see that, in addition to the institution in question, a person's confidence is also related with age. For each additional year in age, the estimated odds of having "A great deal" of confidence in the education system (or either of the others) is multiplied by
\( e^{-0.0050}=0.995\)
So, the proportions of people having "A great deal" of confidence in these institutions tend to be smaller for older ages, but this effect is very small in practical terms. The statistical significance is likely due to the large sample size.
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Standard 95% Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept -1.0889 0.0603 -1.2070 -0.9708 -18.07 <.0001
question conmedic 0.5031 0.0698 0.3663 0.6399 7.21 <.0001
question consci 0.9011 0.0738 0.7564 1.0457 12.21 <.0001
question coneduc 0.0000 0.0000 0.0000 0.0000 . .
c_age -0.0050 0.0023 -0.0095 -0.0006 -2.22 0.0266
# gee with age covariate
gss$age = gss$age - mean(gss$age)
fit.age = geeglm(greatly~question+age, data=gss, id=id, family=binomial, corstr="exchangeable", scale.fix=T)
We see that, in addition to the institution in question, a person's confidence is also related with age. For each additional year in age, the estimated odds of having "A great deal" of confidence in the education system (or either of the others) is multiplied by
\( e^{-0.00504}=0.995\)
So, the proportions of people having "A great deal" of confidence in these institutions tend to be smaller for older ages, but this effect is very small in practical terms. The statistical significance is likely due to the large sample size.
> summary(fit.age)
Call:
geeglm(formula = greatly ~ question + age, family = binomial,
data = gss, id = id, corstr = "exchangeable", scale.fix = T)
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -1.08903 0.06026 326.65 < 2e-16 ***
questionconmedic 0.50306 0.06979 51.95 5.7e-13 ***
questionconsci 0.90109 0.07381 149.06 < 2e-16 ***
age -0.00504 0.00228 4.91 0.027 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation structure = exchangeable
Scale is fixed.
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.233 0.0174
Number of clusters: 1467 Maximum cluster size: 3
Changing Covariance Structure
With PROC GENMOD
, we can also try alternative assumptions about the within-subject correlation structure.
The type=exch
or type=cs
option specifies an "exchangeable" or "compound symmetry assumption," in which the observations within a subject are assumed to be equally correlated:
\(Corr(y_i)=\begin{bmatrix}
1 & \rho & \rho & \cdots & \rho\\
\rho & 1 & \rho & \cdots & \rho\\
\rho & \rho & 1 & \cdots & \rho\\
\vdots & \vdots & \vdots & \ddots & \vdots\\
\rho & \rho & \rho & \cdots & 1
\end{bmatrix}\)
This is the same kind of structure that arises in a normal linear model with random intercepts. It acknowledges that correlations exist within a subject, but it assumes that the correlations are constant over time.
An autoregressive(AR-1) structure, specified by type=ar, allows the correlations to diminish over time as
\(Corr(y_{ij},y_{ij'})=\rho ^{|t_{ij}-t_{ij'}|}\),
where \(t_{ij}\) and \(t_{ij′}\) are the observation times for \(y_{ij}\) and \(y_{ij'}\) .
The exchangeable and the autoregressive structures both express the within-subject correlations in terms of a single parameter \(\rho\). If the subjects are measured at a relatively small common set of occasions, we may be able to estimate an arbitrary correlation matrix. With \(n_i = 4\) measurement times per subject, the unstructured matrix would have six correlations to estimate. An unstructured matrix is obtained by the option type=un.
Which correlation structure is best for any given problem?
In principle, it makes sense to think that the one that is most nearly "correct" would be best. For example, if the autoregressive or exchangeable assumptions are wrong, one might think that the unstructured form would provide the most efficient answer. But that is not necessarily true, because moving to an unstructured matrix introduces many more unknown parameters which could destabilize the model fit. In certain cases, a wrong structure with a small number of parameters could perform better than a correct structure with many parameters. Unfortunately, the literature does not provide much guidance on how to choose a good working correlation structure for GEE.