11: Advanced Topics I

Overview

In the previous lesson we have covered the basic applications of loglinear models: independence and various types of association. In this lesson we continue with further applications involving ordinal variables and dependent data. Ordinal variables allow for a linear relationship to exist, and we can accommodate with the log-linear model by including a slope parameter interpreted in much the same way as that in a regression model.

The dependent data consideration is the categorical version of matched pairs or repeated measures, where an association between variables is taken for granted, and the questions of interest focus more on the type of agreement between the variables.

Objectives

Upon completion of this lesson, you should be able to:

No objectives have been defined for this lesson yet.

Lesson 11 Code Files

R Files

SAS Files

11.1 - Modeling Ordinal Data with Log-linear Models

Earlier in Lesson 4, we had described the ways to perform significance tests for independence and conditional independence, and to measure (linear) associations with ordinal categorical variables. For example, we focused on the CMH statistic and correlation measures for testing independence and linear associations for the example comparing job security and happiness from the general social survey data. We concluded that those who generally agree more with the statement "job security is good" have a greater odds of being happier.

	Not too happy	Pretty happy	Very happy
Not at all true	15	25	5
Not too true	21	47	21
Somewhat true	64	248	100
Very true	73	474	311

Can we answer the same questions (and more) via log-linear models?

Modeling Ordinal Data in 2-way Tables

Loglinear models for contingency tables, by default, treat all variables as nominal variables. If there is an ordering of the categories of the variables, this is not taken into account. That means that we could rearrange the rows and/or columns of a table and we would get the same fitted odds ratios for the data as we would be given the original ordering of the rows and/or columns.

To model ordinal data with log-linear models, we can apply some of the general ideas we saw in the analysis of ordinal data earlier in the course. That is, we typically

assign scores to the levels of our categorical variables, and
include additional parameters (which represent these scores) into a log-linear model to model the dependency between two variables.

Linear by Linear Association Model

This is the most common log-linear model when we have ordinal data.

Objective

Modeling the log counts by accounting for the ordering of the categories of discrete variables.

Suppose we assign numeric scores for the categories of the row variable, \(u_1\le u_2\le \cdots\le u_I\), and for the categories of the column variable, \(v_1\le \cdots\le v_J\). Then, we can model the association between the two variables.

Model Structure

\(\log(\mu_{ij})=\lambda+\lambda_i^C+\lambda_j^S+\beta u_i v_j\)

For each row \(i\), the log fitted values are a linear function of the columns. For each column \(j\), the log fitted values are a linear function of the rows.

Parameter estimates and interpretation

This model only has one more parameter than the independence model (i.e., \(\beta\)), and is in between the independence and the saturated models in terms of its complexity. We are trying to say something about the "linear association" between these two variables based on the scores that we have assigned.

If \(\beta > 0\), then the two variables are positively associated (i.e., happiness tends to go up as job security goes up).
If \(\beta<0\), then the two variables are negatively associated.
The odds ratio for any \(2 \times 2\) sub-table is a direct function of the row and column scores and \(\beta\):

\begin{align}
\theta(ij,i'j') &= \log(\mu_{ij})+\log(\mu_{i'j'})-\log(\mu_{i'j})-\log(\mu_{ij'})\\
&= \beta(u_i-u_{i'})(v_j-v_{j'})\\
\end{align}

Model Fit

We can use the deviance likelihood ratio and deviance statistics \(G^2\) as with any other log-linear model.

For our example (see ordinalLoglin.sas and the resulting output), we create two new numerical variables as seen below in the datalines statement:

    data gss;
    input jobsecok $ x happy $ y count;
    datalines;
      Notatall 1  Nottoo 1    15
        Nottoo 2  Nottoo 1    21
      Somewhat 3  Nottoo 1    64
          Very 4  Nottoo 1    73
      Notatall 1  Pretty 2    25
        Nottoo 2  Pretty 2    47
      Somewhat 3  Pretty 2   248
          Very 4  Pretty 2   474
      Notatall 1    Very 3     5
        Nottoo 2    Very 3    21
      Somewhat 3    Very 3   100
          Very 4    Very 3   311
    ;
    proc genmod data=gss order=data;
    class jobsecok happy;
    model count=jobsecok happy/link=log dist=poisson lrci type3 obstats;
    title 'Indepedence Model';
    run;
    proc genmod data=gss order=data;
    class jobsecok happy;
    model count=jobsecok happy x*y/link=log dist=poisson lrci type3 obstats;
    title 'Linear by Linear Association Model';
    run;

We observe \(G^2=3.6075\) with five degrees of freedom, which indicates that the linear by linear association model fits well and significantly better than the independence model (\(G^2=57.2404\) with six degrees of freedom). Note that these tests are relative to the saturated model, which treats both variables as entirely nominal and allows for any type of association between them, not just linear. In the case of these variables, which have four and three categories, respectively, there would be \((4-1)(3-1)=6)\) parameters needed to describe their nominal association. The linear-by-linear association model replaces them all with a single slope parameter, and the independence model removes them all entirely.

Next, since we're satisfied with the fit of the linear-by-linear association model, we would look at the test for the significance of the linear association. This would test for whether the slope parameter can be dropped and therefore has a single degree of freedom involved. We therefore can use either a likelihood test approach or the Wald test reported in the summary of parameter estimates; they should be very similar. Moreover, we expect these results to agree with those of the CMH test we performed earlier in Lesson 4. To see these results, we can find the LRT from the two deviance values:

\(G^2=57.2404-3.6075=53.633\)

The squared Wald statistic for \(\beta\) is \(z^2=7.131^2=50.85\), and finally, the CMH statistic is 47.97 from Lesson 4, all of which are relative to a chi-square distribution with one degree of freedom and demonstrate significant evidence of a linear association between job security and happiness---specifically, the numeric scores that we assigned for the levels of these variables.


Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	5	3.6075	0.7215
Scaled Deviance	5	3.6075	0.7215
Pearson Chi-Square	5	3.7879	0.7576
Scaled Pearson X2	5	3.7879	0.7576
Log Likelihood		6144.8702
Full Log Likelihood		-36.8426
AIC (smaller is better)		87.6851
AICC (smaller is better)		115.6851
BIC (smaller is better)		91.0795


Algorithm converged.


Analysis Of Maximum Likelihood Parameter Estimates
Parameter		DF	Estimate	Standard Error	Likelihood Ratio 95% Confidence Limits		Wald Chi-Square	Pr > ChiSq
Intercept		1	0.7240	0.6874	-0.6394	2.0571	1.11	0.2922
jobsecok	Notatall	1	-0.4067	0.3643	-1.1283	0.3012	1.25	0.2642
jobsecok	Nottoo	1	-0.5030	0.2636	-1.0205	0.0137	3.64	0.0564
jobsecok	Somewhat	1	0.1822	0.1421	-0.0943	0.4630	1.64	0.1997
jobsecok	Very	0	0.0000	0.0000	0.0000	0.0000	.	.
happy	Nottoo	1	1.9504	0.4096	1.1528	2.7597	22.67	<.0001
happy	Pretty	1	2.0866	0.2219	1.6577	2.5281	88.41	<.0001
happy	Very	0	0.0000	0.0000	0.0000	0.0000	.	.
x*y		1	0.4179	0.0586	0.3041	0.5339	50.85	<.0001
Scale		0	1.0000	0.0000	1.0000	1.0000

Note:The scale parameter was held fixed.

For our example (see ordinalLoglin.R and the resulting output), we create two new numerical variables as seen below:

    jobsecok = factor(rep(c("Not at all", "Not too", "Somewhat", "Very"), times=3))
    happy = factor(rep(c("Not too", "Pretty", "Very"), each=4))
    x = rep(c(1,2,3,4), times=3)
    y = rep(c(1,2,3), each=4)
    count = c(15,21,64,73,25,47,248,474,5,21,100,311)
    
    # setting references to last category (same as SAS)
    jobsecok = relevel(jobsecok, ref='Very')
    happy = relevel(happy, ref='Very')
    tbl.all = data.frame(jobsecok, x, happy, y, count)
    
    # independence Model
    model.ind = glm(count ~ jobsecok+happy, family=poisson(link=log), data=tbl.all)
    summary(model.ind)
    
    # linear by linear association model
    model.lin = glm(count ~ jobsecok+happy+x:y, family=poisson(link=log), data=tbl.all)
    summary(model.lin)

\(G^2=57.2404-3.6075=53.633\)

    > summary(model.lin)
    ...
    Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
    (Intercept)         0.72400    0.68739   1.053   0.2922    
    jobsecokNot at all -0.40671    0.36428  -1.116   0.2642    
    jobsecokNot too    -0.50296    0.26364  -1.908   0.0564 .  
    jobsecokSomewhat    0.18222    0.14208   1.282   0.1997    
    happyNot too        1.95036    0.40959   4.762 1.92e-06 ***
    happyPretty         2.08661    0.22192   9.403  < 2e-16 ***
    x:y                 0.41785    0.05859   7.131 9.95e-13 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for poisson family taken to be 1)
    
        Null deviance: 1729.1636  on 11  degrees of freedom
    Residual deviance:    3.6075  on  5  degrees of freedom
    AIC: 87.685
    
    > lrtest(model.ind, model.lin)
    Likelihood ratio test
    
    Model 1: count ~ jobsecok + happy
    Model 2: count ~ jobsecok + happy + x:y
      #Df  LogLik Df  Chisq Pr(>Chisq)    
    1   6 -63.659                         
    2   7 -36.843  1 53.633  2.417e-13 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Choice of Scores

There are many options for assigning the score values, and these can be of equal or unequal spacing. The most common choice of scores are consecutive integers; that is \(u_1=1\), \(u_2=2\), etc. (which is what we used in the above example). The model with such scores is a special case of the linear-by-linear association model and is known as the Uniform Association Model. It is called the uniform association model because the odds ratios for any two adjacent rows and any two adjacent columns are the same and equal to

\(\theta=\exp(\beta(u_i-u_{i-1})(v_j-v_{j-1}))=\exp(\beta)\)

In other words, the Local Odds Ratio equals \(\exp(\beta)\) and is the same for adjacent rows and columns.

Also, sets of scores with the same spacing between them will lead to the same goodness-of-fit statistics, fitted counts, odds ratios, and \(\hat{\beta}\). For example,

\(v_1=1,v_2=2,v_3=3,v_4=4\) and \(v_1=8,v_2=9,v_3=10,v_4=11\)

will yield the same results. HOWEVER, note that although two sets of scores with the same relative spacing will lead to the same goodness-of-fit statistics, fitted counts, odds ratios, and estimates of \(\beta\), ultimately, the choice of scores should reflect the nature of the data and the context of the questions involved.

11.2 - Two-Way Tables - Dependent Samples

Recall for a quantitative response, when responses are recorded as matched pairs, the question of interest is not usually whether there's a relationship or correlation between the two responses but rather whether the means of the responses are equal. Likewise, when categorical responses are paired, such as when subjects respond to the same question at two different time points, we could carry out the usual test of independence for categorical variables, but more specifically, we may be more interested in whether proportions are equal or in the extent that the subjects' opinions agree over the two-time points.

11.2.1 - Dependent Samples - Introduction

Suppose you were asked shortly after his inauguration in 2021, "Do you think Joe Biden will be effective as President of the United States?" Then, after four years, suppose you are asked, "Do you think Joe Biden has been effective as President of the United States?"

Whatever your responses are over the two time points, they will not be independent because they come from the same person. This is not the same situation as being chosen by chance in two random samples at those time points. If the same people are necessarily asked again four years later, then we really have only one sample with pairs of (dependent) responses.

In dependent samples, each observation in one sample can be paired with an observation in the other sample. Examples may include responses for essentially the same question at two points in time, such as in the example above, or even responses for different individuals if they are related in some way. Consider the examples below, the first of which has responses paired by their sibling relationship.

Example: Siblings and Puzzle Solving

Suppose we conduct an experiment to see whether a two-year difference in age among children leads to a difference in the ability to solve a puzzle quickly. Thirty-seven pairs of siblings of ages six (younger) and eight (older) were sampled, each child is given the puzzle, and the time taken to solve the puzzle (<1 minute, >1 minute) is recorded.

older sibling	younger sibling
older sibling	<1 min	>1 min
<1 min	15	7
>1 min	5	10

Do older siblings tend to be more (or less) likely than younger siblings to solve the puzzle in less than one minute.

When studying matched pairs data we might be interested in:

Comparing the margins of the table (i.e., a row proportion versus a column proportion). In the example here, this would be comparing the proportion of younger siblings solving the puzzle in less than one minute against the proportion of older siblings solving the puzzle in less than one minute.
Measuring agreement between two groups. In this case, a high agreement would mean puzzles that tend to be solved in less than one minute by younger siblings also tend to be solved in less than one minute by older siblings. Higher agreement also corresponds to larger values in the diagonal cells of the table.

We will focus on single summary statistics and tests in this lesson. A log-linear, model-based approach to matched data will be taken up in the next lesson.

Example: Movie Critiques

The data below are on Siskel's and Ebert's opinions about 160 movies from April 1995 through September 1996. Reference: Agresti and Winner (1997) "Evaluating agreement and disagreement among movie reviewers." Chance, 10, pg. 10-14.

Siskel	Ebert
Siskel	con	mixed	pro	total
con	24	8	13	45
mixed	8	13	11	32
pro	10	9	64	83
total	42	30	88	160

Do Siskel and Ebert really agree on the ratings of the same movies? If so, where is the dependence coming from?

11.2.2 - Comparing Dependent Proportions

Let us now look at the example involving the siblings in more detail. Both siblings solve the same puzzle, and the response for each is whether it took less than or longer than one minute. It is sensible to assume that these two responses should be related because siblings likely inherit similar problem-solving skills from their common parents, and indeed if we test for independence between siblings, we have \(X^2=4.3612\) with one degree of freedom and p-value 0.03677. If we view solving the puzzle in less than one minute as "success", then this is equivalent to testing for equal row proportions.

However, the question of primary interest in this study is whether the older siblings tend to have a higher proportion of success, compared with that of the younger siblings, which is a comparison of the first-row proportion against the first column proportion. Such a test does not require the use of siblings, and two samples of six-year-olds and eight-year-olds could have been independently chosen for this purpose. But using siblings allows for matched pairs of responses and controls for confounding factors that may be introduced with children from different parents.

The estimate for the difference in success proportions between ages is

\((15+7)/37-(15+5)/37=0.5946-0.5405=0.0541\)

Stop and Think!

Why can't we apply the hypothesis test for two independent proportions here? What specific part of that approach is not appropriate?

Recall from Lesson 3 the variance for the estimated difference in proportions was

\(\displaystyle V(d)=\left[\frac{ \frac{\pi_{11}}{\pi_{1+}} (1-\frac{\pi_{11}}{\pi_{1+}})} {n_{1+}} + \frac{\frac{\pi_{21}}{\pi_{2+}} (1-\frac{\pi_{21}}{\pi_{2+}})} {n_{2+}} \right] \)

The variance of the difference is the sum of the individual variances only under independence; otherwise, we would need to take into account the covariance.

Another way of asking the question of no age effect is to ask whether the margins of the table are the same (rows versus columns), which can be done with the test of marginal homogeneity or McNemar's test, which we look at next.

Test of Marginal Homogeneity

The notation needed for this test is the same as what we've seen earlier for the test of independence, but instead of focusing on comparing row proportions, we compare row versus column.

For older siblings, the probability of solving the puzzle in less than one minute (success) is

\(\pi_{1+} = \pi_{11} + \pi_{12}\)

And for younger siblings, the probability of solving the puzzle in less than one minute is

\(\pi_{+1} = \pi_{11} + \pi_{21}\)

The null hypothesis of no difference (marginal homogeneity) in a \(2 \times 2\) table is

\(H_0 \colon \pi_{1+} = \pi_{+1} \)

and is equivalent to the hypothesis that the off-diagonal probabilities are equal:

\(H_0 \colon \pi_{12} = \pi_{21} \)

The second of these above is also known as the hypothesis of symmetry and generalizes to equal off-diagonal cell counts for larger tables as well. Note that the diagonal elements are not important here. They correspond to the proportions of puzzles equally between the two siblings (either both less than or both greater than one minute). All the information required to determine whether there's a difference due to age is contained in the off-diagonal elements.

For general square \(I \times I \) tables, the hypothesis of marginal homogeneity is different from the hypothesis of symmetry, and the latter is a stronger hypothesis; symmetry introduces more structure in the square table. In a \(2 \times 2\) table, however, these two are the same test.

McNemar’s test for \(2 \times 2\) tables

This is the usual test of marginal homogeneity (and symmetry) for a \(2 \times 2\) table.

\(H_0 : \pi_{1+} = \pi_{+1}\) or equivalently \(\pi_{12} = \pi_{21}\)

Suppose that we treat the total number of observations in the off-diagonal as fixed:

\(n^\ast =n_{12}+n_{21}\)

Under the null hypothesis above, each of \(n_{12}\) and \(n_{21}\) is assumed to follow a \(Bin (n^\ast , 0.5)\) distribution. The rationale is that, under the null hypothesis, we have \(n_{12}+n_{21}\) total "trials" that can either result in cell \((1,2)\) or \((2,1)\) with probability 0.5. And provided that \(n^*\) is sufficiently large, we can use the usual normal approximation to the binomial:

\(z=\dfrac{n_{12}-0.5n^\ast}{\sqrt{0.5(1-0.5)n^\ast}}=\dfrac{n_{12}-n_{21}}{\sqrt{n_{12}+n_{21}}}\)

where \(0.5n^*\) and \(0.5(1-0.5)n^\ast\) are the expected count and variance for \(n_{12}\) under the \(H_0\). Under \(H_0\), \(z\) is approximately standard normal. This approximation works well provided that \(n^* \ge 10\). The p-value would depend on the alternative hypothesis. If \(H_a\) is that older siblings have a greater success probability, then the p-value would be

\(P(Z\ge z)\)

where \(Z\) is standard normal, and \(z\) is the observed value of the test statistic. A lower-tailed alternative would correspondingly use the lesser-than probability, and a two-sided alternative would double the one-sided probability. Alternatively, for a two-sided alternative we may compare

\(z^2=\dfrac{(n_{12}-n_{21})^2}{n_{12}+n_{21}}\)

to a chi-square distribution with one degree of freedom. This test is valid under general multinomial sampling when \(n^\ast\) is not fixed, but the grand total \(n\) is. When the sample size is small, we can compute exact probabilities (p-values) using the binomial probability distribution.

Applying this to our example data gives

\(z=\dfrac{7-5}{\sqrt{7+5}}=0.577\)

The p-value is \(P(Z\ge 0.577)=0.2820\), which is not evidence of a difference in success probabilities (solving the puzzle in less than one minute) between the two age groups.

Point Estimation and Confidence Interval

A sensible effect-size measure associated with McNemar’s test is the difference between the marginal proportions,

\(d=\pi_{1+}-\pi_{+1}=\pi_{12}-\pi_{21}\)

In large samples, the estimate of \(d\),

\(\hat{d}=\dfrac{n_{12}}{n}-\dfrac{n_{21}}{n}\)

is unbiased and approximately normal with variance

\begin{align}
V(\hat{d}) &= n^{-2} V(n_{12}-n_{21})\\
&= n^{-2}[V(n_{12})+ V(n_{21})-2Cov(n_{12},n_{21})]\\
&= n^{-1} [\pi_{12}(1-\pi_{12})+\pi_{21}(1-\pi_{21})+2\pi_{12} \pi_{21}]\\
\end{align}

An estimate of the variance is

\(\hat{V}(\hat{d})=n^{-1}\left[\dfrac{n_{12}}{n}(1-\dfrac{n_{12}}{n})+\dfrac{n_{21}}{n}(1-\dfrac{n_{21}}{n})+2\dfrac{n_{12}n_{21}}{n^2}\right]\)

and an approximate 95% confidence interval is

\(\hat{d}\pm 1.96\sqrt{\hat{V}(\hat{d})}\)

In our example, we get an estimated effect of \(\hat{d} = 0.0541\) and its standard error of \(\sqrt{\hat{V}(\hat{d})}=0.0932\), giving 95% confidence interval

\(0.0541\pm 1.96(0.0932)=(-0.1286, 0.2368)\)

Thus, although the older siblings had a higher proportion of success, it was not statistically significant. We cannot conclude that the two-year age difference is associated with faster puzzle-solving times.

Next, we do this analysis in SAS and R.

Example: Siblings and Puzzle Solving

McNemar Test in SAS - Sibling Data

In SAS under PROC FREQ: option AGREE gives the normal approximation of the McNemar test, while EXACT MCNEM will give the exact version based on binomial probabilities. Here is a sample of what the SAS coding would look like:

data siblings;
input older younger count ;
datalines;
 1 1 15
 1 2 7
 2 1 5
 2 2 10
; run;

/* normal approximation and exact McNemar test */
proc freq data=siblings; 
weight count;
tables older*younger / agree;
exact mcnem;
run;

Compare the value from the output below to the squared \(z\) value we computed on the previous page. The difference in the p-value (aside from some slight rounding) is due to the software using a two-sided alternative by default. We can divide this by 2 to get the one-sided version, however. In either case, the results indicate insignificant evidence of an age effect in puzzle-solving times.


McNemar's Test
Chi-Square	DF	Pr > ChiSq	Exact Pr >= ChiSq
0.3333	1	0.5637	0.7744

McNemar Test in R - Sibling Data

In R we can use the mcnemar.test() as demonstrated in Siblings.R:

siblings = matrix(c(15,5,7,10),nr=2,
 dimnames=list("older"=c("<1 min",">1 min"),"younger"=c("<1 min",">1 min")))
siblings

# usual test for independence comparing younger vs older
chisq.test(siblings, correct=F)

# McNemar test for equal proportions comparing younger vs older
mcnemar.test(siblings, correct=F)

> mcnemar.test(siblings, correct=F)

        McNemar's Chi-squared test

data:  siblings
McNemar's chi-squared = 0.33333, df = 1, p-value = 0.5637

11.2.3 - Efficiency of Matched Pairs

For the sibling puzzle-solving example, note that the data consisted of 37 responses for six-year-olds (younger siblings) and 37 responses for eight-year-olds (older siblings). How would the results differ if, instead of siblings, those same values had arisen from two independent samples of children of those ages?

To see why the approach with the dependent data, matched by siblings, is more powerful, consider the table below. The sample size here is defined as \(n = 37+37=74\) total responses, compared with \(n=37\) total pairs in the previous approach.

	<1 min	>1 min	total
Older	22	15	37
Younger	20	17	37

The estimated difference in proportions from this table is identical to the previous one:

\(\hat{d}=\hat{p}_1-\hat{p}_2=\dfrac{22}{37}-\dfrac{20}{37}=0.0541\)

But with independent data, the test for an age effect, which is equivalent to the usual \(\chi^2\) test of independence, gives \(X^2 = 0.2202\), compared with McNemar's test gave \(z^2 = 0.333\). The standard error for \(\hat{d}\) in this new table is

\(\sqrt{\dfrac{\hat{p}_1(1-\hat{p}_1)}{37}+\dfrac{\hat{p}_2(1-\hat{p}_2)}{37}}=0.1150\)

compared with \(0.0932\), when using siblings and taking into account the covariance between them. Just as with matched pairs for quantitative variables, the covariance between pair members leads to smaller standard errors and greater overall power, compared with the independent samples approach.

Let's take a look at the last part of Siblings.sas and its relevant output, where the same data are analyzed as if they were sampled independently and not matched by siblings.

data matched;
input time $ approval $ count ;
datalines;
 t1 approve 944
 t1 disapprove 656
 t2 approve  880
 t2 disapprove 720
;
proc freq data=matched order=data;
 weight count;
    tables time*approval /chisq riskdiff;
run;

Now we are doing just a regular test of independence, and the Pearson chi-square is \(0.2202\) with a p-value of 0.02. Although our conclusion seems to be identical (we still can't claim a significant age effect), notice that our p-value is less significant when the data are treated as independent. In general, the matched pairs approach is more powerful.

Statistics for Table of age by time


Statistic	DF	Value	Prob
Chi-Square	1	0.2202	0.6389
Likelihood Ratio Chi-Square	1	0.2204	0.6388
Continuity Adj. Chi-Square	1	0.0551	0.8145
Mantel-Haenszel Chi-Square	1	0.2173	0.6411
Phi Coefficient		0.0546
Contingency Coefficient		0.0545
Cramer's V		0.0546


Column 1 Risk Estimates
	Risk	ASE	95% Confidence Limits		Exact 95% Confidence Limits
Difference is (Row 1 - Row 2)
Row 1	0.5946	0.0807	0.4364	0.7528	0.4210	0.7525
Row 2	0.5405	0.0819	0.3800	0.7011	0.3692	0.7051
Total	0.5676	0.0576	0.4547	0.6804	0.4472	0.6823
Difference	0.0541	0.1150	-0.1714	0.2795

Let's take a look at the last part of Siblings.R and its relevant output, where the same data are analyzed as if they were sampled independently and not matched by siblings.


notsiblings = matrix(c(22,20,15,17),nr=2,
 dimnames=list(c("Older","Younger"),c("<1 min",">1 min")))
notsiblings

chisq.test(notsiblings, correct=F)
prop.test(notsiblings, correct=F)

> chisq.test(notsiblings, correct=F)

        Pearson's Chi-squared test

data:  notsiblings
X-squared = 0.22024, df = 1, p-value = 0.6389

> prop.test(notsiblings, correct=F)

        2-sample test for equality of proportions without continuity
        correction

data:  notsiblings
X-squared = 0.22024, df = 1, p-value = 0.6389
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.1713610  0.2794691

McNemar’s test applies whenever the hypothesis of marginal homogeneity is of interest. Dependency and marginal homogeneity may arise in varieties of problems involving dependency.

11.2.4 - Measure of Agreement: Kappa

Another hypothesis of interest is to evaluate to what degree two different examiners agree on different systems of evaluation. This has important applications in medicine where two physicians may be called upon to evaluate the same group of patients for further treatment.

The Cohen's Kappa statistic (or simply kappa) is intended to measure agreement between two variables.

Example: Movie Critiques

Recall the example on movie ratings from the introduction. Do the two movie critics, in this case, Siskel and Ebert, classify the same movies into the same categories; do they really agree?

Siskel	Ebert
Siskel	con	mixed	pro	total
con	24	8	13	45
mixed	8	13	11	32
pro	10	9	64	83
total	42	30	88	160

In the (necessarily) square table above, the main diagonal counts represent movies where both raters agreed. Let the term \(\pi_{ij}\) denote the probability that Siskel classifies the move in category \(i\), and Ebert classifies the same movie in category \(j\). For example, \(\pi_{13}\) is the probability that Siskel rates a movie as "con", but Ebert rates it as "pro".

The term \(\sum_{i} \pi_{ii}\) then is the total probability of agreement. The extreme case that all observations are classified on the main diagonal is known as "perfect agreement".

Stop and Think!

Is it possible to define perfect disagreement?

Kappa measures agreement. A perfect agreement is when all of the counts fall on the main diagonal of the table, and the probability of agreement will be equal to 1.

To define perfect disagreement, the ratings of the movies, in this case, would have to be opposite one another, ideally in the extremes. In a \(2 \times 2\) table it is possible to define perfect disagreement because each positive rating could have one specific negative rating (e.g. Love vs. Hate it), but what about a \(3 \times 3\) or higher square tables? In these cases there are more ways that one might disagree and it therefore quickly gets more complicated to disagree perfectly. To think of perfect disagreement we would have to have a situation that minimizes agreement in any combination, and in higher way tables this would likely be a situation where there are no counts in some cells because it would be impossible to have perfect disagreement across all combinations at the same time.

In a \(3 \times 3\) table, here are two options that would provide no agreement at all (the # indicating a count):

	1	3
1	0	#
2	0	0
3	#	0

	1	2
1	0	#
2	#	0
3	0	0

Cohen’s kappa is a single summary index that describes strength of inter-rater agreement.

For \(I \times I\) tables, it’s equal to

\(\kappa=\dfrac{\sum\pi_{ii}-\sum\pi_{i+}\pi_{+i}}{1-\sum\pi_{i+}\pi_{+i}}\)

This statistic compares the observed agreement to the expected agreement, computed assuming the ratings are independent.

The null hypothesis that the ratings are independent is, therefore, equivalent to

\(\pi_{ii}=\pi_{i+}\pi_{+i}\quad\text{ for all }i\)

If the observed agreement is due to chance only, i.e., if the ratings are completely independent, then each diagonal element is a product of the two marginals.

Since the total probability of agreement is \(\sum_{i} \pi_{ii}\), then the probability of agreement under the null hypothesis equals to \(\sum_{i} \pi_{i+}\pi_{+i}\). Note also that \(\sum_{i} \pi_{ii} = 0\) means no agreement, and \(\sum_{i} \pi_{ii} = 1\) indicates perfect agreement. The kappa statistic is defined so that a larger value implies stronger agreement. Furthermore,

Perfect agreement \(\kappa = 1\).
\(\kappa = 0\), does not mean perfect disagreement but rather only agreement that would result from chance only, where the diagonal cell probabilities are simply products of the corresponding marginals.
If the actual agreement is greater than agreement by chance, then \(\kappa\geq 0\).
If the actual agreement is less than agreement obtained by chance, then \(\kappa\leq 0\).
The minimum possible value of \(\kappa = −1\).
A value of kappa higher than 0.75 can be considered (arbitrarily) as "excellent" agreement, while lower than 0.4 will indicate "poor" agreement.

Note! Notice that strong agreement implies strong association, but strong association may not imply strong agreement. For example, if Siskel puts most of the movies into the con category while Ebert puts them into the pro category, the association might be strong, but there is certainly no agreement. You may also think of the situation where one examiner is tougher than the other. The first one consistently gives one grade less than the more lenient one. In this case, also the association is very strong but agreement may be insignificant.

Under multinomial sampling, the sampled value \(\hat{\kappa}\) has a large-sample normal distribution. Thus, we can rely on the asymptotic 95% confidence interval.

In SAS, use the option AGREE as shown below and in the SAS program MovieCritics.sas.

data critic;
input siskel $ ebert $ count ;
datalines;
 con con  24 
 con mixed 8 
 con pro 13
 mixed con 8 
 mixed mixed 13
 mixed pro 11
 pro con 10
 pro mixed 9
 pro pro 64
 ; run;
 
proc freq; 
weight count;
tables siskel*ebert / agree chisq;
run;

From the output below, we can see that the "Simple Kappa" gives the estimated kappa value of 0.3888 with its asymptotic standard error (ASE) of 0.0598. The difference between observed agreement and expected under independence is about 40% of the maximum possible difference. Based on the reported 95% confidence interval, \(\kappa\) falls somewhere between 0.2716 and 0.5060 indicating only a moderate agreement between Siskel and Ebert.


Kappa Statistics
Statistic	Estimate	Standard Error	95% Confidence Limits
Simple Kappa	0.3888	0.0598	0.2716	0.5060
Weighted Kappa	0.4269	0.0635	0.3024	0.5513

Sample Size = 160

In R, we can use the Kappa function in the vcd package. The following are from the script MovieCritics.R.

critic = matrix(c(24,8,10,8,13,9,13,11,64),nr=3,
 dimnames=list("siskel"=c("con","mixed","pro"),"ebert"=c("con","mixed","pro")))
critic

# chi-square test for independence between raters
result = chisq.test(critic)
result

# kappa coefficient for agreement
library(vcd)
kappa = Kappa(critic)

From the output below, we can see that the "unweighted" statistic gives the estimated kappa value of 0.389 with its asymptotic standard error (ASE) of 0.063. The difference between observed agreement and expected under independence is about 40% of the maximum possible difference. Based on the reported values, the 95% confidence interval for \(\kappa\) ranges from 0.27 to 0.51, indicating only a moderate agreement between Siskel and Ebert.

> kappa
            value     ASE     z  Pr(>|z|)
Unweighted 0.3888 0.05979 6.503 7.870e-11
Weighted   0.4269 0.06350 6.723 1.781e-11
> confint(kappa)
            
Kappa              lwr       upr
  Unweighted 0.2716461 0.5060309
  Weighted   0.3024256 0.5513224

Issue with Cohen's Kappa

Kappa strongly depends on the marginal distributions. That is the same rating but with different proportions of cases in different categories can give very different \(\kappa\) values. This is one reason why the minimum value of \(\kappa\) depends on the marginal distribution and the minimum possible value of \(-1\) is not always attainable.

Solution

Modeling agreement (e.g., via log-linear or other models) is typically a more informative approach.

Weighted kappa is a version of kappa used for measuring agreement on ordered variables, where certain disagreements (e.g., lowest versus highest) can be weighted as more or less important.

11.3 - Inference for Log-linear Models - Dependent Samples

After the previous discussion on matched and dependent data for square tables, we now consider similar questions with the log-linear model. For the movie ratings example, we concluded that the model of independence is not good (as expected) and that there is only a moderate agreement between Siskel and Ebert (estimate \(\kappa = 0.39\).

Siskel	Ebert
Siskel	con	mixed	pro	total
con	24	8	13	45
mixed	8	13	11	32
pro	10	9	64	83
total	42	30	88	160

How can we use this approach to test for various degrees of agreement between two variables?

Log-linear Models for Square Tables

We can always fit the log-linear models we are already familiar with (e.g., the independence model), but given the nature of matched pairs or repeated measures data, we expect dependence. Thus, there are additional models, specific to these kinds of data that we'd like to consider. Some of those are

Symmetry
Quasi-symmetry
Marginal homogeneity

Fitting of these models typically requires the creation of additional variables, either indicators or other types of numerical variables to be included in the models that we are already familiar with such as an independence model.

11.3.1 - Symmetry Model

A rather restrictive form of agreement between two variables is symmetry, where the expected counts satisfy \(\mu_{ij}=\mu_{ji}\) for all \(i\) and \(j\). As the name suggests, this means the table of expected counts is a symmetric matrix.

Using the notation from the Siskel and Ebert example with general log-linear model

\(\log\mu_{ij}=\lambda+\lambda^S_i+\lambda^E_j+\lambda^{SE}_{ij}\)

the symmetry assumption would mean that

\(\lambda_{ij}^{SE}=\lambda^{SE}_{ji}\)
\(\lambda^S_i=\lambda^E_i\)

for all \(i\) and \(j\). Note that this differs from the independence assumption that all \(\lambda_{ij}^{SE}=0\). Both are restrictive models, but neither is a special case of the other.

For an \(I\times I\) table, then there are \(I\) total counts along the main diagonal and \(I(I-1)\) total counts off the main diagonal, giving \(I+(I(I-1)/2\) unique parameters to be estimated to fit the symmetry model.

The challenge in actually fitting this model, however, is that the symmetry restrictions are not simply removing certain terms (setting them equal to 0). Rather, the restriction is that certain counts---the off-diagonal ones---are required to be equal. So, we use a regression approach with some carefully defined indicator variables. Specifically, for each observation in the sample, we take \(I_{ij}=1\) if the observation falls in row \(i\) and column \(j\) OR if the observation falls in row \(j\) and column \(i\), for all \(i\le j\le I\); otherwise \(I_{ij}=0\).

The idea is that the symmetry model views the \(n_{ij}\) count and the \(n_{ji}\) count as coming from a common population, and so only a single parameter is attributed to its mean. For the Siskel and Ebert example, this reduces to

\(\log\mu_{ij}=\beta_1I_{11}+\beta_2I_{22}+\beta_3I_{33}+ \beta_4I_{12}+\beta_5I_{13}+\beta_6I_{23} \)

Thus, the \(i^{th}\) main diagonal mean is \(\mu_{ii}=e^{\beta_i}\), and the \((i,j)^{th}\) off-diagonal mean is \(\mu_{ij}=\mu_{ji}=e^{\beta}\), where \(\beta\) is the coefficient for \(I_{ij}\) (for \(i<j\)).

Let's see how we can do this in SAS and R.

Take a look at the SAS code in MovieCritics.sas. The data used previously is updated with the indicators we'll use for the symmetry model. These replace not only the original variables S and E but also the intercept, giving a total of \(I+I(I-1)/2=6\) parameters in this case (with \(I=3\)).

data critic;
set critic;

/* indicator variables for each diagonal cell*/
I11=0; if siskel='con' AND ebert='con' THEN I11=1;
I22=0; if siskel='mixed' AND ebert='mixed' THEN I22=1;
I33=0; if siskel='pro' AND ebert='pro' THEN I33=1;

/* indicator variables for each off-diagonal cell*/
I12=0; if siskel='con' AND ebert='mixed' THEN I12=1;
 if siskel='mixed' AND ebert='con' THEN I12=1;
I13=0; if siskel='con' AND ebert='pro' THEN I13=1;
 if siskel='pro' AND ebert='con' THEN I13=1;
I23=0; if siskel='mixed' AND ebert='pro' THEN I23=1;
 if siskel='pro' AND ebert='mixed' THEN I23=1;
; run;

/* symmetry */
proc genmod data=critic order=data;
model count=I11 I22 I33 I12 I13 I23 / 
 link=log dist=poisson noint predicted;
title "Symmetry Model";
run;

From the output, note that the deviance statistic \(G^2=0.5928\) is comparing the symmetry model to the saturated model, which has 3 degrees of freedom corresponding to the difference in the models' numbers of parameters: 9 versus 6. The p-value is fairly large, indicating that the symmetry model is a reasonable fit for this data.


Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	3	0.5928	0.1976
Scaled Deviance	3	0.5928	0.1976
Pearson Chi-Square	3	0.5913	0.1971
Scaled Pearson X2	3	0.5913	0.1971
Log Likelihood		351.2829
Full Log Likelihood		-20.3921
AIC (smaller is better)		52.7842
AICC (smaller is better)		94.7842
BIC (smaller is better)		53.9676

The estimates of the model parameters are given in the table below. We can exponentiate these to get the mean estimates for the \(9\times 9\) table. For example, the estimate for \(\mu_{13}\) is \(e^{2.4423}=11.5\), which also can be calculated intuitively from the table counts directly:

\(\hat{\mu}_{13}=\dfrac{n_{13}+n_{31}}{2}=\dfrac{13+10}{2}\)


Analysis Of Maximum Likelihood Parameter Estimates
Parameter	DF	Estimate	Standard Error	Wald 95% Confidence Limits		Wald Chi-Square	Pr > ChiSq
Intercept	0	0.0000	0.0000	0.0000	0.0000	.	.
I11	1	3.1781	0.2041	2.7780	3.5781	242.40	<.0001
I22	1	2.5649	0.2774	2.0214	3.1085	85.53	<.0001
I33	1	4.1589	0.1250	3.9139	4.4039	1106.96	<.0001
I12	1	2.0794	0.2500	1.5895	2.5694	69.19	<.0001
I13	1	2.4423	0.2085	2.0337	2.8510	137.20	<.0001
I23	1	2.3026	0.2236	1.8643	2.7408	106.04	<.0001
Scale	0	1.0000	0.0000	1.0000	1.0000

Note:The scale parameter was held fixed.

Take a look at the R code in MovieCritics.R. The data is converted to the "stacked" or long format to fit the log-linear model, and the indicators we need for the symmetry model replace the earlier factors S and E. Also, note that the intercept is dropped with the "-1" specification. This is to avoid an overparameterization, given the \(I+I(I-1)/2=6\) indicator variables are already present (with \(I=3\)).

S = factor(rep(c("con","mixed","pro"),each=3)) # siskel
E = factor(rep(c("con","mixed","pro"),times=3)) # ebert
count = c(24,8,13,8,13,11,10,9,64)

# indicator variables for diagonals
I11 = (S=="con")*(E=="con")
I22 = (S=="mixed")*(E=="mixed")
I33 = (S=="pro")*(E=="pro")

# indicator variables for off-diagonals
I12 = (S=="con")*(E=="mixed") + (E=="con")*(S=="mixed")
I13 = (S=="con")*(E=="pro") + (E=="con")*(S=="pro")
I23 = (S=="mixed")*(E=="pro") + (E=="mixed")*(S=="pro")

# symmetry model
fit.s = glm(count~I11+I12+I13+I22+I23+I33-1,family=poisson(link='log'))

At the end of the output, note that the deviance statistic \(G^2=0.59276\) is comparing the symmetry model to the saturated model, which has 3 degrees of freedom corresponding to the difference in the models' numbers of parameters: 9 versus 6. The p-value is fairly large, indicating that the symmetry model is a reasonable fit to this data.

The estimates of the model parameters are given in the "Coefficients" table below. We can exponentiate these to get the mean estimates for the \(9\times 9\) table. For example, the estimate for \(\mu_{13}\) is \(e^{2.4423}=11.5\), which also can be calculated intuitively from the table counts directly:

\(\hat{\mu}_{13}=\dfrac{n_{13}+n_{31}}{2}=\dfrac{13+10}{2}\)

> summary(fit.s)

Call:
glm(formula = count ~ I11 + I12 + I13 + I22 + I23 + I33 - 1, 
    family = poisson(link = "log"))

Deviance Residuals: 
      1        2        3        4        5        6        7        8  
 0.0000   0.0000   0.4332   0.0000   0.0000   0.3112  -0.4525  -0.3217  
      9  
 0.0000  

Coefficients:
    Estimate Std. Error z value Pr(>|z|)    
I11   3.1781     0.2041  15.569   <2e-16 ***
I12   2.0794     0.2500   8.318   <2e-16 ***
I13   2.4423     0.2085  11.713   <2e-16 ***
I22   2.5649     0.2774   9.248   <2e-16 ***
I23   2.3026     0.2236  10.297   <2e-16 ***
I33   4.1589     0.1250  33.271   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 721.15855  on 9  degrees of freedom
Residual deviance:   0.59276  on 3  degrees of freedom
AIC: 52.784

Number of Fisher Scoring iterations: 4

> mu.s = fitted.values(fit.s)
> matrix(mu.s,nrow=3)
     [,1] [,2] [,3]
[1,] 24.0    8 11.5
[2,]  8.0   13 10.0
[3,] 11.5   10 64.0

Although the symmetry model fits well in this example, it may be too restrictive for other types of data. So, something less restrictive (requiring less agreement) may be more suitable. This leads us to consider "quasi-symmetry".

11.3.2 - Quasi-symmetry Model

Less restrictive than the symmetry model but still assuming some agreement compared with the fully saturated (unspecified) model, the quasi-symmetry model assumes that

\(\lambda_{ij}^{SE}=\lambda^{SE}_{ji}\)

Note that, compared with the symmetry model, the quasi-symmetry model does not require marginal homogeneity, which is that \(\lambda^S_i=\lambda^E_i\). In terms of the notation we have so far, the model now becomes

\(\log(\mu_{ij}) = \lambda+\lambda_i^S+\lambda_j^E+ \beta_4I_{12}+\beta_5I_{13}+\beta_6I_{23} \)

Thus, we use the separate S and E factors to allow for more general marginal distributions but apply a restriction to their joint distribution.

Generally for an \(I\times I\) table, the number of parameters breaks down as follows:

\(1+(I-1)+(I-1)+\dfrac{I(I-1)}{2}\)

The first three quantities allow one parameter for \(\lambda\) and \(I-1\) parameters for each marginal distribution separately for S and E. The last quantity \(I(I-1)/2\) corresponds to the number of off-diagonal counts and is identical to the symmetry model. But the symmetry model requires only \(I\) parameters for a common marginal distribution, whereas the quasi-symmetry model here allows for additional parameters and is hence less restrictive.

While it may seem at a glance that using the same indicators for the off-diagonal table counts would require symmetry, note that the additional parameters allowed in the marginal distributions for S and E influence the off-diagonal counts as well. For example, the expected count for the \((1,2)\) cell would be

\(\mu_{12}=\exp[\lambda+\lambda_1^S+\lambda_2^E+\beta_4]\)

while the expected count for the \((2,1)\) cell would be

\(\mu_{21}=\exp[\lambda+\lambda_2^S+\lambda_1^E+\beta_4]\)

For the \(3\times3\) table, we're currently working with, the quasi-symmetry model has 8 parameters, and so the deviance test versus the saturated model will have only a single degree of freedom. But for larger tables, it can be a more moderate appreciable reduction. As it is, we see below when fitting this model with software, the test statistic is \(G^2=0.0061\) and is very similar to the saturated model.

In SAS we have...

/* quasi-symmetry */
proc genmod data=critic order=data;
class siskel ebert;
model count=siskel ebert I12 I13 I23 /
 link=log dist=poisson predicted;
title "Quasi-Symmetry Model";
run;

And from the output, we can verify the deviance statistic above as well as the non-symmetrical form of the fitted values.


Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	1	0.0061	0.0061
Scaled Deviance	1	0.0061	0.0061
Pearson Chi-Square	1	0.0061	0.0061
Scaled Pearson X2	1	0.0061	0.0061
Log Likelihood		351.5763
Full Log Likelihood		-20.0988
AIC (smaller is better)		56.1975
AICC (smaller is better)		.
BIC (smaller is better)		57.7753


Observation Statistics
Observation	count	Predicted Value	Linear Predictor	Standard Error of the Linear Predictor	HessWgt
1	24	24	3.1780538	0.2041241	24
2	8	8.0980871	2.0916279	0.3150296	8.0980871
3	13	12.901913	2.5573756	0.2606862	12.901913
4	8	7.9019129	2.0671049	0.3179476	7.9019129
5	13	13	2.5649494	0.2773501	13
6	11	11.098087	2.4067728	0.2778455	11.098087
7	10	10.098087	2.312346	0.2888566	10.098087
8	9	8.9019129	2.1862662	0.3037655	8.9019129
9	64	64	4.1588831	0.125	64

In R we have...

# quasi-symmetry model
fit.q = glm(count~S+E+I12+I13+I23,family=poisson(link='log'))

And from the output, we can verify the deviance statistic above as well as the non-symmetrical form of the fitted values.

> summary(fit.q)

Call:
glm(formula = count ~ S + E + I12 + I13 + I23, family = poisson(link = "log"))

Deviance Residuals: 
       1         2         3         4         5         6         7         8  
 0.00000  -0.03454   0.02727   0.03482   0.00000  -0.02949  -0.03092   0.03281  
       9  
 0.00000  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   3.1781     0.2041  15.569  < 2e-16 ***
Smixed       -0.3188     0.2594  -1.229  0.21913    
Spro          0.3679     0.2146   1.714  0.08652 .  
Emixed       -0.2943     0.2594  -1.134  0.25666    
Epro          0.6129     0.2146   2.856  0.00430 ** 
I12          -0.7921     0.3036  -2.609  0.00907 ** 
I13          -1.2336     0.2414  -5.110 3.22e-07 ***
I23          -1.0654     0.2712  -3.928 8.55e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1.0221e+02  on 8  degrees of freedom
Residual deviance: 6.0515e-03  on 1  degrees of freedom
AIC: 56.198

> mu.q = fitted.values(fit.q)
> matrix(mu.q,nrow=3)
          [,1]      [,2]      [,3]
[1,] 24.000000  7.901913 10.098087
[2,]  8.098087 13.000000  8.901913
[3,] 12.901913 11.098087 64.000000

11.3.3 - Marginal Homogeneity

We have seen this when we discussed the two-way table for the siblings' puzzle times. Marginal homogeneity means the row and column distributions (of a square table) are the same. That is,

\(\mu_{i+}=\mu_{+i}\)

for all \(i\). There is no log-linear model that corresponds directly to marginal homogeneity, but we can use the log-linear models of symmetry and quasi-symmetry to indirectly test for marginal homogeneity.

Recall that symmetry has two restrictions:

\(\lambda_{ij}^{SE}=\lambda^{SE}_{ji}\)
\(\lambda^S_i=\lambda^E_i\)

And this suggests that symmetry consists of restrictions from both quasi-symmetry and marginal homogeneity. This correspondence is actually one-to-one, meaning that both quasi-symmetry and marginal homogeneity together are necessary and sufficient for symmetry. And thus if a symmetry model lacks fit when compared with a quasi-symmetry model, it must be due to the marginal homogeneity assumption. This gives us an avenue for testing marginal homogeneity:

\(G^2\) (marginal homogeneity versus saturated) \(= G^2\) (symmetry versus quasi-symmetry) \)

For our example, we calculated \(G^2 = 0.5928\) for the test of symmetry versus saturated (with 3 df) and \(G^2=0.0061\) for the test of quasi-symmetry saturated (with 1 df), giving a likelihood ratio test of \(G^2=0.5928-0.0061=0.5867\) for the test of symmetry versus quasi-symmetry (with 2 df). Indirectly, this serves as our test for marginal homogeneity, and in this case, it would be a reasonable fit to the data (p-value 0.746).

A diagram of the models discussed in the lesson are depicted below.

The most general (complex) model is the saturated model. Each of quasi-symmetry and marginal homogeneity is a special case of the saturated model, but neither is a special case of the other. And symmetry is both quasi-symmetry and marginal homogeneity.

In terms of deviance statistics versus saturated, the table below summarizes the test results. For this example, the symmetry model would be preferred because it's the most restrictive model that fits the data reasonably well.

Model	df	\(G^2\)	p-value
Independence	4	43.2325	0.0001
Symmetry	3	0.5928	0.900
Marginal homogeneity	2	0.587	0.746
Quasi-symmetry	1	0.0061	0.938

Thus, Siskel and Ebert tend to agree on their ratings of movies.

11.4 - Lesson 11 Summary

In this lesson, we applied the log-linear model to ordinal data and showed how it can be used to describe a linear relationship between two ordinal variables. For a given set of numeric scores representing the ordinal categories, we introduced a regression-like slope parameter to describe the linear relationship, replacing the ANOVA-like interaction terms previously used when variables were nominal. The advantage of this was to reduce the number of parameters involved to a single slope and increase the power for testing its significance.

We also showed how agreement between two categorical variables can be measured and tested. This is a question of interest when variables are matched within-subjects, such as in repeated measures over time. In such cases, some association is usually taken for granted, and the focus is then on the specific type of association, such as symmetry or marginal homogeneity.

In the next lesson, we extend this idea of dependent or matched-pairs data to cases involving three or more measurements. This would be the three-way (or higher) table extension to the examples considered so far.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility