11.2 - Two-Way Tables - Dependent Samples

Recall for a quantitative response, when responses are recorded as matched pairs, the question of interest is not usually whether there's a relationship or correlation between the two responses but rather whether the means of the responses are equal. Likewise, when categorical responses are paired, such as when subjects respond to the same question at two different time points, we could carry out the usual test of independence for categorical variables, but more specifically, we may be more interested in whether proportions are equal or in the extent that the subjects' opinions agree over the two-time points.

11.2.1 - Dependent Samples - Introduction

Suppose you were asked shortly after his inauguration in 2021, "Do you think Joe Biden will be effective as President of the United States?" Then, after four years, suppose you are asked, "Do you think Joe Biden has been effective as President of the United States?"

Whatever your responses are over the two time points, they will not be independent because they come from the same person. This is not the same situation as being chosen by chance in two random samples at those time points. If the same people are necessarily asked again four years later, then we really have only one sample with pairs of (dependent) responses.

In dependent samples, each observation in one sample can be paired with an observation in the other sample. Examples may include responses for essentially the same question at two points in time, such as in the example above, or even responses for different individuals if they are related in some way. Consider the examples below, the first of which has responses paired by their sibling relationship.

Example: Siblings and Puzzle Solving

Suppose we conduct an experiment to see whether a two-year difference in age among children leads to a difference in the ability to solve a puzzle quickly. Thirty-seven pairs of siblings of ages six (younger) and eight (older) were sampled, each child is given the puzzle, and the time taken to solve the puzzle (<1 minute, >1 minute) is recorded.

older sibling	younger sibling
older sibling	<1 min	>1 min
<1 min	15	7
>1 min	5	10

Do older siblings tend to be more (or less) likely than younger siblings to solve the puzzle in less than one minute.

When studying matched pairs data we might be interested in:

Comparing the margins of the table (i.e., a row proportion versus a column proportion). In the example here, this would be comparing the proportion of younger siblings solving the puzzle in less than one minute against the proportion of older siblings solving the puzzle in less than one minute.
Measuring agreement between two groups. In this case, a high agreement would mean puzzles that tend to be solved in less than one minute by younger siblings also tend to be solved in less than one minute by older siblings. Higher agreement also corresponds to larger values in the diagonal cells of the table.

We will focus on single summary statistics and tests in this lesson. A log-linear, model-based approach to matched data will be taken up in the next lesson.

Example: Movie Critiques

The data below are on Siskel's and Ebert's opinions about 160 movies from April 1995 through September 1996. Reference: Agresti and Winner (1997) "Evaluating agreement and disagreement among movie reviewers." Chance, 10, pg. 10-14.

Siskel	Ebert
Siskel	con	mixed	pro	total
con	24	8	13	45
mixed	8	13	11	32
pro	10	9	64	83
total	42	30	88	160

Do Siskel and Ebert really agree on the ratings of the same movies? If so, where is the dependence coming from?

11.2.2 - Comparing Dependent Proportions

Let us now look at the example involving the siblings in more detail. Both siblings solve the same puzzle, and the response for each is whether it took less than or longer than one minute. It is sensible to assume that these two responses should be related because siblings likely inherit similar problem-solving skills from their common parents, and indeed if we test for independence between siblings, we have \(X^2=4.3612\) with one degree of freedom and p-value 0.03677. If we view solving the puzzle in less than one minute as "success", then this is equivalent to testing for equal row proportions.

However, the question of primary interest in this study is whether the older siblings tend to have a higher proportion of success, compared with that of the younger siblings, which is a comparison of the first-row proportion against the first column proportion. Such a test does not require the use of siblings, and two samples of six-year-olds and eight-year-olds could have been independently chosen for this purpose. But using siblings allows for matched pairs of responses and controls for confounding factors that may be introduced with children from different parents.

The estimate for the difference in success proportions between ages is

\((15+7)/37-(15+5)/37=0.5946-0.5405=0.0541\)

Stop and Think!

Why can't we apply the hypothesis test for two independent proportions here? What specific part of that approach is not appropriate?

Recall from Lesson 3 the variance for the estimated difference in proportions was

\(\displaystyle V(d)=\left[\frac{ \frac{\pi_{11}}{\pi_{1+}} (1-\frac{\pi_{11}}{\pi_{1+}})} {n_{1+}} + \frac{\frac{\pi_{21}}{\pi_{2+}} (1-\frac{\pi_{21}}{\pi_{2+}})} {n_{2+}} \right] \)

The variance of the difference is the sum of the individual variances only under independence; otherwise, we would need to take into account the covariance.

Another way of asking the question of no age effect is to ask whether the margins of the table are the same (rows versus columns), which can be done with the test of marginal homogeneity or McNemar's test, which we look at next.

Test of Marginal Homogeneity

The notation needed for this test is the same as what we've seen earlier for the test of independence, but instead of focusing on comparing row proportions, we compare row versus column.

For older siblings, the probability of solving the puzzle in less than one minute (success) is

\(\pi_{1+} = \pi_{11} + \pi_{12}\)

And for younger siblings, the probability of solving the puzzle in less than one minute is

\(\pi_{+1} = \pi_{11} + \pi_{21}\)

The null hypothesis of no difference (marginal homogeneity) in a \(2 \times 2\) table is

\(H_0 \colon \pi_{1+} = \pi_{+1} \)

and is equivalent to the hypothesis that the off-diagonal probabilities are equal:

\(H_0 \colon \pi_{12} = \pi_{21} \)

The second of these above is also known as the hypothesis of symmetry and generalizes to equal off-diagonal cell counts for larger tables as well. Note that the diagonal elements are not important here. They correspond to the proportions of puzzles equally between the two siblings (either both less than or both greater than one minute). All the information required to determine whether there's a difference due to age is contained in the off-diagonal elements.

For general square \(I \times I \) tables, the hypothesis of marginal homogeneity is different from the hypothesis of symmetry, and the latter is a stronger hypothesis; symmetry introduces more structure in the square table. In a \(2 \times 2\) table, however, these two are the same test.

McNemar’s test for \(2 \times 2\) tables

This is the usual test of marginal homogeneity (and symmetry) for a \(2 \times 2\) table.

\(H_0 : \pi_{1+} = \pi_{+1}\) or equivalently \(\pi_{12} = \pi_{21}\)

Suppose that we treat the total number of observations in the off-diagonal as fixed:

\(n^\ast =n_{12}+n_{21}\)

Under the null hypothesis above, each of \(n_{12}\) and \(n_{21}\) is assumed to follow a \(Bin (n^\ast , 0.5)\) distribution. The rationale is that, under the null hypothesis, we have \(n_{12}+n_{21}\) total "trials" that can either result in cell \((1,2)\) or \((2,1)\) with probability 0.5. And provided that \(n^*\) is sufficiently large, we can use the usual normal approximation to the binomial:

\(z=\dfrac{n_{12}-0.5n^\ast}{\sqrt{0.5(1-0.5)n^\ast}}=\dfrac{n_{12}-n_{21}}{\sqrt{n_{12}+n_{21}}}\)

where \(0.5n^*\) and \(0.5(1-0.5)n^\ast\) are the expected count and variance for \(n_{12}\) under the \(H_0\). Under \(H_0\), \(z\) is approximately standard normal. This approximation works well provided that \(n^* \ge 10\). The p-value would depend on the alternative hypothesis. If \(H_a\) is that older siblings have a greater success probability, then the p-value would be

\(P(Z\ge z)\)

where \(Z\) is standard normal, and \(z\) is the observed value of the test statistic. A lower-tailed alternative would correspondingly use the lesser-than probability, and a two-sided alternative would double the one-sided probability. Alternatively, for a two-sided alternative we may compare

\(z^2=\dfrac{(n_{12}-n_{21})^2}{n_{12}+n_{21}}\)

to a chi-square distribution with one degree of freedom. This test is valid under general multinomial sampling when \(n^\ast\) is not fixed, but the grand total \(n\) is. When the sample size is small, we can compute exact probabilities (p-values) using the binomial probability distribution.

Applying this to our example data gives

\(z=\dfrac{7-5}{\sqrt{7+5}}=0.577\)

The p-value is \(P(Z\ge 0.577)=0.2820\), which is not evidence of a difference in success probabilities (solving the puzzle in less than one minute) between the two age groups.

Point Estimation and Confidence Interval

A sensible effect-size measure associated with McNemar’s test is the difference between the marginal proportions,

\(d=\pi_{1+}-\pi_{+1}=\pi_{12}-\pi_{21}\)

In large samples, the estimate of \(d\),

\(\hat{d}=\dfrac{n_{12}}{n}-\dfrac{n_{21}}{n}\)

is unbiased and approximately normal with variance

\begin{align}
V(\hat{d}) &= n^{-2} V(n_{12}-n_{21})\\
&= n^{-2}[V(n_{12})+ V(n_{21})-2Cov(n_{12},n_{21})]\\
&= n^{-1} [\pi_{12}(1-\pi_{12})+\pi_{21}(1-\pi_{21})+2\pi_{12} \pi_{21}]\\
\end{align}

An estimate of the variance is

\(\hat{V}(\hat{d})=n^{-1}\left[\dfrac{n_{12}}{n}(1-\dfrac{n_{12}}{n})+\dfrac{n_{21}}{n}(1-\dfrac{n_{21}}{n})+2\dfrac{n_{12}n_{21}}{n^2}\right]\)

and an approximate 95% confidence interval is

\(\hat{d}\pm 1.96\sqrt{\hat{V}(\hat{d})}\)

In our example, we get an estimated effect of \(\hat{d} = 0.0541\) and its standard error of \(\sqrt{\hat{V}(\hat{d})}=0.0932\), giving 95% confidence interval

\(0.0541\pm 1.96(0.0932)=(-0.1286, 0.2368)\)

Thus, although the older siblings had a higher proportion of success, it was not statistically significant. We cannot conclude that the two-year age difference is associated with faster puzzle-solving times.

Next, we do this analysis in SAS and R.

Example: Siblings and Puzzle Solving

McNemar Test in SAS - Sibling Data

In SAS under PROC FREQ: option AGREE gives the normal approximation of the McNemar test, while EXACT MCNEM will give the exact version based on binomial probabilities. Here is a sample of what the SAS coding would look like:

data siblings;
input older younger count ;
datalines;
 1 1 15
 1 2 7
 2 1 5
 2 2 10
; run;

/* normal approximation and exact McNemar test */
proc freq data=siblings; 
weight count;
tables older*younger / agree;
exact mcnem;
run;

Compare the value from the output below to the squared \(z\) value we computed on the previous page. The difference in the p-value (aside from some slight rounding) is due to the software using a two-sided alternative by default. We can divide this by 2 to get the one-sided version, however. In either case, the results indicate insignificant evidence of an age effect in puzzle-solving times.


McNemar's Test
Chi-Square	DF	Pr > ChiSq	Exact Pr >= ChiSq
0.3333	1	0.5637	0.7744

McNemar Test in R - Sibling Data

In R we can use the mcnemar.test() as demonstrated in Siblings.R:

siblings = matrix(c(15,5,7,10),nr=2,
 dimnames=list("older"=c("<1 min",">1 min"),"younger"=c("<1 min",">1 min")))
siblings

# usual test for independence comparing younger vs older
chisq.test(siblings, correct=F)

# McNemar test for equal proportions comparing younger vs older
mcnemar.test(siblings, correct=F)

> mcnemar.test(siblings, correct=F)

        McNemar's Chi-squared test

data:  siblings
McNemar's chi-squared = 0.33333, df = 1, p-value = 0.5637

11.2.3 - Efficiency of Matched Pairs

For the sibling puzzle-solving example, note that the data consisted of 37 responses for six-year-olds (younger siblings) and 37 responses for eight-year-olds (older siblings). How would the results differ if, instead of siblings, those same values had arisen from two independent samples of children of those ages?

To see why the approach with the dependent data, matched by siblings, is more powerful, consider the table below. The sample size here is defined as \(n = 37+37=74\) total responses, compared with \(n=37\) total pairs in the previous approach.

	<1 min	>1 min	total
Older	22	15	37
Younger	20	17	37

The estimated difference in proportions from this table is identical to the previous one:

\(\hat{d}=\hat{p}_1-\hat{p}_2=\dfrac{22}{37}-\dfrac{20}{37}=0.0541\)

But with independent data, the test for an age effect, which is equivalent to the usual \(\chi^2\) test of independence, gives \(X^2 = 0.2202\), compared with McNemar's test gave \(z^2 = 0.333\). The standard error for \(\hat{d}\) in this new table is

\(\sqrt{\dfrac{\hat{p}_1(1-\hat{p}_1)}{37}+\dfrac{\hat{p}_2(1-\hat{p}_2)}{37}}=0.1150\)

compared with \(0.0932\), when using siblings and taking into account the covariance between them. Just as with matched pairs for quantitative variables, the covariance between pair members leads to smaller standard errors and greater overall power, compared with the independent samples approach.

Let's take a look at the last part of Siblings.sas and its relevant output, where the same data are analyzed as if they were sampled independently and not matched by siblings.

data matched;
input time $ approval $ count ;
datalines;
 t1 approve 944
 t1 disapprove 656
 t2 approve  880
 t2 disapprove 720
;
proc freq data=matched order=data;
 weight count;
    tables time*approval /chisq riskdiff;
run;

Now we are doing just a regular test of independence, and the Pearson chi-square is \(0.2202\) with a p-value of 0.02. Although our conclusion seems to be identical (we still can't claim a significant age effect), notice that our p-value is less significant when the data are treated as independent. In general, the matched pairs approach is more powerful.

Statistics for Table of age by time


Statistic	DF	Value	Prob
Chi-Square	1	0.2202	0.6389
Likelihood Ratio Chi-Square	1	0.2204	0.6388
Continuity Adj. Chi-Square	1	0.0551	0.8145
Mantel-Haenszel Chi-Square	1	0.2173	0.6411
Phi Coefficient		0.0546
Contingency Coefficient		0.0545
Cramer's V		0.0546


Column 1 Risk Estimates
	Risk	ASE	95% Confidence Limits		Exact 95% Confidence Limits
Difference is (Row 1 - Row 2)
Row 1	0.5946	0.0807	0.4364	0.7528	0.4210	0.7525
Row 2	0.5405	0.0819	0.3800	0.7011	0.3692	0.7051
Total	0.5676	0.0576	0.4547	0.6804	0.4472	0.6823
Difference	0.0541	0.1150	-0.1714	0.2795

Let's take a look at the last part of Siblings.R and its relevant output, where the same data are analyzed as if they were sampled independently and not matched by siblings.


notsiblings = matrix(c(22,20,15,17),nr=2,
 dimnames=list(c("Older","Younger"),c("<1 min",">1 min")))
notsiblings

chisq.test(notsiblings, correct=F)
prop.test(notsiblings, correct=F)

> chisq.test(notsiblings, correct=F)

        Pearson's Chi-squared test

data:  notsiblings
X-squared = 0.22024, df = 1, p-value = 0.6389

> prop.test(notsiblings, correct=F)

        2-sample test for equality of proportions without continuity
        correction

data:  notsiblings
X-squared = 0.22024, df = 1, p-value = 0.6389
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.1713610  0.2794691

McNemar’s test applies whenever the hypothesis of marginal homogeneity is of interest. Dependency and marginal homogeneity may arise in varieties of problems involving dependency.

11.2.4 - Measure of Agreement: Kappa

Another hypothesis of interest is to evaluate to what degree two different examiners agree on different systems of evaluation. This has important applications in medicine where two physicians may be called upon to evaluate the same group of patients for further treatment.

The Cohen's Kappa statistic (or simply kappa) is intended to measure agreement between two variables.

Example: Movie Critiques

Recall the example on movie ratings from the introduction. Do the two movie critics, in this case, Siskel and Ebert, classify the same movies into the same categories; do they really agree?

Siskel	Ebert
Siskel	con	mixed	pro	total
con	24	8	13	45
mixed	8	13	11	32
pro	10	9	64	83
total	42	30	88	160

In the (necessarily) square table above, the main diagonal counts represent movies where both raters agreed. Let the term \(\pi_{ij}\) denote the probability that Siskel classifies the move in category \(i\), and Ebert classifies the same movie in category \(j\). For example, \(\pi_{13}\) is the probability that Siskel rates a movie as "con", but Ebert rates it as "pro".

The term \(\sum_{i} \pi_{ii}\) then is the total probability of agreement. The extreme case that all observations are classified on the main diagonal is known as "perfect agreement".

Stop and Think!

Is it possible to define perfect disagreement?

Kappa measures agreement. A perfect agreement is when all of the counts fall on the main diagonal of the table, and the probability of agreement will be equal to 1.

To define perfect disagreement, the ratings of the movies, in this case, would have to be opposite one another, ideally in the extremes. In a \(2 \times 2\) table it is possible to define perfect disagreement because each positive rating could have one specific negative rating (e.g. Love vs. Hate it), but what about a \(3 \times 3\) or higher square tables? In these cases there are more ways that one might disagree and it therefore quickly gets more complicated to disagree perfectly. To think of perfect disagreement we would have to have a situation that minimizes agreement in any combination, and in higher way tables this would likely be a situation where there are no counts in some cells because it would be impossible to have perfect disagreement across all combinations at the same time.

In a \(3 \times 3\) table, here are two options that would provide no agreement at all (the # indicating a count):

	1	3
1	0	#
2	0	0
3	#	0

	1	2
1	0	#
2	#	0
3	0	0

Cohen’s kappa is a single summary index that describes strength of inter-rater agreement.

For \(I \times I\) tables, it’s equal to

\(\kappa=\dfrac{\sum\pi_{ii}-\sum\pi_{i+}\pi_{+i}}{1-\sum\pi_{i+}\pi_{+i}}\)

This statistic compares the observed agreement to the expected agreement, computed assuming the ratings are independent.

The null hypothesis that the ratings are independent is, therefore, equivalent to

\(\pi_{ii}=\pi_{i+}\pi_{+i}\quad\text{ for all }i\)

If the observed agreement is due to chance only, i.e., if the ratings are completely independent, then each diagonal element is a product of the two marginals.

Since the total probability of agreement is \(\sum_{i} \pi_{ii}\), then the probability of agreement under the null hypothesis equals to \(\sum_{i} \pi_{i+}\pi_{+i}\). Note also that \(\sum_{i} \pi_{ii} = 0\) means no agreement, and \(\sum_{i} \pi_{ii} = 1\) indicates perfect agreement. The kappa statistic is defined so that a larger value implies stronger agreement. Furthermore,

Perfect agreement \(\kappa = 1\).
\(\kappa = 0\), does not mean perfect disagreement but rather only agreement that would result from chance only, where the diagonal cell probabilities are simply products of the corresponding marginals.
If the actual agreement is greater than agreement by chance, then \(\kappa\geq 0\).
If the actual agreement is less than agreement obtained by chance, then \(\kappa\leq 0\).
The minimum possible value of \(\kappa = −1\).
A value of kappa higher than 0.75 can be considered (arbitrarily) as "excellent" agreement, while lower than 0.4 will indicate "poor" agreement.

Note! Notice that strong agreement implies strong association, but strong association may not imply strong agreement. For example, if Siskel puts most of the movies into the con category while Ebert puts them into the pro category, the association might be strong, but there is certainly no agreement. You may also think of the situation where one examiner is tougher than the other. The first one consistently gives one grade less than the more lenient one. In this case, also the association is very strong but agreement may be insignificant.

Under multinomial sampling, the sampled value \(\hat{\kappa}\) has a large-sample normal distribution. Thus, we can rely on the asymptotic 95% confidence interval.

In SAS, use the option AGREE as shown below and in the SAS program MovieCritics.sas.

data critic;
input siskel $ ebert $ count ;
datalines;
 con con  24 
 con mixed 8 
 con pro 13
 mixed con 8 
 mixed mixed 13
 mixed pro 11
 pro con 10
 pro mixed 9
 pro pro 64
 ; run;
 
proc freq; 
weight count;
tables siskel*ebert / agree chisq;
run;

From the output below, we can see that the "Simple Kappa" gives the estimated kappa value of 0.3888 with its asymptotic standard error (ASE) of 0.0598. The difference between observed agreement and expected under independence is about 40% of the maximum possible difference. Based on the reported 95% confidence interval, \(\kappa\) falls somewhere between 0.2716 and 0.5060 indicating only a moderate agreement between Siskel and Ebert.


Kappa Statistics
Statistic	Estimate	Standard Error	95% Confidence Limits
Simple Kappa	0.3888	0.0598	0.2716	0.5060
Weighted Kappa	0.4269	0.0635	0.3024	0.5513

Sample Size = 160

In R, we can use the Kappa function in the vcd package. The following are from the script MovieCritics.R.

critic = matrix(c(24,8,10,8,13,9,13,11,64),nr=3,
 dimnames=list("siskel"=c("con","mixed","pro"),"ebert"=c("con","mixed","pro")))
critic

# chi-square test for independence between raters
result = chisq.test(critic)
result

# kappa coefficient for agreement
library(vcd)
kappa = Kappa(critic)

From the output below, we can see that the "unweighted" statistic gives the estimated kappa value of 0.389 with its asymptotic standard error (ASE) of 0.063. The difference between observed agreement and expected under independence is about 40% of the maximum possible difference. Based on the reported values, the 95% confidence interval for \(\kappa\) ranges from 0.27 to 0.51, indicating only a moderate agreement between Siskel and Ebert.

> kappa
            value     ASE     z  Pr(>|z|)
Unweighted 0.3888 0.05979 6.503 7.870e-11
Weighted   0.4269 0.06350 6.723 1.781e-11
> confint(kappa)
            
Kappa              lwr       upr
  Unweighted 0.2716461 0.5060309
  Weighted   0.3024256 0.5513224

Issue with Cohen's Kappa

Kappa strongly depends on the marginal distributions. That is the same rating but with different proportions of cases in different categories can give very different \(\kappa\) values. This is one reason why the minimum value of \(\kappa\) depends on the marginal distribution and the minimum possible value of \(-1\) is not always attainable.

Solution

Modeling agreement (e.g., via log-linear or other models) is typically a more informative approach.

Weighted kappa is a version of kappa used for measuring agreement on ordered variables, where certain disagreements (e.g., lowest versus highest) can be weighted as more or less important.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility