Another hypothesis of interest is to evaluate to what degree two different examiners agree on different systems of evaluation. This has important applications in medicine where two physicians may be called upon to evaluate the same group of patients for further treatment.

The **Cohen's Kappa** statistic (or simply kappa) is intended to measure agreement between two variables.

##
Example: Movie Critiques
Section* *

Recall the example on movie ratings from the introduction. Do the two movie critics, in this case, Siskel and Ebert, classify the same movies into the same categories; do they really agree?

Siskel | Ebert | |||
---|---|---|---|---|

con | mixed | pro | total | |

con | 24 | 8 | 13 | 45 |

mixed | 8 | 13 | 11 | 32 |

pro | 10 | 9 | 64 | 83 |

total | 42 | 30 | 88 | 160 |

In the (necessarily) square table above, the main diagonal counts represent movies where both raters agreed. Let the term \(\pi_{ij}\) denote the probability that Siskel classifies the move in category \(i\), and Ebert classifies the same movie in category \(j\). For example, \(\pi_{13}\) is the probability that Siskel rates a movie as "con", but Ebert rates it as "pro".

The term \(\sum_{i} \pi_{ii}\) then is the total probability of agreement. The extreme case that all observations are classified on the main diagonal is known as "perfect agreement".

Kappa measures agreement. A perfect agreement is when all of the counts fall on the main diagonal of the table, and the probability of agreement will be equal to 1.

To define perfect disagreement, the ratings of the movies, in this case, would have to be opposite one another, ideally in the extremes. In a \(2 \times 2\) table it is possible to define perfect disagreement because each positive rating could have one specific negative rating (e.g. Love vs. Hate it), but what about a \(3 \times 3\) or higher square tables? In these cases there are more ways that one might disagree and it therefore quickly gets more complicated to disagree perfectly. To think of perfect disagreement we would have to have a situation that minimizes agreement in any combination, and in higher way tables this would likely be a situation where there are no counts in some cells because it would be impossible to have perfect disagreement across all combinations at the same time.

In a \(3 \times 3\) table, here are two options that would provide no agreement at all (the # indicating a count):

1 | 2 | 3 | |
---|---|---|---|

1 | 0 | 0 | # |

2 | 0 | 0 | 0 |

3 | # | 0 | 0 |

1 | 2 | 3 | |
---|---|---|---|

1 | 0 | # | 0 |

2 | # | 0 | 0 |

3 | 0 | 0 | 0 |

**Cohen’s kappa** is a single summary index that describes strength of inter-rater agreement.

For \(I \times I\) tables, it’s equal to

\(\kappa=\dfrac{\sum\pi_{ii}-\sum\pi_{i+}\pi_{+i}}{1-\sum\pi_{i+}\pi_{+i}}\)

This statistic compares the observed agreement to the expected agreement, computed assuming the ratings are independent.

The null hypothesis that the ratings are independent is, therefore, equivalent to

\(\pi_{ii}=\pi_{i+}\pi_{+i}\quad\text{ for all }i\)

If the observed agreement is due to chance only, i.e., if the ratings are completely independent, then each diagonal element is a product of the two marginals.

Since the total probability of agreement is \(\sum_{i} \pi_{ii}\), then the probability of agreement under the null hypothesis equals to \(\sum_{i} \pi_{i+}\pi_{+i}\). Note also that \(\sum_{i} \pi_{ii} = 0\) means no agreement, and \(\sum_{i} \pi_{ii} = 1\) indicates perfect agreement. The kappa statistic is defined so that a larger value implies stronger agreement. Furthermore,

- Perfect agreement \(\kappa = 1\).
- \(\kappa = 0\), does not mean perfect disagreement but rather only agreement that would result from chance only, where the diagonal cell probabilities are simply products of the corresponding marginals.
- If the actual agreement is greater than agreement by chance, then \(\kappa\geq 0\).
- If the actual agreement is less than agreement obtained by chance, then \(\kappa\leq 0\).
- The minimum possible value of \(\kappa = −1\).
- A value of kappa higher than 0.75 can be considered (arbitrarily) as "excellent" agreement, while lower than 0.4 will indicate "poor" agreement.

**Note!**Notice that strong agreement implies strong association, but strong association may not imply strong agreement. For example, if Siskel puts most of the movies into the con category while Ebert puts them into the pro category, the association might be strong, but there is certainly no agreement. You may also think of the situation where one examiner is tougher than the other. The first one consistently gives one grade less than the more lenient one. In this case, also the association is very strong but agreement may be insignificant.

Under multinomial sampling, the sampled value \(\hat{\kappa}\) has a large-sample normal distribution. Thus, we can rely on the asymptotic 95% confidence interval.

In SAS, use the option AGREE as shown below and in the SAS program MovieCritics.sas.

```
data critic;
input siskel $ ebert $ count ;
datalines;
con con 24
con mixed 8
con pro 13
mixed con 8
mixed mixed 13
mixed pro 11
pro con 10
pro mixed 9
pro pro 64
; run;
proc freq;
weight count;
tables siskel*ebert / agree chisq;
run;
```

From the output below, we can see that the "Simple Kappa" gives the estimated kappa value of 0.3888 with its asymptotic standard error (ASE) of 0.0598. The difference between observed agreement and expected under independence is about 40% of the maximum possible difference. Based on the reported 95% confidence interval, \(\kappa\) falls somewhere between 0.2716 and 0.5060 indicating only a moderate agreement between Siskel and Ebert.

Kappa Statistics | ||||
---|---|---|---|---|

Statistic | Estimate | Standard Error |
95% Confidence Limits | |

Simple Kappa | 0.3888 | 0.0598 | 0.2716 | 0.5060 |

Weighted Kappa | 0.4269 | 0.0635 | 0.3024 | 0.5513 |

Sample Size = 160

In R, we can use the Kappa function in the vcd package. The following are from the script MovieCritics.R.

```
critic = matrix(c(24,8,10,8,13,9,13,11,64),nr=3,
dimnames=list("siskel"=c("con","mixed","pro"),"ebert"=c("con","mixed","pro")))
critic
# chi-square test for independence between raters
result = chisq.test(critic)
result
# kappa coefficient for agreement
library(vcd)
kappa = Kappa(critic)
```

From the output below, we can see that the "unweighted" statistic gives the estimated kappa value of 0.389 with its asymptotic standard error (ASE) of 0.063. The difference between observed agreement and expected under independence is about 40% of the maximum possible difference. Based on the reported values, the 95% confidence interval for \(\kappa\) ranges from 0.27 to 0.51, indicating only a moderate agreement between Siskel and Ebert.

```
> kappa
value ASE z Pr(>|z|)
Unweighted 0.3888 0.05979 6.503 7.870e-11
Weighted 0.4269 0.06350 6.723 1.781e-11
> confint(kappa)
Kappa lwr upr
Unweighted 0.2716461 0.5060309
Weighted 0.3024256 0.5513224
```

#### Issue with Cohen's Kappa

Kappa strongly depends on the marginal distributions. That is the same rating but with different proportions of cases in different categories can give very different \(\kappa\) values. This is one reason why the minimum value of \(\kappa\) depends on the marginal distribution and the minimum possible value of \(-1\) is not always attainable.

#### Solution

Modeling agreement (e.g., via log-linear or other models) is typically a more informative approach.

Weighted kappa is a version of kappa used for measuring agreement on ordered variables, where certain disagreements (e.g., lowest versus highest) can be weighted as more or less important.