11.2.4 - Measure of Agreement: Kappa

11.2.4 - Measure of Agreement: Kappa

Another hypothesis of interest is to evaluate to what degree two different examiners agree on different systems of evaluation. This has important applications in medicine where two physicians may be called upon to evaluate the same group of patients for further treatment.

The Cohen's Kappa statistic (or simply kappa) is intended to measure agreement between two variables.

Example: Movie Critiques

popcorn

Recall the example on movie ratings from the introduction. Do the two movie critics, in this case, Siskel and Ebert, classify the same movies into the same categories; do they really agree?

Siskel Ebert  
con mixed pro total
con 24 8 13 45
mixed 8 13 11 32
pro 10 9 64 83
total 42 30 88 160

In the (necessarily) square table above, the main diagonal counts represent movies where both raters agreed. Let the term \(\pi_{ij}\) denote the probability that Siskel classifies the move in category \(i\), and Ebert classifies the same movie in category \(j\). For example, \(\pi_{13}\) is the probability that Siskel rates a movie as "con", but Ebert rates it as "pro".

The term \(\sum_{i} \pi_{ii}\) then is the total probability of agreement. The extreme case that all observations are classified on the main diagonal is known as "perfect agreement".

 Stop and Think!
Is it possible to define perfect disagreement?

Kappa measures agreement. A perfect agreement is when all of the counts fall on the main diagonal of the table, and the probability of agreement will be equal to 1.

To define perfect disagreement, the ratings of the movies, in this case, would have to be opposite one another, ideally in the extremes. In a \(2 \times 2\) table it is possible to define perfect disagreement because each positive rating could have one specific negative rating (e.g. Love vs. Hate it), but what about a \(3 \times 3\) or higher square tables? In these cases there are more ways that one might disagree and it therefore quickly gets more complicated to disagree perfectly. To think of perfect disagreement we would have to have a situation that minimizes agreement in any combination, and in higher way tables this would likely be a situation where there are no counts in some cells because it would be impossible to have perfect disagreement across all combinations at the same time.

In a \(3 \times 3\) table, here are two options that would provide no agreement at all (the # indicating a count):

  1 2 3
1 0 0 #
2 0 0 0
3 # 0 0
  1 2 3
1 0 # 0
2 # 0 0
3 0 0 0

Cohen’s kappa is a single summary index that describes strength of inter-rater agreement.

For \(I \times I\) tables, it’s equal to

\(\kappa=\dfrac{\sum\pi_{ii}-\sum\pi_{i+}\pi_{+i}}{1-\sum\pi_{i+}\pi_{+i}}\)

This statistic compares the observed agreement to the expected agreement, computed assuming the ratings are independent.

The null hypothesis that the ratings are independent is, therefore, equivalent to

\(\pi_{ii}=\pi_{i+}\pi_{+i}\quad\text{ for all }i\)

If the observed agreement is due to chance only, i.e., if the ratings are completely independent, then each diagonal element is a product of the two marginals.

Since the total probability of agreement is \(\sum_{i} \pi_{ii}\), then the probability of agreement under the null hypothesis equals to \(\sum_{i} \pi_{i+}\pi_{+i}\). Note also that \(\sum_{i} \pi_{ii} = 0\) means no agreement, and \(\sum_{i} \pi_{ii} = 1\) indicates perfect agreement. The kappa statistic is defined so that a larger value implies stronger agreement. Furthermore,

  • Perfect agreement \(\kappa = 1\).
  • \(\kappa = 0\), does not mean perfect disagreement but rather only agreement that would result from chance only, where the diagonal cell probabilities are simply products of the corresponding marginals.
  • If the actual agreement is greater than agreement by chance, then \(\kappa\geq 0\).
  • If the actual agreement is less than agreement obtained by chance, then \(\kappa\leq 0\).
  • The minimum possible value of \(\kappa = −1\).
  • A value of kappa higher than 0.75 can be considered (arbitrarily) as "excellent" agreement, while lower than 0.4 will indicate "poor" agreement.
Note! Notice that strong agreement implies strong association, but strong association may not imply strong agreement. For example, if Siskel puts most of the movies into the con category while Ebert puts them into the pro category, the association might be strong, but there is certainly no agreement. You may also think of the situation where one examiner is tougher than the other. The first one consistently gives one grade less than the more lenient one. In this case, also the association is very strong but agreement may be insignificant.

Under multinomial sampling, the sampled value \(\hat{\kappa}\) has a large-sample normal distribution. Thus, we can rely on the asymptotic 95% confidence interval.

In SAS, use the option AGREE as shown below and in the SAS program MovieCritics.sas.

data critic;
input siskel $ ebert $ count ;
datalines;
 con con  24 
 con mixed 8 
 con pro 13
 mixed con 8 
 mixed mixed 13
 mixed pro 11
 pro con 10
 pro mixed 9
 pro pro 64
 ; run;
 
proc freq; 
weight count;
tables siskel*ebert / agree chisq;
run;

From the output below, we can see that the "Simple Kappa" gives the estimated kappa value of 0.3888 with its asymptotic standard error (ASE) of 0.0598. The difference between observed agreement and expected under independence is about 40% of the maximum possible difference. Based on the reported 95% confidence interval, \(\kappa\) falls somewhere between 0.2716 and 0.5060 indicating only a moderate agreement between Siskel and Ebert.

 
Kappa Statistics
Statistic Estimate Standard
Error
95% Confidence Limits
Simple Kappa 0.3888 0.0598 0.2716 0.5060
Weighted Kappa 0.4269 0.0635 0.3024 0.5513

Sample Size = 160

 

In R, we can use the Kappa function in the vcd package. The following are from the script MovieCritics.R.

critic = matrix(c(24,8,10,8,13,9,13,11,64),nr=3,
 dimnames=list("siskel"=c("con","mixed","pro"),"ebert"=c("con","mixed","pro")))
critic

# chi-square test for independence between raters
result = chisq.test(critic)
result

# kappa coefficient for agreement
library(vcd)
kappa = Kappa(critic)

From the output below, we can see that the "unweighted" statistic gives the estimated kappa value of 0.389 with its asymptotic standard error (ASE) of 0.063. The difference between observed agreement and expected under independence is about 40% of the maximum possible difference. Based on the reported values, the 95% confidence interval for \(\kappa\) ranges from 0.27 to 0.51, indicating only a moderate agreement between Siskel and Ebert.

> kappa
            value     ASE     z  Pr(>|z|)
Unweighted 0.3888 0.05979 6.503 7.870e-11
Weighted   0.4269 0.06350 6.723 1.781e-11
> confint(kappa)
            
Kappa              lwr       upr
  Unweighted 0.2716461 0.5060309
  Weighted   0.3024256 0.5513224

 Issue with Cohen's Kappa

Kappa strongly depends on the marginal distributions. That is the same rating but with different proportions of cases in different categories can give very different \(\kappa\) values. This is one reason why the minimum value of \(\kappa\) depends on the marginal distribution and the minimum possible value of \(-1\) is not always attainable.

 Solution

Modeling agreement (e.g., via log-linear or other models) is typically a more informative approach.

Weighted kappa is a version of kappa used for measuring agreement on ordered variables, where certain disagreements (e.g., lowest versus highest) can be weighted as more or less important.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility