Printer-friendly versionPrinter-friendly version

Another hypothesis of interest is to evaluate whether two different examiners agree among themselves or to what degree two different systems of evaluation are in agreement. This has important applications in medicine where two physicians may be called upon to evaluate the same group of patients for further treatment.

The Cohen's Kappa statistic (or simply kappa) is intended to measure agreement between two raters.

Example - Movie Critiques

Recall the movies example from the introduction. Do the two movie critics, in this case Ebert and Siskel, classify the same movies into the same categories; do they really agree?

Siskel
Ebert
 
con
mixed
pro
total
con
24
8
13
45
mixed
8
13
11
32
pro
10
9
64
83
total
42
30
88
160

In the square $I\times I$ table, the main diagonal {i = j} represents rater or observer agreement. Let the term πij denotes the probability that Siskel classifies the move in category i and Ebert classifies the same movie in category j. For example, π13 means that Ebert gave "two thumbs up" and Siskel gave "thumbs down".

The term πii is the probability that they both placed the movie into the same category i, and Σi πii is the total probability of agreement. Ideally, all or most of the observations will be classified on the main diagonal, which denotes perfect agreement. 

Think about the following question, then click on the icon to the left display an answer.

Is it possible to define perfect disagreement?

Cohen’s kappa is a single summary index that describes strength of inter-rater agreement.

For I × I tables, it’s equal to:

\(\kappa=\dfrac{\sum\pi_{ii}-\sum\pi_{i+}\pi_{+i}}{1-\sum\pi_{i+}\pi_{+i}}\)

This statistic compares the observed agreement to the expected agreement, computed assuming the ratings are independent.

The null hypothesis that the ratings are independent is, therefore, equivalent to

\(\pi_{ii}=\pi_{i+}\pi_{+i}\text{ for all }i\)

If the observed agreement is due to chance only, i.e. if the ratings are completely independent, then each diagonal element is a product of the two marginals.

Since the total probability of agreement is Σi πii, then the probability of agreement under the null hypothesis equals to Σi πi+π+i. Note also that Σi πii = 0 means no agreement and Σi πii = 1 indicates perfect agreement. The kappa statistic is defined so that a larger value implies stronger agreement, furthermore:

  • Perfect agreement $\kappa = 1$.
  • $\kappa = 0$, does not mean perfect disagreement; it only means agreement by chance as that would indicate that the diagonal cell probabilities are simply product of the corresponding marginals.
  • If agreement is greater than agreement by chance, then $\kappa\geq 0$.
  • If agreement is less than agreement obtained by chance, then $\kappa\leq 0$.
  • The minimum possible value of $\kappa = −1$.
  • A value of kappa higher than 0.75 will indicate excellent agreement while lower than 0.4 will indicate poor agreement.

Notice that, strong agreement implies strong association, but strong association may not imply strong agreement. For example, if Siskel puts most of the movies into the con category, while Ebert puts them into the pro category, the association might be strong, but there is certainly no agreement. You may also think of the situation where one examiner is tougher than the other. The first one consistently gives one grade less than the more lenient one. In this case also the association is very strong but agreement may be insignificant.

Under multinomial sampling, the sampled value \(\hat{\kappa}\) has a large-sample normal distribution. For the sample variance you can refer to Agresti (2013), pg. 435. Thus we can rely on the usual asymptotic 95% confidence interval.

SAS logoIn SAS, use the option AGREE as shown below and in the SAS program MovieCritiques.sas

SAS program Lec12ex3.sas

From the output below, we can see that the "Simple Kappa" gives the estimated kappa value of 0.389 with its asymptotic standard error (ASE) of 0.0598. The difference between observed agreement and expected under independence is about 40% of the maximum possible difference. Based on the reported 95% confidence interval, $\kappa$ falls somewhere between 0.27 and 0.51 indicating only a moderate agreement between Siskel and Ebert.

SAS output

R logoFor R, see the file MovieCritiques.R. If you use {vcd} package you can use function Kappa(); do NOT forget to upload package first, e.g., library(vcd).

movie critiques R code

From the output below, we can see that the "unweighted" statistic gives the estimated kappa value of 0.389 with its asymptotic standard error (ASE) of 0.063. The difference between observed agreement and expected under independence is about 40% of the maximum possible difference. Based on the reported values, the 95% confidence interval will show that $\kappa$ falls somewhere between 0.27 and 0.51 indicating only a moderate agreement between Siskel and Ebert.

movie critiques R output

There are other similar functions built by R researchers which are worth exploration. You can also write your own function given the above formulas.

Issue with Cohen's Kappa: Kappa strongly depends on the marginal distributions. That is the same rating, but with different proportions of cases in different categories can give very different $\kappa$ values. This is one reason why the minimum value of $\kappa$ depends on the marginal distribution and the minimum possible value of −1 is not always attainable.

Solution: Modeling agreement (e.g. via log-linear or other models) is typically a more informative approach.

Weighted kappa is a version of kappa used for measuring agreement on ordered variables (see Section 11.5.5 of Agresti, 2013). More details on measures of agreement and modeling of matched data can be found in Chapter 11 (Agresti, 2013), and Chapter 8 (Agresti, 2007). We will only touch upon some of these models later in the semester while we study log-linear and logit model.