4.2.5  Measure of Agreement: Kappa
Another hypothesis of interest is to evaluate whether two different examiners agree among themselves or to what degree two different systems of evaluation are in agreement. This has important applications in medicine where two physicians may be called upon to evaluate the same group of patients for further treatment.
The Cohen's Kappa statistic (or simply kappa) is intended to measure agreement between two raters.
Example  Movie Critiques
Recall the movies example from the introduction. Do the two movie critics, in this case Ebert and Siskel, classify the same movies into the same categories; do they really agree?
Siskel

Ebert


con

mixed

pro

total


con 
24

8

13

45

mixed 
8

13

11

32

pro 
10

9

64

83

total 
42

30

88

160

In the square $I\times I$ table, the main diagonal {i = j} represents rater or observer agreement. Let the term π_{ij} denotes the probability that Siskel classifies the move in category i and Ebert classifies the same movie in category j. For example, π_{13} means that Ebert gave "two thumbs up" and Siskel gave "thumbs down".
The term π_{ii} is the probability that they both placed the movie into the same category i, and Σ_{i} π_{ii} is the total probability of agreement. Ideally, all or most of the observations will be classified on the main diagonal, which denotes perfect agreement.
Think about the following question, then click on the icon to the left display an answer. Is it possible to define perfect disagreement? 
Cohen’s kappa is a single summary index that describes strength of interrater agreement.
For I × I tables, it’s equal to:
\(\kappa=\dfrac{\sum\pi_{ii}\sum\pi_{i+}\pi_{+i}}{1\sum\pi_{i+}\pi_{+i}}\)
This statistic compares the observed agreement to the expected agreement, computed assuming the ratings are independent.
The null hypothesis that the ratings are independent is, therefore, equivalent to
\(\pi_{ii}=\pi_{i+}\pi_{+i}\text{ for all }i\)
If the observed agreement is due to chance only, i.e. if the ratings are completely independent, then each diagonal element is a product of the two marginals.
Since the total probability of agreement is Σ_{i} π_{ii}, then the probability of agreement under the null hypothesis equals to Σ_{i} π_{i+}π_{+i}. Note also that Σ_{i} π_{ii} = 0 means no agreement and Σ_{i} π_{ii} = 1 indicates perfect agreement. The kappa statistic is defined so that a larger value implies stronger agreement, furthermore:
 Perfect agreement $\kappa = 1$.
 $\kappa = 0$, does not mean perfect disagreement; it only means agreement by chance as that would indicate that the diagonal cell probabilities are simply product of the corresponding marginals.
 If agreement is greater than agreement by chance, then $\kappa\geq 0$.
 If agreement is less than agreement obtained by chance, then $\kappa\leq 0$.
 The minimum possible value of $\kappa = −1$.
 A value of kappa higher than 0.75 will indicate excellent agreement while lower than 0.4 will indicate poor agreement.
Notice that, strong agreement implies strong association, but strong association may not imply strong agreement. For example, if Siskel puts most of the movies into the con category, while Ebert puts them into the pro category, the association might be strong, but there is certainly no agreement. You may also think of the situation where one examiner is tougher than the other. The first one consistently gives one grade less than the more lenient one. In this case also the association is very strong but agreement may be insignificant.
Under multinomial sampling, the sampled value \(\hat{\kappa}\) has a largesample normal distribution. For the sample variance you can refer to Agresti (2013), pg. 435. Thus we can rely on the usual asymptotic 95% confidence interval.
In SAS, use the option AGREE as shown below and in the SAS program MovieCritiques.sas.
From the output below, we can see that the "Simple Kappa" gives the estimated kappa value of 0.389 with its asymptotic standard error (ASE) of 0.0598. The difference between observed agreement and expected under independence is about 40% of the maximum possible difference. Based on the reported 95% confidence interval, $\kappa$ falls somewhere between 0.27 and 0.51 indicating only a moderate agreement between Siskel and Ebert.
For R, see the file MovieCritiques.R. If you use {vcd} package you can use function Kappa(); do NOT forget to upload package first, e.g., library(vcd).
From the output below, we can see that the "unweighted" statistic gives the estimated kappa value of 0.389 with its asymptotic standard error (ASE) of 0.063. The difference between observed agreement and expected under independence is about 40% of the maximum possible difference. Based on the reported values, the 95% confidence interval will show that $\kappa$ falls somewhere between 0.27 and 0.51 indicating only a moderate agreement between Siskel and Ebert.
There are other similar functions built by R researchers which are worth exploration. You can also write your own function given the above formulas.
Issue with Cohen's Kappa: Kappa strongly depends on the marginal distributions. That is the same rating, but with different proportions of cases in different categories can give very different $\kappa$ values. This is one reason why the minimum value of $\kappa$ depends on the marginal distribution and the minimum possible value of −1 is not always attainable.
Solution: Modeling agreement (e.g. via loglinear or other models) is typically a more informative approach.
Weighted kappa is a version of kappa used for measuring agreement on ordered variables (see Section 11.5.5 of Agresti, 2013). More details on measures of agreement and modeling of matched data can be found in Chapter 11 (Agresti, 2013), and Chapter 8 (Agresti, 2007). We will only touch upon some of these models later in the semester while we study loglinear and logit model.