4.2 - Measures of Positive and Negative Association

While two nominal variables can be associated (knowing something about one can tell us something about the other), without a natural ordering, we can't specify that association with a particular direction, such as positive or negative. With an ordinal variable, however, it makes sense to think of one of its outcomes as being "higher" or "greater" than another, even if they aren't necessarily numerically meaningful. And with two such ordinal variables, we can define a positive association to mean that both variables tend to be either "high" or "low" together; a negative association would exist if one tends to be "high" when the other is "low". We solidify these ideas in this lesson.

Concordance and Gamma Section

With this notion of "high" and "low" available for two (ordinal) variables \(X\) and \(Y\), we can define a quantity that measures both strength and direction of association. Suppose that \(X\) takes on the values \(1,2,\ldots,I\) and that \(Y\) takes on the values \(1,2,\ldots,J\). We can think of these as ranks of the categories in each variable once we have decided which direction "high" corresponds to. As we've seen already, the direction of the ordering is somewhat arbitrary. The "highest" category for happy could just as easily be "very happy" as "not too happy", as long as adjacent categories are kept together. One direction is often more intuitive than the other, however.

If \((i,j)\) represents an observation in the \(i\)th row and \(j\)th column (i.e., \(X=i\) and \(Y=j\) for that observation), then the pair of observations \((i,j)\) and \((i',j')\) are concordant if

\((i-i')(j-j')>0\)

If \((i-i')(j-j')<0\), then the pair is discordant. If either \(i=i'\) or \(j=j'\), then neither term will be used---effectively, ties in either \(X\) or \(Y\) are ignored. If \(C\) and \(D\) denote the number of concordant and discordant pairs, respectively, then a correlation-like measure of association between \(X\) and \(Y\) due to Goodman and Kruskal (1954) is

\(\hat{\gamma}=\dfrac{C-D}{C+D}\)

Like the usual sample correlation between quantitative variables, \(\hat{\gamma}\) always falls within \([-1,1]\) and indicates stronger positive (or negative) association the closer it is to \(+1\) (or \(-1)\).

Example Section

In our example from the 2018 GSS, respondents were asked to rate their agreement to the statement "Job security is good" with respect to the work they were doing (jobsecok). For the concordance calculations, the responses "not at all true", "not too true", "somewhat true", and "very true" are interpreted as increasing in job security. Similarly, the responses for general happiness (happy) "not too happy", "pretty happy", and "very happy" are interpreted as increasing in happiness. Recall the observed counts for these variables below.

	Not too happy	Pretty happy	Very happy
Not at all true	15	25	5
Not too true	21	47	21
Somewhat true	64	248	100
Very true	73	474	311

First, we'll count the number of concordant pairs of individuals (observations). Consider first an individual from cell \((1,1)\) paired with an individual from cell \((2,2)\). This pair is concordant because

\((2-1)(2-1)>0\)

That is, the first individual has lower values for both happy ("not too happy") and jobsecok ("not at all true"). Moreover, this result holds for all pairs consisting of one individual chosen from the 15 in cell \((1,1)\) and one individual chosen from the 47 in cell \((2,2)\). Since there are \(15(47)=705\) such pairings, we've just counted 705 concordant pairs!

By the same argument, pairs chosen from cells \((1,1)\) and \((2,3)\) contribute \(15(21)=315\), pairs chosen from cells \((1,1)\) and \((3,2)\) contribute \(15(248)=3720\), and so. If we're careful to skip over pairs for which any ties occur, our final count should be \(C=199,293\).

Counting discordant pairs works in a similar way. For example, a pair from cells \((1,2)\) and \((2,1)\) are discordant because

\((2-1)(1-2)<0\)

This pair contributes to a negative relationship because the first individual provided a lower value for jobsecok ("not at all true") but a higher value for happy ("pretty happy"), relative to the second individual. And all \(25(21)=525\) such pairs of individuals are counted likewise. Again, without consideration of tied values, we can count a total of \(D=105,867\) discordant pairs in this table. And as a final summary, the gamma correlation value is

\(\hat{\gamma}=\dfrac{199293-105867}{199293+105867}=0.30615\)

This represents a relatively weak positive association between perceived job security and happiness.

Kendall's Tau and Tau-b Section

Another way to measure association for two ordinal variables is due to Kendall (1945) and incorporates the number of tied observations. With categorical variables, such as happiness or opinion on job security, many individuals will agree or "tie" in either the row variable, the column variable, or both. If we let \(T_r\) and \(T_c\) represent the number of pairs that are tied in the row variable and column variable, respectively, then we can count from the row and column totals

\(T_r=\sum_{i=1}^I\dfrac{n_{i+}(n_{i+}-1)}{2}, \) and \(T_c=\sum_{j=1}^J\dfrac{n_{+j}(n_{+j}-1)}{2} \)

Moreover, if \(n\) denotes the total number of observations in the sample, then Kendall's tau-b is calculated as

\( \hat{\tau}_b=\dfrac{C-D}{\sqrt{[n(n-1)/2-T_r][n(n-1)/2-T_c]}} \)

Like gamma, tau-b is a correlation-like quantitative that measures both strength and association of two ordinal variables. It always falls within \([-1,1]\) with stronger positive (or negative) association corresponding to \(+1\) (or \(-1\)). However, the denominator for tau-b is generally larger than that for gamma so that tau-b is generally weaker (closer to 0 in absolute value). To see why this is, note that \({n\choose2}=n(n-1)/2\) represents the total number of ways to choose two observations from the grand total and can be expanded as

\(\dfrac{n(n-1)}{2}=C+D+T_r+T_c-T_{rc} \)

where \(T_{rc}\) is the number of pairs of observations (from the diagonal counts) that tie on both the row and column variable. In other words, tau-b includes some ties in its denominator, while gamma does not, and ties do not contribute to either a positive or negative association.

It should also be noted that if there are no ties (i.e., every individual provides a unique response), both gamma and tau-b reduce to

\( \hat{\tau}=\dfrac{C-D}{n(n-1)/2} \)

which is known simply as Kendall's tau and is usually reserved for continuous data that has been converted to ranks. In the case of categorical variables, ties are unavoidable.

Example Section

For the data above regarding jobsecok and happy, we find

\(T_r=\dfrac{45(44)+89(88)+412(411)+858(857)}{2} =457225,\) and \(T_c=\dfrac{173(172)+794(793)+437(436)}{2} =424965\)

and then, with \(n=1404\), calculate Kendall's tau-b to be

\( \hat{\tau}_b=\dfrac{199293-105867}{\sqrt{[1404(1403)/2-457225][1404(1403)/2-424965]}}=0.1719 \)

Thus, both gamma and tau-b estimate the relationship between jobsecok and happy to be moderately positive.

Code

The R code to carry out the calculations above:

gss = read.csv(file.choose(), header=T) # "GSS3.csv"

gss = gss[(gss$happy!="Don't know") & (gss$happy!="No answer"),]
gss = gss[(gss$jobsecok!="Dont know") & (gss$jobsecok!="No answer") 
 & (gss$jobsecok!="Not applicable"),]
tbl = table(gss$jobsecok, gss$happy)

# concordance
library(DescTools)
ConDisPairs(tbl)
GoodmanKruskalGamma(tbl, conf.level=.95)
KendallTauB(tbl, conf.level=.95)