4.1 - Cumulative Odds and Odds Ratios

Recall that a discrete ordinal variable, say \(Y\), takes on values that can be sorted either least to greatest or greatest to least. Examples introduced earlier might be the face of a die (\(1,2,\ldots,6\)) or a person's attitude towards war ("strongly disagree", "disagree", "agree", "strongly agree"). For convenience, we may label such categories as 1 through \(J\), which allows us to express cumulative probabilities as

\(F(j) = P(Y\le j)=\pi_1+\cdots+\pi_j\),

where the parameter \(\pi_j\) represents the probability of category \(j\). So, we can write \(F(2)=\pi_1+\pi_2\) to represent the probability that the die is either 1 or 2, or, in the second example, the probability that an individual responds either "strongly disagree" or "disagree". Note that we need only the ordinal property for this to make sense; the values \(1,2,\ldots,J\) themselves do not represent any numerically meaningful quantities in these cases.

Cumulative Odds

In addition to probabilities (or risks), ordinal categories also allow an odds to be defined for cumulative events. Recall the observed data for the war attitude example:

Attitude	Count
Strongly disagree	35
Disagree	27
Agree	23
Strongly agree	31
Total	116

If we focus on the "disagree" outcome in particular, the estimated probability would be \(\hat{\pi}_2=27/116=0.2328\) with corresponding estimated odds \(\hat{\pi}_2/(1-\hat{\pi}_2)=27/(35+23+31)=0.3034\). However, using the estimated cumulative probability \(\hat{F}(2)=\hat{\pi}_1+\hat{\pi}_2=(35+27)/116=0.5345\), we may also consider the estimated cumulative odds:

\( \dfrac{\hat{F}(2)}{1-\hat{F}(2)}=\dfrac{35+27}{23+31}=1.1481 \)

We interpret this value as the (estimated) cumulative odds that an individual will "disagree", where the category of "strongly disagree" is implicitly included. Equivalently, we may also refer to this as the (estimated) odds of "strongly disagree" or "disagree". In general, we define the cumulative odds for \(Y\le j\) as

\(\dfrac{F(j)}{(1-F(j))}=\dfrac{\pi_1+\cdots+\pi_j}{\pi_{j+1}+\cdots+\pi_J},\quad\mbox{for }j=1,\ldots,J-1\)

with sample estimate \(\hat{F}(j)/(1-\hat{F}(j))\). The case \(j=J\) is not defined because \(1-F(J)=0\). Like cumulative probabilities, cumulative odds are necessarily non-decreasing and will be strictly increasing if the observed counts are all positive.

Cumulative Odds Ratios

If an additional variable is involved, this idea extends to cumulative odds ratios. Consider the table below summarizing the responses for extent of agreement to the statement "job security is good" (jobsecok) and general happiness (happy) from the 2018 General Social Surveys. Additional possible responses of "don't know" and "no answer" are omitted here.

	Not too happy	Pretty happy	Very happy
Not at all true	15	25	5
Not too true	21	47	21
Somewhat true	64	248	100
Very true	73	474	311

If we condition on jobsecok and view happy as the response variable, the cumulative odds ratio for "not too happy" or "pretty happy" for those who say "not at all true", relative to those who say "very true", would be estimated with

\(\dfrac{(15+25)/5}{(73+474)/311}=4.55 \)

If perhaps "pretty happy" and "very happy" seem like a more intuitive combination to consider, keep in mind that we're free to reverse the order to start with "very happy" and end with "not too happy" without violating the ordinal nature of this variable. By indexing "very happy" with \(j=1\), \(F(2)\) becomes the cumulative probability of "very happy" or "pretty happy", and the cumulative odds would likewise be \(F(2)/(1-F(2))\). Likewise, we could choose any two rows to serve as the groups for comparison. The row variable doesn't even have to be ordinal itself to define such a cumulative odds ratio.

However, if we happen to have two ordinal variables, as we do in this example, we can work with a cumulative version in both dimensions. We may, for example, consider the cumulative odds of "not too happy" or "pretty happy", for those who say "not at all" or "not too" true, relative to those who say "somewhat" or "very" true. The estimate of this would be

\(\dfrac{(15+25+21+47)/(5+21)}{(64+248+73+474)/(100+311)}=1.99 \)

The common theme in all these odds ratios is that they essentially convert an \(I\times J\) table into a \(2\times2\) table by combining or "accumulating" counts in adjacent categories, depending on our focus of interest. This is illustrated by the shading in the tables below.

\(2\times2\) table cumulative odds ratio for "not too" or "pretty" happy for "not at all" compared with "very true"
	Not too happy	Pretty happy	Very happy
Not at all true	15	25	5
Not too true	21	47	21
Somewhat true	64	248	100
Very true	73	474	311

	Not too or pretty happy	Very happy
Not at all true	40	5
Very true	547	311

	Not too happy	Pretty happy	Very happy
Not at all true	15	25	5
Not too true	21	47	21
Somewhat true	64	248	100
Very true	73	474	311

\(2\times2\) table cumulative odds ratio for "not too" or "pretty" happy for "not at all" or "not too" true compared with "somewhat" or "very" true
	Not too or pretty happy	Very happy
Not at all or not too true	108	26
Somewhat or very true	859	411

CIs for Cumulative Odds Ratios

Recall for a \(2\times 2\) table with counts \((n_{11},n_{12},n_{21},n_{22})\), we have the sample odds ratio \(\dfrac{n_{11}n_{22}}{n_{12}n_{21}}\) and corresponding 95% confidence interval for the (population) log odds ratio:

\(\log\dfrac{n_{11}n_{22}}{n_{12}n_{21}} \pm 1.96\sqrt{\dfrac{1}{n_{11}}+\dfrac{1}{n_{12}}+\dfrac{1}{n_{21}}+\dfrac{1}{n_{22}}}\)

We can readily adopt this formula for a cumulative odds ratio \(\theta\) as well. We just need to work with the \(2\times2\) table of counts induced by any accumulation. For example, the \(2\times2\) table induced by the cumulative odds ratio for "not too" or "pretty" happy for those saying "not at all" compared with "very" true is

	Not too or Pretty happy	Very happy
Not at all true	40	5
Very true	547	311

With estimate \(\hat{\theta}=4.55\), the 95% confidence interval for \(\log\theta\) is

\(\log 4.55 \pm 1.96\sqrt{\dfrac{1}{40}+\dfrac{1}{5}+\dfrac{1}{547}+\dfrac{1}{311}} = (0.5747, 2.4549)\)

And by exponentiating the endpoints, we have the 95% CI for \(\theta\):

\(e^{(0.5747, 2.4549)}=(1.7766, 11.6448)\)

Likewise, for the cumulative odds of "not too" or "pretty" happy, for those saying "not at all" or "not too" true compared with those saying "somewhat" or "very" true, we have on the log scale

\(\log 1.99 \pm 1.96\sqrt{\dfrac{1}{108}+\dfrac{1}{26}+\dfrac{1}{859}+\dfrac{1}{411}} = (0.2429, 1.1308)\)

And, by exponentiating limits, we have the final CI for the odds ratio:

\(e^{(0.2429, 1.1308)}=(1.275, 3.098)\)

To put it a bit more loosely, we can say that individuals who generally don't agree as much with the statement "job security is good" have a greater odds of being less happy. Or, equivalently, those who generally agree more with the statement "job security is good" have a greater odds of being happier.

4.2 - Measures of Positive and Negative Association

While two nominal variables can be associated (knowing something about one can tell us something about the other), without a natural ordering, we can't specify that association with a particular direction, such as positive or negative. With an ordinal variable, however, it makes sense to think of one of its outcomes as being "higher" or "greater" than another, even if they aren't necessarily numerically meaningful. And with two such ordinal variables, we can define a positive association to mean that both variables tend to be either "high" or "low" together; a negative association would exist if one tends to be "high" when the other is "low". We solidify these ideas in this lesson.

Concordance and Gamma

With this notion of "high" and "low" available for two (ordinal) variables \(X\) and \(Y\), we can define a quantity that measures both strength and direction of association. Suppose that \(X\) takes on the values \(1,2,\ldots,I\) and that \(Y\) takes on the values \(1,2,\ldots,J\). We can think of these as ranks of the categories in each variable once we have decided which direction "high" corresponds to. As we've seen already, the direction of the ordering is somewhat arbitrary. The "highest" category for happy could just as easily be "very happy" as "not too happy", as long as adjacent categories are kept together. One direction is often more intuitive than the other, however.

If \((i,j)\) represents an observation in the \(i\)th row and \(j\)th column (i.e., \(X=i\) and \(Y=j\) for that observation), then the pair of observations \((i,j)\) and \((i',j')\) are concordant if

\((i-i')(j-j')>0\)

If \((i-i')(j-j')<0\), then the pair is discordant. If either \(i=i'\) or \(j=j'\), then neither term will be used---effectively, ties in either \(X\) or \(Y\) are ignored. If \(C\) and \(D\) denote the number of concordant and discordant pairs, respectively, then a correlation-like measure of association between \(X\) and \(Y\) due to Goodman and Kruskal (1954) is

\(\hat{\gamma}=\dfrac{C-D}{C+D}\)

Like the usual sample correlation between quantitative variables, \(\hat{\gamma}\) always falls within \([-1,1]\) and indicates stronger positive (or negative) association the closer it is to \(+1\) (or \(-1)\).

Example

In our example from the 2018 GSS, respondents were asked to rate their agreement to the statement "Job security is good" with respect to the work they were doing (jobsecok). For the concordance calculations, the responses "not at all true", "not too true", "somewhat true", and "very true" are interpreted as increasing in job security. Similarly, the responses for general happiness (happy) "not too happy", "pretty happy", and "very happy" are interpreted as increasing in happiness. Recall the observed counts for these variables below.

	Not too happy	Pretty happy	Very happy
Not at all true	15	25	5
Not too true	21	47	21
Somewhat true	64	248	100
Very true	73	474	311

First, we'll count the number of concordant pairs of individuals (observations). Consider first an individual from cell \((1,1)\) paired with an individual from cell \((2,2)\). This pair is concordant because

\((2-1)(2-1)>0\)

That is, the first individual has lower values for both happy ("not too happy") and jobsecok ("not at all true"). Moreover, this result holds for all pairs consisting of one individual chosen from the 15 in cell \((1,1)\) and one individual chosen from the 47 in cell \((2,2)\). Since there are \(15(47)=705\) such pairings, we've just counted 705 concordant pairs!

By the same argument, pairs chosen from cells \((1,1)\) and \((2,3)\) contribute \(15(21)=315\), pairs chosen from cells \((1,1)\) and \((3,2)\) contribute \(15(248)=3720\), and so. If we're careful to skip over pairs for which any ties occur, our final count should be \(C=199,293\).

Counting discordant pairs works in a similar way. For example, a pair from cells \((1,2)\) and \((2,1)\) are discordant because

\((2-1)(1-2)<0\)

This pair contributes to a negative relationship because the first individual provided a lower value for jobsecok ("not at all true") but a higher value for happy ("pretty happy"), relative to the second individual. And all \(25(21)=525\) such pairs of individuals are counted likewise. Again, without consideration of tied values, we can count a total of \(D=105,867\) discordant pairs in this table. And as a final summary, the gamma correlation value is

\(\hat{\gamma}=\dfrac{199293-105867}{199293+105867}=0.30615\)

This represents a relatively weak positive association between perceived job security and happiness.

Kendall's Tau and Tau-b

Another way to measure association for two ordinal variables is due to Kendall (1945) and incorporates the number of tied observations. With categorical variables, such as happiness or opinion on job security, many individuals will agree or "tie" in either the row variable, the column variable, or both. If we let \(T_r\) and \(T_c\) represent the number of pairs that are tied in the row variable and column variable, respectively, then we can count from the row and column totals

\(T_r=\sum_{i=1}^I\dfrac{n_{i+}(n_{i+}-1)}{2}, \) and \(T_c=\sum_{j=1}^J\dfrac{n_{+j}(n_{+j}-1)}{2} \)

Moreover, if \(n\) denotes the total number of observations in the sample, then Kendall's tau-b is calculated as

\( \hat{\tau}_b=\dfrac{C-D}{\sqrt{[n(n-1)/2-T_r][n(n-1)/2-T_c]}} \)

Like gamma, tau-b is a correlation-like quantitative that measures both strength and association of two ordinal variables. It always falls within \([-1,1]\) with stronger positive (or negative) association corresponding to \(+1\) (or \(-1\)). However, the denominator for tau-b is generally larger than that for gamma so that tau-b is generally weaker (closer to 0 in absolute value). To see why this is, note that \({n\choose2}=n(n-1)/2\) represents the total number of ways to choose two observations from the grand total and can be expanded as

\(\dfrac{n(n-1)}{2}=C+D+T_r+T_c-T_{rc} \)

where \(T_{rc}\) is the number of pairs of observations (from the diagonal counts) that tie on both the row and column variable. In other words, tau-b includes some ties in its denominator, while gamma does not, and ties do not contribute to either a positive or negative association.

It should also be noted that if there are no ties (i.e., every individual provides a unique response), both gamma and tau-b reduce to

\( \hat{\tau}=\dfrac{C-D}{n(n-1)/2} \)

which is known simply as Kendall's tau and is usually reserved for continuous data that has been converted to ranks. In the case of categorical variables, ties are unavoidable.

Example

For the data above regarding jobsecok and happy, we find

\(T_r=\dfrac{45(44)+89(88)+412(411)+858(857)}{2} =457225,\) and \(T_c=\dfrac{173(172)+794(793)+437(436)}{2} =424965\)

and then, with \(n=1404\), calculate Kendall's tau-b to be

\( \hat{\tau}_b=\dfrac{199293-105867}{\sqrt{[1404(1403)/2-457225][1404(1403)/2-424965]}}=0.1719 \)

Thus, both gamma and tau-b estimate the relationship between jobsecok and happy to be moderately positive.

Code

The R code to carry out the calculations above:

gss = read.csv(file.choose(), header=T) # "GSS3.csv"

gss = gss[(gss$happy!="Don't know") & (gss$happy!="No answer"),]
gss = gss[(gss$jobsecok!="Dont know") & (gss$jobsecok!="No answer") 
 & (gss$jobsecok!="Not applicable"),]
tbl = table(gss$jobsecok, gss$happy)

# concordance
library(DescTools)
ConDisPairs(tbl)
GoodmanKruskalGamma(tbl, conf.level=.95)
KendallTauB(tbl, conf.level=.95)

4.3 - Measures of Linear Trend

So far, we've referred to both gamma and tau-b as "correlation-like" in that they measure both the strength and the direction of an association. The correlation in reference is Pearson's Correlation, or Pearson's correlation coefficient. As a parameter, Pearson's correlation is defined for two (quantitative) random variables Y and Z as

\( \rho=\dfrac{Cov(Y,Z)}{\sigma_Y\sigma_Z} \)

where \(Cov(Y,Z)=E[(Y-\mu_Y)(Z-\mu_Z)]\) is the covariance of \(Y\) and \(Z\). To see intuitively how this measures linear association, notice that \((Y-\mu_Y)(Z-\mu_Z)\) will be positive if \(Y\) and \(Z\) tend to be large (greater than their means) or tend to be small (less than their means) together. Dividing by their standard deviations removes the units involved with the variables results in a quantity that always falls within \([-1,1]\). The sample estimate of \(\rho\) is

\(r = \dfrac{(n-1)^{-1}\sum_{i=1}^n(Y_i-\overline{Y})(Z_i-\overline{Z})}{s_Ys_Z} \)

where \(s_Y\) and \(s_Z\) are the sample standard deviations of \(Y\) and \(Z\), respectively. Both the population \(\rho\) and the sample estimate \(r\) satisfy the following properties:

\(−1 \le r \le 1\)
\(r = 0\) corresponds to no (linear) relationship
\(r = \pm1\) corresponds to perfect association, which for a two-way (square) table means that all observations fall into the diagonal cells.

When classification is ordinal, we can sometimes assign quantitative or numerically meaningful scores to the categories, which allows Pearson's correlation to be defined and estimated. The null hypothesis of interest in such a situation is whether the correlation is 0, meaning specifically that there is no linear trend among the ordinal levels. If the null hypothesis of (linear) independence is rejected, it is natural and meaningful to then measure the linear trend. It should be noted that the correlation considered is defined from the scores of the categories. Different choices of scores will generally result in different correlation measures.

A common approach to assigning scores to \(J\) ordered categories is to use their ranks: \(1,2,\ldots,J\) from least to greatest. Pearson's correlation formula applied to this choice of scores is known as Spearman's correlation (Spearman's rho). Even when the variables are originally quantitative, Spearman's rho is a popular non-parametric choice because it's not influenced by outliers. For example, the values 3.2, 1.8, and 4.9 have the same ranks (2, 1, 3) as the values 3.2, 1.8 and 49.0. We've encountered this idea of ranks already in our discussion of cumulative odds, but the difference here is that scores are treated as numerically meaningful and, if two variables are scored as such, will define a correlation that measures linear association.

For the 2018 GSS survey data relating jobsecok to happy, we can assign the scores (1,2,3,4) going down the rows to indicate increasing agreement (to the statement on job security) and scores (1,2,3) going across the columns from left to right to indicate increasing happiness.

Ranked scores for jobsecok (rows) and happy (columns)
	1	2	3
1	15	25	5
2	21	47	21
3	64	248	100
4	73	474	311

The correlation calculated from these scores is then 0.1948, which represents a relatively weak positive, linear association.

Assigning Scores

As an alternative to scores based on ranks, we may have another choice that appropriately reflects the nature of the ordered categories. In general, we denote scores for the categories of the row variable by \(u_1 \le u_2 \le\cdots\le u_I\) and of the column variable by \(v_1\le v_2 \cdots\le v_J\). This defines a correlation as a measure of linear association between these two scored variables and can be estimated with the same formula used for Pearson's correlation. For an \(I\times J\) table with cell counts \(n_{ij}\), this estimate is

\(r=\dfrac{\sum_i\sum_j(u_i-\bar{u})(v_j-\bar{v})n_{ij}}{\sqrt{[\sum_i\sum_j(u_i-\bar{u})^2 n_{ij}][\sum_i\sum_j(v_j-\bar{v})^2n_{ij}]}}\)

where \(\bar{u}=\sum_i\sum_j u_i n_{ij}/n\) is the row mean, and \(\bar{v}=\sum_i\sum_j v_j n_{ij}/n\) is the column mean.

Note that the assignment of scores is arbitrary and will in general influence the resulting correlation. As a demonstration of this in our current example, suppose we assign the scores (1,2,9,10) to jobsecok and the scores (1,9,10) to happy. Such a choice would make sense if the response categories were spaced more erratically. The resulting correlation would be \(r=0.1698\).

Thus, to avoid "data-snooping" or artificially significant results, scores should be assigned first, chosen thoughtfully to reflect the nature of the data, and results should be interpreted precisely in terms of the scores that were chosen.

Midrank Scores

If the row and column totals of the table are highly unbalanced, different scores can produce large differences in the estimated correlation \(r\). One option for assigning scores that accounts for this is to use midranks. The idea is based on ranks, but instead of assigning 1, 2, etc. to the categories directly, the ranks are first assigned to the individuals \(1,2,\ldots,n\) for each variable, and then category scores are calculated based on the average rank for the individuals in that category.

For an \(I\times J\) table with \(n_{i+}\) and \(n_{+j}\) representing the row and column totals, respectively, the midrank score for the 1st category of the row variable is

\(u_1=\dfrac{1+2+\cdots+n_{1+}}{n_{1+}} \)

For the second category, we also assign the average rank of the individuals, but note that the ranks begin where they ended in the first category:

\(u_2=\dfrac{(n_{1+}+1)+\cdots+(n_{1+}+n_{2+})}{n_{2+}} \)

And in general, this formula is expressed as

\(u_i=\dfrac{(\sum_{k=1}^{i-1}n_{k+}+1)+\cdots+(\sum_{k=1}^{i}n_{k+})}{n_{i+}} \)

Similarly, midrank scores for the column variable are based on column totals. Ultimately, we end up with \((u_1,\ldots,u_I)\) for the row variable and \((v_1,\ldots,v_J)\) for the column variable, but the values and the spacing of these scores depends on the number of individuals in each category. Let's see how this works for the 2018 GSS example.

	Not too happy	Pretty happy	Very happy
Not at all true	15	25	5
Not too true	21	47	21
Somewhat true	64	248	100
Very true	73	474	311

The first row corresponds to the \(n_{1+}=45\) individuals replying "Not at all true". If they were individually ranked from 1 to 45, the average of these ranks would be \(u_1=(1+\cdots+45)/45=23\). Likewise, the \(n_{2+}=89\) individuals in the second row would be ranked from 46 to 134 with average rank \(u_2=(46+\cdots+134)/89=90\). Continuing in this way, we find the midrank scores for jobsecok to be \((23,90,340.5,975.5)\) and those for happy to be \((87,570.5,1186)\), from which we can calculate the correlation \(0.1849\).

4.4 - Mantel-Haenszel Test for Linear Trend

For a given set of scores and corresponding correlation \(\rho\), we can carry out the test of \(H_0\colon\rho=0\) versus \(H_A\colon\rho\ne0\) using the Mantel-Haenszel (MH) statistic:

\(M^2=(n-1)r^2\)

where \(n\) is the sample size (total number of individuals providing both row and column variable responses), and \(r\) is the sample estimate of the correlation from the given scores. Additional properties are as follows:

When \(H_0\) is true, \(M^2\) has an approximate chi-square distribution with 1 degree of freedom, regardless of the size of the two-way table.
\(\mbox{sign}\cdot(r)|M|\) has an approximate standard normal distribution, which can be used for testing one-sided alternatives too. Note that we use the sign of the sample correlation, so this test statistic may be either positive or negative.
Larger values of \(M^2\) provide more evidence against the linear independence model.
The value of \(M^2\) (and hence the evidence to reject \(H_0\)) increases with both sample size \(n\) and the absolute value of the estimated correlation \(|r|\).

Example

Now, we apply the Mantel-Haenszel test of linear relationship to the 2018 GSS variables jobsecok and happy. There are two separate R functions we can run, depending on whether we want to input scores directly (MH.test) or whether we want R to use the midranks (MH.test.mid). Both functions are defined in the file MantelHaenszel.R, so we need to run that code first.

For the scores \((1,2,3,4)\) assigned to jobsecok and \((1,2,3)\) assigned to happy, the correlation estimate is \(r=0.1948\) (slightly rounded), and with \(n=1404\) total observations, we calculate \(M^2=(1404-1)0.1948^2=53.248\), which R outputs, along with the \(p\)-value indicating extremely significant evidence that the correlation between job security and happiness is greater than 0, given these particular scores. Similarly, if the midranks are used, the correlation estimate becomes \(r=0.1849\) with MH test statistic \(M^2=47.97\). The conclusion is unchanged.

The R code to carry out the calculations above:

source(file.choose()) # choose MantelHaenszel.R

# input data: jobsecok (rows) and happy (columns)
tbl = matrix(c(15,21,64,73,25,47,248,474,5,21,100,311),nrow=4)

# user input scores u and v -- need to match the dimensions of tbl
u = c(1,2,3,4); v = c(1,2,3)
MH.test(tbl, u, v)

# no user input scores, uses midranks
MH.test.mid(tbl)

If we search for a built-in function in R to compute the MH test statistic, we will find a few additional functions, such as the mantelhaen.test. However, this is a different test and is designed to work with \(2\times2\times K\) tables to measure conditional association, which we'll see later.

Power vs Pearson Test of Association

When dealing with ordinal data, when there is a positive or negative linear association between variables, the MH test has a power advantage over the Pearson or LRT test that treats the categories as nominal, meaning the MH test is more likely to detect such an association when it exists. Here are some other comparisons between them.

The Pearson and LRT tests consider the most general alternative hypothesis for any type of association. The degrees of freedom, which is the number of parameters separating the null and alternative models, is relatively large at \((I − 1)(J − 1)\).
The MH test has only a single degree of freedom, reflecting the fact that the correlation is the only parameter separating the null model of independence from the alternative of linear association. The critical values (thresholds for establishing significance) will generally be smaller, compared with critical values from a chi-squared distribution with \((I-1)(J-1)\) degrees of freedom.
For small to moderate sample size, the sampling distribution of \(M^2\) is better approximated by a chi-squared distribution than are the sampling distributions for \(X^2\) and \(G^2\), the Pearson and LRT statistics, respectively; this tends to hold in general for distributions with smaller degrees of freedom.

Since the MH test assumes from the beginning that the association is linear, it may not be as effective in detecting associations that are not linear. Consider the hypothetical table below relating the binary response (success, failure) to the level of drug applied (low, medium, high).

Hypothetical table with neither increasing nor decreasing success rates
	Success	Failure
Low	5	15
Medium	10	10
High	5	15

Treating these categories as nominal, the Pearson test statistic would be \(X^2=3.75\) with \(p\)-value 0.1534 relative to chi-square distribution with two degrees of freedom. By contrast, with scores of \((1,2,3)\) assigned to the drug levels and \((1,2)\) assigned to the binary outcomes, the MH test statistic for linear association would be \(M^2=0\) with \(p\)-value 1, indicating no evidence for linear association. The reason the MH test fails to find any evidence is that it's looking specifically for increasing or decreasing trends consistent with stronger correlation values, which is not present in this data. The success rate increases from low to medium drug level but then decreases from medium to high. The Pearson test is more powerful in this situation because it is essentially measuring whether there is any change in success rates.

Not surprisingly, if the table values were observed to be those below, where the success rates are increasing from drug levels low to high, the MH test would be more powerful (\(X^2=10.0\) with \(p\)-value 0.0067 versus \(M^2=9.83\) with \(p\)-value 0.0017).

Hypothetical table with increasing success rates
	Success	Failure
Low	5	15
Medium	10	10
High	15	5

The takeaway point from this is that the MH test is potentially more powerful than the Pearson (or LRT) if the association between the variables is linear. Even if the ordered categories can be assigned meaningful scores, it doesn't guarantee that the variables' association is linear.

Finally, although we did use the scores \((1,2)\) for the binary outcome, this choice is inconsequential for calculating a correlation involving a variable with only two categories, meaning that we would get the same value for the MH test statistic with any choice of scores. If the outcome variable had three or more categories, however, they would need to be ordinal with appropriately assigned scores for the MH test to be meaningful.

4.5 - Fisher's Exact Test

The tests discussed so far that use the chi-square approximation, including the Pearson and LRT for nominal data as well as the Mantel-Haenszel test for ordinal data, perform well when the contingency tables have a reasonable number of observations in each cell, as already discussed in Lesson 1. When samples are small, the distributions of \(X^2\), \(G^2\), and \(M^2\)(and other large-sample based statistics) are not well approximated by the chi-squared distribution; thus their \(p\)-values are not to be trusted. In such situations, we can perform inference using an exact distribution (or estimates of exact distributions), but we should keep in mind that \(p\)-values based on exact tests can be conservative (i.e, measured to be larger than they really are).

We may use an exact test if:

the row totals \(n_{i+}\) and the column totals \(n_{+j}\) are both fixed by design of the study.
we have a small sample size \(n\),
more than 20% of cells have expected cell counts less than 5, and no expected cell count is less than 1.

Example: Lady tea tasting

Here we consider the famous tea tasting example! In a summer tea-part in Cambridge, England, a lady claimed to be able to discern, by taste alone, whether a cup of tea with milk had the tea poured first or the milk poured first. An experiment was performed by Sir R.A. Fisher himself, then and there, to see if her claim was valid. Eight cups of tea were prepared and presented to her in random order. Four had the milk poured first, and other four had the tea poured first. The lady tasted each one and rendered her opinion. The results are summarized in the following \(2 \times 2\) table:

Actually poured first	Lady says poured first
Actually poured first	tea	milk
tea	3	1
milk	1	3

The row totals are fixed by the experimenter. The column totals are fixed by the lady, who knows that four of the cups are "tea first" and four are "milk first." Under \(H_0\), the lady has no discerning ability, which is to say the four cups she calls "tea first" are a random sample from the eight. If she selects four at random, the probability that three of these four are actually "tea first" comes from the hypergeometric distribution, \(P(n_{11}=3)\):

\(\dfrac{\dbinom{4}{3}\dbinom{4}{1}}{\dbinom{8}{4}}=\dfrac{\dfrac{4!}{3!1!}\dfrac{4!}{1!3!}}{\dfrac{8!}{4!4!}}=\dfrac{16}{70}=0.229\)

A \(p\)-value is the probability of getting a result as extreme or more extreme than the event actually observed, assuming \(H_0\) is true. In this example, the \(p\)-value would be \(P(n_{11}\ge t_0)\), where \(t_0\) is the observed value of \(n_{11}\), which in this case is 3. The only result more extreme would be the lady's (correct) selection of all four the cups that are truly "tea first," which has probability

\(\dfrac{\dbinom{4}{4}\dbinom{4}{0}}{\dbinom{8}{4}}=\dfrac{1}{70}=0.014\)

As it turns out, the \(p\)-value is \(.229 + .014 = .243\), which is only weak evidence against the null. In other words, there is not enough evidence to reject the null hypothesis that the lady is just purely guessing. To be fair, experiments with small amounts of data are generally not very powerful, to begin with, given the limited information.

Here is how we can do this computation in SAS and R. Further below we describe in a bit more detail the underlying idea behind these calculations.

Code

/*----------------------------
| Example: Fisher's Tea Lady
-----------------------------*/
data tea;
input poured $ lady $ count;
datalines;
tea tea 3
tea milk 1
milk tea 1
milk milk 3
;
run;
proc freq data=tea order=data;
weight count;
tables poured*lady/ chisq relrisk riskdiff expected;
exact fisher chisq or;
run;

Output

The SAS System

The FREQ Procedure

Frequency Expected Percent Row Pct Col Pct


Table of poured by lady
poured	lady
poured	tea	milk	Total
tea	3 2 37.50 75.00 75.00	1 2 12.50 25.00 25.00	4 50.00
milk	1 2 12.50 25.00 25.00	3 2 37.50 75.00 75.00	4 50.00
Total	4 50.00	4 50.00	8 100.00

Statistics for Table of poured by lady


Statistic	DF	Value	Prob
WARNING: 100% of the cells have expected counts less than 5. (Asymptotic) Chi-Square may not be a valid test.
Chi-Square	1	2.0000	0.1573
Likelihood Ratio Chi-Square	1	2.0930	0.1480
Continuity Adj. Chi-Square	1	0.5000	0.4795
Mantel-Haenszel Chi-Square	1	1.7500	0.1859
Phi Coefficient		0.5000
Contingency Coefficient		0.4472
Cramer's V		0.5000


Pearson Chi-Square Test
Chi-Square	2.0000
DF	1
Asymptotic Pr > ChiSq	0.1573
Exact Pr >= ChiSq	0.4857


Likelihood Ratio Chi-Square Test
Chi-Square	2.0930
DF	1
Asymptotic Pr > ChiSq	0.1480
Exact Pr >= ChiSq	0.4857


Mantel-Haenszel Chi-Square Test
Chi-Square	1.7500
DF	1
Asymptotic Pr > ChiSq	0.1859
Exact Pr >= ChiSq	0.4857


Fisher's Exact Test
Cell (1,1) Frequency (F)	3
Left-sided Pr <= F	0.9857
Right-sided Pr >= F	0.2429

Table Probability (P)	0.2286
Two-sided Pr <= P	0.4857


Column 1 Risk Estimates
	Risk	ASE	95% Confidence Limits		Exact 95% Confidence Limits
Difference is (Row 1 - Row 2)
Row 1	0.7500	0.2165	0.3257	1.0000	0.1941	0.9937
Row 2	0.2500	0.2165	0.0000	0.6743	0.0063	0.8059
Total	0.5000	0.1768	0.1535	0.8465	0.1570	0.8430
Difference	0.5000	0.3062	-0.1001	1.0000


Column 2 Risk Estimates
	Risk	ASE	95% Confidence Limits		Exact 95% Confidence Limits
Difference is (Row 1 - Row 2)
Row 1	0.2500	0.2165	0.0000	0.6743	0.0063	0.8059
Row 2	0.7500	0.2165	0.3257	1.0000	0.1941	0.9937
Total	0.5000	0.1768	0.1535	0.8465	0.1570	0.8430
Difference	-0.5000	0.3062	-1.0000	0.1001


Odds Ratio and Relative Risks
Statistic	Value	95% Confidence Limits
Odds Ratio	9.0000	0.3666	220.9270
Relative Risk (Column 1)	3.0000	0.5013	17.9539
Relative Risk (Column 2)	0.3333	0.0557	1.9949


Odds Ratio
Odds Ratio	9.0000

Asymptotic Conf Limits
95% Lower Conf Limit	0.3666
95% Upper Conf Limit	220.9270

Exact Conf Limits
95% Lower Conf Limit	0.2117
95% Upper Conf Limit	626.2435

Sample Size = 8

OPTION EXACT in SAS indicates that we are doing exact tests which consider ALL tables with the exact same margins as the observed table. This option will work for any \(I \times J\) table. OPTION FISHER, more specifically performs Fisher's exact test which is an exact test only for a \(2 \times 2\) table in SAS.

For R, see TeaLady.R where you can see we used the fisher.test() function to perform Fisher's exact test for the \(2 \times 2\) table in question.

#### one-sided Fisher's exact test

fisher.test(tea, alternative = "greater")

#### two-sided Fisher's exact test

fisher.test(tea)

The same could be done using chisq.test() with option, simulate.p.value=TRUE. By reading the help file on fisher. test() function, you will see that certain options in this function only work for \(2 \times 2\) tables. For the output, see TeaLady.out

The basic idea behind exact tests is that they enumerate all possible tables that have the same margins, e.g., row sums and column sums. Then to compute the relevant statistics, e.g., \(X^2\), \(G^2\), odds-ratios, we look for all tables where the values are more extreme than the one we have observed. The key here is that in the set of tables with the same margins, once we know the value in one cell, we know the rest of the cells. Therefore, to find a probability of observing a table, we need to find the probability of only one cell in the table (rather than the probabilities of four cells). Typically we use the value of cell (1,1).

Under the null hypothesis of independence, more specifically when odds-ratio \(\theta = 1\), the probability distribution of that one cell \(n_{11}\) is hypergeometric, as discussed in the Tea lady example.

Stop and Think!

In the Lady tasting tea example, there are 5 possible \(2 \times 2\) tables that have the same observed margins. Can you figure out which are those?

Extension of Fisher's test

For problems where the number of possible tables is too large, Monte Carlo methods are used to approximate "exact" statistics (e.g., option MC in SAS FREQ EXACT and in R under chisq.test() you need to specify simulate.p.value = TRUE and indicate how many runs you want MC simulation to do; for more see the help files). This extension, and thus these options in SAS and R, of the Fisher's exact test for a \(2 \times 2\) table, in effect, takes samples from a large number of possibilities in order to simulate the exact test.

This test is "exact" because no large-sample approximations are used. The \(p\)-value is valid regardless of the sample size. Asymptotic results may be unreliable when the distribution of the data is sparse, or skewed. Exact computations are based on the statistical theory of exact conditional inference for contingency tables.

Fisher's exact test is definitely appropriate when the row totals and column totals are both fixed by design. Some have argued that it may also be used when only one set of margins is truly fixed. This idea arises because the marginal totals \(n_{1+}, n_{+1}\) provide little information about the odds ratio \(\theta\).

Exact non-null inference for \(\theta\)

When \(\theta = 1\), this distribution is hypergeometric, which we used in Fisher's exact test. More generally, Fisher (1935) gave this distribution for any value of \(\theta\). Using this distribution, it is easy to compute Fisher's exact \(p\)-value for testing the null hypothesis \(H_0:\theta=\theta^*\) for any \(\theta^*\). The set of all values \(\theta^*\) that cannot be rejected at the \(\alpha=.05\) level test forms an exact 95% confidence region for \(\theta\).

Let's look at a part of the SAS output a bit closer, we get the same CIs in the R ouput. First, notice the sample estimate of the odds ratio equal to 9, which we can compute from the cross-product ratio as we have discussed earlier. Note also that fisher.test() in R for \(2 \times 2\) tables will give so-called "conditional estimate" of the odds-ratio so the value will be different (in this case, approximately 6.408).

Notice the difference between the exact and asymptotic CIs for the odds ratio for the two-sided alternative (e.g., \(\theta\ne1\)). The exact version is larger. Recalling the interpretation of the odds ratio, what do these CIs tell us about true unknown odds-ratio? This is a simple example of how inference may vary if you have small samples or sparseness.


Odds Ratio
Odds Ratio	9.0000

Asymptotic Conf Limits
95% Lower Conf Limit	0.3666
95% Upper Conf Limit	220.9270

Exact Conf Limits
95% Lower Conf Limit	0.2117
95% Upper Conf Limit	626.2435

Sample Size = 8

Bias correction for estimating \(\theta\)

Earlier, we learned that the natural estimate of \(\theta\) is

\(\hat{\theta}=\dfrac{n_{11}n_{22}}{n_{12}n_{21}}\)

and that \(\log\hat{\theta}\) is approximately normally distributed about \(\log \theta\) with estimated variance

\(\hat{V}(\log\hat{\theta})=\dfrac{1}{n_{11}}+\dfrac{1}{n_{12}}+\dfrac{1}{n_{21}}+\dfrac{1}{n_{22}}\)

Advanced note: this is the estimated variance of the limiting distribution, not an estimate of the variance of \(\log\hat{\theta}\) itself. Because there is a nonzero probability that the numerator or the denominator of \(\hat{\theta}\) may be zero, the moments of \(\hat{\theta}\) and \(\log\hat{\theta}\) do not actually exist. If the estimate is modified by adding \(1/2\) to each \(n_{ij}\), we have

\(\tilde{\theta}=\dfrac{(n_{11}+0.5)(n_{22}+0.5)}{(n_{12}+0.5)(n_{21}+0.5)}\)

with estimated variance

\(\hat{V}(\log\tilde{\theta})=\sum\limits_{i,j} \dfrac{1}{(n_{ij}+0.5)}\)

In smaller samples, \(\log\tilde{\theta}\) may be slightly less biased than \(\log\hat{\theta}\).

4.6 - Lesson 4 Summary

In this lesson, we extended the idea of association in two-way contingency tables to accommodate ordinal data and linear trends. The key concepts are that ordinal data has a particular nature that we can summarize and potentially take advantage of when carrying out a significance test. We also introduced an alternative "exact" test for \(2\times2\) tables that can be used when the observed counts are too small for the usual chi-square approximation to apply.

Coming up next, we consider new ways to measure associations if more than two variables are in the picture. Specifically, we'll see how one relationship can be confounded with another so that interpretations will change, depending on whether we condition on certain variables in advance.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility

4: Tests for Ordinal Data and Small Samples

Overview

Lesson 4 Code Files

R Files

SAS Files

Useful Links

4.1 - Cumulative Odds and Odds Ratios

Cumulative Odds

Cumulative Odds Ratios

CIs for Cumulative Odds Ratios

4.2 - Measures of Positive and Negative Association

Concordance and Gamma

Example

Kendall's Tau and Tau-b

Example

Code

4.3 - Measures of Linear Trend

Assigning Scores

Midrank Scores

4.4 - Mantel-Haenszel Test for Linear Trend

Example

Power vs Pearson Test of Association

4.5 - Fisher's Exact Test

Example: Lady tea tasting

Code

Output

Stop and Think!

Extension of Fisher's test

Exact non-null inference for \(\theta\)

Bias correction for estimating \(\theta\)

4.6 - Lesson 4 Summary