4: Tests for Ordinal Data and Small Samples
4: Tests for Ordinal Data and Small SamplesOverview
So far, we've dealt with the concepts of independence and association in twoway tables assuming nominal classification. The Pearson and likelihood test statistics introduced do not assume any particular ordering of the rows or columns, and their numeric values do not change if the ordering is altered in any way. If a natural ordering exists (i.e., for ordinal data), we can utilize it to potentially increase the power of our tests. In the last section, we will consider a test proposed by R. A. Fisher for twoway tables that is applicable when the sample size is small and/or cell counts are small, and the largesample methodology does not hold. But before getting into testing and inference, we revisit some measures of association and see how they can accommodate ordinal data.
Lesson 4 Code Files
R Files
SAS Files
Useful Links
4.1  Cumulative Odds and Odds Ratios
4.1  Cumulative Odds and Odds RatiosRecall that a discrete ordinal variable, say \(Y\), takes on values that can be sorted either least to greatest or greatest to least. Examples introduced earlier might be the face of a die (\(1,2,\ldots,6\)) or a person's attitude towards war ("strongly disagree", "disagree", "agree", "strongly agree"). For convenience, we may label such categories as 1 through \(J\), which allows us to express cumulative probabilities as
\(F(j) = P(Y\le j)=\pi_1+\cdots+\pi_j\),
where the parameter \(\pi_j\) represents the probability of category \(j\). So, we can write \(F(2)=\pi_1+\pi_2\) to represent the probability that the die is either 1 or 2, or, in the second example, the probability that an individual responds either "strongly disagree" or "disagree". Note that we need only the ordinal property for this to make sense; the values \(1,2,\ldots,J\) themselves do not represent any numerically meaningful quantities in these cases.
Cumulative Odds
In addition to probabilities (or risks), ordinal categories also allow an odds to be defined for cumulative events. Recall the observed data for the war attitude example:
Attitude  Count 

Strongly disagree  35 
Disagree  27 
Agree  23 
Strongly agree  31 
Total  116 
If we focus on the "disagree" outcome in particular, the estimated probability would be \(\hat{\pi}_2=27/116=0.2328\) with corresponding estimated odds \(\hat{\pi}_2/(1\hat{\pi}_2)=27/(35+23+31)=0.3034\). However, using the estimated cumulative probability \(\hat{F}(2)=\hat{\pi}_1+\hat{\pi}_2=(35+27)/116=0.5345\), we may also consider the estimated cumulative odds:
\( \dfrac{\hat{F}(2)}{1\hat{F}(2)}=\dfrac{35+27}{23+31}=1.1481 \)
We interpret this value as the (estimated) cumulative odds that an individual will "disagree", where the category of "strongly disagree" is implicitly included. Equivalently, we may also refer to this as the (estimated) odds of "strongly disagree" or "disagree". In general, we define the cumulative odds for \(Y\le j\) as
\(\dfrac{F(j)}{(1F(j))}=\dfrac{\pi_1+\cdots+\pi_j}{\pi_{j+1}+\cdots+\pi_J},\quad\mbox{for }j=1,\ldots,J1\)
with sample estimate \(\hat{F}(j)/(1\hat{F}(j))\). The case \(j=J\) is not defined because \(1F(J)=0\). Like cumulative probabilities, cumulative odds are necessarily nondecreasing and will be strictly increasing if the observed counts are all positive.
Cumulative Odds Ratios
If an additional variable is involved, this idea extends to cumulative odds ratios. Consider the table below summarizing the responses for extent of agreement to the statement "job security is good" (jobsecok) and general happiness (happy) from the 2018 General Social Surveys. Additional possible responses of "don't know" and "no answer" are omitted here.
Not too happy  Pretty happy  Very happy  

Not at all true  15  25  5 
Not too true  21  47  21 
Somewhat true  64  248  100 
Very true  73  474  311 
If we condition on jobsecok and view happy as the response variable, the cumulative odds ratio for "not too happy" or "pretty happy" for those who say "not at all true", relative to those who say "very true", would be estimated with
\(\dfrac{(15+25)/5}{(73+474)/311}=4.55 \)
If perhaps "pretty happy" and "very happy" seem like a more intuitive combination to consider, keep in mind that we're free to reverse the order to start with "very happy" and end with "not too happy" without violating the ordinal nature of this variable. By indexing "very happy" with \(j=1\), \(F(2)\) becomes the cumulative probability of "very happy" or "pretty happy", and the cumulative odds would likewise be \(F(2)/(1F(2))\). Likewise, we could choose any two rows to serve as the groups for comparison. The row variable doesn't even have to be ordinal itself to define such a cumulative odds ratio.
However, if we happen to have two ordinal variables, as we do in this example, we can work with a cumulative version in both dimensions. We may, for example, consider the cumulative odds of "not too happy" or "pretty happy", for those who say "not at all" or "not too" true, relative to those who say "somewhat" or "very" true. The estimate of this would be
\(\dfrac{(15+25+21+47)/(5+21)}{(64+248+73+474)/(100+311)}=1.99 \)
The common theme in all these odds ratios is that they essentially convert an \(I\times J\) table into a \(2\times2\) table by combining or "accumulating" counts in adjacent categories, depending on our focus of interest. This is illustrated by the shading in the tables below.
Not too happy  Pretty happy  Very happy  

Not at all true  15  25  5 
Not too true  21  47  21 
Somewhat true  64  248  100 
Very true  73  474  311 
Not too or pretty happy  Very happy  

Not at all true  40  5 
Very true  547  311 
Not too happy  Pretty happy  Very happy  

Not at all true  15  25  5 
Not too true  21  47  21 
Somewhat true  64  248  100 
Very true  73  474  311 
Not too or pretty happy  Very happy  

Not at all or not too true  108  26 
Somewhat or very true  859  411 
CIs for Cumulative Odds Ratios
Recall for a \(2\times 2\) table with counts \((n_{11},n_{12},n_{21},n_{22})\), we have the sample odds ratio \(\dfrac{n_{11}n_{22}}{n_{12}n_{21}}\) and corresponding 95% confidence interval for the (population) log odds ratio:
\(\log\dfrac{n_{11}n_{22}}{n_{12}n_{21}} \pm 1.96\sqrt{\dfrac{1}{n_{11}}+\dfrac{1}{n_{12}}+\dfrac{1}{n_{21}}+\dfrac{1}{n_{22}}}\)
We can readily adopt this formula for a cumulative odds ratio \(\theta\) as well. We just need to work with the \(2\times2\) table of counts induced by any accumulation. For example, the \(2\times2\) table induced by the cumulative odds ratio for "not too" or "pretty" happy for those saying "not at all" compared with "very" true is
Not too or Pretty happy  Very happy  

Not at all true  40  5 
Very true  547  311 
With estimate \(\hat{\theta}=4.55\), the 95% confidence interval for \(\log\theta\) is
\(\log 4.55 \pm 1.96\sqrt{\dfrac{1}{40}+\dfrac{1}{5}+\dfrac{1}{547}+\dfrac{1}{311}} = (0.5747, 2.4549)\)
And by exponentiating the endpoints, we have the 95% CI for \(\theta\):
\(e^{(0.5747, 2.4549)}=(1.7766, 11.6448)\)
Likewise, for the cumulative odds of "not too" or "pretty" happy, for those saying "not at all" or "not too" true compared with those saying "somewhat" or "very" true, we have on the log scale
\(\log 1.99 \pm 1.96\sqrt{\dfrac{1}{108}+\dfrac{1}{26}+\dfrac{1}{859}+\dfrac{1}{411}} = (0.2429, 1.1308)\)
And, by exponentiating limits, we have the final CI for the odds ratio:
\(e^{(0.2429, 1.1308)}=(1.275, 3.098)\)
To put it a bit more loosely, we can say that individuals who generally don't agree as much with the statement "job security is good" have a greater odds of being less happy. Or, equivalently, those who generally agree more with the statement "job security is good" have a greater odds of being happier.
4.2  Measures of Positive and Negative Association
4.2  Measures of Positive and Negative AssociationWhile two nominal variables can be associated (knowing something about one can tell us something about the other), without a natural ordering, we can't specify that association with a particular direction, such as positive or negative. With an ordinal variable, however, it makes sense to think of one of its outcomes as being "higher" or "greater" than another, even if they aren't necessarily numerically meaningful. And with two such ordinal variables, we can define a positive association to mean that both variables tend to be either "high" or "low" together; a negative association would exist if one tends to be "high" when the other is "low". We solidify these ideas in this lesson.
Concordance and Gamma
With this notion of "high" and "low" available for two (ordinal) variables \(X\) and \(Y\), we can define a quantity that measures both strength and direction of association. Suppose that \(X\) takes on the values \(1,2,\ldots,I\) and that \(Y\) takes on the values \(1,2,\ldots,J\). We can think of these as ranks of the categories in each variable once we have decided which direction "high" corresponds to. As we've seen already, the direction of the ordering is somewhat arbitrary. The "highest" category for happy could just as easily be "very happy" as "not too happy", as long as adjacent categories are kept together. One direction is often more intuitive than the other, however.
If \((i,j)\) represents an observation in the \(i\)th row and \(j\)th column (i.e., \(X=i\) and \(Y=j\) for that observation), then the pair of observations \((i,j)\) and \((i',j')\) are concordant if
\((ii')(jj')>0\)
If \((ii')(jj')<0\), then the pair is discordant. If either \(i=i'\) or \(j=j'\), then neither term will be usedeffectively, ties in either \(X\) or \(Y\) are ignored. If \(C\) and \(D\) denote the number of concordant and discordant pairs, respectively, then a correlationlike measure of association between \(X\) and \(Y\) due to Goodman and Kruskal (1954) is
\(\hat{\gamma}=\dfrac{CD}{C+D}\)
Like the usual sample correlation between quantitative variables, \(\hat{\gamma}\) always falls within \([1,1]\) and indicates stronger positive (or negative) association the closer it is to \(+1\) (or \(1)\).
Example
In our example from the 2018 GSS, respondents were asked to rate their agreement to the statement "Job security is good" with respect to the work they were doing (jobsecok). For the concordance calculations, the responses "not at all true", "not too true", "somewhat true", and "very true" are interpreted as increasing in job security. Similarly, the responses for general happiness (happy) "not too happy", "pretty happy", and "very happy" are interpreted as increasing in happiness. Recall the observed counts for these variables below.
Not too happy  Pretty happy  Very happy  

Not at all true  15  25  5 
Not too true  21  47  21 
Somewhat true  64  248  100 
Very true  73  474 
311 
First, we'll count the number of concordant pairs of individuals (observations). Consider first an individual from cell \((1,1)\) paired with an individual from cell \((2,2)\). This pair is concordant because
\((21)(21)>0\)
That is, the first individual has lower values for both happy ("not too happy") and jobsecok ("not at all true"). Moreover, this result holds for all pairs consisting of one individual chosen from the 15 in cell \((1,1)\) and one individual chosen from the 47 in cell \((2,2)\). Since there are \(15(47)=705\) such pairings, we've just counted 705 concordant pairs!
By the same argument, pairs chosen from cells \((1,1)\) and \((2,3)\) contribute \(15(21)=315\), pairs chosen from cells \((1,1)\) and \((3,2)\) contribute \(15(248)=3720\), and so. If we're careful to skip over pairs for which any ties occur, our final count should be \(C=199,293\).
Counting discordant pairs works in a similar way. For example, a pair from cells \((1,2)\) and \((2,1)\) are discordant because
\((21)(12)<0\)
This pair contributes to a negative relationship because the first individual provided a lower value for jobsecok ("not at all true") but a higher value for happy ("pretty happy"), relative to the second individual. And all \(25(21)=525\) such pairs of individuals are counted likewise. Again, without consideration of tied values, we can count a total of \(D=105,867\) discordant pairs in this table. And as a final summary, the gamma correlation value is
\(\hat{\gamma}=\dfrac{199293105867}{199293+105867}=0.30615\)
This represents a relatively weak positive association between perceived job security and happiness.
Kendall's Tau and Taub
Another way to measure association for two ordinal variables is due to Kendall (1945) and incorporates the number of tied observations. With categorical variables, such as happiness or opinion on job security, many individuals will agree or "tie" in either the row variable, the column variable, or both. If we let \(T_r\) and \(T_c\) represent the number of pairs that are tied in the row variable and column variable, respectively, then we can count from the row and column totals
\(T_r=\sum_{i=1}^I\dfrac{n_{i+}(n_{i+}1)}{2}, \) and \(T_c=\sum_{j=1}^J\dfrac{n_{+j}(n_{+j}1)}{2} \)
Moreover, if \(n\) denotes the total number of observations in the sample, then Kendall's taub is calculated as
\( \hat{\tau}_b=\dfrac{CD}{\sqrt{[n(n1)/2T_r][n(n1)/2T_c]}} \)
Like gamma, taub is a correlationlike quantitative that measures both strength and association of two ordinal variables. It always falls within \([1,1]\) with stronger positive (or negative) association corresponding to \(+1\) (or \(1\)). However, the denominator for taub is generally larger than that for gamma so that taub is generally weaker (closer to 0 in absolute value). To see why this is, note that \({n\choose2}=n(n1)/2\) represents the total number of ways to choose two observations from the grand total and can be expanded as
\(\dfrac{n(n1)}{2}=C+D+T_r+T_cT_{rc} \)
where \(T_{rc}\) is the number of pairs of observations (from the diagonal counts) that tie on both the row and column variable. In other words, taub includes some ties in its denominator, while gamma does not, and ties do not contribute to either a positive or negative association.
It should also be noted that if there are no ties (i.e., every individual provides a unique response), both gamma and taub reduce to
\( \hat{\tau}=\dfrac{CD}{n(n1)/2} \)
which is known simply as Kendall's tau and is usually reserved for continuous data that has been converted to ranks. In the case of categorical variables, ties are unavoidable.
Example
For the data above regarding jobsecok and happy, we find
\(T_r=\dfrac{45(44)+89(88)+412(411)+858(857)}{2} =457225,\) and \(T_c=\dfrac{173(172)+794(793)+437(436)}{2} =424965\)
and then, with \(n=1404\), calculate Kendall's taub to be
\( \hat{\tau}_b=\dfrac{199293105867}{\sqrt{[1404(1403)/2457225][1404(1403)/2424965]}}=0.1719 \)
Thus, both gamma and taub estimate the relationship between jobsecok and happy to be moderately positive.
Code
The R code to carry out the calculations above:
gss = read.csv(file.choose(), header=T) # "GSS3.csv"
gss = gss[(gss$happy!="Don't know") & (gss$happy!="No answer"),]
gss = gss[(gss$jobsecok!="Dont know") & (gss$jobsecok!="No answer")
& (gss$jobsecok!="Not applicable"),]
tbl = table(gss$jobsecok, gss$happy)
# concordance
library(DescTools)
ConDisPairs(tbl)
GoodmanKruskalGamma(tbl, conf.level=.95)
KendallTauB(tbl, conf.level=.95)
4.3  Measures of Linear Trend
4.3  Measures of Linear TrendSo far, we've referred to both gamma and taub as "correlationlike" in that they measure both the strength and the direction of an association. The correlation in reference is Pearson's Correlation, or Pearson's correlation coefficient. As a parameter, Pearson's correlation is defined for two (quantitative) random variables Y and Z as
\( \rho=\dfrac{Cov(Y,Z)}{\sigma_Y\sigma_Z} \)
where \(Cov(Y,Z)=E[(Y\mu_Y)(Z\mu_Z)]\) is the covariance of \(Y\) and \(Z\). To see intuitively how this measures linear association, notice that \((Y\mu_Y)(Z\mu_Z)\) will be positive if \(Y\) and \(Z\) tend to be large (greater than their means) or tend to be small (less than their means) together. Dividing by their standard deviations removes the units involved with the variables results in a quantity that always falls within \([1,1]\). The sample estimate of \(\rho\) is
\(r = \dfrac{(n1)^{1}\sum_{i=1}^n(Y_i\overline{Y})(Z_i\overline{Z})}{s_Ys_Z} \)
where \(s_Y\) and \(s_Z\) are the sample standard deviations of \(Y\) and \(Z\), respectively. Both the population \(\rho\) and the sample estimate \(r\) satisfy the following properties:
 \(−1 \le r \le 1\)
 \(r = 0\) corresponds to no (linear) relationship
 \(r = \pm1\) corresponds to perfect association, which for a twoway (square) table means that all observations fall into the diagonal cells.
When classification is ordinal, we can sometimes assign quantitative or numerically meaningful scores to the categories, which allows Pearson's correlation to be defined and estimated. The null hypothesis of interest in such a situation is whether the correlation is 0, meaning specifically that there is no linear trend among the ordinal levels. If the null hypothesis of (linear) independence is rejected, it is natural and meaningful to then measure the linear trend. It should be noted that the correlation considered is defined from the scores of the categories. Different choices of scores will generally result in different correlation measures.
A common approach to assigning scores to \(J\) ordered categories is to use their ranks: \(1,2,\ldots,J\) from least to greatest. Pearson's correlation formula applied to this choice of scores is known as Spearman's correlation (Spearman's rho). Even when the variables are originally quantitative, Spearman's rho is a popular nonparametric choice because it's not influenced by outliers. For example, the values 3.2, 1.8, and 4.9 have the same ranks (2, 1, 3) as the values 3.2, 1.8 and 49.0. We've encountered this idea of ranks already in our discussion of cumulative odds, but the difference here is that scores are treated as numerically meaningful and, if two variables are scored as such, will define a correlation that measures linear association.
For the 2018 GSS survey data relating jobsecok to happy, we can assign the scores (1,2,3,4) going down the rows to indicate increasing agreement (to the statement on job security) and scores (1,2,3) going across the columns from left to right to indicate increasing happiness.
1  2  3  

1  15  25  5 
2  21  47  21 
3  64  248  100 
4  73  474  311 
The correlation calculated from these scores is then 0.1948, which represents a relatively weak positive, linear association.
Assigning Scores
As an alternative to scores based on ranks, we may have another choice that appropriately reflects the nature of the ordered categories. In general, we denote scores for the categories of the row variable by \(u_1 \le u_2 \le\cdots\le u_I\) and of the column variable by \(v_1\le v_2 \cdots\le v_J\). This defines a correlation as a measure of linear association between these two scored variables and can be estimated with the same formula used for Pearson's correlation. For an \(I\times J\) table with cell counts \(n_{ij}\), this estimate is
\(r=\dfrac{\sum_i\sum_j(u_i\bar{u})(v_j\bar{v})n_{ij}}{\sqrt{[\sum_i\sum_j(u_i\bar{u})^2 n_{ij}][\sum_i\sum_j(v_j\bar{v})^2n_{ij}]}}\)
where \(\bar{u}=\sum_i\sum_j u_i n_{ij}/n\) is the row mean, and \(\bar{v}=\sum_i\sum_j v_j n_{ij}/n\) is the column mean.
Note that the assignment of scores is arbitrary and will in general influence the resulting correlation. As a demonstration of this in our current example, suppose we assign the scores (1,2,9,10) to jobsecok and the scores (1,9,10) to happy. Such a choice would make sense if the response categories were spaced more erratically. The resulting correlation would be \(r=0.1698\).
Thus, to avoid "datasnooping" or artificially significant results, scores should be assigned first, chosen thoughtfully to reflect the nature of the data, and results should be interpreted precisely in terms of the scores that were chosen.
Midrank Scores
If the row and column totals of the table are highly unbalanced, different scores can produce large differences in the estimated correlation \(r\). One option for assigning scores that accounts for this is to use midranks. The idea is based on ranks, but instead of assigning 1, 2, etc. to the categories directly, the ranks are first assigned to the individuals \(1,2,\ldots,n\) for each variable, and then category scores are calculated based on the average rank for the individuals in that category.
For an \(I\times J\) table with \(n_{i+}\) and \(n_{+j}\) representing the row and column totals, respectively, the midrank score for the 1st category of the row variable is
\(u_1=\dfrac{1+2+\cdots+n_{1+}}{n_{1+}} \)
For the second category, we also assign the average rank of the individuals, but note that the ranks begin where they ended in the first category:
\(u_2=\dfrac{(n_{1+}+1)+\cdots+(n_{1+}+n_{2+})}{n_{2+}} \)
And in general, this formula is expressed as
\(u_i=\dfrac{(\sum_{k=1}^{i1}n_{k+}+1)+\cdots+(\sum_{k=1}^{i}n_{k+})}{n_{i+}} \)
Similarly, midrank scores for the column variable are based on column totals. Ultimately, we end up with \((u_1,\ldots,u_I)\) for the row variable and \((v_1,\ldots,v_J)\) for the column variable, but the values and the spacing of these scores depends on the number of individuals in each category. Let's see how this works for the 2018 GSS example.
Not too happy  Pretty happy  Very happy  

Not at all true  15  25  5 
Not too true  21  47  21 
Somewhat true  64  248  100 
Very true  73  474 
311 
The first row corresponds to the \(n_{1+}=45\) individuals replying "Not at all true". If they were individually ranked from 1 to 45, the average of these ranks would be \(u_1=(1+\cdots+45)/45=23\). Likewise, the \(n_{2+}=89\) individuals in the second row would be ranked from 46 to 134 with average rank \(u_2=(46+\cdots+134)/89=90\). Continuing in this way, we find the midrank scores for jobsecok to be \((23,90,340.5,975.5)\) and those for happy to be \((87,570.5,1186)\), from which we can calculate the correlation \(0.1849\).
4.4  MantelHaenszel Test for Linear Trend
4.4  MantelHaenszel Test for Linear TrendFor a given set of scores and corresponding correlation \(\rho\), we can carry out the test of \(H_0\colon\rho=0\) versus \(H_A\colon\rho\ne0\) using the MantelHaenszel (MH) statistic:
\(M^2=(n1)r^2\)
where \(n\) is the sample size (total number of individuals providing both row and column variable responses), and \(r\) is the sample estimate of the correlation from the given scores. Additional properties are as follows:
 When \(H_0\) is true, \(M^2\) has an approximate chisquare distribution with 1 degree of freedom, regardless of the size of the twoway table.
 \(\mbox{sign}\cdot(r)M\) has an approximate standard normal distribution, which can be used for testing onesided alternatives too. Note that we use the sign of the sample correlation, so this test statistic may be either positive or negative.
 Larger values of \(M^2\) provide more evidence against the linear independence model.
 The value of \(M^2\) (and hence the evidence to reject \(H_0\)) increases with both sample size \(n\) and the absolute value of the estimated correlation \(r\).
Example
Now, we apply the MantelHaenszel test of linear relationship to the 2018 GSS variables jobsecok and happy. There are two separate R functions we can run, depending on whether we want to input scores directly (MH.test) or whether we want R to use the midranks (MH.test.mid). Both functions are defined in the file MantelHaenszel.R, so we need to run that code first.
For the scores \((1,2,3,4)\) assigned to jobsecok and \((1,2,3)\) assigned to happy, the correlation estimate is \(r=0.1948\) (slightly rounded), and with \(n=1404\) total observations, we calculate \(M^2=(14041)0.1948^2=53.248\), which R outputs, along with the \(p\)value indicating extremely significant evidence that the correlation between job security and happiness is greater than 0, given these particular scores. Similarly, if the midranks are used, the correlation estimate becomes \(r=0.1849\) with MH test statistic \(M^2=47.97\). The conclusion is unchanged.
The R code to carry out the calculations above:
source(file.choose()) # choose MantelHaenszel.R
# input data: jobsecok (rows) and happy (columns)
tbl = matrix(c(15,21,64,73,25,47,248,474,5,21,100,311),nrow=4)
# user input scores u and v  need to match the dimensions of tbl
u = c(1,2,3,4); v = c(1,2,3)
MH.test(tbl, u, v)
# no user input scores, uses midranks
MH.test.mid(tbl)
If we search for a builtin function in R to compute the MH test statistic, we will find a few additional functions, such as the mantelhaen.test. However, this is a different test and is designed to work with \(2\times2\times K\) tables to measure conditional association, which we'll see later.
Power vs Pearson Test of Association
When dealing with ordinal data, when there is a positive or negative linear association between variables, the MH test has a power advantage over the Pearson or LRT test that treats the categories as nominal, meaning the MH test is more likely to detect such an association when it exists. Here are some other comparisons between them.
 The Pearson and LRT tests consider the most general alternative hypothesis for any type of association. The degrees of freedom, which is the number of parameters separating the null and alternative models, is relatively large at \((I − 1)(J − 1)\).
 The MH test has only a single degree of freedom, reflecting the fact that the correlation is the only parameter separating the null model of independence from the alternative of linear association. The critical values (thresholds for establishing significance) will generally be smaller, compared with critical values from a chisquared distribution with \((I1)(J1)\) degrees of freedom.
 For small to moderate sample size, the sampling distribution of \(M^2\) is better approximated by a chisquared distribution than are the sampling distributions for \(X^2\) and \(G^2\), the Pearson and LRT statistics, respectively; this tends to hold in general for distributions with smaller degrees of freedom.
Since the MH test assumes from the beginning that the association is linear, it may not be as effective in detecting associations that are not linear. Consider the hypothetical table below relating the binary response (success, failure) to the level of drug applied (low, medium, high).
Success  Failure  

Low  5  15 
Medium  10  10 
High  5  15 
Treating these categories as nominal, the Pearson test statistic would be \(X^2=3.75\) with \(p\)value 0.1534 relative to chisquare distribution with two degrees of freedom. By contrast, with scores of \((1,2,3)\) assigned to the drug levels and \((1,2)\) assigned to the binary outcomes, the MH test statistic for linear association would be \(M^2=0\) with \(p\)value 1, indicating no evidence for linear association. The reason the MH test fails to find any evidence is that it's looking specifically for increasing or decreasing trends consistent with stronger correlation values, which is not present in this data. The success rate increases from low to medium drug level but then decreases from medium to high. The Pearson test is more powerful in this situation because it is essentially measuring whether there is any change in success rates.
Not surprisingly, if the table values were observed to be those below, where the success rates are increasing from drug levels low to high, the MH test would be more powerful (\(X^2=10.0\) with \(p\)value 0.0067 versus \(M^2=9.83\) with \(p\)value 0.0017).
Success  Failure  

Low  5  15 
Medium  10  10 
High  15  5 
The takeaway point from this is that the MH test is potentially more powerful than the Pearson (or LRT) if the association between the variables is linear. Even if the ordered categories can be assigned meaningful scores, it doesn't guarantee that the variables' association is linear.
Finally, although we did use the scores \((1,2)\) for the binary outcome, this choice is inconsequential for calculating a correlation involving a variable with only two categories, meaning that we would get the same value for the MH test statistic with any choice of scores. If the outcome variable had three or more categories, however, they would need to be ordinal with appropriately assigned scores for the MH test to be meaningful.
4.5  Fisher's Exact Test
4.5  Fisher's Exact TestThe tests discussed so far that use the chisquare approximation, including the Pearson and LRT for nominal data as well as the MantelHaenszel test for ordinal data, perform well when the contingency tables have a reasonable number of observations in each cell, as already discussed in Lesson 1. When samples are small, the distributions of \(X^2\), \(G^2\), and \(M^2\)^{ }(and other largesample based statistics) are not well approximated by the chisquared distribution; thus their \(p\)values are not to be trusted. In such situations, we can perform inference using an exact distribution (or estimates of exact distributions), but we should keep in mind that \(p\)values based on exact tests can be conservative (i.e, measured to be larger than they really are).
We may use an exact test if:
 the row totals \(n_{i+}\) and the column totals \(n_{+j}\) are both fixed by design of the study.
 we have a small sample size \(n\),
 more than 20% of cells have expected cell counts less than 5, and no expected cell count is less than 1.
Example: Lady tea tasting
Here we consider the famous tea tasting example! In a summer teapart in Cambridge, England, a lady claimed to be able to discern, by taste alone, whether a cup of tea with milk had the tea poured first or the milk poured first. An experiment was performed by Sir R.A. Fisher himself, then and there, to see if her claim was valid. Eight cups of tea were prepared and presented to her in random order. Four had the milk poured first, and other four had the tea poured first. The lady tasted each one and rendered her opinion. The results are summarized in the following \(2 \times 2\) table:
Actually poured first  Lady says poured first  

tea  milk  
tea  3  1 
milk  1  3 
The row totals are fixed by the experimenter. The column totals are fixed by the lady, who knows that four of the cups are "tea first" and four are "milk first." Under \(H_0\), the lady has no discerning ability, which is to say the four cups she calls "tea first" are a random sample from the eight. If she selects four at random, the probability that three of these four are actually "tea first" comes from the hypergeometric distribution, \(P(n_{11}=3)\):
\(\dfrac{\dbinom{4}{3}\dbinom{4}{1}}{\dbinom{8}{4}}=\dfrac{\dfrac{4!}{3!1!}\dfrac{4!}{1!3!}}{\dfrac{8!}{4!4!}}=\dfrac{16}{70}=0.229\)
A \(p\)value is the probability of getting a result as extreme or more extreme than the event actually observed, assuming \(H_0\) is true. In this example, the \(p\)value would be \(P(n_{11}\ge t_0)\), where \(t_0\) is the observed value of \(n_{11}\), which in this case is 3. The only result more extreme would be the lady's (correct) selection of all four the cups that are truly "tea first," which has probability
\(\dfrac{\dbinom{4}{4}\dbinom{4}{0}}{\dbinom{8}{4}}=\dfrac{1}{70}=0.014\)
As it turns out, the \(p\)value is \(.229 + .014 = .243\), which is only weak evidence against the null. In other words, there is not enough evidence to reject the null hypothesis that the lady is just purely guessing. To be fair, experiments with small amounts of data are generally not very powerful, to begin with, given the limited information.
Here is how we can do this computation in SAS and R. Further below we describe in a bit more detail the underlying idea behind these calculations.
Code
/*
 Example: Fisher's Tea Lady
*/
data tea;
input poured $ lady $ count;
datalines;
tea tea 3
tea milk 1
milk tea 1
milk milk 3
;
run;
proc freq data=tea order=data;
weight count;
tables poured*lady/ chisq relrisk riskdiff expected;
exact fisher chisq or;
run;
Output
The FREQ Procedure


Statistics for Table of poured by lady
Statistic  DF  Value  Prob 

WARNING: 100% of the cells have expected counts less than 5. (Asymptotic) ChiSquare may not be a valid test. 

ChiSquare  1  2.0000  0.1573 
Likelihood Ratio ChiSquare  1  2.0930  0.1480 
Continuity Adj. ChiSquare  1  0.5000  0.4795 
MantelHaenszel ChiSquare  1  1.7500  0.1859 
Phi Coefficient  0.5000  
Contingency Coefficient  0.4472  
Cramer's V  0.5000 
Pearson ChiSquare Test  

ChiSquare  2.0000 
DF  1 
Asymptotic Pr > ChiSq  0.1573 
Exact Pr >= ChiSq  0.4857 
Likelihood Ratio ChiSquare Test  

ChiSquare  2.0930 
DF  1 
Asymptotic Pr > ChiSq  0.1480 
Exact Pr >= ChiSq  0.4857 
MantelHaenszel ChiSquare Test  

ChiSquare  1.7500 
DF  1 
Asymptotic Pr > ChiSq  0.1859 
Exact Pr >= ChiSq  0.4857 
Fisher's Exact Test  

Cell (1,1) Frequency (F)  3 
Leftsided Pr <= F  0.9857 
Rightsided Pr >= F  0.2429 
Table Probability (P)  0.2286 
Twosided Pr <= P  0.4857 
Column 1 Risk Estimates  

Risk  ASE  95% Confidence Limits 
Exact 95% Confidence Limits 

Difference is (Row 1  Row 2)  
Row 1  0.7500  0.2165  0.3257  1.0000  0.1941  0.9937 
Row 2  0.2500  0.2165  0.0000  0.6743  0.0063  0.8059 
Total  0.5000  0.1768  0.1535  0.8465  0.1570  0.8430 
Difference  0.5000  0.3062  0.1001  1.0000 
Column 2 Risk Estimates  

Risk  ASE  95% Confidence Limits 
Exact 95% Confidence Limits 

Difference is (Row 1  Row 2)  
Row 1  0.2500  0.2165  0.0000  0.6743  0.0063  0.8059 
Row 2  0.7500  0.2165  0.3257  1.0000  0.1941  0.9937 
Total  0.5000  0.1768  0.1535  0.8465  0.1570  0.8430 
Difference  0.5000  0.3062  1.0000  0.1001 
Odds Ratio and Relative Risks  

Statistic  Value  95% Confidence Limits  
Odds Ratio  9.0000  0.3666  220.9270 
Relative Risk (Column 1)  3.0000  0.5013  17.9539 
Relative Risk (Column 2)  0.3333  0.0557  1.9949 
Odds Ratio  

Odds Ratio  9.0000 
Asymptotic Conf Limits  
95% Lower Conf Limit  0.3666 
95% Upper Conf Limit  220.9270 
Exact Conf Limits  
95% Lower Conf Limit  0.2117 
95% Upper Conf Limit  626.2435 
Sample Size = 8
OPTION EXACT in SAS indicates that we are doing exact tests which consider ALL tables with the exact same margins as the observed table. This option will work for any \(I \times J\) table. OPTION FISHER, more specifically performs Fisher's exact test which is an exact test only for a \(2 \times 2\) table in SAS.
For R, see TeaLady.R where you can see we used the fisher.test() function to perform Fisher's exact test for the \(2 \times 2\) table in question.
#### onesided Fisher's exact test
fisher.test(tea, alternative = "greater")
#### twosided Fisher's exact test
fisher.test(tea)
The same could be done using chisq.test() with option, simulate.p.value=TRUE. By reading the help file on fisher. test() function, you will see that certain options in this function only work for \(2 \times 2\) tables. For the output, see TeaLady.out
The basic idea behind exact tests is that they enumerate all possible tables that have the same margins, e.g., row sums and column sums. Then to compute the relevant statistics, e.g., \(X^2\), \(G^2\), oddsratios, we look for all tables where the values are more extreme than the one we have observed. The key here is that in the set of tables with the same margins, once we know the value in one cell, we know the rest of the cells. Therefore, to find a probability of observing a table, we need to find the probability of only one cell in the table (rather than the probabilities of four cells). Typically we use the value of cell (1,1).
Under the null hypothesis of independence, more specifically when oddsratio \(\theta = 1\), the probability distribution of that one cell \(n_{11}\) is hypergeometric, as discussed in the Tea lady example.
In the Lady tasting tea example, there are 5 possible \(2 \times 2\) tables that have the same observed margins. Can you figure out which are those? Stop and Think!
Extension of Fisher's test
For problems where the number of possible tables is too large, Monte Carlo methods are used to approximate "exact" statistics (e.g., option MC in SAS FREQ EXACT and in R under chisq.test() you need to specify simulate.p.value = TRUE and indicate how many runs you want MC simulation to do; for more see the help files). This extension, and thus these options in SAS and R, of the Fisher's exact test for a \(2 \times 2\) table, in effect, takes samples from a large number of possibilities in order to simulate the exact test.
This test is "exact" because no largesample approximations are used. The \(p\)value is valid regardless of the sample size. Asymptotic results may be unreliable when the distribution of the data is sparse, or skewed. Exact computations are based on the statistical theory of exact conditional inference for contingency tables.
Fisher's exact test is definitely appropriate when the row totals and column totals are both fixed by design. Some have argued that it may also be used when only one set of margins is truly fixed. This idea arises because the marginal totals \(n_{1+}, n_{+1}\) provide little information about the odds ratio \(\theta\).
Exact nonnull inference for \(\theta\)
When \(\theta = 1\), this distribution is hypergeometric, which we used in Fisher's exact test. More generally, Fisher (1935) gave this distribution for any value of \(\theta\). Using this distribution, it is easy to compute Fisher's exact \(p\)value for testing the null hypothesis \(H_0:\theta=\theta^*\) for any \(\theta^*\). The set of all values \(\theta^*\) that cannot be rejected at the \(\alpha=.05\) level test forms an exact 95% confidence region for \(\theta\).
Let's look at a part of the SAS output a bit closer, we get the same CIs in the R ouput. First, notice the sample estimate of the odds ratio equal to 9, which we can compute from the crossproduct ratio as we have discussed earlier. Note also that fisher.test() in R for \(2 \times 2\) tables will give socalled "conditional estimate" of the oddsratio so the value will be different (in this case, approximately 6.408).
Notice the difference between the exact and asymptotic CIs for the odds ratio for the twosided alternative (e.g., \(\theta\ne1\)). The exact version is larger. Recalling the interpretation of the odds ratio, what do these CIs tell us about true unknown oddsratio? This is a simple example of how inference may vary if you have small samples or sparseness.
Odds Ratio  

Odds Ratio  9.0000 
Asymptotic Conf Limits  
95% Lower Conf Limit  0.3666 
95% Upper Conf Limit  220.9270 
Exact Conf Limits  
95% Lower Conf Limit  0.2117 
95% Upper Conf Limit  626.2435 
Sample Size = 8
Bias correction for estimating \(\theta\)
Earlier, we learned that the natural estimate of \(\theta\) is
\(\hat{\theta}=\dfrac{n_{11}n_{22}}{n_{12}n_{21}}\)
and that \(\log\hat{\theta}\) is approximately normally distributed about \(\log \theta\) with estimated variance
\(\hat{V}(\log\hat{\theta})=\dfrac{1}{n_{11}}+\dfrac{1}{n_{12}}+\dfrac{1}{n_{21}}+\dfrac{1}{n_{22}}\)
Advanced note: this is the estimated variance of the limiting distribution, not an estimate of the variance of \(\log\hat{\theta}\) itself. Because there is a nonzero probability that the numerator or the denominator of \(\hat{\theta}\) may be zero, the moments of \(\hat{\theta}\) and \(\log\hat{\theta}\) do not actually exist. If the estimate is modified by adding \(1/2\) to each \(n_{ij}\), we have
\(\tilde{\theta}=\dfrac{(n_{11}+0.5)(n_{22}+0.5)}{(n_{12}+0.5)(n_{21}+0.5)}\)
with estimated variance
\(\hat{V}(\log\tilde{\theta})=\sum\limits_{i,j} \dfrac{1}{(n_{ij}+0.5)}\)
In smaller samples, \(\log\tilde{\theta}\) may be slightly less biased than \(\log\hat{\theta}\).
4.6  Lesson 4 Summary
4.6  Lesson 4 SummaryIn this lesson, we extended the idea of association in twoway contingency tables to accommodate ordinal data and linear trends. The key concepts are that ordinal data has a particular nature that we can summarize and potentially take advantage of when carrying out a significance test. We also introduced an alternative "exact" test for \(2\times2\) tables that can be used when the observed counts are too small for the usual chisquare approximation to apply.
Coming up next, we consider new ways to measure associations if more than two variables are in the picture. Specifically, we'll see how one relationship can be confounded with another so that interpretations will change, depending on whether we condition on certain variables in advance.