4.4 - Mantel-Haenszel Test for Linear Trend

For a given set of scores and corresponding correlation \(\rho\), we can carry out the test of \(H_0\colon\rho=0\) versus \(H_A\colon\rho\ne0\) using the Mantel-Haenszel (MH) statistic:

\(M^2=(n-1)r^2\)

where \(n\) is the sample size (total number of individuals providing both row and column variable responses), and \(r\) is the sample estimate of the correlation from the given scores. Additional properties are as follows:

When \(H_0\) is true, \(M^2\) has an approximate chi-square distribution with 1 degree of freedom, regardless of the size of the two-way table.
\(\mbox{sign}\cdot(r)|M|\) has an approximate standard normal distribution, which can be used for testing one-sided alternatives too. Note that we use the sign of the sample correlation, so this test statistic may be either positive or negative.
Larger values of \(M^2\) provide more evidence against the linear independence model.
The value of \(M^2\) (and hence the evidence to reject \(H_0\)) increases with both sample size \(n\) and the absolute value of the estimated correlation \(|r|\).

Example

Now, we apply the Mantel-Haenszel test of linear relationship to the 2018 GSS variables jobsecok and happy. There are two separate R functions we can run, depending on whether we want to input scores directly (MH.test) or whether we want R to use the midranks (MH.test.mid). Both functions are defined in the file MantelHaenszel.R, so we need to run that code first.

For the scores \((1,2,3,4)\) assigned to jobsecok and \((1,2,3)\) assigned to happy, the correlation estimate is \(r=0.1948\) (slightly rounded), and with \(n=1404\) total observations, we calculate \(M^2=(1404-1)0.1948^2=53.248\), which R outputs, along with the \(p\)-value indicating extremely significant evidence that the correlation between job security and happiness is greater than 0, given these particular scores. Similarly, if the midranks are used, the correlation estimate becomes \(r=0.1849\) with MH test statistic \(M^2=47.97\). The conclusion is unchanged.

The R code to carry out the calculations above:

source(file.choose()) # choose MantelHaenszel.R

# input data: jobsecok (rows) and happy (columns)
tbl = matrix(c(15,21,64,73,25,47,248,474,5,21,100,311),nrow=4)

# user input scores u and v -- need to match the dimensions of tbl
u = c(1,2,3,4); v = c(1,2,3)
MH.test(tbl, u, v)

# no user input scores, uses midranks
MH.test.mid(tbl)

If we search for a built-in function in R to compute the MH test statistic, we will find a few additional functions, such as the mantelhaen.test. However, this is a different test and is designed to work with \(2\times2\times K\) tables to measure conditional association, which we'll see later.

Power vs Pearson Test of Association

When dealing with ordinal data, when there is a positive or negative linear association between variables, the MH test has a power advantage over the Pearson or LRT test that treats the categories as nominal, meaning the MH test is more likely to detect such an association when it exists. Here are some other comparisons between them.

The Pearson and LRT tests consider the most general alternative hypothesis for any type of association. The degrees of freedom, which is the number of parameters separating the null and alternative models, is relatively large at \((I − 1)(J − 1)\).
The MH test has only a single degree of freedom, reflecting the fact that the correlation is the only parameter separating the null model of independence from the alternative of linear association. The critical values (thresholds for establishing significance) will generally be smaller, compared with critical values from a chi-squared distribution with \((I-1)(J-1)\) degrees of freedom.
For small to moderate sample size, the sampling distribution of \(M^2\) is better approximated by a chi-squared distribution than are the sampling distributions for \(X^2\) and \(G^2\), the Pearson and LRT statistics, respectively; this tends to hold in general for distributions with smaller degrees of freedom.

Since the MH test assumes from the beginning that the association is linear, it may not be as effective in detecting associations that are not linear. Consider the hypothetical table below relating the binary response (success, failure) to the level of drug applied (low, medium, high).

Hypothetical table with neither increasing nor decreasing success rates
	Success	Failure
Low	5	15
Medium	10	10
High	5	15

Treating these categories as nominal, the Pearson test statistic would be \(X^2=3.75\) with \(p\)-value 0.1534 relative to chi-square distribution with two degrees of freedom. By contrast, with scores of \((1,2,3)\) assigned to the drug levels and \((1,2)\) assigned to the binary outcomes, the MH test statistic for linear association would be \(M^2=0\) with \(p\)-value 1, indicating no evidence for linear association. The reason the MH test fails to find any evidence is that it's looking specifically for increasing or decreasing trends consistent with stronger correlation values, which is not present in this data. The success rate increases from low to medium drug level but then decreases from medium to high. The Pearson test is more powerful in this situation because it is essentially measuring whether there is any change in success rates.

Not surprisingly, if the table values were observed to be those below, where the success rates are increasing from drug levels low to high, the MH test would be more powerful (\(X^2=10.0\) with \(p\)-value 0.0067 versus \(M^2=9.83\) with \(p\)-value 0.0017).

Hypothetical table with increasing success rates
	Success	Failure
Low	5	15
Medium	10	10
High	15	5

The takeaway point from this is that the MH test is potentially more powerful than the Pearson (or LRT) if the association between the variables is linear. Even if the ordered categories can be assigned meaningful scores, it doesn't guarantee that the variables' association is linear.

Finally, although we did use the scores \((1,2)\) for the binary outcome, this choice is inconsequential for calculating a correlation involving a variable with only two categories, meaning that we would get the same value for the MH test statistic with any choice of scores. If the outcome variable had three or more categories, however, they would need to be ordinal with appropriately assigned scores for the MH test to be meaningful.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility