6.3 - Issues with Multiple Testing

If we are conducting a hypothesis test with an \(\alpha\) level of 0.05, then we are accepting a 5% chance of making a Type I error (i.e., rejecting the null hypothesis when the null hypothesis is really true). If we would conduct 100 hypothesis tests at a 0.05 \(\alpha\) level where the null hypotheses are really true, we would expect to reject the null and make a Type I error in about 5 of those tests.

Later in this course you will learn about some statistical procedures that may be used instead of performing multiple tests. For example, to compare the means of more than two groups you can use an analysis of variance ("ANOVA"). To compare the proportions of more than two groups you can conduct a chi-square goodness-of-fit test.

Publication Bias

A related issue is publication bias. Research studies with statistically significant results are published much more often than studies without statistically significant results. This means that if 100 studies are performed in which there is really no difference in the population, the 5 studies that found statistically significant results may be published while the 95 studies that did not find statistically significant results will not be published. Thus, when you perform a review of published literature you will only read about the studies that found statistically significance results. You would not find the studies that did not find statistically significant results.

Quick Correction for Multiple Tests

One quick method for correcting for multiple tests is to divide the alpha level by the number of tests being conducted. For instance, if you are comparing three groups using a series of three pairwise tests you could divided your overall alpha level ("family-wise alpha level") by three. If we were using a standard alpha level of 0.05, then our pairwise alpha level would be \(\frac{0.05}{3}=0.016667\). We would then compare each of our three p-values to 0.016667 to determine statistical significance. This is known as the Bonferroni method. This is one of the most conservative approaches to controlling for multiple tests (i.e., more likely to make a Type II error). Later in the course you will learn how to use the Tukey method when comparing the means of three or more groups, this approach is often preferred because it is more liberal.