3.3 - Multiple Comparisons

3.3 - Multiple Comparisons

Scheffé's Method

Scheffé's method for investigating all possible contrasts of the means corresponds exactly to the F-test in the following sense. If the F-test rejects the null hypothesis at level \(\alpha\), then there exists at least one contrast which would be rejected using the Scheffé procedure at level \(\alpha\). Therefore, Scheffé provides \(\alpha\) level protection against rejecting the null hypothesis when it is true, regardless of how many contrasts of the means are tested.

Fisher's LSD

Fisher’s LSD, which is the F test, followed by ordinary t-tests among all pairs of means, but only if the F-test rejects the null hypothesis. The F-test provides the overall protection against rejecting \(H_0\) when it is true. The t-tests are each performed at \(\alpha\) level and thus likely will reject more than they should, when the F-test rejects. A simple example may explain this statement: assume there are eight treatment groups, and one treatment has a mean higher than the other seven, which all have the same value, and the F-test will reject \(H_0\). However, when following up with the pairwise t-tests, the \(7 \times 6 / 2 = 21\) pairwise t-tests among the seven means which are all equal, will by chance alone reject at least one pairwise hypothesis, \(H_0 \colon \mu_i = \mu_i^{\prime}\) at \(\alpha = 0.05\). Despite this drawback, Fisher's LSD remains a favorite method since it has overall \(\alpha\) level protection, and offers simplicity to understand and interpret.

Bonferroni Method

Bonferroni method for \(g\) comparisons – use \(\alpha / g\)instead of \(\alpha\) for testing each of the \(g\) comparisons.

Comparing the Bonferroni Procedure with the Fishers LSD

Fishers’s LSD method is an alternative to other pairwise comparison methods (for post ANOVA analysis). This method controls the \(\alpha\text{-level}\) error rate for each pairwise comparison so it does not control the family error rate. This procedure uses the t statistic for testing \(H_0 \colon \mu_i = \mu_j\) for all i and j pairs.


Alternatively, the Bonferroni method does control the family error rate, by performing the pairwise comparison tests using \(_{\alpha/g}\) level of significance, where g is the number of pairwise comparisons. Hence, the Bonferroni confidence intervals for differences of the means are wider than that of Fisher’s LSD. In addition, it can be easily shown that the p-value of each pairwise comparison calculated by Bonferroni method is g times the p-value calculated by Fisher’s LSD method.

Tukey's Studentized Range

Tukey’s Studentized Range considers the differences among all pairs of means divided by the estimated standard deviation of the mean and compares them with the tabled critical values provided in Appendix VII. Why is it called the studentized range? The denominator uses an estimated standard deviation, hence, the statistic is studentized like the student t-test. The Tukey procedure assumes all \(n_i\) are equal say to \(n\).


Comparing the Tukey Procedure with the Bonferroni Procedure

The Bonferroni procedure is a good all around tool, but for all pairwise comparisons the Tukey studentized range procedure is slightly better as we show here.

The studentized range is the distribution of the difference between the maximum and a minimum over the standard error of the mean. When we calculate a t-test, or when we're using the Bonferroni adjustment where g is the number of comparisons, we are not comparing apples and oranges. In one case (Tukey) the statistic has a denominator with the standard error of a single mean and in the other case (t-test) with the standard error of the difference between means as seen in the equation for t and q above.

Example 3.3: Tukey vs. Bonferroni approaches

Here is an example we can work out. Let's say we have 5 means, so a = 5, we will let \(\alpha = 0.05\), and the total number of observations N = 35, so each group has seven observations and df = 30.

If we look at the studentized range distribution for 5, 30 degrees of freedom, we find a critical value of 4.11.

If we took a Bonferroni approach - we would use \(g = 5 × 4 / 2 = 10\) pairwise comparisons since a = 5. Thus, again for an α = 0.05 test all we need to look at is the t-distribution for \(\alpha / 2g = 0.0025\) and N - a =30 df. Looking at the t-table we get the value 3.03. However, to compare with the Tukey Studentized Range statistic, we need to multiply the tabled critical value by \(\sqrt{2} = 1.414\), therefore 3.03 x1.414 = 4.28, which is slightly larger than the 4.11 obtained for the Tukey table.

The point that we want to make is that the Bonferroni procedure is slightly more conservative than the Tukey result since the Tukey procedure is exact in this situation whereas Bonferroni only approximate.

The Tukey's procedure is exact for equal samples sizes. However, there is an approximate procedure called the Tukey-Kramer test for unequal \(n_i\).

If you are looking at all pairwise comparisons then Tukey's exact procedure is probably the best procedure to use. The Bonferroni, however, is a good general procedure.

Contrasts of Means

A pairwise comparison is just one example of a contrast of the means. A general contrast can be written as a set of coefficients of the means that sum to zero. This will often involve more than just a pair of treatments. In general, we can write a contrast to make any comparison we like. We will also consider sets of orthogonal contrasts.

Example 3.4: Gas Mileage

We want to compare the gas mileage on a set of cars: Ford Escape (hybrid), Toyota Camry, Toyota Prius (hybrid), Honda Accord, and the Honda Civic (hybrid). A consumer testing group wants to test each of these cars for gas mileage under certain conditions. They take n prescribed test runs and record the mileage for each vehicle.

Now they first need to define some contrasts among these means. Contrasts are the coefficients which provide a comparison that is meaningful. Then they can test and estimate these contrasts. For the first contrast, \(C_1\), they could compare the American brand to the foreign brands. We need each contrast to sum to 0, and for convenience only use integers. How about comparing Toyota to Honda (that is \(C_2\)), or hybrid compared to non-hybrid (that is \(C_3\)).

  Ford Escape Toyota Camry Toyota Prius Honda Accord Honda Civic
  \(Y_{1.}\) \(Y_{2.}\) \(Y_{3.}\) \(Y_{4.}\) \(Y_{5.}\)
\(C_1\) 4 -1 -1 -1 -1
\(C_2\) 0 -1 -1 1 1
\(C_3\) 2 -3 2 -3 2
\(C_4\) 0 -1 1 0 0
\(C_5\) 0 0 0 -1 1

So the first three contrast coefficients would specify the comparisons described, and the \(C_4\) and \(C_5\) are comparisons within the brands with two models.

After we develop a set of contrasts, we can then test these contrasts or we can estimate them. We can also calculate a confidence intervals around the true contrast of the means by using the estimated contrast ± the t-distribution times the estimated standard deviation of the contrast. See equation 3-30 in the text.

Concerning Sets of Multiple Contrasts

Scheffé’s Method provides \(\alpha\text{-level}\) protection for all possible contrasts - especially useful when we don't really know how many contrasts we will have in advance. This test is quite conservative because this test is valid for all possible contrasts of the means. Therefore the Scheffé procedure is equivalent to the F-test, and if the F-test rejects, there will be some contrast that will not contain zero in its confidence interval.

What is an orthogonal contrast?

Two contrasts are orthogonal if the sum of the product of the coefficients of the two contrasts sum to zero. An orthogonal set of contrasts are also orthogonal to the overall mean, since the coefficients sum to zero.

Look at the table above and locate which contrasts are orthogonal.

There always exists a-1 orthogonal contrasts of a means. When the sample sizes are equal, the sum of squares for these contrasts, when added up, total the sum of squares due to treatment. Any set of orthogonal contrasts partition the variation such that the total variation corresponding to those a-1 contrasts equals the total sum of squares among treatments. When the sample sizes are not equal, the definition of orthogonal contrasts involves the sample sizes.

Dunnett's Procedure

Dunnett’s procedure is another multiple comparison procedure specifically designed to compare each treatment to a control. If we have a groups, let the last one be a control group and the first a - 1 be treatment groups. We want to compare each of these treatment groups to this one control. Therefore, we will have a - 1 contrasts or a - 1 pairwise comparisons. To perform multiple comparisons on these a - 1 contrasts we use special tables for finding hypothesis test critical values, derived by Dunnett.

Comparing Dunnett’s procedure to the Bonferroni procedure

We can compare the Bonferroni approach to the Dunnett procedure. The Dunnett procedure calculates the difference of means for the control versus treatment one, control versus treatment two, etc. to a - 1. Which provides a - 1 pairwise comparisons.

So, we now consider an example where we have six groups, a = 6, and t = 5 and n = 6 observations per group. Then, Dunnett's procedure will give the critical point for comparing the difference of means. From the table, we get \(\alpha =0.05\) two-sided comparison d(a-1, f) = 2.66, where a - 1 = 5 and f = df = 30.

Using the Bonferroni approach, if we look at the t-distribution for g = 5 comparisons and a two-sided test with 30 degrees of freedom for error we get 2.75.

Comparing the two, we can see that the Bonferroni approach is a bit more conservative. The Dunnett's is an exact procedure for comparing a control to a-1 treatments. Bonferroni is a general tool but not exact. However, there is not much of a difference in this example

Fisher's LSD has the practicality of always using the same measuring stick, the unadjusted t-test. Everyone knows that if you do a lot of these tests, that for every 20 tests you do, that one could be wrong by chance. This is another way to handle this uncertainty. All of these methods are protecting you from making too many Type I errors whether you are either doing hypothesis testing or confidence intervals. In your lifetime how many tests are you going to do?

So in a sense, you have to ask yourself the question of what is the set of tests that I want to protect against making a Type I error. So, in Fisher's LSD procedure each test is standing on its own and is not really a multiple comparisons test. If you are looking for any type of difference and you don't know how many you are going to end up doing, you should probably be using Scheffé to protect you against all of them. But if you know it is all pairwise and that is it, then Tukey's would be best. If you're comparing a bunch of treatments against a control then Dunnett's would be best.

There is a whole family of step-wise procedures which are now available, but we will not consider them here. Each can be shown to be better in certain situations. Another approach to this problem is called False Discovery Rate control. It is used when there are hundreds of hypotheses - a situation that occurs for example in testing gene expression of all genes in an organism, or differences in pixel intensities for pixels in a set of images. The multiple comparisons procedures discussed above all guard against the probability of making one false significant call. But when there are hundreds of tests, we might prefer to make a few false significant calls if it greatly increases our power to detect the true difference. False Discovery Rate methods attempt to control the expected percentage of false significant calls among the tests declared significant.

Has Tooltip/Popover
 Toggleable Visibility