2: ANOVA Foundations

2: ANOVA Foundations

Overview

In this lesson, we will begin to learn about the notation and formulas used to compute the fundamental quantities for ANOVA-related hypothesis testing as well as for mean comparison procedures. The application of these statistical procedures will be illustrated using the greenhouse example from Lesson 1.

Objectives

Upon completion of this lesson, you should be able to:

  1. Perform basic computations for Single Factor ANOVA and interpret the results.
  2. Carry out the Tukey pairwise mean comparison method.
  3. Learn about other pairwise mean comparison methods.
  4. Conduct a contrast analysis that accommodates the comparison of group means.

2.1 - Building the ANOVA Table: Notation

2.1 - Building the ANOVA Table: Notation

Recall that the alternative hypothesis is usually what we suspect to be true and hope to conclude. Typically we are looking to find differences among at least one pair of our treatment means. Because of this, the null hypothesis (the opposite of the alternative) states that there are no differences among the treatment group means.

The idea behind ANOVA methods is to compare different sources of variability: between sample variability and within sample variability. To test the Null hypothesis, traditionally written as \(H_0 \colon \mu_1 = \mu_2 = ⋯ = \mu_T\), we need to compute a test (F) statistic that compares the between sample variability to within sample variability.

To understand the computation of this statistic it is helpful to look at the ANOVA table. The table below is an example of a blank (no entries yet) ANOVA table.

ANOVA

Source df SS MS F
         
         
         

To define the elements of the table and fill in these quantities, let’s return to our example data (Lesson1 Data) for the hypothetical greenhouse experiment:

Control F1 F2 F3
21 32 22.5 28
19.5 30.5 26 27.5
22.5 25 28 31
21.5 27.5 27 29.5
20.5 28 26.5 30
21 28.6 25.2 29.2

Notation

Each observation in the dataset can be referenced by two indicator subscripts, i and j as \(Y_{ij}\)

For those of you not familiar with this notation, we use Y to indicate that it is a response variable. The subscript i refers to the \(i^{th}\) level of the treatment; our example has 4 treatments so i will take on the values 1, 2, 3, and 4. The subscript j refers to the \(j^{th}\) observation; our example has 6 observations for each treatment so j takes the values 1, 2, 3, 4, 5, and 6. It is important to note that the \(j^{th}\)observation is occurring within the \(i^{th}\) treatment level. For example, it can be seen in the table below that \(Y_{4,2} = 27.5\).

subscripts i = 1 i = 2 i = 3 i = 4
Control F1 F2 F3
j = 1 21 32 22.5 28
j = 2 19.5 30.5 26 27.5
j = 3 22.5 25 28 31
j = 4 21.5 27.5 27 29.5
j = 5 20.5 28 26.5 30
j = 6 21 28.6 25.2 29.2

We now can define the various means explicitly using these subscripts. The overall or Grand Mean is given by

\(\text{Grand Mean }=\bar{Y}_{..}\)

where the dots indicate that the quantity has been averaged over that subscript. For the Grand Mean, we have averaged over all j observations in all i treatment levels. Alternatively, the treatment means are given by

\(\text{Treatment Mean }=\bar{Y}_{i.}\)

Indicating that we have averaged over the j observations in each of the i treatment levels.

The means can be found in the output from the summary procedure generated in SAS, seen below. These and other coding details will be discussed in Lesson 3.

Summary Output for Lesson 1

Fert _Type_ _FREQ_ mean
  0 24 26.1667
Control 1 6 21.0000
F1 1 6 28.6000
F2 1 6 25.8667
F3 1 6 29.2000

In the output, we see the column heading _TYPE_. The summary procedure in SAS calculates all possible means when specified, thus the _TYPE_ indicates what mean is being computed. _TYPE_ = 0 is the Grand Mean, and we can see this from the number of observations (given by _FREQ_) of 24. Further, each of the treatment means is listed as _TYPE_ = 1, and we confirm that 6 replications were made for each treatment level (remember that j took on values 1 through 6). Note that SAS automatically has ordered the treatment levels alphabetically.

In this example, the grand mean and treatment means are all we need to compute the quantities for the ANOVA table, continued in the next section.


2.2 - Computing Quanitites for the ANOVA table

2.2 - Computing Quanitites for the ANOVA table

When working with ANOVA, we start with the total variability in the response variable and divide or "partition" it into different parts: the variability due to our treatment (i.e. the between sample variability) and the leftover or residual variability (i.e. the within sample variability). The variability that is due to our treatment we hope is significantly large whereas the variability in the response that is leftover can be thought of as the nuisance, "error", or "residual" variability.

To imagine this further, think about the data storage capacity of a computer. If you have 8GB of storage total, you can ask your computer to show the types of files that are occupying the storage. We can visualize the response variability similarly, seen below. The ANOVA model is (in a very elementary fashion) going to compare the variability due to the treatment to the variability left over.

MediaVariability due to the treatmentResidual variability

From elementary statistics, when we think of computing a variance of a random variable (say X), we use the expression:

\(\text{variance }=\dfrac{\sum(X_i-\bar{X})^2}{N-1}=\dfrac{SS}{df}\)

The numerator of this expression is referred to as the Sum of Squares, or Sum of Squared deviations from the mean, or simply SS. (If you don't recognize this then we suggest you sharpen your introductory statistics skills!) The denominator is the degrees of freedom, (N - 1), or df.

ANOVA Table Rules

  1. Total SS = sum of the SS of all Sources (i.e. Total SS = Treatment SS + Error SS)
  2. Total df = sum of dfs of all Sources
  3. MS = SS/df
  4. \(F_{calculated}=\dfrac{\text{Treatment  MS}}{\text{Error  MS}}\), with numerator df = number of treatments - 1 and denominator df = error df

The ANOVA table is set up to generate quantities analogous to the simple variance calculation above. In our greenhouse experiment example:

  1. We start by considering the TOTAL variability in the response variable. This is done by calculating the SSTotal

    \(\text{Total SS }=\sum_{i}\sum_{j}(Y_{ij}-\bar{Y}_{..})^2\)

    = 312.47.

    The degrees of freedom for the Total SS is N - 1 = 24 – 1 = 23, where N is the total sample size.

  2. Our next step determines how much of the variability in Y is accounted for by our treatment. We now calculate SSTreatment or SSTrt:

    \(\text{Treatment SS }=\sum_{i}n_i(\bar{Y_{i.}}-\bar{Y}_{..})^2\)

    Note: The sum of squares for the treatment is the deviation of the group mean from the grand mean. So in some sense, we are "aggregating" all of the responses from that group and representing the "group effect" as the group mean.

    For our example:

    \begin{aligned}\text{Treatment SS }= 6*(21.0-26.1667)^2 + 6*(28.6-26.1667)^2 \\ +6*(25.8667-26.1667)^2 + 6*(29.2-26.1667)^2 = 251.44\end{aligned}

    Note that in this case we have equal numbers of observations (6) per treatment level, and it is, therefore, a balanced ANOVA.

  3. Finally, we need to determine how much variability is "leftover". This is the Error or Residual sums of squares. SSError or SSE is calculated by subtraction:

    \(\text{Error SS }=\sum_{i}\sum_{j}(Y_{ij}-\bar{Y}_{i.})^2 = \text{ Total SS - Treatment SS}\)

    \(SS_{Error} = 312.47 – 251.44 = \mathbf{61.033}\)

    Note here that the "leftover" is really the deviation of any score from its group mean.

We can now fill in the table:

ANOVA

Source

df

SS

MS

F

Treatment

T - 1 = 3

251.44

  

Error

23-3=20

61.033

  

Total

N - 1 =23

312.47

  

We have T treatment levels and so we use T - 1 for the df for the treatment. In our example, there are 4 treatment levels (the control and the 3 fertilizers) so T = 4 and T - 1 = 4 - 1 or 3. Finally, we obtain the error df by subtraction as we did with the SS.

The Mean Squares (MS) can now be calculated as:

\(MS_{Trt}=\dfrac{SS_{Trt}}{df_{Trt}}=\dfrac{251.44}{3}=83.813\)

and

\(MS_{Error}=\dfrac{SS_{Error}}{df_{Error}}=\dfrac{61.033}{20}=3.052\)

It is important to note that \(MS_{\text{Error}}\) is commonly referred to as \(MSE\). Also notice we do not need to calculate the \(MS_{\text{Total}}\).

ANOVA

Source

df

SS

MS

F

Treatment

3

251.44

83.813

 

Error

20

61.033

3.052

 

Total

23

312.47

  

Finally, we can compute the F statistic for our ANOVA. Conceptually we are comparing the ratio of the variability due to our treatment (remember we expect this to be relatively large) to the variability leftover, or due to error (and of course, since this is an error we want this to be small). Following this logic, we expect our F to be a large number. If we go back and think about the computer storage space we can picture most of the storage space taken up by our treatment, and less of it taken up by error. In our example, the F is calculated as:

\(F=\dfrac{MS_{Trt}}{MS_{Error}}=\dfrac{83.813}{3.052}=27.46\)

ANOVA

Source

df

SS

MS

F

Treatment

3

251.44

83.813

27.46

Error

20

61.033

3.052

 

Total

23

312.47

  

So how do we know if the F is "large enough" to conclude we have a significant amount of variability due to our treatment? We look up the critical value of F and compare it to the value we calculated. Specifically the critical F is \(F_\alpha = F_{(0.05, 3, 20)} = 3.10\). The critical value can be found using tables or technology. An example using SAS is seen below. 

Finding a critical value of F

Using a table:

Using SAS:

data Fvalue;                                                                                                                               
                   q=finv(0.95, 3, 20);                                                                                                                   
                   put q=;                                                                                                                           
                run;
                proc print data=work.Fvalue;
                 run;

The Print Procedure

Data Set WORK.FVALUE

 

Obs

q

1

3.09839

The Print Procedure

Data Set WORK.FVALUE

 

Obs

q

1

3.09839

Most F tables actually index this value as \(1 - \alpha = .95\)

FCalculatedP(F)αFα= 3.1FThe F distribution

The \(F_{\text{calculated}}\) > \(F_\alpha\) so we reject \(H_0\) in favor of the alternative \(H_A\). The p-value (which we don't typically calculate by hand) is the area under the curve to the right of the \(F_{\text{calculated}}\) and is the way the process is reported in statistical software. Note that in the unlikely event that the \(F_{\text{calculated}}\) is exactly equal to the \(F_{\alpha}\) then the \(\text{p-value} = \alpha\). As the calculated F statistic increases beyond the \(F_{\alpha}\) and we go further into the rejection region, the area under the curve (hence the p-value) gets smaller and smaller. This leads us to the analogous decision rule: If the p-value is \(<\alpha\) then we reject \(H_0\).


2.3 - Tukey Test for Pairwise Mean Comparisons

2.3 - Tukey Test for Pairwise Mean Comparisons

If (and only if) we reject the null hypothesis, we then conclude at least one group is different from one other (importantly we do NOT conclude that all the groups are different).

If we reject the null, then we want to know WHICH group, or groups, are different. In our example we are not satisfied knowing at least one treatment level is different; we want to know where the difference is and the nature of the difference. To answer this question, we can follow up the ANOVA with a mean comparison procedure to find out which means differ from each other and which ones do not.

You might think we could not bother with the ANOVA and proceed with a series of t-tests to compare the groups. While that is intuitively simple, it creates inflation of the type I error. How does this inflation of type I error happen? For a single test, the probability of type I error is

\(\alpha=1-(.95) \)

The probability of committing a type I error for two simultaneous tests follows from the Multiplication Rule for independent events in probability. Recall that for two independent events A and B, the probability of A and B both occurring is P(A and B) = P(A) * P(B). So for two tests, we have

\(\alpha = 1 - ( (.95)*(.95) ) = 0.0975\)

which is now larger than the \(\alpha\) that we originally set. For our example, we have 6 comparisons, so

\(\alpha = 1 - (.95^6) = 0.2649\)

which is a much larger (inflated) probability of committing a type I error than we originally set (0.05). The multiple comparison procedures are designed to compensate for the type I error inflation (although each does so in a slightly different way).

There are several multiple comparison procedures that can be employed, but we will start with the one most commonly used, the Tukey procedure. In the Tukey procedure, we compute a "yardstick" value ( \(w\)) based on the \(MS_{\text{Error}}\) and the number of means being compared. If any two means differ by more than the Tukey \(w\) value, then we conclude they are significantly different.

  1. Step 1: Compute Tukey’s \(w\) value

    \(w=q_{\alpha(p, df_{Error})}\cdot s_{\bar{Y}}\)

    where \(q_\alpha\) is obtained from a Table of Tukey \(q\) values,

    df for Error Term \(\alpha\) p = Number of Treatments
    2 3 4 5 6 7 3 9 10
    5 0.05
    0.01
    3.64
    5.70
    4.6
    6.98
    5.22
    7.80
    5.67
    8.42
    6.03
    8.91
    6.33
    9.32
    6.58
    9.67
    6.80
    9.97
    6.99
    10.24
    6 0.05
    0.01
    3.46
    5.24
    4.34
    6.33
    4.90
    7.03
    5.30
    7.56
    5.63
    7.97
    5.90
    8.32
    6.12
    8.61
    6.32
    8.87
    6.49
    9.10
    7 0.05
    0.01
    3.34
    4.95
    4.16
    5.92
    4.68
    6.54
    5.06
    7.01
    5.36
    7.37
    5.61
    7.68
    5.82
    7.94
    6.00
    8.17
    6.16
    8.37
    8 0.05
    0.01
    3.26
    4.75
    4.04
    5.64
    4.53
    6.20
    4.89
    6.62
    5.17
    6.96
    5.40
    7.24
    5.60
    7.47
    5.77
    7.68
    5.92
    7.86
    9 0.05
    0.01
    3.20
    4.60
    3.95
    5.43
    4.41
    5.96
    4.76
    6.35
    5.02
    6.66
    5.24
    6.91
    5.43
    7.13
    5.59
    7.33
    5.74
    7.49
    10 0.05
    0.01
    3.15
    4.48
    3.88
    5.27
    4.33
    5.77
    4.65
    6.14
    4.91
    6.43
    5.12
    6.67
    5.30
    6.87
    5.46
    7.05
    5.60
    7.21
    11 0.05
    0.01
    3.11
    4.39
    3.82
    5.15
    4.26
    5.62
    4.57
    5.97
    4.82
    6.25
    5.03
    6.48
    5.20
    6.67
    5.35
    6.84
    5.49
    6.99
    12 0.05
    0.01
    3.08
    4.32
    3.77
    5.05
    4.20
    5.50
    4.51
    5.84
    4.75
    6.10
    4.95
    6.32
    5.12
    6.51
    5.27
    6.67
    5.39
    6.81
    13 0.05
    0.01
    3.06
    4.26
    3.73
    4.96
    4.15
    5.40
    4.45
    5.73
    4.69
    5.98
    4.88
    6.19
    5.05
    6.37
    5.19
    6.53
    5.32
    6.67
    14 0.05
    0.01
    3.03
    4.21
    3.70
    4.89
    4.11
    5.32
    4.41
    5.63
    4.64
    5.88
    4.83
    6.08
    4.99
    6.26
    5.13
    6.41
    5.25
    6.54
    15 0.05
    0.01
    3.01
    4.17
    3.67
    4.84
    4.08
    5.25
    4.37
    5.56
    4.59
    5.80
    4.78
    5.99
    4.94
    6.16
    5.08
    6.31
    5.20
    6.44
    16 0.05
    0.01
    3.00
    4.13
    3.65
    4.79
    4.05
    5.19
    4.33
    5.49
    4.56
    5.72
    4.74
    5.92
    4.90
    6.08
    5.03
    6.22
    5.15
    6.35
    17 0.05
    0.01
    2.98
    4.10
    3.63
    4.74
    4.02
    5.14
    4.30
    5.43
    4.52
    5.66
    4.70
    5.85
    4.86
    6.01
    4.99
    6.15
    5.11
    6.27
    18 0.05
    0.01
    2.97
    4.07
    3.61
    4.70
    4.00
    5.09
    4.28
    5.38
    4.49
    5.60
    4.67
    5.79
    4.82
    5.94
    4.96
    6.08
    5.07
    6.20
    19 0.05
    0.01
    2.96
    4.05
    3.59
    4.67
    3.98
    5.05
    4.25
    5.33
    4.47
    5.55
    4.65
    5.73
    4.79
    5.89
    4.92
    6.02
    5.04
    6.14
    20 0.05
    0.01
    2.95
    4.02
    3.58
    4.64
    3.96
    5.02
    4.23
    5.29
    4.45
    5.51
    4.62
    5.69
    4.77
    5.84
    4.90
    5.97
    5.01
    6.09
    24 0.05
    0.01
    2.92
    3.96
    3.53
    4.55
    3.90
    4.91
    4.17
    5.17
    4.37
    5.37
    4.54
    5.54
    4.68
    5.69
    4.81
    5.81
    4.92
    5.92
    30 0.05
    0.01
    2.89
    3.89
    3.49
    4.45
    3.84
    4.80
    4.10
    5.05
    4.30
    5.24
    4.46
    5.40
    4.60
    5.54
    4.72
    5.65
    4.83
    5.76
    40 0.05
    0.01
    2.86
    3.82
    3.44
    4.37
    3.79
    4.70
    4.04
    4.93
    4.23
    5.11
    4.39
    5.27
    4.52
    5.39
    4.63
    5.50
    4.74
    5.60
  2. and  \(p\) = the number of treatment levels
    \(s_\bar{Y}\) = standard error of a treatment mean = \(\sqrt{MS_{Error}/n}\)
    \(n\) = number of replications

    For our greenhouse example we get:

    \(w=q_{.05(4,20)}\sqrt{(3.052⁄6)}=3.96(0.7132)=2.824\)

  3.  

  4. Step 2: Rank the means, calculate differences

    For the greenhouse example, we rank the group means as:

    29.20 28.6 25.87 21.00

    Start with the largest and second-largest means and calculate the difference:

    \(29.20 – 28.60 = 0.60\) which is less than our w of 2.824, so we indicate there is no significant difference between these two means by placing the letter “a” under each.

    29.20 28.6 25.87 21.00
    a a    

    Then calculate the difference between the largest and third-largest means:

    \(29.20 – 25.87 = 3.33\) which exceeds the critical w of 2.824, so we can label the third group mean with a “b” to show this difference is significant.

    29.20 28.6 25.87 21.00
    a a b  

    Now we have to consider whether or not the second-largest and third-largest differ significantly. This is a step that sets up a back and forth process. Here

    \(28.6 – 25.87 = 2.73\), less than the critical w of 2.824, so these two means do not differ significantly. We need to add a “b” to the second group mean to show this:

    29.20 28.6 25.87 21.00
    a ab b  

    Continuing down the line, we now calculate the next difference:

    \(28.60 – 21.00 = 7.60\), exceeding the critical w, so we now add a “c”:

    29.20 28.6 25.87 21.00
    a ab b c

    Again, we need to go back and check to see if the third-largest also differs from the smallest:

    \(25.87 – 21.00 = 4.87\), which it does. So we are done.

These letters can be added to figures summarizing the results of the ANOVA.

The Tukey procedure explained above is valid only with equal sample sizes for each treatment level. In the presence of unequal sample sizes, more appropriate is the Tukey–Cramer Method, which calculates the standard deviation for each pairwise comparison separately. This method is available in SAS, R, and most other statistical software.


2.4 - Other Pairwise Mean Comparison Methods

2.4 - Other Pairwise Mean Comparison Methods

Although the Tukey procedure is the most widely used multiple comparison procedure, there are many other multiple comparison techniques.

An older approach, no longer offered in many statistical computing packages, is Fisher’s Protected Least Significant Difference (LSD). This is a method to compare all possible means, two at a time, as t-tests. Unlike an ordinary two-sample t-test, however, the method does rely on the experiment-wide error (the MSE). The LSD is calculated as:

\(LSD(\alpha)=t_{\alpha,df}s_{\bar{d}}\)

where \(t_\alpha\) is based on \(\alpha\) and df = error degrees of freedom from the ANOVA table. The standard error for the difference between two treatment means (\(s_{\bar{d}}\) or SE) is calculated as:

\(s_{\bar{d}}=\sqrt{\dfrac{2s^2}{n}}\)

Where n is the number of observations per treatment mean (replications) and \(s^2\) is the MSE from the ANOVA. As in the Tukey method, any pair of means that differ by more than the LSD value differ significantly. The major drawback of this method is that it does not control \(\alpha\) over an entire set of pairwise comparisons (the experiment-wise error rate) and hence is associated with Type 1 inflation.

The following multiple comparison procedures are much more assertive in dealing with Type 1 inflation. In theory, while we can set \(\alpha\) for a single test, the fact that we have T treatment levels means there are T(T - 1)/2 tests (the number of pairs of possible comparisons), and so we need to adjust \(\alpha\) to have the desired confidence level for the set of tests. The Tukey, Bonferroni, and Scheffé methods control the experiment-wise error, but in different ways. All three use a

“multiplier * SE”

but differ in the form of the multiplier.

Contrasts are comparisons involving two or more factor level means (discussed more in the following section). Pairwise mean comparisons can be thought of as a subset of possible contrasts among the means. If only pairwise comparisons are made, the Tukey method will produce the narrowest confidence intervals and is the recommended method. The Bonferroni and Scheffé methods are used for general tests of possible contrasts. The Bonferroni method is better when the number of contrasts being tested is about the same as the number of factor levels. The Scheffé method covers all possible contrasts, and as a result, is the most conservative of all the methods. The drawback for such a highly conservative test, however, is that it becomes more difficult to resolve differences among means, even though the ANOVA would indicate that they exist.

When treatment levels include a control and mean comparisons are restricted to only comparing treatment levels against a control level, Dunnett’s mean comparison method is appropriate. Because there are fewer comparisons made in this case, the test provides more power compared to a test (see Section 3.7) using the full set of all pairwise comparisons.

To illustrate these methods, the following output was obtained for the hypothetical greenhouse data of our example. We will be running these types of analyses later in the course.

Fisher’s Least Significant Difference (LSD)

F3F1F2Control29.200028.600025.866721.0000FertilizerEstimateHeight t Grouping for Means of Fertilizer (Alpha =0.05)SMeans covered by the same bar are not significantly different.Sin

Since the estimated means for F1 and F3 are covered by the same colored bar, they are not significantly different using the LSD approach.

Tukey

F3F1F2Control29.200028.600025.866721.0000FertilizerEstimateHeight Tukey Grouping for Means of Fertilizer (Alpha= 0.05)Means covered by the same bar are not significantly different

Since the estimated means for F1 and F3 are covered by the same colored bar (red bar), they are not significantly different using Tukey's approach. Similarly, since F1 and F2 are covered by the same colored bar (blue bar) they are not significantly different using Tukey's approach.

Bonferroni

F3F1F2Control29.200028.600025.866721.0000FertilizerEstimateHeight Bonferroni Grouping for Means of Fertilizer(Alpha = 0.05)Means covered by the same bar are not significantly different.

Results from the Bonferroni approach are similar to the ones from Tukey's approach.

Scheffé

F3F1F2Control29.200028.600025.866721.0000FertilizerEstimateHeight Scheffé Grouping for Means of Fertilizer(Alpha = 0.05)Means covered by the same bar are not significantly different.

Results from the Scheffé approach are similar to the ones from Tukey's and Bonferroni's approaches.

Dunnett

 
Comparisons significant at the 0.05 level are indicated by ***.
Fertilizer
Comparison
Difference
Between
Means
Simultaneous 95% Confidence Limits ***
F3 - Control 8.200 5.638 10.762 ***
F1 - Control 7.600 5.038 10.162 ***
F2 - Control 4.867 2.305 7.429 ***

We can see that the LSD method was the most liberal, that is, it indicated the largest number of significant differences between means. In this example, Tukey, Bonferroni, and Scheffé produced the same results. The Dunnett test was consistent with the other 4 methods, and this is not surprising given the small value of the control mean compared to the other treatment levels.

To get a closer look at the results of employing the different methods, we can focus on the differences between the means for each possible pair:

Comparison Difference between means
Control F1 7.6000
Control F2 4.8667
Control F3 8.2000
F1 F2 2.7333
F1 F3 0.6000
F2 F3 3.3333

and compare the 95% confidence intervals produced:

Type LSD Tukey Bonferroni Scheffé Dunnett
Comparison Lower Upper Lower Upper Lower Upper Lower Upper Lower Upper
Control F1 5.496 9.704 4.777 10.423 4.648 10.552 4.525 10.675 5.038 10.162
Control F2 2.763 6.971 2.044 7.690 1.914 7.819 1.792 7.942 2.305 7.429
Control F3 6.096 10.304 5.377 11.023 5.248 11.152 5.125 11.275 5.638 10.762
F1 F2 0.629 4.837 -0.090 5.556 -0.2189 5.686 -0.342 5.808 X X
F1 F3 -1.504 2.704 -2.223 3.423 -2.352 3.552 -2.475 3.675 X X
F2 F3 1.229 5.437 0.510 6.156 0.3811 6.286 0.258 6.408 X X

You can see that the LSD produced the narrowest confidence intervals for the differences between means. Dunnett’s test had the next most narrow intervals (but only compares treatment levels to the control). The Tukey method produced intervals that were similar to those obtained for the LSD, and the Scheffé method produced the broadest confidence intervals.

What does this mean? When we need to be REALLY sure about our results, we should use conservative tests. If you are working in life-and-death situations such as in most clinical trials or bridge building you might want to be surer. If the consequences are less severe you can use a more liberal test, understanding there is more of a chance you might be incorrect (but still able to detect differences). In reality, you need to be consistent with the rigor used in your discipline. While we can't tell you which comparison to use, we can tell you the differences among the tests and the trade-offs for each one.


2.5 - Contrast Analysis

2.5 - Contrast Analysis

Unsurprisingly, paired comparison methods (presented in Sections 2.2 and 2.3) are limited to comparisons made only between treatment mean pairs. However, a contrast analysis procedure can be used to carry out comparisons of a much wider context, such as comparisons of treatment level groups or even testing of trends. In the context of a single-factor ANOVA model, a linear contrast can be defined as a linear combination of the treatment means such that their numerical coefficients add to zero. Mathematically, a contrast can be represented by

\(A=\sum_{i=1}^{T} a_{i} \bar{y}_{i}\)

where \(\bar{y}_{1}, \bar{y}_{2}, \ldots, \bar{y}_{T}\) represent the sample treatment means and \(\sum_{i=1}^{T} a_{i}=0\). The quantity A is a sample statistic and serves as an estimate for the parameter contrast \(\sum_{i=1}^{T} a_{i} \mu_{i}\). By choosing the numerical coefficients appropriately, linear contrasts can be used to make different comparisons among groups of treatment means, including but not limited to mean pairs. The table below gives 4 linear contrasts defined in terms of the 3 fertilizer levels, F1, F2, F3, and the Control in the greenhouse example.

Table: Greenhouse example contrasts
Ex \(a_1\) \(a_2\) \(a_3\) \(a_4\) Contrast
1 1 -1 0 0 F1-F2
2 1 1 1 -3 F1+F2+F3-3C
3 1 1 -2 0 F1+F2-2F3
4 0 1 -1 0 F2-F3

Notice that values of each list of \(a_{i}\) (i = 1, 2, 3, 4) add to zero. The first contrast compares the first two fertilizer types in terms of their means (a pairwise comparison), the second compares the means of the 3 fertilizer types with the Control mean. The 3rd one is a comparison between the combined effect of fertilizer types 1 and 2 with fertilizer type 3, while the last contrast compares the 2nd and 3rd fertilizer types.

A pair of contrasts \(A=\sum_{i=1}^{T} a_{i} \bar{y}_{i}\), and \(B=\sum_{i=1}^{T} b_{i} \bar{y}_{i}\) is orthogonal if the products of their numerical coefficients add to zero. This can be expressed mathematically as

\(\sum_{i=1}^{T} a_{i}b_{i}=0\)

A set of contrasts is said to be orthogonal if every pair of contrasts in the set is orthogonal. Two orthogonal contrasts are not correlated. This means that if A and B are orthogonal, then the Covariance (A, B) = 0.  Furthermore, the sum of squares of the treatment displayed usually in the ANOVA table, can be partitioned into a set of (T-1) orthogonal contrasts each with 1 degree of freedom. Note that the maximal number of orthogonal contrasts associated with the treatment of T levels is (T-1) and each of them would be associated with one specific comparison independent of the other. In the table above, contrasts 1, 2, and 3 form an orthogonal set of (T-1) contrasts. 

The statistical significance of a linear contrast, which can be equated to testing for the zero contrast value, can be formulated using the null and alternative hypotheses:

\(H_0\colon \sum_{i=1}^{T} a_{i} \mu_{i}=0 \text { vs. } H_A\colon  \sum_{i=1}^{T} a_{i} \mu_{i} \neq 0 \text {, }\)

and can be tested using either,

\(t=\dfrac{\sum_{i=1}^{T} a_{i} \bar{y}_{i}}{\sqrt{\operatorname{MSE} \sum_{i=1}^{T} \frac{a_{i}^{2}}{n_i}}}\) with (N-T) degrees of freedom or \(F=\dfrac{\left(\sum_{i=1}^{T} a_{i} \bar{y}_{i}\right)^{2}}{\operatorname{MSE} \sum_{i=1}^{T} \frac{a_{i}^{2}}{n_i}}\)

with the numerator and denominator degrees of freedom equal to 1 and (N-T) respectively.

Note that MSE can be obtained from the ANOVA table. Applying the above formula, the t- statistic for testing contrast 2 above is

\(t=\dfrac{\sum_{i=1}^{T} a_{i} \bar{y}_{i}}{\sqrt{\operatorname{MSE} \sum_{i=1}^{T}\frac{a_{i}^{2}}{n_i}}}=\dfrac{28.6+25.867+29.2-(3 * 21)}{\sqrt{3.052 \times \frac{(1+1+1+9)}{6}}}=8.365\)

with df=20 and has a p-value of approximately 0. This indicates that the average plant height due to the combined treatment of the 3 fertilizer types differs significantly from the average plant height yielded by the control.

The above testing procedure is applicable to non-orthogonal contrasts as well. However, as non-orthogonal contrasts are not guaranteed to be uncorrelated, the conclusions arrived at may be "overlapping", leading to redundancies. In Lesson 3, examples are provided to illustrate how software can be used to conduct contrast testing. The hypothesis testing for trends using contrasts will be discussed in Lesson 10 ANCOVA II.


2.6 - Try it!

2.6 - Try it!

Exercise 1: Teaching Effectiveness

To compare the teaching effectiveness of 3 teaching methods, the semester average based on 4 midterm exams from five randomly selected students enrolled in each teaching method were used.

  1. What is the response in this study?
  2. How many replicates are there?
  3. Write the appropriate null and alternative hypotheses.
  4. Complete the partially filled ANOVA table given below. Round your answers to 4 decimal places.
    Source df SS MS F p-value
    teach_mtd   245      
    error          
    total   345.1      
  5. Find the critical value at \(\alpha = .01\)
  6. Make your conclusion.
  7. From the ANOVA analysis you performed, can you detect the teaching method which yields the highest semester average? If not, suggest a technique that will.

Solutions

  1. Average of 4 mid-terms
  2. 5
  3. \(H_0\colon \mu_1=\mu_2=\mu_3=\mu, \ \text{where} \ \mu_1, \mu_2, \mu_3\) are the actual semester averages of all students enrolled in teaching method 1, method 2, and method 3, respectively.
    \(H_a\colon\) Not all semester averages are equal. This means that there are at least two teaching methods that differ in their actual semester averages.
  4. Source df SS MS F p-value
    teach_mtd 2 245 122.5000 14.6853 0.0006
    error 12 100.1 8.3417    
    total 14 345.1      
  5. 6.925
  6. Since the calculated F-statistic value = 14.6853 is more than the critical value of 6.925, \(H_0\) should be rejected. Therefore we can conclude that all 3 teaching methods do not have the same semester average indicating that at least 2 teaching methods differ in their actual semester average.
  7. The ANOVA conclusion indicated that not all 3 teaching methods are equally effective, but did not indicate which one yields the highest mean score. The Tukey comparison method is one procedure that shows the teaching method that yields the significantly highest average semester score.

Exercise 2: Commuter Times

In a local commuter bus service, the number of daily passengers for 50 weeks was recorded. The purpose was to determine if the passenger volume is significantly less during weekends compared to workdays. Below are summary statistics for each day of the week. The partially filled ANOVA table along with a Tukey plot is shown below.

Statistics
 
Day N Mean SE Mean Std Dev
Sun 50 486.500 9.003 63.661
Mon 50 514.600 6.891 48.724
Tue 50 501.340 7.922 56.018
Wed 50 520.640 7.055 49.886
Thu 50 512.880 10.258 72.532
Fri 50 512.600 8.086 57.174
Sat 50 469.860 8.988 63.555
  1. State the appropriate null and alternative hypotheses for this test.

    \(H_0\colon\mu_{Sun}=\mu_{Mon}=\mu_{Tues}=\mu_{Wed}=\mu_{Thurs}=\mu_{Fri}=\mu_{Sat}\)

    \(H_a\colon \text{At least one }\mu_{dayi}\ne\mu_{dayj}, \text{for some }i,j=1,2,...,7\text{ OR not all means are equal}\)

  2. Complete the partially filled ANOVA table given below. Use two decimal places in the F-statistic.
    Source df SS MS F p-value
    Groups   100391      
    Error          
    Total   1306887      

    Source df SS MS F p-value
    Day 6 100391 16731.8 4.76 0.0001
    Error 343 1206496 3517.5    
    Total 349 1306887      
  3. Use the appropriate F-distribution cumulative probabilities to verify that the p-value for the test is approximately zero.

    p-value \(\approx\) 0 (from the F-distribution with 6 and 343 degrees of freedom)
  4. Use \(\alpha=0.05\), to test if the mean passenger volume differs significantly by day of the week.

    Since the p-value \(\le \alpha=0.05\), we reject \(H_0\). There is strong evidence to indicate that the mean passenger volume differs significantly by day of the week (i.e. for some days of the week the average number of commuters is more than others, but this test does not indicate which days have a higher passenger volume).
    Grouping Information Using the Tukey Method and 95% Confidence
    Wed Mon Thu Fri Tue Sun Sat 520.64 514.60 512.88 512.60 501.34 486.50 469.86 Day Estimate Volume Tukey Grouping for Means of Day (Alpha = 0.05) Means covered by the same bar are not significantly different.
  5. Use the output to make a statement about how the mean daily passenger volume differs significantly by day of the week.

    The passenger volume on Sundays is not statistically different from Saturdays or from Tuesdays. However, the mean passenger volume on Saturdays is significantly lower than on workdays other than Tuesday.
  6. The management would like to know if the overall number of commuters is significantly more during workdays than during weekends. An appropriate comparison to respond to their query would be to compare the average number of commuters between workdays (Monday through Friday) and the weekend. Write the weights (coefficients) for a linear contrast to make this comparison. Test the hypothesis that the average commuter volume during the weekends is less.

    The weights (coefficients) for the appropriate contrast are given below.
    Day Mon Tue Wed Thu Fri Sat Sun
    weight 1 1 1 1 1 -2.5 -2.5

    \( t=\frac{\underset{i=1}{\overset{T}{\sum}} a \bar{y}_{i}}{\sqrt{M S E \underset{i=1}{\overset{T}{\sum}} \frac{a_{i}^{2}}{n_{i}}}}=\frac{171.16}{\sqrt{3517.5 * \frac{17.5}{50}}}=4.878 \)

    Under the null hypothesis, this test statistic has a t-distribution with 343 degrees of freedom. You can obtain the p-value using statistical software. Recall this is a one-tailed test.

    Student's t distribution with 343 DF
    x P(X\(\ge\)x)
    4.878 8.216815e-07\(\approx\)0

    This p-value indicates that the difference in the average number of passengers is statistically significant between workdays and weekends.

    See the table below for computations:

    Factor N Mean weights product weight2
    Mon 50 514.6 1.0 514.6 1.00
    Tue 50 501.34 1.0 501.34 1.00
    Wed 50 520.64 1.0 520.64 1.00
    Thu 50 512.88 1.0 512.88 1.00
    Fri 50 512.6 1.0 512.6 1.00
    Sat 50 469.86 -2.5 -1174.65 6.25
    Sun 50 486.5 -2.5 -1216.25 6.25

    Recall that the MSE (error mean squares) is 3517.5 with \(df_{error} = 343\)


2.7 - Lesson 2 Summary

2.7 - Lesson 2 Summary

In this lesson, we became familiar with the ANOVA methodology to test for equality among treatment means. As follow-up procedures, we were also exposed to the Tukey method for paired mean comparisons which helped to identify significantly different treatment (factor) levels. The contrast analysis was also discussed as a means to compare differences among group means. 


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility