8.1 - The Univariate Approach: Analysis of Variance (ANOVA)

In the univariate case, the data can often be arranged in a table as shown in the table below:

Treatment

  1 2 \(\dots\) g
Subjects 1 \(Y_{11}\) \(Y_{21}\) \(\dots\) \(Y_{g_1}\)
2 \(Y_{12}\) \(Y_{22}\) \(\dots\) \(Y_{g_2}\)
\(\vdots\) \(\vdots\) \(\vdots\)   \(\vdots\)
\(n_i\) \(Y_{1n_1}\) \(Y_{2n_2}\) \(\dots\) \(Y_{gn_g}\)

The columns correspond to the responses to g different treatments or from g different populations. And, the rows correspond to the subjects in each of these treatments or populations.

Notations:

  • \(Y_{ij}\) = Observation from subject j in group i
  • \(n_{i}\) = Number of subjects in group i
  • \(N = n_{1} + n_{2} + \dots + n_{g}\) = Total sample size.

Assumptions for the Analysis of Variance are the same as for a two-sample t-test except that there are more than two groups:

  1. The data from group i has common mean = \(\mu_{i}\); i.e., \(E\left(Y_{ij}\right) = \mu_{i}\) . This means that there are no sub-populations with different means.
  2. Homoskedasticity: The data from all groups have common variance \(\sigma^2\); i.e., \(var(Y_{ij}) = \sigma^{2}\). That is, the variability in the data does not depend on group membership.
  3. Independence: The subjects are independently sampled.
  4. Normality: The data are normally distributed.

The hypothesis of interest is that all of the means are equal. Mathematically we write this as:

\(H_0\colon \mu_1 = \mu_2 = \dots = \mu_g\)

The alternative is expressed as:

\(H_a\colon \mu_i \ne \mu_j \) for at least one \(i \ne j\).

i.e., there is a difference between at least one pair of group population means. The following notation should be considered:

This involves taking an average of all the observations for j = 1 to \(n_{i}\) belonging to the ith group. The dot in the second subscript means that the average involves summing over the second subscript of y.

This involves taking the average of all the observations within each group and over the groups and dividing by the total sample size. The double dots indicate that we are summing over both subscripts of y.

  • \(\bar{y}_{i.} = \frac{1}{n_i}\sum_{j=1}^{n_i}Y_{ij}\) = Sample mean for group i .
  • \(\bar{y}_{..} = \frac{1}{N}\sum_{i=1}^{g}\sum_{j=1}^{n_i}Y_{ij}\) = Grand mean.
ANOVA
The Analysis of Variance involves the partitioning of the total sum of squares which is defined as in the expression below:
\(SS_{total} = \sum\limits_{i=1}^{g}\sum\limits_{j=1}^{n_i}(Y_{ij}-\bar{y}_{..})^2\)

Here we are looking at the average squared difference between each observation and the grand mean. Note that if the observations tend to be far away from the Grand Mean then this will take a large value. Conversely, if all of the observations tend to be close to the Grand mean, this will take a small value. Thus, the total sum of squares measures the variation of the data about the Grand mean.

An Analysis of Variance (ANOVA) is a partitioning of the total sum of squares. In the second line of the expression below, we are adding and subtracting the sample mean for the ith group. In the third line, we can divide this out into two terms, the first term involves the differences between the observations and the group means, \(\bar{y}_i\), while the second term involves the differences between the group means and the grand mean.

\(\begin{array}{lll} SS_{total} & = & \sum_{i=1}^{g}\sum_{j=1}^{n_i}\left(Y_{ij}-\bar{y}_{..}\right)^2 \\ & = & \sum_{i=1}^{g}\sum_{j=1}^{n_i}\left((Y_{ij}-\bar{y}_{i.})+(\bar{y}_{i.}-\bar{y}_{..})\right)^2 \\ & = &\underset{SS_{error}}{\underbrace{\sum_{i=1}^{g}\sum_{j=1}^{n_i}(Y_{ij}-\bar{y}_{i.})^2}}+\underset{SS_{treat}}{\underbrace{\sum_{i=1}^{g}n_i(\bar{y}_{i.}-\bar{y}_{..})^2}} \end{array}\)

The first term is called the error sum of squares and measures the variation in the data about their group means.

Note that if the observations tend to be close to their group means, then this value will tend to be small. On the other hand, if the observations tend to be far away from their group means, then the value will be larger. The second term is called the treatment sum of squares and involves the differences between the group means and the Grand mean. Here, if group means are close to the Grand mean, then this value will be small. While, if the group means tend to be far away from the Grand mean, this will take a large value. This second term is called the Treatment Sum of Squares and measures the variation of the group means about the Grand mean.

The Analysis of Variance results is summarized in an analysis of variance table below:

 Hover over the light bulb to get more information on that item.

Source

d.f.

SS

MS

F

Treatments

\(g-1\)

\(\sum _ { i = 1 } ^ { g } n _ { i } \left( \overline { y } _ { i . } - \overline { y } _ { . . } \right) ^ { 2 }\)

\(\dfrac { S S _ { \text { treat } } } { g - 1 }\)

\(\dfrac { M S _ { \text { treat } } } { M S _ { \text { error } } }\)

Error

\(N-g\)

\(\sum _ { i = 1 } ^ { g } \sum _ { j = 1 } ^ { n _ { i } } \left( Y _ { i j } - \overline { y } _ { i . } \right) ^ { 2 }\)

\(\dfrac { S S _ { \text { error } } } { N - g }\)

 

Total

\(N-1\)

\(\sum _ { i = 1 } ^ { g } \sum _ { j = 1 } ^ { n _ { i } } \left( Y _ { i j } - \overline { y } _ { \dots } \right) ^ { 2 }\)

   

The ANOVA table contains columns for Source, Degrees of Freedom, Sum of Squares, Mean Square and F. Sources include Treatment and Error which together add up to the Total.

The degrees of freedom for treatment in the first row of the table are calculated by taking the number of groups or treatments minus 1. The total degree of freedom is the total sample size minus 1.  The Error degrees of freedom are obtained by subtracting the treatment degrees of freedom from the total degrees of freedom to obtain N-g.

The formulae for the Sum of Squares are given in the SS column. The Mean Square terms are obtained by taking the Sums of Squares terms and dividing them by the corresponding degrees of freedom.

The final column contains the F statistic which is obtained by taking the MS for treatment and dividing it by the MS for Error.

Under the null hypothesis that the treatment effect is equal across group means, that is \(H_{0} \colon \mu_{1} = \mu_{2} = \dots = \mu_{g} \), this F statistic is F-distributed with g - 1 and N - g degrees of freedom:

\(F \sim F_{g-1, N-g}\)

The numerator degrees of freedom g - 1 comes from the degrees of freedom for treatments in the ANOVA table. This is referred to as the numerator degrees of freedom since the formula for the F-statistic involves the Mean Square for Treatment in the numerator. The denominator degrees of freedom N - g is equal to the degrees of freedom for error in the ANOVA table. This is referred to as the denominator degrees of freedom because the formula for the F-statistic involves the Mean Square Error in the denominator.

We reject \(H_{0}\) at level \(\alpha\) if the F statistic is greater than the critical value of the F-table, with g - 1 and N - g degrees of freedom, and evaluated at level \(\alpha\).

\(F > F_{g-1, N-g, \alpha}\)