Lesson 3a: 'Behind the Curtains' - How is ANOVA Calculated?

In the lessons so far we developed the ANOVA conceptually in terms of deviations from means. In the case of the treatment SS, we calculated the deviations of treatment level means from the overall mean, and for the residual error SS we calculated the deviations of individual observations from treatment level means. In practice, however, computing the SS for ANOVA is done differently. To understand this we can utilize the fundamental equivalence of the following two expressions:

\(SS=\Sigma(Y_i-\bar{Y})^2=\Sigma Y_{i}^{2}-\frac{(\Sigma Y_i)^2}{n}\)

The latter expression is often referred to as the ‘working formula’ or ‘machine formula’ and is used by computer software to calculate the various SS in the ANOVA.

The first part of the working formula is simply squaring and summing the observations. For computing the SS to for the total variance, the formula above would be used, but in the case of computing the SS for the treatment, we have the following modification:

\(SS_{treatment} = \sum_{i=1}^{k}\frac{\left( \sum_{j=1}^{n_i}Y_{ij} \right)^2}{n_i}\)

The second part of the working formula, \(\dfrac{(\Sigma Y_i)^2}{n}\) is referred to as a ‘correction factor’, abbreviated as CF. Employing this formulation is readily accomplished by using matrix algebra as we will see below. To gain a view into the process that software is using, we will use a procedure in SAS, Proc IML (the Interactive Matrix Language). The interactive matrix language in SAS provides an ideal setting in which to see how the various ANOVA models discussed in the online lesson notes and in our textbook are actually computed. The following 4 models will be set up and run here in this lesson:

Model 1 - The Overall Mean Model

 See Textbook: Equation 16.58

\(Y_{ij}=\mu_.+\epsilon_{ij}\)

which simply fits an overall or ‘grand’ mean. This model reflects the situation where \(H_0\) is true and \(\mu_1=\mu_2=\cdots=\mu_r\).

Model 2 - The Cell Means Model

 See Textbook: Equation 16.57

\(Y_{ij}=\mu_i+\epsilon_{ij}\)

where \(\mu_i\) is the factor level means. Note that in this model and there is no overall mean being fitted.

Model 3 - Dummy Variable Regression (No Textbook Equation)
\(Y_{ij}=\mu_.+\mu_i+\epsilon_{ij}\), fitted as \(Y_ij=\beta_0+\beta_{Level1}+\beta_{Level2} \cdots \beta_{Levelr-1}+\epsilon_{ij}\)

where \(\beta_i\) are regression coefficients for r-1 indicator-coded regression ‘dummy’ variables that are coded to indicate the r categorical factor levels. The rth factor level mean is given by the regression intercept \(\beta_0\).

Model 4 - The Effects Model

 See Textbook: Equation 16.62

\(Y_{ij}=\mu_.+\tau_i+\epsilon_{ij}\)

where \(\tau_i\) are the the deviations of each factor level mean from the overall mean.

Running the different models above is accomplished by simply changing the design matrix \(\mathbf{X}\) in the general linear model (GLM): \(\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\).