Lesson 3a: 'Behind the Curtains' - How is ANOVA Calculated?

Lesson 3a: 'Behind the Curtains' - How is ANOVA Calculated?

In the lessons so far we developed the ANOVA conceptually in terms of deviations from means. In the case of the treatment SS, we calculated the deviations of treatment level means from the overall mean, and for the residual error SS we calculated the deviations of individual observations from treatment level means. In practice, however, computing the SS for ANOVA is done differently. To understand this we can utilize the fundamental equivalence of the following two expressions:

\(SS=\Sigma(Y_i-\bar{Y})^2=\Sigma Y_{i}^{2}-\frac{(\Sigma Y_i)^2}{n}\)

The latter expression is often referred to as the ‘working formula’ or ‘machine formula’ and is used by computer software to calculate the various SS in the ANOVA.

The first part of the working formula is simply squaring and summing the observations. For computing the SS to for the total variance, the formula above would be used, but in the case of computing the SS for the treatment, we have the following modification:

\(SS_{treatment} = \sum_{i=1}^{k}\frac{\left( \sum_{j=1}^{n_i}Y_{ij} \right)^2}{n_i}\)

The second part of the working formula, \(\dfrac{(\Sigma Y_i)^2}{n}\) is referred to as a ‘correction factor’, abbreviated as CF. Employing this formulation is readily accomplished by using matrix algebra as we will see below. To gain a view into the process that software is using, we will use a procedure in SAS, Proc IML (the Interactive Matrix Language). The interactive matrix language in SAS provides an ideal setting in which to see how the various ANOVA models discussed in the online lesson notes and in our textbook are actually computed. The following 4 models will be set up and run here in this lesson:

Model 1 - The Overall Mean Model

 See Textbook: Equation 16.58

\(Y_{ij}=\mu_.+\epsilon_{ij}\)

which simply fits an overall or ‘grand’ mean. This model reflects the situation where \(H_0\) is true and \(\mu_1=\mu_2=\cdots=\mu_r\).

Model 2 - The Cell Means Model

 See Textbook: Equation 16.57

\(Y_{ij}=\mu_i+\epsilon_{ij}\)

where \(\mu_i\) is the factor level means. Note that in this model and there is no overall mean being fitted.

Model 3 - Dummy Variable Regression (No Textbook Equation)
\(Y_{ij}=\mu_.+\mu_i+\epsilon_{ij}\), fitted as \(Y_ij=\beta_0+\beta_{Level1}+\beta_{Level2} \cdots \beta_{Levelr-1}+\epsilon_{ij}\)

where \(\beta_i\) are regression coefficients for r-1 indicator-coded regression ‘dummy’ variables that are coded to indicate the r categorical factor levels. The rth factor level mean is given by the regression intercept \(\beta_0\).

Model 4 - The Effects Model

 See Textbook: Equation 16.62

\(Y_{ij}=\mu_.+\tau_i+\epsilon_{ij}\)

where \(\tau_i\) are the the deviations of each factor level mean from the overall mean.

Running the different models above is accomplished by simply changing the design matrix \(\mathbf{X}\) in the general linear model (GLM): \(\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\).


3a.1 - The Overall Mean Model

3a.1 - The Overall Mean Model

The design matrix for the overall mean model is:

\(\mathbf{Y}=\begin{bmatrix}
2\\
1\\
3\\
4\\
6\\
5
\end{bmatrix}
\mathbf{X}=\begin{bmatrix}
1\\
1\\
1\\
1\\
1\\
1
\end{bmatrix}
\boldsymbol{\beta}=(\mu) \;\;
\boldsymbol{\epsilon}=\begin{bmatrix}
\epsilon_1\\
\epsilon_2\\
\epsilon_3\\
\epsilon_4\\
\epsilon_5\\
\epsilon_6
\end{bmatrix}\)

The estimates of the parameters in \(\boldsymbol{\beta}\) are obtained as:

\(b= (\mathbf{X}^{'} \mathbf{X} ) ^{-1} \mathbf{X}^{' } \mathbf{Y}\)

In this simplest of cases, we can see how this works. The term \(\mathbf{X}^{'} \mathbf{X}\) would be:

\(\begin{bmatrix}
1 & 1 & 1 & 1 &1 & 1
\end{bmatrix}\ast \begin{bmatrix}
1\\
1\\
1\\
1\\
1\\
1
\end{bmatrix}=1+1+1+1+1+1=6 = n\)

exponent, indicating that we will be dividing by \(\mathbf{X}^{'} \mathbf{X}\).

The term \(\mathbf{X}^{'} \mathbf{Y}\) would be:

 \(\begin{bmatrix}
2 & 1 & 3 & 4 &5 & 6
\end{bmatrix}\ast \begin{bmatrix}
1\\
1\\
1\\
1\\
1\\
1
\end{bmatrix}=2+1+3+4+5+6=21 = \Sigma Y_i\)

So in this case, the estimate \(\mathbf{b}\) is simply the overall mean = \(\mu_{.}= 21/6 = 3.5\)

To perform these matrix operations in SAS IML, we will open a regular SAS editor window, and then copy and paste three components from the file (IML Design Matrices):

 SAS: Overall Mean Model

  1. Step 1

    Procedure initiation, and specification of the dependent variable vector, \(\mathbf{Y}\).

    For our example we have:

    /* Initiate IML, define response variable */
    proc iml;
    y={
    2, 
    1,
    3,
    4,
    6,
    5};
  2. Step 2

    We then enter a design matrix (\(\mathbf{X}\)). For the Overall Mean model and our example data, we have:

    
    x={
    1,
    1,
    1,
    1,
    1,
    1};
  3. Step 3

    We can now copy and paste a program for the matrix computations to generate results (regression coefficients and ANOVA output):

    beta=inv(x`*x)*x`*y; 
    beta_label={"Beta_0","Beta_1","Beta_2","Beta_3"};
    print beta [label="Regression Coefficients"
    			rowname=beta_label];
    n=nrow(y); 
    p=ncol(x); 
    j=j(n,n,1);
    ss_tot = (y`*y) - (1/n)*(y`*j)*y;
    ss_trt = (beta`*(x`*y)) - (1/n)*(y`*j)*y;
    ss_error = ss_tot - ss_trt; 
    total_df=n-1;
    trt_df=p-1;
    error_df=n-p; 
    MS_trt = ss_trt/(p-1);
    MS_error = ss_error / error_df;
    F=ms_trt/ms_error; 
    
    empty={.}; 
    row_label= {"Treatment", "Error", "Total"}; 
    col_label={"df" "SS" "MS" "F"}; 
    trt_row= trt_df || ss_trt || ms_trt || F; 
    error_row= error_df || ss_error || ms_error || empty;
    tot_row=total_df || ss_tot || empty || empty; 
    aov = trt_row // error_row // tot_row;
    print aov [label="ANOVA"
    			rowname=row_label
    			colname=col_label];

Here is a quick video walk-through to show you the process for how you can do this in SAS. (Right-click and select 'Show All' if your browser does not display the entire screencast window.)

The program can then be run to produce the following output:

Regression Coefficients
Beta_0 3.5
ANOVA
  df SS MS F
Treatment 0 0    
Error 5 17.5 3.5  
Total 5 17.5    

We see the estimate of the regression coefficient for \(\beta_0\) is indeed the overall mean of the response variable. In this simple case, where the treatment factor has not entered the model, the only item of interest from the ANOVA table would be the \(SS_{Error}\) for later use in the General Linear F test.

To check to see the internal calculations, we can add the following few lines to the end of the calculation code.

 /* (Optional) Intermediates in the matrix computations */
xprimex=x`*x; print xprimex;
xprimey=x`*y; print xprimey;
xprimexinv=inv(x`*x); print xprimexinv;
check=xprimexinv*xprimex; print check;
SumY2=beta`*(x`*y); print SumY2;
CF=(1/n)*(y`*j)*y; print CF;

This additional code produces the following output:

xprimex
6
xprimey
21
xprimexinv
0.1666667
check
1
SumY2
73.5
SumY_2
73.5

From this we can verify the computations for the \(SS_{treatment}\Sigma Y_{i}^{2} – \frac{(\Sigma Y_i)^2}{n}=\Sigma Y^{2}-CF = 0\).

The ‘check’ calculation is confirming that \(\mathbf{X}^{-1} \mathbf{X} = 1\), which in fact defines the matrix division operation. In this simple case, it amounts to simple division by n, but in other models that we will work with, the matrix division process is more complicated. The inverse (\(\mathbf{X}^{-1}\)) is defined by the product \(\mathbf{X}^{-1} \mathbf{X} = \mathbf{I}\), where \(\mathbf{I}\) is the identity matrix (a diagonal matrix of ‘1’s). Here in this simple case there is only a single diagonal element.


3a.2 - Cell Means Model

3a.2 - Cell Means Model

To run the cell means model, wherein we don’t fit an overall mean, but instead fit an individual mean for each of the treatment levels, we simply replace the design matrix in the IML code with:

/* The Cell Means Model */
x={
1	0	0,
1	0	0,
0	1	0,
0	1	0,
0	0	1,
0	0	1};
 

Each column is using indicator coding, a ‘1’ for the treatment, and 0 for other. So column 1 will generate the mean for treatment level1, column 2 for treatment level2, and column 3 for treatment level3.

We then re-run the program with the new design matrix to get the following output:

Regression Coefficients
Beta_0 1.5
Beta_1 3.5
Beta_2 5.5
ANOVA
  df SS MS F
Treatment 2 16 8 16
Error 3 1.5 0.5  
Total 5 17.5    

We can see that the regression coefficients are now the means for each treatment level, and in the ANOVA table we see that the \(SS_{Error}\) is 1.5. This reduction in the \(SS_{Error}\) is the \(SS_{treatmemt}\).

Internally, we have:

xprimex
2 0 0
0 2 0
0 0 2
check
1 0 0
0 1 0
0 0 1
xprimey
3
7
11
SumY2
89.5
CF
73.5
xprimexinv
0.5 0 0
0 0.5 0
0 0 0.5

Here we can see that \(\mathbf{X}^{'}\mathbf{X}\) now contains diagonal elements that are the \(n_i\) = number of observations for each treatment level mean being computed. In addition, we can verify that the ‘working formula’ \(CF = \Sigma Y^2 - CF = 16\), the treatment SS.

We can now test for the significance of the treatment by using the General Linear F test:

 See Textbook: Section 16.7

\(F=\dfrac{SSE_{reduced} - SSE_{full}/dfE_{reduced} - dfE_{full}}{SSE_{full} / dfE_{full}}\)

The Overall Mean model is the ‘Reduced’ model, and the Cell Means model is the ‘Full’ model. From the ANOVA tables, we get:

\(F=\dfrac{17.5-1.5⁄5-3}{1.5⁄3}=16\)

which can be compared to \(F_{.05,2,3}=9.55\).


3a.3 - Dummy Variable Regression

3a.3 - Dummy Variable Regression

The GLM can be viewed from the regression perspective as an ordinary multiple linear regression (MLR) with ‘dummy’ coding (actually indicator coding) for the categorical treatment levels. Typically software performing the MLR will automatically include an intercept, which complicates the interpretation of the regression coefficients.

In IML, we now replace the design matrix with:

/* Dummy Variable Regression Model */
x = {
1	1	0,
1	1	0, 
1	0	1,
1	0	1,
1   	0   	0,
1   	0   	0};

We can see something strange with this design matrix. Although we have three treatment levels in this example, we have a column of ‘1’s’ for the intercept and only two columns that have been indicator coded for treatment levels.. The reason for this is that the complete matrix

\(\begin{bmatrix}
1 & 1 & 0 & 0\\
1 & 1 & 0 & 0\\
1 & 0 & 1 & 0\\
1 & 0 & 1 & 0\\
1 & 0 & 0 & 1\\
1 & 0 & 0 & 1
\end{bmatrix}\)

has the property that the sum of columns 2-4 will equal the first column for the intercept. As a result, a condition called singularity is created and the matrix computations will not run. So one of the treatment levels is omitted from the coding in our design matrix above for IML. It doesn’t matter which one we eliminate, although some refer to the eliminated level as a ‘reference’ level. Unlike logistic regression, the treatment level that we eliminate from the coding here does not act as a reference level for comparing other treatment levels. Its simply an algebraic manipulation. In the our design matrix for IML we have eliminated the indicator coding for treatment level3.

Re-running IML, we now get the following output;

Regression Coefficients
Beta_0 5.5
Beta_1 -4
Beta_2 -2

The coefficient for \(\beta_0\) is the mean for treatment level3. The mean for treatment level1 is then calculated from \(\beta_0+\beta_1=1.5\). Likewise, the mean for treatment level2 is calculated as \(\beta_0+\beta_2=3.5\).

Notice that the F statistic calculated from this model is the same as that produced from the Cell Means model.

ANOVA
  df SS MS F
Treatment 2 16 8 16
Error 3 1.5 0.5  
Total 5 17.5    

 Using Minitab

We can confirm our ANOVA table now by running the analysis is ordinary software such as Minitab, given that we can set the coding that the software uses. In Mintab, under the Stat > ANOVA > General Linear Model, we control this by specifying Indicator (1,0) coding:

 minitab dialog box for the general linear model

This produces the regular ANOVA output:

Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
trt 2 16.000 8.0000 16.00 0.025
Error 3 1.500 0.5000    
Total 5 17.500      

And also the Regression Equation

Regression Equation

y = 5.500 - 4.000 trt_level1 - 2.000 trt_level2 + 0.0 trt_level3

 Using SAS

In SAS, the default coding is indicator coding, so when you specify the option

model y=trt / solution;

you get the regression coefficients:

Solution for Fixed Effects
Effect trt Estimate Standard Error DF t Value Pr > |t|
Intercept   5.5000 0.5000 3 11.00 0.0016
trt level1 -4.0000 0.7071 3 -5.66 0.0109
trt level2 -2.0000 0.7071 3 -2.83 0.0663
trt level3 0        

And the same ANOVA table:

Type 3 Analysis of Variance
Source DF Sum of Squares Mean Square Expected Mean Square Error Term Error DF F Value Pr > F
trt 2 16.000000 8.000000 Var(Residual)+Q(trt) MS(Residual) 3 16.00 0.0251
Residual 3 1.500000 0.500000 Var(Residual)        

The Intermediate calculations for this model are:

xprimex
6 2 2
2 2 0
2 0 2
check
1 -2.22E-16 0
3.331E-16 1 0
0 0 1
xprimey
21
3
7
SumY2
89.5
CF
73.5
xprimexinv
0.5 -0.5 -0.5
-0.5 1 0.5
-0.5 0.5 1

3a.4 - The Effects Model

3a.4 - The Effects Model

In effects model we are no longer estimating means, but instead are estimating the deviations of treatment means from the overall mean (the \(tau_i\)). The model must include the intercept, so we have the following design matrix for IML:

/* The Effects Model */
x={
1	1	0,
1	1	0,
1	0	1,
1	0	1,
1	-1	-1,
1	-1	-1};
 

Here we have another omission of a treatment level, but for a different reason. In the effects model, we have the constraint \(\Sigma \tau_i =0\). As a result, we need to remove one treatment level. Again, the coding for treatment level3 has been omitted.

Re-running with this design matrix yields:

Regression Coefficients
Beta_0 3.5
Beta_1 -2
Beta_2 0
ANOVA
  df SS MS F
Treatment 2 16 8 16
Error 3 1.5 0.5  
Total 5 17.5    
 

The regression coefficient \(\beta_0\) is the overall mean and the coefficients \(\beta_1\) and \(\beta_2\) are \(\tau_1\) and \(\tau_2\) respectively. The estimate for \(\tau_3\) is obtained as \(– (\tau_1)-( \tau_2) = 2.0\).

If we change the coding in Minitab now to be Effect coding (1,0,-1), the default setting, we get the following:

Regression Equation

y = 3.500 - 2.000 trt_level1 - 0.000 trt_level2 + 2.000 trt_level3

The ANOVA table is the same as for the dummy-variable regression model above.

The intermediates were:

xprimex
6 0 0
0 4 2
0 2 4
check
1 0 0
0 1 0
0 0 1
xprimey
21
-8
-4
SumY2
89.5
CF
73.5
xprimexinv
0.1666667 0 0
0 0.3333333 -0.166667
0 -0.166667 0.3333333

3a.5 - Summary

3a.5 - Summary

By coding treatment or factor levels in to numerical terms, we can use regression methods to perform the ANOVA.

 See Textbook: Section 16.8

To state the ANOVA model

\(Y_{ij}=\mu_. + \tau_i + \epsilon_{ij}\)

as a regression model, we need to represent the parameters \(\mu_., \tau_i , ... , \tau_r\) in the model. However, constraint 16.64 (from text) for the case of equal weightings:

\(\sum_{i=1}^{r} \tau_i =0\)

implies that one of the r parameters \(\tau_i\) is not needed since it can be expressed in terms of the other r - 1 parameters. We shall drop the parameter \(\tau_r\), which according to constraint 16.64 (from text) can be expressed in terms of the other r - 1 parameters \(\tau_i\) as follows:

\(\tau_r\ = -\tau_1 - \tau_2 - \cdots \tau_{r-1}\)

Thus, we shall use only the parameters \(\mu_., \tau_1 , ... , \tau_{r-1}\) for the linear model.

To illustrate how a linear model is developed with this approach, consider a single-factor study with r = 3 factor levels when \(n_1=n_2=n_3=2\). The Y, X, \(\boldsymbol{\beta}\) and \(\boldsymbol{\epsilon}\) matrices for this case are as follows:

\(\mathbf{Y} = \begin{bmatrix}Y_{11}\\ Y_{12}\\ Y_{21}\\ Y_{22}\\ Y_{31}\\ Y_{32}\end{bmatrix} \mathbf{X} = \begin{bmatrix}1 & 1 & 0\\ 1 & 1 & 0\\ 1 & 0 & 1\\ 1 & 0 & 1\\ 1 & -1 & -1\\ 1 & -1 & -1\end{bmatrix} \boldsymbol{\beta} = \begin{bmatrix}\beta_{0}\\ \beta_{1}\\ \beta_{2}\end{bmatrix} \boldsymbol{\epsilon} = \begin{bmatrix}\epsilon_{11}\\ \epsilon_{12}\\ \epsilon_{21}\\ \epsilon_{22}\\ \epsilon_{31}\\ \epsilon_{32}\end{bmatrix}\)

Note that the vector of expected values \(\mathbf{E}\{\mathbf{Y}\} =\mathbf{X}\boldsymbol{\beta}\), yields the following:

\(\mathbf{E}\{\mathbf{Y}\} = \begin{bmatrix}E\{Y_{11}\}\\ E\{Y_{12}\}\\ E\{Y_{21}\}\\ E\{Y_{22}\}\\E\{Y_{31}\}\\E\{Y_{32}\}\end{bmatrix} =\mathbf{X}\boldsymbol{\beta}= \begin{bmatrix}1 & 1 & 0\\ 1 & 1 & 0\\ 1 & 0 & 1\\ 1 & 0 & 1\\ 1 & -1 & -1\\ 1 & -1 & -1\end{bmatrix}\begin{bmatrix}\beta_{0}\\ \beta_{1}\\ \beta_{2}\end{bmatrix}=\begin{bmatrix}\mu_{.}+\tau_{1}\\ \mu_{.}+\tau_{1}\\ \mu_{.}+\tau_{2}\\ \mu_{.}+\tau_{2}\\ \mu_{.}-\tau_{1}-\tau_{2}\\ \mu_{.}-\tau_{1}-\tau_{2}\end{bmatrix}\)

Since \(\tau_3=-\tau_{1}-\tau_{2}\), see above, we see that \(E\{Y_{31}\} =E\{Y_{32}\}=\mu_{.}+\tau_{3}\). Thus, the above X matrix and \(\beta\) vector representation provides in all cases the appropriate expected values:

\(E\{Y_{ij}\} =\mu_{.}+\tau_{i}\)


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility