4.5 - A Note About Balanced Designs

Introduction to Unbalanced Data Section

Up to this point in the course, the data have all been from balanced designs (there are equal observations in all cells). In nice neat designed experiments, this is often the case. However, there are countless examples where test samples are lost or contaminated, people drop out of studies or data is somehow "lost". The result? Unbalanced designs. The information below demonstrates how the "mechanics" of ANOVA are influenced when the designs are not balanced.

In Lesson 2 we developed the ANOVA model for a single treatment (the one-way ANOVA) using the following model:

\(Y_{ij}=\mu+\tau_i+\epsilon_{ij}\)

We developed the ideas for this model using situations where we had an equal number of observations (the ni) in each of the treatment levels. This is referred to as a balanced design. The computations we used to obtain the ANOVA table began by calculating 1) the overall or grand mean, and 2) the means for each of treatment levels. Using these, we obtained the \(SS_{\text{Total}}\) \(SS_{\text{Trt}}\) and (by difference) the \(SS_{\text{Error}}\).

Looking back at the example data that we used (Lesson1 Data):

Control F1 F2 F3
21 32 22.5 28
19.5 30.5 26 27.5
22.5 25 28 31
21.5 27.5 27 29.5
20.5 28 26.5 30
21 28.6 25.2 29.2

We obtained the following table of means:

Grand Mean 26.1667
F1 28.6000
F2 25.8667
F3 29.2000
Control 21.0000

General LInear Model: ht versus trt

Factor Type Levels Values
trt fixed 4 control, f1, f2, f3
Analysis of Variance for ht, using Adjusted SS for Tests
Source DF Seq SS Adj SS Adj MS F P
trt 3 251.440 251.440 83.813 27.46 0.000
Error 20 61.033 61.033 3.052    
Total 23 312.473        

Using the Regression Approach:

C1-T C2 C3 C4 C5 C6 C7 C8-T
trt ht x0 x1 x2 x3 COEF1  
control 21.0 1 1 0 0 26.1667 grandmean
control 19.5 1 1 0 0 -5.1667  
control 22.5 1 1 0 0 2.4333  
control 21.5 1 1 0 0 -0.3000  
control 21.0 1 1 0 0    
f1 32.0 1 0 1 0    
f1 30.5 1 0 1 0    
f1 25.0 1 0 1 0    
f1 27.5 1 0 1 0    
f1 28.0 1 0 1 0    
f1 28.6 1 0 1 0    
f2 22.5 1 0 0 1    
f2 26.0 1 0 0 1    
f2 28.0 1 0 0 1    
f2 27.0 1 0 0 1    
f2 26.5 1 0 0 1    
f2 25.2 1 0 0 1    
f3 28.0 1 -1 -1 -1    
f3 27.5 1 -1 -1 -1    
f3 31.0 1 -1 -1 -1    
f3 29.5 1 -1 -1 -1    
f3 30.0 1 -1 -1 -1    
f3 29.2 1 -1 -1 -1    
 

we obtained the solutions for the estimates as the coefficients (shown in C7 above). Note that the overall or grand mean, in this case, was the coefficient for \(\beta_0\) in the regression model.

Now, what would happen if one of the plants died?

The data might look like this:

Control F1 F2 F3
21 32 22.5 28
19.5 30.5 26 27.5
22.5 25 28 31
21.5 27.5 27 29.5
20.5 28 26.5 30
21 28.6 25.2 (dead)

For the new data, we get the following table of means:

Grand Mean 26.0348
F1 28.6000
F2 25.8667
F3 29.2000
Control 21.0000
Note! The plant that died just happened to have a height equal to the mean for that treatment, so the treatment mean did not change in this example. We did this on purpose, to illustrate what is happening when we run the ANOVA on unbalanced data.

The ANOVA now is:

Factor Type Levels Values
trt fixed 4 control, f1, f2, f3
Analysis of Variance for ht, using Adjusted SS for Tests
Source DF Seq SS Adj SS Adj MS F P
trt 3 241.839 241.839 80.613 25.10 0.000
Error 19 61.033 61.033 3.212    
Total 22 302.872        
 

The treatment SS has changed because in the formula for the \(SS_{\text{Trt}}\) we multiply each deviation of treatment mean from the overall mean by \(n_i\) and now there are only 5 instead of 6 plants for the F3 treatment level and the degrees of freedom for the error is now 19, the total degrees of freedom are ow 22, but the treatment degress of freedom do not change.

We can again run the ANOVA as a Regression:

C1-T C2 C3 C4 C5 C6 C7 C
trt ht x0 x1 x2 x3 COEF1  
control 21.0 1 1 0 0 26.1667  
control 19.5 1 1 0 0 -5.1667  
control 22.5 1 1 0 0 2.4333  
control 21.5 1 1 0 0 -0.3000  
control 21.0 1 1 0 0    
f1 32.0 1 0 1 0    
f1 30.5 1 0 1 0    
f1 25.0 1 0 1 0    
f1 27.5 1 0 1 0    
f1 28.0 1 0 1 0    
f1 28.6 1 0 1 0    
f2 22.5 1 0 0 1    
f2 26.0 1 0 0 1    
f2 28.0 1 0 0 1    
f2 27.0 1 0 0 1    
f2 26.5 1 0 0 1    
f2 25.2 1 0 0 1    
f3 28.0 1 -1 -1 -1    
f3 27.5 1 -1 -1 -1    
f3 31.0 1 -1 -1 -1    
f3 29.5 1 -1 -1 -1    
f3 30.0 1 -1 -1 -1    
 

What is going on here? The coefficient in the regression model for the overall or grand mean (highlighted) is unchanged! But we saw that by losing one of the plants that the overall mean did change from 26.1667 to 26.0348.

This exercise highlights an important aspect of the ANOVA model as we have developed so far. In Section 16.7 of the text, we see that the model appears (Equation 16.62) as:

\(Y_{ij}=\mu_{.}+\tau_{i}+\epsilon_{ij}\)

When we calculated the SS for the ANOVA by hand in Lesson 2, we used the overall mean (\(\bar{Y}_{..}\)) and the treatment level means (\(\bar{Y}_{i.}\) ). In the regression analysis, the estimate of the overall mean is being calculated as the least square estimator of \(\mu_{.}\) :

\(\hat{\mu}_{.}=\dfrac{\sum_{i=1}^{r}\bar{Y}_{i.}}{r}\)

as shown in Equation 16.75a in the textbook. This equation shows that the estimate of the overall mean is being calculated as the mean of the treatment level means. The authors point out that "this quantity is generally not the same as the overall mean \(\bar{Y}_{..}\) unless the cell sample sizes are equal." The \(\tau_i\) that are computed will, therefore, depend on the how the overall mean is obtained, but the \(SS_{\text{Trt}}\) and F test for the treatment effect will not be affected.

In the example above, we worked with a one-way or single factor ANOVA. The problems associated with unbalanced data are more important to recognize in multi-factor studies, as we will see next.

Example 4-1: Diastolic Blood Pressure with Gender and Treatment Section

Here is an example of an unbalanced design examining the difference between males and females and two different treatments (D and F) on diastolic blood pressure.

Tabulated statistics: Gender, Trt

 Rows: Gender  Columns: Trt

  D E All
F 84.00 80.00 82.67
  3.162 7.071 4.502
  4 2 6
  4 2 6
M 82.00 74.00 76.67
  2.828 4.082 5.354
  2 4 6
  2 4 6
All 83.33 76.00 79.76
  6 6 12
  6 6 12
Cell Contents: DiasBP: Mean
 

DiasBP:

Standard

deviation

 

DiasBP:

Normissing

Count

Tabulated statistics: Gender, Trt

Rows: Gender Columns: Trt

  D E All
F 4 2 6
  80 75  
  86 85  
  83    
  87    
M 2 4 6
  80 70  
  84 78  
    71  
    77  
All 6 6 12
Cell Contents: DiasBP: Count DATA
 

Full Model with Interaction

General Linear Model: DiasBP versus Gender, Trt

Factor Type Levels Values
Gender fixed 2 F,M
Trt fixed 2 D,E
Analysis of Variance for DiasBP, using Adjusted SS for Tests
Source DF Seq SS Adj SS  Adj MS F P
Gender 1 108.00 42.67 42.67 2.47 0.154
Trt 1 96.00 96.00 96.00 5.57 0.046
Gender*Trt 1 10.67 10.67 10.67 0.62 0.454
Error 8 138.00 138.00 17.25    
Total 11 352.67        

 S = 4.15331  R-Sq = 60.87%  R-Sq(adj) = 46.20%

Term Coef SE Coef T P
Constant 80.000 1.272 62.91 0.000
Gender F 2.000 1.272 1.57 0.154
Trt D 3.000 1.272 2.36 0.046
Gender F*Trt D -1.000 1.272 -0.79 0.454
Least Squares Means for DiasBP
Gender Mean SE Mean
F   82.00 1.798
M   78.00 1.798
Trt      
D   83.00 1.798
E   77.00 1.798
Gender*Trt    
F D 84.00 2.077
F E 80.00 2.937
M D 82.00 2.937
M E 74.00 2.077
       

\(\hat{\mu}_{i.}=\hat{\mu}+\hat{\alpha}_{i}\)

\(\hat{\mu}_{1.}=80 + 2 =82\)
\(\hat{\mu}_{2.}=80 - 2 =78\)

\(\hat{\mu}_{ij}=\hat{\mu}+\hat{\alpha}_{i}+\hat{\beta}_{j}+\hat{(\alpha\beta)}_{ij}\)

e.g. \(\mu_{11}=80+2+3+(-1)=84\)