Introduction to Unbalanced Data Section
Up to this point in the course, the data have all been from balanced designs (there are equal observations in all cells). In nice neat designed experiments, this is often the case. However, there are countless examples where test samples are lost or contaminated, people drop out of studies or data is somehow "lost". The result? Unbalanced designs. The information below demonstrates how the "mechanics" of ANOVA are influenced when the designs are not balanced.
In Lesson 2 we developed the ANOVA model for a single treatment (the oneway ANOVA) using the following model:
\(Y_{ij}=\mu+\tau_i+\epsilon_{ij}\)
We developed the ideas for this model using situations where we had an equal number of observations (the ni) in each of the treatment levels. This is referred to as a balanced design. The computations we used to obtain the ANOVA table began by calculating 1) the overall or grand mean, and 2) the means for each of treatment levels. Using these, we obtained the \(SS_{\text{Total}}\) \(SS_{\text{Trt}}\) and (by difference) the \(SS_{\text{Error}}\).
Looking back at the example data that we used (Lesson1 Data):
Control  F1  F2  F3 
21  32  22.5  28 
19.5  30.5  26  27.5 
22.5  25  28  31 
21.5  27.5  27  29.5 
20.5  28  26.5  30 
21  28.6  25.2  29.2 
We obtained the following table of means:
Grand Mean  26.1667 

F1  28.6000 
F2  25.8667 
F3  29.2000 
Control  21.0000 
General LInear Model: ht versus trt
Factor  Type  Levels  Values 

trt  fixed  4  control, f1, f2, f3 
Analysis of Variance for ht, using Adjusted SS for Tests
Source  DF  Seq SS  Adj SS  Adj MS  F  P 

trt  3  251.440  251.440  83.813  27.46  0.000 
Error  20  61.033  61.033  3.052  
Total  23  312.473 
Using the Regression Approach:
C1T  C2  C3  C4  C5  C6  C7  C8T 

trt  ht  x0  x1  x2  x3  COEF1  
control  21.0  1  1  0  0  26.1667  grandmean 
control  19.5  1  1  0  0  5.1667  
control  22.5  1  1  0  0  2.4333  
control  21.5  1  1  0  0  0.3000  
control  21.0  1  1  0  0  
f1  32.0  1  0  1  0  
f1  30.5  1  0  1  0  
f1  25.0  1  0  1  0  
f1  27.5  1  0  1  0  
f1  28.0  1  0  1  0  
f1  28.6  1  0  1  0  
f2  22.5  1  0  0  1  
f2  26.0  1  0  0  1  
f2  28.0  1  0  0  1  
f2  27.0  1  0  0  1  
f2  26.5  1  0  0  1  
f2  25.2  1  0  0  1  
f3  28.0  1  1  1  1  
f3  27.5  1  1  1  1  
f3  31.0  1  1  1  1  
f3  29.5  1  1  1  1  
f3  30.0  1  1  1  1  
f3  29.2  1  1  1  1 
we obtained the solutions for the estimates as the coefficients (shown in C7 above). Note that the overall or grand mean, in this case, was the coefficient for \(\beta_0\) in the regression model.
Now, what would happen if one of the plants died?
The data might look like this:
Control  F1  F2  F3 
21  32  22.5  28 
19.5  30.5  26  27.5 
22.5  25  28  31 
21.5  27.5  27  29.5 
20.5  28  26.5  30 
21  28.6  25.2  (dead) 
For the new data, we get the following table of means:
Grand Mean  26.0348 
F1  28.6000 
F2  25.8667 
F3  29.2000 
Control  21.0000 
The ANOVA now is:
Factor  Type  Levels  Values 

trt  fixed  4  control, f1, f2, f3 
Analysis of Variance for ht, using Adjusted SS for Tests
Source  DF  Seq SS  Adj SS  Adj MS  F  P 

trt  3  241.839  241.839  80.613  25.10  0.000 
Error  19  61.033  61.033  3.212  
Total  22  302.872 
The treatment SS has changed because in the formula for the \(SS_{\text{Trt}}\) we multiply each deviation of treatment mean from the overall mean by \(n_i\) and now there are only 5 instead of 6 plants for the F3 treatment level and the degrees of freedom for the error is now 19, the total degrees of freedom are ow 22, but the treatment degress of freedom do not change.
We can again run the ANOVA as a Regression:
C1T  C2  C3  C4  C5  C6  C7  C 

trt  ht  x0  x1  x2  x3  COEF1  
control  21.0  1  1  0  0  26.1667  
control  19.5  1  1  0  0  5.1667  
control  22.5  1  1  0  0  2.4333  
control  21.5  1  1  0  0  0.3000  
control  21.0  1  1  0  0  
f1  32.0  1  0  1  0  
f1  30.5  1  0  1  0  
f1  25.0  1  0  1  0  
f1  27.5  1  0  1  0  
f1  28.0  1  0  1  0  
f1  28.6  1  0  1  0  
f2  22.5  1  0  0  1  
f2  26.0  1  0  0  1  
f2  28.0  1  0  0  1  
f2  27.0  1  0  0  1  
f2  26.5  1  0  0  1  
f2  25.2  1  0  0  1  
f3  28.0  1  1  1  1  
f3  27.5  1  1  1  1  
f3  31.0  1  1  1  1  
f3  29.5  1  1  1  1  
f3  30.0  1  1  1  1 
What is going on here? The coefficient in the regression model for the overall or grand mean (highlighted) is unchanged! But we saw that by losing one of the plants that the overall mean did change from 26.1667 to 26.0348.
This exercise highlights an important aspect of the ANOVA model as we have developed so far. In Section 16.7 of the text, we see that the model appears (Equation 16.62) as:
\(Y_{ij}=\mu_{.}+\tau_{i}+\epsilon_{ij}\)
When we calculated the SS for the ANOVA by hand in Lesson 2, we used the overall mean (\(\bar{Y}_{..}\)) and the treatment level means (\(\bar{Y}_{i.}\) ). In the regression analysis, the estimate of the overall mean is being calculated as the least square estimator of \(\mu_{.}\) :
\(\hat{\mu}_{.}=\dfrac{\sum_{i=1}^{r}\bar{Y}_{i.}}{r}\)
as shown in Equation 16.75a in the textbook. This equation shows that the estimate of the overall mean is being calculated as the mean of the treatment level means. The authors point out that "this quantity is generally not the same as the overall mean \(\bar{Y}_{..}\) unless the cell sample sizes are equal." The \(\tau_i\) that are computed will, therefore, depend on the how the overall mean is obtained, but the \(SS_{\text{Trt}}\) and F test for the treatment effect will not be affected.
In the example above, we worked with a oneway or single factor ANOVA. The problems associated with unbalanced data are more important to recognize in multifactor studies, as we will see next.
Example 41: Diastolic Blood Pressure with Gender and Treatment Section
Here is an example of an unbalanced design examining the difference between males and females and two different treatments (D and F) on diastolic blood pressure.
Tabulated statistics: Gender, Trt
Rows: Gender Columns: Trt
D  E  All  

F  84.00  80.00  82.67 
3.162  7.071  4.502  
4  2  6  
4  2  6  
M  82.00  74.00  76.67 
2.828  4.082  5.354  
2  4  6  
2  4  6  
All  83.33  76.00  79.76 
6  6  12  
6  6  12 
Cell Contents:  DiasBP:  Mean 
DiasBP: 
Standard deviation 

DiasBP: 
Normissing Count 
Tabulated statistics: Gender, Trt
Rows: Gender Columns: Trt
D  E  All  

F  4  2  6 
80  75  
86  85  
83  
87  
M  2  4  6 
80  70  
84  78  
71  
77  
All  6  6  12 
Cell Contents:  DiasBP:  Count DATA 
Full Model with Interaction
General Linear Model: DiasBP versus Gender, Trt
Factor  Type  Levels  Values 

Gender  fixed  2  F,M 
Trt  fixed  2  D,E 
Analysis of Variance for DiasBP, using Adjusted SS for Tests
Source  DF  Seq SS  Adj SS  Adj MS  F  P 

Gender  1  108.00  42.67  42.67  2.47  0.154 
Trt  1  96.00  96.00  96.00  5.57  0.046 
Gender*Trt  1  10.67  10.67  10.67  0.62  0.454 
Error  8  138.00  138.00  17.25  
Total  11  352.67 
S = 4.15331 RSq = 60.87% RSq(adj) = 46.20%
Term  Coef  SE Coef  T  P 

Constant  80.000  1.272  62.91  0.000 
Gender F  2.000  1.272  1.57  0.154 
Trt D  3.000  1.272  2.36  0.046 
Gender F*Trt D  1.000  1.272  0.79  0.454 
Least Squares Means for DiasBP
Gender  Mean  SE Mean  

F  82.00  1.798  
M  78.00  1.798  
Trt  
D  83.00  1.798  
E  77.00  1.798  
Gender*Trt  
F  D  84.00  2.077 
F  E  80.00  2.937 
M  D  82.00  2.937 
M  E  74.00  2.077 
\(\hat{\mu}_{i.}=\hat{\mu}+\hat{\alpha}_{i}\)
\(\hat{\mu}_{1.}=80 + 2 =82\)
\(\hat{\mu}_{2.}=80  2 =78\)
\(\hat{\mu}_{ij}=\hat{\mu}+\hat{\alpha}_{i}+\hat{\beta}_{j}+\hat{(\alpha\beta)}_{ij}\)
e.g. \(\mu_{11}=80+2+3+(1)=84\)