4.2 - RCBD and RCBD's with Missing Data

Example 4.1: Vascular Graft

This example investigates a procedure to create artificial arteries using a resin. The resin is pressed or extruded through an aperture that forms the resin into a tube.

To conduct this experiment as a RCBD, we need to assign all 4 pressures at random to each of the 6 batches of resin. Each batch of resin is called a “block”, since a batch is a more homogenous set of experimental units on which to test the extrusion pressures. Below is a table which provides percentages of those products that met the specifications.

Extrusion Pressure (PSI)	Batch of Resin (Block)						Treatment Total
Extrusion Pressure (PSI)	1	2	3	4	5	6	Treatment Total
8500	90.3	89.2	98.2	93.9	87.4	97.9	556.9
8700	92.5	89.5	90.6	94.7	87.0	95.8	550.1
8900	85.5	90.8	89.6	86.2	88.0	93.4	533.5
9100	82.5	89.5	85.6	87.4	78.9	90.7	514.6
Block Totals	350.8	359.0	364.0	362.2	341.3	377.8	\(y_n = 2155.1\)
Table 4-3 Randomized Complete Block Design for the Vascular Graft Experiment

NOTE! Since percent response data does not generally meet the assumption of constant variance, we might consider a variance stabilizing transformation, i.e., the arcsine square root of the proportion. However, since the range of the percent data is quite limited, it goes from the high 70s through the 90s, this data seems fairly homogeneous.

Output...

Response: Yield

ANOVA for selected Factorial Model

Analysis of variance table [Partial sum of squares]

Source	Sum of Squares	DF	Mean Square	F Value	Prob > F
Block	192.25	5	38.45
Model	178.17	3	59.39	8.11	0.0019
A	178.17	3	59.39	8.11	0.0019
Residual	109.89	15	7.33
Cor Total	480.31	23

Std. Dev.	2.71	R-Squared	0.6185
Mean	89.80	Adj R-Squared	0.5422
C.V.	3.01	Pred R-Squared	0.0234
PRESS	281.31	Adeq Precision	9.759

Notice that Design Expert does not perform the hypothesis test on the block factor. Should we test the block factor?

Below is the Minitab output which treats both batch and treatment the same and tests the hypothesis of no effect.

ANOVA: Yield versus Batch, Pressure

Factor	Type	Level	Values
Batch	random	6	1,2,3,4,5,6
Pressure	fixed	4	8500, 8700, 8900, 9100

Analysis of Variance for Yield

Source	DF	SS	MS	F	P
Batch	5	192.252	38.450	5.25	0.006
Pressure	3	178.171	59.390	8.11	0.002
Error	15	109.886	7.326
Total	23	480.310

S = 2.70661

R-Sq = 77.12%

R-Sq(adj) = 64.92%

This example shows the output from the ANOVA command in Minitab (Menu > Stat > ANOVA > Balanced ANOVA). It does hypothesis tests for both batch and pressure, and they are both significant. Otherwise, the results from both programs are very similar.

Again, should we test the block factor? Generally, the answer is no, but in some instances, this might be helpful. We use the RCBD design because we hope to remove from error the variation due to the block. If the block factor is not significant, then the block variation, or mean square due to the block treatments is no greater than the mean square due to the error. In other words, if the block F ratio is close to 1 (or generally not greater than 2), you have wasted effort in doing the experiment as a block design, and used in this case 5 degrees of freedom that could be part of error degrees of freedom, hence the design could actually be less efficient!

Therefore, one can test the block simply to confirm that the block factor is effective and explains variation that would otherwise be part of your experimental error. However, you generally cannot make any stronger conclusions from the test on a block factor, because you may not have randomly selected the blocks from any population, nor randomly assigned the levels.

Why did I first say no?

There are two cases we should consider separately when blocks are: 1) a classification factor and 2) an experimental factor. In the case where blocks are a batch, it is a classification factor, but it might also be subjects or plots of land which are also classification factors. For a RCBD you can apply your experiment to convenient subjects. In the general case of classification factors, you should sample from the population in order to make inferences about that population. These observed batches are not necessarily a sample from any population. If you want to make inferences about a factor then there should be an appropriate randomization, i.e. random selection, so that you can make inferences about the population. In the case of experimental factors, such as oven temperature for a process, all you want is a representative set of temperatures such that the treatment is given under homogeneous conditions. The point is that we set the temperature once in each block; we don't reset it for each observation. So, there is no replication of the block factor. We do our randomization of treatments within a block. In this case, there is an asymmetry between treatment and block factors. In summary, you are only including the block factor to reduce the error variation due to this nuisance factor, not to test the effect of this factor.

ANOVA: Yield versus Batch, Pressure

The residual analysis for the Vascular Graft example is shown:

plot

The pattern does not strike me as indicating an unequal variance.

Another way to look at these residuals is to plot the residuals against the two factors. Notice that pressure is the treatment factor and batch is the block factor. Here we'll check for homogeneous variance. Against treatment these look quite homogeneous.

plot

Plotted against block the sixth does raise ones eyebrow a bit. It seems to be very close to zero.

plot

Basic residual plots indicate that normality, constant variance assumptions are satisfied. Therefore, there seems to be no obvious problems with randomization. These plots provide more information about the constant variance assumption, and can reveal possible outliers. The plot of residuals versus order sometimes indicates a problem with the independence assumption.

Missing Data

In the example dataset above, what if the data point 94.7 (second treatment, fourth block) was missing? What data point can I substitute for the missing point?

If this point is missing we can substitute x, calculate the sum of squares residuals, and solve for x which minimizes the error and gives us a point based on all the other data and the two-way model. We sometimes call this an imputed point, where you use the least squares approach to estimate this missing data point.

After calculating x, you could substitute the estimated data point and repeat your analysis. Now you have an artificial point with known residual zero. So you can analyze the resulting data, but now should reduce your error degrees of freedom by one. In any event, these are all approximate methods, i.e., using the best fitting or imputed point.

Before high-speed computing, data imputation was often done because the ANOVA computations are more readily done using a balanced design. There are times where imputation is still helpful but in the case of a two-way or multiway ANOVA we generally will use the General Linear Model (GLM) and use the full and reduced model approach to do the appropriate test. This is often called the General Linear Test (GLT).

Let's take a look at this in Minitab now (no sound)...

The sum of squares you want to use to test your hypothesis will be based on the adjusted treatment sum of squares, \(R( \tau_i | \mu, \beta_j) \) using the notation for testing:

\(H_0 \colon \tau_i = 0\)

The numerator of the F-test, for the hypothesis you want to test, should be based on the adjusted SS's that is last in the sequence or is obtained from the adjusted sums of squares. That will be very close to what you would get using the approximate method we mentioned earlier. The general linear test is the most powerful test for this type of situation with unbalanced data.

The General Linear Test can be used to test for significance of multiple parameters of the model at the same time. Generally, the significance of all those parameters which are in the Full model but are not included in the Reduced model are tested, simultaneously. The F test statistic is defined as

\(F^\ast=\dfrac{SSE(R)-SSE(F)}{df_R-df_F}\div \dfrac{SSE(F)}{df_F}\)

Where F stands for “Full” and R stands for “Reduced.” The numerator and denominator degrees of freedom for the F statistic is \(df_R - df_F\) and \(df_F\) , respectively.

Here are the results for the GLM with all the data intact. There are 23 degrees of freedom total here so this is based on the full set of 24 observations.

General Linear Model: Yield versus, Batch, Pressure

Factor	Type	Levels	Values
Batch	fixed	6	1, 2, 3, 4, 5, 6
Pressure	fixed	4	8500, 8700, 8900, 9100

Analysis of variance for Yield, using Adjusted SS for Tests

Source	DF	Seq SS	Adj SS	Adj MS	F	P
Batch	5	192.252	192.252	38.450	5.25	0.006
Pressure	3	178.171	178.171	59.390	8.11	0.002
Error	15	109.886	109.886	7.326
Total	23	480.310

S = 2.70661

R-Sq = 77.12%

R-Sq(adj) =64.92%

Least Squares Means for Yield

Pressure	Mean	SE Mean
8500	92.82	1.105
8700	91.68	1.105
8900	88.92	1.105
9100	85.77	1.105
Main Effects Plot (fitted means) for Yield

When the data are complete this analysis from GLM is correct and equivalent to the results from the two-way command in Minitab. When you have missing data, the raw marginal means are wrong. What if the missing data point were from a very high measuring block? It would reduce the overall effect of that treatment, and the estimated treatment mean would be biased.

Above you have the least squares means that correspond exactly to the simple means from the earlier analysis.

We now illustrate the GLM analysis based on the missing data situation - one observation missing (Batch 4, pressure 2 data point removed). The least squares means as you can see (below) are slightly different, for pressure 8700. What you also want to notice is the standard error of these means, i.e., the S.E., for the second treatment is slightly larger. The fact that you are missing a point is reflected in the estimate of error. You do not have as many data points on that particular treatment.

Results for: Ex4-1miss.MTW

General Linear Model: Yield versus, Batch, Pressure

Factor	Type	Levels	Values
Batch	fixed	6	1, 2, 3, 4, 5, 6
Pressure	fixed	4	8500, 8700, 8900, 9100

Analysis of variance for Yield, using Adjusted SS for Tests

Source	DF	Seq SS	Adj SS	Adj MS	F	P
Batch	5	190.119	189.522	37.904	5.22	0.007
Pressure	3	163.398	163.398	54.466	7.50	0.003
Error	14	101.696	101.696	7.264
Total	22	455.213

S = 2.69518

R-Sq = 77.66%

R-Sq(adj) =64.99%

Least Squares Means for Yield

Pressure	Mean	SE Mean
8500	92.82	1.100
8700	91.08	1.238
8900	88.92	1.100
9100	85.77	1.100

The overall results are similar. We have only lost one point and our hypothesis test is still significant, with a p-value of 0.003 rather than 0.002.

Here is a plot of the least squares means for Yield with all of the observations included.

plot

Here is a plot of the least squares means for Yield with the missing data, not very different.

plot

Again, for any unbalanced data situation, we will use the GLM. For most of our examples, GLM will be a useful tool for analyzing and getting the analysis of variance summary table. Even if you are unsure whether your data are orthogonal, one way to check if you simply made a mistake in entering your data is by checking whether the sequential sums of squares agree with the adjusted sums of squares.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility