Lesson 13: OneFactor Analysis of Variance
Lesson 13: OneFactor Analysis of VarianceWe previously learned how to compare two population means using either the pooled twosample ttest or Welch's ttest. What happens if we want to compare more than two means? In this lesson, we'll learn how to do just that. More specifically, we'll learn how to use the analysis of variance method to compare the equality of the (unknown) means \(\mu_1 , \mu_2 , \dots, \mu_m\) of m normal distributions with an unknown but common variance \(\sigma^2\). Take specific note about that last part.... "an unknown but common variance \(\sigma^2\)." That is, the analysis of variance method assumes that the population variances are equal. In that regard, the analysis of variance method can be thought of as an extension of the pooled twosample ttest.
13.1  The Basic Idea
13.1  The Basic IdeaWe could take a topdown approach by first presenting the theory of analysis of variance and then following it up with an example. We're not going to do it that way though. We're going to take a bottomup approach, in which we first develop the idea behind the analysis of variance on this page, and then present the results on the next page. Only after we've completed those two steps will we take a step back and look at the theory behind analysis of variance. That said, let's start with our first example of the lesson.
Example 131
A researcher for an automobile safety institute was interested in determining whether or not the distance that it takes to stop a car going 60 miles per hour depends on the brand of the tire. The researcher measured the stopping distance (in feet) of ten randomly selected cars for each of five different brands. So that he and his assistants would remain blinded, the researcher arbitrarily labeled the brands of the tires as Brand1, Brand2, Brand3, Brand4, and Brand5. Here are the data resulting from his experiment:
Brand1  Brand2  Brand3  Brand4  Brand5 

194  189  185  183  195 
184  204  183  193  197 
189  190  186  184  194 
189  190  183  186  202 
188  189  179  194  200 
186  207  191  199  211 
195  203  188  196  203 
186  193  196  188  206 
183  181  189  193  202 
188  206  194  196  195 
Do the data provide enough evidence to conclude that at least one of the brands is different from the others with respect to stopping distance?
Answer
The first thing we might want to do is to create some sort of summary plot of the data. Here is a box plot of the data:
Hmmm. It appears that the box plots for Brand1 and Brand5 have very little, if any, overlap at all. The same can be said for Brand3 and Brand5. Here are some summary statistics of the data:
Brand  N  MEAN  SD 

1  10  188.20  3.88 
2  10  195.20  9.02 
3  10  187.40  5.27 
4  10  191.20  5.55 
5  10  200.50  5.44 
It appears that the sample means differ quite a bit. For example, the average stopping distance of Brand3 is 187.4 feet (with a standard deviation of 5.27 feet), while the average stopping distance of Brand5 is 200.5 feet (with a standard deviation of 5.44 feet). A difference of 13 feet could mean the difference between getting into an accident or not. But, of course, we can't draw conclusions about the performance of the brands based on one sample. After all, a different random sample of cars could yield different results. Instead, we need to use the sample means to try to draw conclusions about the population means.
More specifically, the researcher needs to test the null hypothesis that the group population means are all the same against the alternative that at least one group population mean differs from the others. That is, the researcher needs to test this null hypothesis:
\(H_0 \colon \mu_1=\mu_2=\mu_3=\mu_4=\mu_5\)
against this alternative hypothesis:
\(H_A \colon \) at least one of the \(\mu_i\) differs from the others
In this lesson, we are going to learn how to use a method called analysis of variance to answer the researcher's question. Jumping right to the punch line, with no development or theoretical justification whatsoever, we'll use an analysis of variance table, such as this one:
Analysis of Variance for comparing all 5 brands 


Source  DF  SS  MS  F  P 
Brand  4  1174.8  293.7  7.95  0.000 
Error  45  1661.7  36.9  
Total  49  2836.5 
to draw conclusions about the equality of two or more population means. And, as we always do when performing hypothesis tests, we'll compare the Pvalue to \(\alpha\), our desired willingness to commit a Type I error. In this case, the researcher's Pvalue is very small (0.000, to three decimal places), so he should reject his null hypothesis. That is, there is sufficient evidence, at even a 0.01 level, to conclude that the mean stopping distance for at least one brand of tire is different than the mean stopping distances of the others.
So far, we have seen a typical null and alternative hypothesis in the analysis of variance framework, as well as an analysis of variance table. Let's take a look at another example with the idea of continuing to work on developing the basic idea behind the analysis of variance method.
Example 132
Suppose an education researcher is interested in determining whether a learning method affects students' exam scores. Specifically, suppose she considers these three methods:
 standard
 osmosis
 shock therapy
Suppose she convinces 15 students to take part in her study, so she randomly assigns 5 students to each method. Then, after waiting eight weeks, she tests the students to get exam scores.
What would the researcher's data have to look like to be able to conclude that at least one of the methods yields different exam scores than the others?
Answer
Suppose a dot plot of the researcher's data looked like this:
What would we want to conclude? Well, there's a lot of separation in the data between the three methods. In this case, there is little variation in the data within each method, but a lot of variation in the data across the three methods. For these data, we would probably be willing to conclude that there is a difference between the three methods.
Now, suppose instead that a dot plot of the researcher's data looked like this:
What would we want to conclude? Well, there's less separation in the data between the three methods. In this case, there is a lot of variation in the data within each method, and still some variation in the data across the three methods, but not as much as in the previous dot plot. For these data, it is not as obvious that we can conclude that there is a difference between the three methods.
Let's consider one more possible dot plot:
What would we want to conclude here? Well, there's even less separation in the data between the three methods. In this case, there is a real lot of variation in the data within each method, and not much variation at all in the data across the three methods. For these data, we would probably want to conclude that there is no difference between the three methods.
If you go back and look at the three possible data sets, you'll see that we drew our conclusions by comparing the variation in the data within a method to the variation in the data across methods. Let's try to formalize that idea a bit more by revisiting the two most extreme examples. First, the example in which we concluded that the methods differ:
Let's quantify (or are we still just qualifying?) the amount of variation within a method by comparing the five data points within a method to the method's mean, as represented in the plot as a colorcoded triangle. And, let's quantify (or qualify?) the amount of variation across the methods by comparing the method means, again represented in the plot as a colorcoded triangle, to the overall grand mean, that is, the average of all fifteen data points (ignoring the method). In this case, the variation between the group means and the grand mean is larger than the variation within the groups.
Now, let's revisit the example in which we wanted to conclude that there was no difference in the three methods:
In this case, the variation between the group means and the grand mean is smaller than the variation within the groups.
Hmmm... these two examples suggest that our method should compare the variation between the groups to that of the variation within the groups. That's just what an analysis of variance does!
Let's see what conclusion we draw from an analysis of variance of these data. Here's the analysis of variance table for the first study, in which we wanted to conclude that there was a difference in the three methods:
Source  DF  SS  MS  F  P 

Factor  2  2510.5  1255.3  93.44  0.000 
Error  12  161.2  13.4  
Total  14  2671.7 
In this case, the Pvalue is small (0.000, to three decimal places). We can reject the null hypothesis of equal means at the 0.05 level. That is, there is sufficient evidence at the 0.05 level to conclude that the mean exam scores of the three study methods are significantly different.
Here's the analysis of variance table for the third study, in which we wanted to conclude that there was no difference in the three methods:
Source  DF  SS  MS  F  P 

Factor  2  80.1  40.1  0.46  0.643 
Error  12  1050.8  87.6  
Total  14  1130.9 
In this case, the Pvalue, 0.643, is large. We fail to reject the null hypothesis of equal means at the 0.05 level. That is, there is insufficient evidence at the 0.05 level to conclude that the mean exam scores of the three study methods are significantly different.
Hmmm. It seems like we're on to something! Let's summarize.
The Basic Idea Behind Analysis of Variance
Analysis of variance involves dividing the overall variability in observed data values so that we can draw conclusions about the equality, or lack thereof, of the means of the populations from where the data came. The overall (or "total") variability is divided into two components:
 the variability "between" groups
 the variability "within" groups
We summarize the division of the variability in an "analysis of variance table", which is often shortened and called an "ANOVA table." Without knowing what we were really looking at, we looked at a few examples of ANOVA tables here on this page. Let's now go take an indepth look at the content of ANOVA tables.
13.2  The ANOVA Table
13.2  The ANOVA TableFor the sake of concreteness here, let's recall one of the analysis of variance tables from the previous page:
Source  DF  SS  MS  F  P 

Factor  2  2510.5  1255.3  93.44  0.000 
Error  12  161.2  13.4  
Total  14  2671.7 
In working to digest what is all contained in an ANOVA table, let's start with the column headings:
 Source means "the source of the variation in the data." As we'll soon see, the possible choices for a onefactor study, such as the learning study, are Factor, Error, and Total. The factor is the characteristic that defines the populations being compared. In the tire study, the factor is the brand of tire. In the learning study, the factor is the learning method.
 DF means "the degrees of freedom in the source."
 SS means "the sum of squares due to the source."
 MS means "the mean sum of squares due to the source."
 F means "the Fstatistic."
 P means "the Pvalue."
Now, let's consider the row headings:
 Factor means "the variability due to the factor of interest." In the tire example on the previous page, the factor was the brand of the tire. In the learning example on the previous page, the factor was the method of learning.
Sometimes, the factor is a treatment, and therefore the row heading is instead labeled as Treatment. And, sometimes the row heading is labeled as Between to make it clear that the row concerns the variation between the groups.
 Error means "the variability within the groups" or "unexplained random error." Sometimes, the row heading is labeled as Within to make it clear that the row concerns the variation within the groups.
 Total means "the total variation in the data from the grand mean" (that is, ignoring the factor of interest).
With the column headings and row headings now defined, let's take a look at the individual entries inside a general onefactor ANOVA table:
Hover over the lightbulb for further explanation.
Source  DF  SS  MS  F  P 

Factor  m1  SS (Between)  MSB  MSB/MSE  0.000 
Error  nm  SS (Error)  MSE  
Total  n1  SS (Total) 
Yikes, that looks overwhelming! Let's work our way through it entry by entry to see if we can make it all clear. Let's start with the degrees of freedom (DF) column:
 If there are n total data points collected, then there are n−1 total degrees of freedom.
 If there are m groups being compared, then there are m−1 degrees of freedom associated with the factor of interest.
 If there are n total data points collected and m groups being compared, then there are n−m error degrees of freedom.
Now, the sums of squares (SS) column:
 As we'll soon formalize below, SS(Between) is the sum of squares between the group means and the grand mean. As the name suggests, it quantifies the variability between the groups of interest.
 Again, as we'll formalize below, SS(Error) is the sum of squares between the data and the group means. It quantifies the variability within the groups of interest.
 SS(Total) is the sum of squares between the n data points and the grand mean. As the name suggests, it quantifies the total variability in the observed data. We'll soon see that the total sum of squares, SS(Total), can be obtained by adding the between sum of squares, SS(Between), to the error sum of squares, SS(Error). That is:
SS(Total) = SS(Between) + SS(Error)
The mean squares (MS) column, as the name suggests, contains the "average" sum of squares for the Factor and the Error:
 The Mean Sum of Squares between the groups, denoted MSB, is calculated by dividing the Sum of Squares between the groups by the between group degrees of freedom. That is, MSB = SS(Between)/(m−1).
 The Error Mean Sum of Squares, denoted MSE, is calculated by dividing the Sum of Squares within the groups by the error degrees of freedom. That is, MSE = SS(Error)/(n−m).
The F column, not surprisingly, contains the Fstatistic. Because we want to compare the "average" variability between the groups to the "average" variability within the groups, we take the ratio of the Between Mean Sum of Squares to the Error Mean Sum of Squares. That is, the Fstatistic is calculated as F = MSB/MSE.
When, on the next page, we delve into the theory behind the analysis of variance method, we'll see that the Fstatistic follows an Fdistribution with m−1 numerator degrees of freedom and n−m denominator degrees of freedom. Therefore, we'll calculate the Pvalue, as it appears in the column labeled P, by comparing the Fstatistic to an Fdistribution with m−1 numerator degrees of freedom and n−m denominator degrees of freedom.
Now, having defined the individual entries of a general ANOVA table, let's revisit and, in the process, dissect the ANOVA table for the first learning study on the previous page, in which n = 15 students were subjected to one of m = 3 methods of learning:
Hover over the lightbulb for further explanation.
Source  DF  SS  MS  F  P 

Factor  2  2510.5  1255.3  93.44  0.000 
Error  12  161.2  13.4  
Total  14  2671.7 
 Because n = 15, there are n−1 = 15−1 = 14 total degrees of freedom.
 Because m = 3, there are m−1 = 3−1 = 2 degrees of freedom associated with the factor.
 The degrees of freedom add up, so we can get the error degrees of freedom by subtracting the degrees of freedom associated with the factor from the total degrees of freedom. That is, the error degrees of freedom is 14−2 = 12. Alternatively, we can calculate the error degrees of freedom directly from n−m = 15−3=12.
 We'll learn how to calculate the sum of squares in a minute. For now, take note that the total sum of squares, SS(Total), can be obtained by adding the between sum of squares, SS(Between), to the error sum of squares, SS(Error). That is:
2671.7 = 2510.5 + 161.2
 MSB is SS(Between) divided by the between group degrees of freedom. That is, 1255.3 = 2510.5 ÷ 2.
 MSE is SS(Error) divided by the error degrees of freedom. That is, 13.4 = 161.2 ÷ 12.
 The Fstatistic is the ratio of MSB to MSE. That is, F = 1255.3 ÷ 13.4 = 93.44.
 The Pvalue is P(F(2,12) ≥ 93.44) < 0.001.
Okay, we slowly, but surely, keep on adding bit by bit to our knowledge of an analysis of variance table. Let's now work a bit on the sums of squares.
The Sums of Squares
In essence, we now know that we want to break down the TOTAL variation in the data into two components:
 a component that is due to the TREATMENT (or FACTOR), and
 a component that is due to just RANDOM ERROR.
Let's see what kind of formulas we can come up with for quantifying these components. But first, as always, we need to define some notation. Let's represent our data, the group means, and the grand mean as follows:
Group  Data  Means  

1  \(X_{11}\)  \(X_{12}\)  . . .  \(X_{1_{n_1}}\)  \(\bar{{X}}_{1.}\) 
2  \(X_{21}\)  \(X_{22}\)  . . .  \(X_{2_{n_2}}\)  \(\bar{{X}}_{2.}\) 
. . . 
. . . 
. . . 
. . . 
. . . 
. . . 
\(m\)  \(X_{m1}\)  \(X_{m2}\)  . . .  \(X_{m_{n_m}}\)  \(\bar{{X}}_{m.}\) 
Grand Mean  \(\bar{{X}}_{..}\) 
That is, we'll let:
 m denotes the number of groups being compared
 \(X_{ij}\) denote the \(j_{th}\) observation in the \(i_{th}\) group, where \(i = 1, 2, \dots , m\) and \(j = 1, 2, \dots, n_i\). The important thing to note here... note that j goes from 1 to \(n_i\), not to \(n\). That is, the number of the data points in a group depends on the group i. That means that the number of data points in each group need not be the same. We could have 5 measurements in one group, and 6 measurements in another.
 \(\bar{X}_{i.}=\dfrac{1}{n_i}\sum\limits_{j=1}^{n_i} X_{ij}\) denote the sample mean of the observed data for group i, where \(i = 1, 2, \dots , m\)
 \(\bar{X}_{..}=\dfrac{1}{n}\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} X_{ij}\) denote the grand mean of all n data observed data points
Okay, with the notation now defined, let's first consider the total sum of squares, which we'll denote here as SS(TO). Because we want the total sum of squares to quantify the variation in the data regardless of its source, it makes sense that SS(TO) would be the sum of the squared distances of the observations \(X_{ij}\) to the grand mean \(\bar{X}_{..}\). That is:
\(SS(TO)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}\bar{X}_{..})^2\)
With just a little bit of algebraic work, the total sum of squares can be alternatively calculated as:
\(SS(TO)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} X^2_{ij}n\bar{X}_{..}^2\)
Can you do the algebra?
Now, let's consider the treatment sum of squares, which we'll denote SS(T). Because we want the treatment sum of squares to quantify the variation between the treatment groups, it makes sense that SS(T) would be the sum of the squared distances of the treatment means \(\bar{X}_{i.}\) to the grand mean \(\bar{X}_{..}\). That is:
\(SS(T)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (\bar{X}_{i.}\bar{X}_{..})^2\)
Again, with just a little bit of algebraic work, the treatment sum of squares can be alternatively calculated as:
\(SS(T)=\sum\limits_{i=1}^{m}n_i\bar{X}^2_{i.}n\bar{X}_{..}^2\)
Can you do the algebra?
Finally, let's consider the error sum of squares, which we'll denote SS(E). Because we want the error sum of squares to quantify the variation in the data, not otherwise explained by the treatment, it makes sense that SS(E) would be the sum of the squared distances of the observations \(X_{ij}\) to the treatment means \(\bar{X}_{i.}\). That is:
\(SS(E)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}\bar{X}_{i.})^2\)
As we'll see in just one short minute why the easiest way to calculate the error sum of squares is by subtracting the treatment sum of squares from the total sum of squares. That is:
\(SS(E)=SS(TO)SS(T)\)
Okay, now, do you remember that part about wanting to break down the total variation SS(TO) into a component due to the treatment SS(T) and a component due to random error SS(E)? Well, some simple algebra leads us to this:
\(SS(TO)=SS(T)+SS(E)\)
and hence why the simple way of calculating the error of the sum of squares. At any rate, here's the simple algebra:
Well, okay, so the proof does involve a little trick of adding 0 in a special way to the total sum of squares:
\(SS(TO) = \sum\limits_{i=1}^{m} \sum\limits_{i=j}^{n_{i}}((X_{ij}\color{red}\overbrace{\color{black}\bar{X}_{i_\cdot})+(\bar{X}_{i_\cdot}}^{\text{Add to 0}}\color{black}\bar{X}_{..}))^{2}\)
Then, squaring the term in parentheses, as well as distributing the summation signs, we get:
\(SS(TO)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}\bar{X}_{i.})^2+2\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}\bar{X}_{i.})(\bar{X}_{i.}\bar{X}_{..})+\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (\bar{X}_{i.}\bar{X}_{..})^2\)
Now, it's just a matter of recognizing each of the terms:
\(S S(T O)=
\color{red}\overbrace{\color{black}\sum\limits_{i=1}^{m} \sum\limits_{j=1}^{n_{i}}\left(X_{i j}\bar{X}_{i \cdot}\right)^{2}}^{\text{SSE}}
\color{black}+2
\color{red}\overbrace{\color{black}\sum\limits_{i=1}^{m} \sum\limits_{j=1}^{n_{i}}\left(X_{i j}\bar{X}_{i \cdot}\right)\left(\bar{X}_{i \cdot}\bar{X}_{. .}\right)}^{\text{O}}
\color{black}+
\color{red}\overbrace{\color{black}\left(\sum\limits_{i=1}^{m} \sum\limits_{j=1}^{n_{i}}\left(\bar{X}_{i \cdot}\bar{X}_{* . *}\right)^{2}\right.}^{\text{SST}}\)
That is, we've shown that:
\(SS(TO)=SS(T)+SS(E)\)
as was to be proved.
13.3  Theoretical Results
13.3  Theoretical ResultsSo far, in an attempt to understand the analysis of variance method conceptually, we've been waving our hands at the theory behind the method. We can't procrastinate any further... we now need to address some of the theories behind the method. Specifically, we need to address the distribution of the error sum of squares (SSE), the distribution of the treatment sum of squares (SST), and the distribution of the allimportant Fstatistic.
The Error Sum of Squares (SSE)
Recall that the error sum of squares:
\(SS(E)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}\bar{X}_{i.})^2\)
quantifies the error remaining after explaining some of the variation in the observations \(X_{ij}\) by the treatment means. Let's see what we can say about SSE. Well, the following theorem enlightens us as to the distribution of the error sum of squares.
If:

the \(j^{th}\) measurement of the \(i^{th}\) group, that is, \(X_{ij}\), is an independently and normally distributed random variable with mean \(\mu_i\) and variance \(\sigma^2\)

and \(W^2_i=\dfrac{1}{n_i1}\sum\limits_{j=1}^{n_i} (X_{ij}\bar{X}_{i.})^2\) is the sample variance of the \(i^{th}\) sample
Then:
\(\dfrac{SSE}{\sigma^2}\)
follows a chisquare distribution with n−m degrees of freedom.
Proof
A theorem we learned (way) back in Stat 414 tells us that if the two conditions stated in the theorem hold, then:
\(\dfrac{(n_i1)W^2_i}{\sigma^2}\)
follows a chisquare distribution with \(n_{i}−1\) degrees of freedom. Another theorem we learned back in Stat 414 states that if we add up a bunch of independent chisquare random variables, then we get a chisquare random variable with the degrees of freedom added up, too. So, let's add up the above quantity for all n data points, that is, for \(j = 1\) to \(n_i\) and \(i = 1\) to m. Doing so, we get:
\(\sum\limits_{i=1}^{m}\dfrac{(n_i1)W^2_i}{\sigma^2}=\dfrac{\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}\bar{X}_{i.})^2}{\sigma^2}=\dfrac{SSE}{\sigma^2}\)
Because we assume independence of the observations \(X_{ij}\), we are adding up independent chisquare random variables. (By the way, the assumption of independence is a perfectly fine assumption as long as we take a random sample when we collect the data.) Therefore, the theorem tells us that \(\dfrac{SSE}{\sigma^2}\) follows a chisquare random variable with:
\((n_11)+(n_21)+\cdots+(n_m1)=nm\)
degrees of freedom... as was to be proved.
Now, what can we say about the mean square error MSE? Well, one thing is...
Recall that to show that MSE is an unbiased estimator of \(\sigma^2\), we need to show that \(E(MSE) = \sigma^2\). Also, recall that the expected value of a chisquare random variable is its degrees of freedom. The results of the previous theorem, therefore, suggest that:
\(E\left[ \dfrac{SSE}{\sigma^2}\right]=nm\)
That said, here's the crux of the proof:
\(E[MSE]=E\left[\dfrac{SSE}{nm} \right]=E\left[\dfrac{\sigma^2}{nm} \cdot \dfrac{SSE}{\sigma^2} \right]=\dfrac{\sigma^2}{nm}(nm)=\sigma^2\)
The first equality comes from the definition of MSE. The second equality comes from multiplying MSE by 1 in a special way. The third equality comes from taking the expected value of \(\dfrac{SSE}{\sigma^2}\). And, the fourth and final equality comes from simple algebra.
Because \(E(MSE) = \sigma^2\), we have shown that, no matter what, MSE is an unbiased estimator of \(\sigma^2\)... always!
The Treatment Sum of Squares (SST)
Recall that the treatment sum of squares:
\(SS(T)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i}(\bar{X}_{i.}\bar{X}_{..})^2\)
quantifies the distance of the treatment means from the grand mean. We'll just state the distribution of SST without proof.
If the null hypothesis:
\(H_0: \text{all }\mu_i \text{ are equal}\)
is true, then:
\(\dfrac{SST}{\sigma^2}\)
follows a chisquare distribution with m−1 degrees of freedom.
When we investigated the mean square error MSE above, we were able to conclude that MSE was always an unbiased estimator of \(\sigma^2\). Can the same be said for the mean square due to treatment MST = SST/(m−1)? Well...
The mean square due to treatment is an unbiased estimator of \(\sigma^2\) only if the null hypothesis is true, that is, only if the m population means are equal.
Answer
Since MST is a function of the sum of squares due to treatment SST, let's start with finding the expected value of SST. We learned, on the previous page, that the definition of SST can be written as:
\(SS(T)=\sum\limits_{i=1}^{m}n_i\bar{X}^2_{i.}n\bar{X}_{..}^2\)
Therefore, the expected value of SST is:
\(E(SST)=E\left[\sum\limits_{i=1}^{m}n_i\bar{X}^2_{i.}n\bar{X}_{..}^2\right]=\left[\sum\limits_{i=1}^{m}n_iE(\bar{X}^2_{i.})\right]nE(\bar{X}_{..}^2)\)
Now, because, in general, \(E(X^2)=Var(X)+\mu^2\), we can do some substituting into that last equation, which simplifies to:
\(E(SST)=\left[\sum\limits_{i=1}^{m}n_i\left(\dfrac{\sigma^2}{n_i}+\mu_i^2\right)\right]n\left[\dfrac{\sigma^2}{n}+\bar{\mu}^2\right]\)
where:
\(\bar{\mu}=\dfrac{1}{n}\sum\limits_{i=1}^{m}n_i \mu_i\)
because:
\(E(\bar{X}_{..})=\dfrac{1}{n}\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} E(X_{ij})=\dfrac{1}{n}\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} \mu_i=\dfrac{1}{n}\sum\limits_{i=1}^{m}n_i \mu_i=\bar{\mu}\)
Simplifying our expectiation yet more, we get:
\(E(SST)=\left[\sum\limits_{i=1}^{m}\sigma^2\right]+\left[\sum\limits_{i=1}^{m}n_i\mu^2_i\right]\sigma^2n\bar{\mu}^2\)
And, simplifying yet again, we get:
\(E(SST)=\sigma^2(m1)+\left[\sum\limits_{i=1}^{m}n_i(\mu_i\bar{\mu})^2\right]\)
Okay, so we've simplified E(SST) as far as is probably necessary. Let's use it now to find E(MST).
Well, if the null hypothesis is true, \(\mu_1=\mu_2=\cdots=\mu_m=\bar{\mu}\), say, the expected value of the mean square due to treatment is:
\(E[M S T]=E\left[\frac{S S T}{m1}\right]=\sigma^{2}+\frac{1}{m1} \color{red}\overbrace{\color{black}\sum\limits_{i=1}^{m} n_{i}\left(\mu_{i}\bar{\mu}\right)^{2}}^0 \color{black}=\sigma^{2}\)
On the other hand, if the null hypothesis is not true, that is, if not all of the \(\mu_i\) are equal, then:
\(E(MST)=E\left[\dfrac{SST}{m1}\right]=\sigma^2+\dfrac{1}{m1}\sum\limits_{i=1}^{m} n_i(\mu_i\bar{\mu})^2>\sigma^2\)
So, in summary, we have shown that MST is an unbiased estimator of \(\sigma^2\) if the null hypothesis is true, that is, if all of the means are equal. On the other hand, we have shown that, if the null hypothesis is not true, that is, if all of the means are not equal, then MST is a biased estimator of \(\sigma^2\) because E(MST) is inflated above \(\sigma^2\). Our proof is complete.
Our work on finding the expected values of MST and MSE suggests a reasonable statistic for testing the null hypothesis:
\(H_0: \text{all }\mu_i \text{ are equal}\)
against the alternative hypothesis:
\(H_A: \text{at least one of the }\mu_i \text{ differs from the others}\)
is:
\(F=\dfrac{MST}{MSE}\)
Now, why would this F be a reasonable statistic? Well, we showed above that \(E(MSE) = \sigma^2\). We also showed that under the null hypothesis, when the means are assumed to be equal, \(E(MST) = \sigma^2\), and under the alternative hypothesis when the means are not all equal, E(MST) is inflated above \(\sigma^2\). That suggests then that:

If the null hypothesis is true, that is, if all of the population means are equal, we'd expect the ratio MST/MSE to be close to 1.

If the alternative hypothesis is true, that is, if at least one of the population means differs from the others, we'd expect the ratio MST/MSE to be inflated above 1.
Now, just two questions remain:
 Why do you suppose we call MST/MSE an Fstatistic?
 And, how inflated would MST/MSE have to be in order to reject the null hypothesis in favor of the alternative hypothesis?
Both of these questions are answered by knowing the distribution of MST/MSE.
The Fstatistic
If \(X_{ij} ~ N(\mu\), \(\sigma^2\)), then:
\(F=\dfrac{MST}{MSE}\)
follows an F distribution with m−1 numerator degrees of freedom and n−m denominator degrees of freedom.
Answer
It can be shown (we won't) that SST and SSE are independent. Then, it's just a matter of recalling that an F random variable is defined to be the ratio of two independent chisquare random variables. That is:
\(F=\dfrac{SST/(m1)}{SSE/(nm)}=\dfrac{MST}{MSE} \sim F(m1,nm)\)
as was to be proved.
Now this all suggests that we should reject the null hypothesis of equal population means:
if \(F\geq F_{\alpha}(m1,nm)\) or if \(P=P(F(m1,nm)\geq F)\leq \alpha\)
If you go back and look at the assumptions that we made in deriving the analysis of variance Ftest, you'll see that the Ftest for the equality of means depends on three assumptions about the data:
 independence
 normality
 equal group variances
That means that you'll want to use the Ftest only if there is evidence to believe that the assumptions are met. That said, as is the case with the twosample ttest, the Ftest works quite well even if the underlying measurements are not normally distributed unless the data are highly skewed or the variances are markedly different. If the data are highly skewed, or if there is evidence that the variances differ greatly, we have two analysis options at our disposal. We could attempt to transform the observations (take the natural log of each value, for example) to make the data more symmetric with more similar variances. Alternatively, we could use nonparametric methods (that are unfortunately not covered in this course).
13.4  Another Example
13.4  Another ExampleExample 133
A researcher was interested in investigating whether Holocaust survivors have more sleep problems than others. She evaluated \(n = 120\) subjects in total, a subset of them were Holocaust survivors, a subset of them were documented as being depressed, and another subset of them were deemed healthy. (Of course, it's not at all obvious that these are mutually exclusive groups.) At any rate, all n = 120 subjects completed a questionnaire about the quality and duration of their regular sleep patterns. As a result of the questionnaire, each subject was assigned a Pittsburgh Sleep Quality Index (PSQI). Here's a dot plot of the resulting data:
Is there sufficient evidence at the \(\alpha = 0.05\) level to conclude that the mean PSQI for the three groups differ?
Answer
We can use Minitab to obtain the analysis of variance table. Doing so, we get:
Source  DF  SS  MS  F  P 

Factor  2  1723.8  861.9  61.69  0.000 
Error  117  1634.8  14.0  
Total  119  3358.6 
Since P < 0.001 ≤ 0.05, we reject the null hypothesis of equal means in favor of the alternative hypothesis of unequal means. There is sufficient evidence at the 0.05 level to conclude that the mean Pittsburgh Sleep Quality Index differs among the three groups.
Minitab^{®}
Using Minitab
There is no doubt that you'll want to use Minitab when performing an analysis of variance. The commands necessary to perform a onefactor analysis of variance in Minitab depends on whether the data in your worksheet are "stacked" or "unstacked." Let's illustrate using the learning method study data. Here's what the data would look like unstacked:
std1  osm1  shk1 

51  58  77 
45  68  72 
40  64  78 
41  63  73 
41  62  75 
That is, the data from each group resides in a different column in the worksheet. If your data are entered in this way, then follow these instructions for performing the onefactor analysis of variance:
 Under the Stat menu, select ANOVA.
 Select OneWay (Unstacked).
 In the box labeled Responses, specify the columns containing the data.
 If you want dot plots and/or boxplots of the data, select Graphs...
 Select OK.
 The output should appear in the Session Window.
Here's what the data would look like stacked:
Method  Score 

1  51 
1  45 
1  40 
1  41 
1  41 
2  58 
2  68 
2  64 
2  63 
2  62 
3  77 
3  72 
3  78 
3  73 
3  75 
That is, one column contains a grouping variable, and another column contains the responses. If your data are entered in this way, then follow these instructions for performing the onefactor analysis of variance:
 Under the Stat menu, select ANOVA.
 Select OneWay.
 In the box labeled Response, specify the column containing the responses.
 In the box labeled Factor, specify the column containing the grouping variable.
 If you want dot plots and/or boxplots of the data, select Graphs...
 Select OK.
 The output should appear in the Session Window.