13.1 - The Basic Idea

We could take a top-down approach by first presenting the theory of analysis of variance and then following it up with an example. We're not going to do it that way though. We're going to take a bottom-up approach, in which we first develop the idea behind the analysis of variance on this page, and then present the results on the next page. Only after we've completed those two steps will we take a step back and look at the theory behind analysis of variance. That said, let's start with our first example of the lesson.

Example 13-1

A researcher for an automobile safety institute was interested in determining whether or not the distance that it takes to stop a car going 60 miles per hour depends on the brand of the tire. The researcher measured the stopping distance (in feet) of ten randomly selected cars for each of five different brands. So that he and his assistants would remain blinded, the researcher arbitrarily labeled the brands of the tires as Brand1, Brand2, Brand3, Brand4, and Brand5. Here are the data resulting from his experiment:

Brand1	Brand2	Brand3	Brand4	Brand5
194	189	185	183	195
184	204	183	193	197
189	190	186	184	194
189	190	183	186	202
188	189	179	194	200
186	207	191	199	211
195	203	188	196	203
186	193	196	188	206
183	181	189	193	202
188	206	194	196	195

Do the data provide enough evidence to conclude that at least one of the brands is different from the others with respect to stopping distance?

Answer

The first thing we might want to do is to create some sort of summary plot of the data. Here is a box plot of the data:

Hmmm. It appears that the box plots for Brand1 and Brand5 have very little, if any, overlap at all. The same can be said for Brand3 and Brand5. Here are some summary statistics of the data:

Brand	N	MEAN	SD
1	10	188.20	3.88
2	10	195.20	9.02
3	10	187.40	5.27
4	10	191.20	5.55
5	10	200.50	5.44

It appears that the sample means differ quite a bit. For example, the average stopping distance of Brand3 is 187.4 feet (with a standard deviation of 5.27 feet), while the average stopping distance of Brand5 is 200.5 feet (with a standard deviation of 5.44 feet). A difference of 13 feet could mean the difference between getting into an accident or not. But, of course, we can't draw conclusions about the performance of the brands based on one sample. After all, a different random sample of cars could yield different results. Instead, we need to use the sample means to try to draw conclusions about the population means.

More specifically, the researcher needs to test the null hypothesis that the group population means are all the same against the alternative that at least one group population mean differs from the others. That is, the researcher needs to test this null hypothesis:

\(H_0 \colon \mu_1=\mu_2=\mu_3=\mu_4=\mu_5\)

against this alternative hypothesis:

\(H_A \colon \) at least one of the \(\mu_i\) differs from the others

In this lesson, we are going to learn how to use a method called analysis of variance to answer the researcher's question. Jumping right to the punch line, with no development or theoretical justification whatsoever, we'll use an analysis of variance table, such as this one:

Analysis of Variance for comparing all 5 brands
Source	DF	SS	MS	F	P
Brand	4	1174.8	293.7	7.95	0.000
Error	45	1661.7	36.9
Total	49	2836.5

to draw conclusions about the equality of two or more population means. And, as we always do when performing hypothesis tests, we'll compare the P-value to \(\alpha\), our desired willingness to commit a Type I error. In this case, the researcher's P-value is very small (0.000, to three decimal places), so he should reject his null hypothesis. That is, there is sufficient evidence, at even a 0.01 level, to conclude that the mean stopping distance for at least one brand of tire is different than the mean stopping distances of the others.

So far, we have seen a typical null and alternative hypothesis in the analysis of variance framework, as well as an analysis of variance table. Let's take a look at another example with the idea of continuing to work on developing the basic idea behind the analysis of variance method.

Example 13-2

Suppose an education researcher is interested in determining whether a learning method affects students' exam scores. Specifically, suppose she considers these three methods:

standard
osmosis
shock therapy

Suppose she convinces 15 students to take part in her study, so she randomly assigns 5 students to each method. Then, after waiting eight weeks, she tests the students to get exam scores.

What would the researcher's data have to look like to be able to conclude that at least one of the methods yields different exam scores than the others?

Answer

Suppose a dot plot of the researcher's data looked like this:

What would we want to conclude? Well, there's a lot of separation in the data between the three methods. In this case, there is little variation in the data within each method, but a lot of variation in the data across the three methods. For these data, we would probably be willing to conclude that there is a difference between the three methods.

Now, suppose instead that a dot plot of the researcher's data looked like this:

What would we want to conclude? Well, there's less separation in the data between the three methods. In this case, there is a lot of variation in the data within each method, and still some variation in the data across the three methods, but not as much as in the previous dot plot. For these data, it is not as obvious that we can conclude that there is a difference between the three methods.

Let's consider one more possible dot plot:

What would we want to conclude here? Well, there's even less separation in the data between the three methods. In this case, there is a real lot of variation in the data within each method, and not much variation at all in the data across the three methods. For these data, we would probably want to conclude that there is no difference between the three methods.

If you go back and look at the three possible data sets, you'll see that we drew our conclusions by comparing the variation in the data within a method to the variation in the data across methods. Let's try to formalize that idea a bit more by revisiting the two most extreme examples. First, the example in which we concluded that the methods differ:

Let's quantify (or are we still just qualifying?) the amount of variation within a method by comparing the five data points within a method to the method's mean, as represented in the plot as a color-coded triangle. And, let's quantify (or qualify?) the amount of variation across the methods by comparing the method means, again represented in the plot as a color-coded triangle, to the overall grand mean, that is, the average of all fifteen data points (ignoring the method). In this case, the variation between the group means and the grand mean is larger than the variation within the groups.

Now, let's revisit the example in which we wanted to conclude that there was no difference in the three methods:

In this case, the variation between the group means and the grand mean is smaller than the variation within the groups.

Hmmm... these two examples suggest that our method should compare the variation between the groups to that of the variation within the groups. That's just what an analysis of variance does!

Let's see what conclusion we draw from an analysis of variance of these data. Here's the analysis of variance table for the first study, in which we wanted to conclude that there was a difference in the three methods:

Source	DF	SS	MS	F	P
Factor	2	2510.5	1255.3	93.44	0.000
Error	12	161.2	13.4
Total	14	2671.7

In this case, the P-value is small (0.000, to three decimal places). We can reject the null hypothesis of equal means at the 0.05 level. That is, there is sufficient evidence at the 0.05 level to conclude that the mean exam scores of the three study methods are significantly different.

Here's the analysis of variance table for the third study, in which we wanted to conclude that there was no difference in the three methods:

Source	DF	SS	MS	F	P
Factor	2	80.1	40.1	0.46	0.643
Error	12	1050.8	87.6
Total	14	1130.9

In this case, the P-value, 0.643, is large. We fail to reject the null hypothesis of equal means at the 0.05 level. That is, there is insufficient evidence at the 0.05 level to conclude that the mean exam scores of the three study methods are significantly different.

Hmmm. It seems like we're on to something! Let's summarize.

The Basic Idea Behind Analysis of Variance

Analysis of variance involves dividing the overall variability in observed data values so that we can draw conclusions about the equality, or lack thereof, of the means of the populations from where the data came. The overall (or "total") variability is divided into two components:

the variability "between" groups
the variability "within" groups

We summarize the division of the variability in an "analysis of variance table", which is often shortened and called an "ANOVA table." Without knowing what we were really looking at, we looked at a few examples of ANOVA tables here on this page. Let's now go take an in-depth look at the content of ANOVA tables.

13.2 - The ANOVA Table

For the sake of concreteness here, let's recall one of the analysis of variance tables from the previous page:

One-way Analysis of Variance
Source	DF	SS	MS	F	P
Factor	2	2510.5	1255.3	93.44	0.000
Error	12	161.2	13.4
Total	14	2671.7

In working to digest what is all contained in an ANOVA table, let's start with the column headings:

Source means "the source of the variation in the data." As we'll soon see, the possible choices for a one-factor study, such as the learning study, are Factor, Error, and Total. The factor is the characteristic that defines the populations being compared. In the tire study, the factor is the brand of tire. In the learning study, the factor is the learning method.
DF means "the degrees of freedom in the source."
SS means "the sum of squares due to the source."
MS means "the mean sum of squares due to the source."
F means "the F-statistic."
P means "the P-value."

Now, let's consider the row headings:

Factor means "the variability due to the factor of interest." In the tire example on the previous page, the factor was the brand of the tire. In the learning example on the previous page, the factor was the method of learning.
Sometimes, the factor is a treatment, and therefore the row heading is instead labeled as Treatment. And, sometimes the row heading is labeled as Between to make it clear that the row concerns the variation between the groups.
Error means "the variability within the groups" or "unexplained random error." Sometimes, the row heading is labeled as Within to make it clear that the row concerns the variation within the groups.
Total means "the total variation in the data from the grand mean" (that is, ignoring the factor of interest).

With the column headings and row headings now defined, let's take a look at the individual entries inside a general one-factor ANOVA table:

Hover over the lightbulb for further explanation.

One-way Analysis of Variance
Source	DF	SS	MS	F	P
Factor	m-1	SS (Between)	MSB	MSB/MSE	0.000
Error	n-m	SS (Error)	MSE
Total	n-1	SS (Total)

Yikes, that looks overwhelming! Let's work our way through it entry by entry to see if we can make it all clear. Let's start with the degrees of freedom (DF) column:

If there are n total data points collected, then there are n−1 total degrees of freedom.
If there are m groups being compared, then there are m−1 degrees of freedom associated with the factor of interest.
If there are n total data points collected and m groups being compared, then there are n−m error degrees of freedom.

Now, the sums of squares (SS) column:

As we'll soon formalize below, SS(Between) is the sum of squares between the group means and the grand mean. As the name suggests, it quantifies the variability between the groups of interest.
Again, as we'll formalize below, SS(Error) is the sum of squares between the data and the group means. It quantifies the variability within the groups of interest.
SS(Total) is the sum of squares between the n data points and the grand mean. As the name suggests, it quantifies the total variability in the observed data. We'll soon see that the total sum of squares, SS(Total), can be obtained by adding the between sum of squares, SS(Between), to the error sum of squares, SS(Error). That is:
SS(Total) = SS(Between) + SS(Error)

The mean squares (MS) column, as the name suggests, contains the "average" sum of squares for the Factor and the Error:

The Mean Sum of Squares between the groups, denoted MSB, is calculated by dividing the Sum of Squares between the groups by the between group degrees of freedom. That is, MSB = SS(Between)/(m−1).
The Error Mean Sum of Squares, denoted MSE, is calculated by dividing the Sum of Squares within the groups by the error degrees of freedom. That is, MSE = SS(Error)/(n−m).

The F column, not surprisingly, contains the F-statistic. Because we want to compare the "average" variability between the groups to the "average" variability within the groups, we take the ratio of the Between Mean Sum of Squares to the Error Mean Sum of Squares. That is, the F-statistic is calculated as F = MSB/MSE.

When, on the next page, we delve into the theory behind the analysis of variance method, we'll see that the F-statistic follows an F-distribution with m−1 numerator degrees of freedom and n−m denominator degrees of freedom. Therefore, we'll calculate the P-value, as it appears in the column labeled P, by comparing the F-statistic to an F-distribution with m−1 numerator degrees of freedom and n−m denominator degrees of freedom.

Now, having defined the individual entries of a general ANOVA table, let's revisit and, in the process, dissect the ANOVA table for the first learning study on the previous page, in which n = 15 students were subjected to one of m = 3 methods of learning:

Hover over the lightbulb for further explanation.

One-way Analysis of Variance
Source	DF	SS	MS	F	P
Factor	2	2510.5	1255.3	93.44	0.000
Error	12	161.2	13.4
Total	14	2671.7

Because n = 15, there are n−1 = 15−1 = 14 total degrees of freedom.
Because m = 3, there are m−1 = 3−1 = 2 degrees of freedom associated with the factor.
The degrees of freedom add up, so we can get the error degrees of freedom by subtracting the degrees of freedom associated with the factor from the total degrees of freedom. That is, the error degrees of freedom is 14−2 = 12. Alternatively, we can calculate the error degrees of freedom directly from n−m = 15−3=12.
We'll learn how to calculate the sum of squares in a minute. For now, take note that the total sum of squares, SS(Total), can be obtained by adding the between sum of squares, SS(Between), to the error sum of squares, SS(Error). That is:
2671.7 = 2510.5 + 161.2
MSB is SS(Between) divided by the between group degrees of freedom. That is, 1255.3 = 2510.5 ÷ 2.
MSE is SS(Error) divided by the error degrees of freedom. That is, 13.4 = 161.2 ÷ 12.
The F-statistic is the ratio of MSB to MSE. That is, F = 1255.3 ÷ 13.4 = 93.44.
The P-value is P(F(2,12) ≥ 93.44) < 0.001.

Okay, we slowly, but surely, keep on adding bit by bit to our knowledge of an analysis of variance table. Let's now work a bit on the sums of squares.

The Sums of Squares

In essence, we now know that we want to break down the TOTAL variation in the data into two components:

a component that is due to the TREATMENT (or FACTOR), and
a component that is due to just RANDOM ERROR.

Let's see what kind of formulas we can come up with for quantifying these components. But first, as always, we need to define some notation. Let's represent our data, the group means, and the grand mean as follows:

Group	Data				Means
1	\(X_{11}\)	\(X_{12}\)	. . .	\(X_{1_{n_1}}\)	\(\bar{{X}}_{1.}\)
2	\(X_{21}\)	\(X_{22}\)	. . .	\(X_{2_{n_2}}\)	\(\bar{{X}}_{2.}\)
. . .	. . .	. . .	. . .	. . .	. . .
\(m\)	\(X_{m1}\)	\(X_{m2}\)	. . .	\(X_{m_{n_m}}\)	\(\bar{{X}}_{m.}\)
		Grand Mean			\(\bar{{X}}_{..}\)

That is, we'll let:

m denotes the number of groups being compared
\(X_{ij}\) denote the \(j_{th}\) observation in the \(i_{th}\) group, where \(i = 1, 2, \dots , m\) and \(j = 1, 2, \dots, n_i\). The important thing to note here... note that j goes from 1 to \(n_i\), not to \(n\). That is, the number of the data points in a group depends on the group i. That means that the number of data points in each group need not be the same. We could have 5 measurements in one group, and 6 measurements in another.
\(\bar{X}_{i.}=\dfrac{1}{n_i}\sum\limits_{j=1}^{n_i} X_{ij}\) denote the sample mean of the observed data for group i, where \(i = 1, 2, \dots , m\)
\(\bar{X}_{..}=\dfrac{1}{n}\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} X_{ij}\) denote the grand mean of all n data observed data points

Okay, with the notation now defined, let's first consider the total sum of squares, which we'll denote here as SS(TO). Because we want the total sum of squares to quantify the variation in the data regardless of its source, it makes sense that SS(TO) would be the sum of the squared distances of the observations \(X_{ij}\) to the grand mean \(\bar{X}_{..}\). That is:

\(SS(TO)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}-\bar{X}_{..})^2\)

With just a little bit of algebraic work, the total sum of squares can be alternatively calculated as:

\(SS(TO)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} X^2_{ij}-n\bar{X}_{..}^2\)

Can you do the algebra?

Now, let's consider the treatment sum of squares, which we'll denote SS(T). Because we want the treatment sum of squares to quantify the variation between the treatment groups, it makes sense that SS(T) would be the sum of the squared distances of the treatment means \(\bar{X}_{i.}\) to the grand mean \(\bar{X}_{..}\). That is:

\(SS(T)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (\bar{X}_{i.}-\bar{X}_{..})^2\)

Again, with just a little bit of algebraic work, the treatment sum of squares can be alternatively calculated as:

\(SS(T)=\sum\limits_{i=1}^{m}n_i\bar{X}^2_{i.}-n\bar{X}_{..}^2\)

Can you do the algebra?

Finally, let's consider the error sum of squares, which we'll denote SS(E). Because we want the error sum of squares to quantify the variation in the data, not otherwise explained by the treatment, it makes sense that SS(E) would be the sum of the squared distances of the observations \(X_{ij}\) to the treatment means \(\bar{X}_{i.}\). That is:

\(SS(E)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}-\bar{X}_{i.})^2\)

As we'll see in just one short minute why the easiest way to calculate the error sum of squares is by subtracting the treatment sum of squares from the total sum of squares. That is:

\(SS(E)=SS(TO)-SS(T)\)

Okay, now, do you remember that part about wanting to break down the total variation SS(TO) into a component due to the treatment SS(T) and a component due to random error SS(E)? Well, some simple algebra leads us to this:

\(SS(TO)=SS(T)+SS(E)\)

and hence why the simple way of calculating the error of the sum of squares. At any rate, here's the simple algebra:

Proof

Well, okay, so the proof does involve a little trick of adding 0 in a special way to the total sum of squares:

\(SS(TO) = \sum\limits_{i=1}^{m} \sum\limits_{i=j}^{n_{i}}((X_{ij}-\color{red}\overbrace{\color{black}\bar{X}_{i_\cdot})+(\bar{X}_{i_\cdot}}^{\text{Add to 0}}\color{black}-\bar{X}_{..}))^{2}\)

Then, squaring the term in parentheses, as well as distributing the summation signs, we get:

\(SS(TO)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}-\bar{X}_{i.})^2+2\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}-\bar{X}_{i.})(\bar{X}_{i.}-\bar{X}_{..})+\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (\bar{X}_{i.}-\bar{X}_{..})^2\)

Now, it's just a matter of recognizing each of the terms:

\(S S(T O)=
\color{red}\overbrace{\color{black}\sum\limits_{i=1}^{m} \sum\limits_{j=1}^{n_{i}}\left(X_{i j}-\bar{X}_{i \cdot}\right)^{2}}^{\text{SSE}}
\color{black}+2
\color{red}\overbrace{\color{black}\sum\limits_{i=1}^{m} \sum\limits_{j=1}^{n_{i}}\left(X_{i j}-\bar{X}_{i \cdot}\right)\left(\bar{X}_{i \cdot}-\bar{X}_{. .}\right)}^{\text{O}}
\color{black}+
\color{red}\overbrace{\color{black}\left(\sum\limits_{i=1}^{m} \sum\limits_{j=1}^{n_{i}}\left(\bar{X}_{i \cdot}-\bar{X}_{* . *}\right)^{2}\right.}^{\text{SST}}\)

That is, we've shown that:

\(SS(TO)=SS(T)+SS(E)\)

as was to be proved.

13.3 - Theoretical Results

So far, in an attempt to understand the analysis of variance method conceptually, we've been waving our hands at the theory behind the method. We can't procrastinate any further... we now need to address some of the theories behind the method. Specifically, we need to address the distribution of the error sum of squares (SSE), the distribution of the treatment sum of squares (SST), and the distribution of the all-important F-statistic.

The Error Sum of Squares (SSE)

Recall that the error sum of squares:

\(SS(E)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}-\bar{X}_{i.})^2\)

quantifies the error remaining after explaining some of the variation in the observations \(X_{ij}\) by the treatment means. Let's see what we can say about SSE. Well, the following theorem enlightens us as to the distribution of the error sum of squares.

Theorem

If:

the \(j^{th}\) measurement of the \(i^{th}\) group, that is, \(X_{ij}\), is an independently and normally distributed random variable with mean \(\mu_i\) and variance \(\sigma^2\)
and \(W^2_i=\dfrac{1}{n_i-1}\sum\limits_{j=1}^{n_i} (X_{ij}-\bar{X}_{i.})^2\) is the sample variance of the \(i^{th}\) sample

Then:

\(\dfrac{SSE}{\sigma^2}\)

follows a chi-square distribution with n−m degrees of freedom.

Proof

A theorem we learned (way) back in Stat 414 tells us that if the two conditions stated in the theorem hold, then:

\(\dfrac{(n_i-1)W^2_i}{\sigma^2}\)

follows a chi-square distribution with \(n_{i}−1\) degrees of freedom. Another theorem we learned back in Stat 414 states that if we add up a bunch of independent chi-square random variables, then we get a chi-square random variable with the degrees of freedom added up, too. So, let's add up the above quantity for all n data points, that is, for \(j = 1\) to \(n_i\) and \(i = 1\) to m. Doing so, we get:

\(\sum\limits_{i=1}^{m}\dfrac{(n_i-1)W^2_i}{\sigma^2}=\dfrac{\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} (X_{ij}-\bar{X}_{i.})^2}{\sigma^2}=\dfrac{SSE}{\sigma^2}\)

Because we assume independence of the observations \(X_{ij}\), we are adding up independent chi-square random variables. (By the way, the assumption of independence is a perfectly fine assumption as long as we take a random sample when we collect the data.) Therefore, the theorem tells us that \(\dfrac{SSE}{\sigma^2}\) follows a chi-square random variable with:

\((n_1-1)+(n_2-1)+\cdots+(n_m-1)=n-m\)

degrees of freedom... as was to be proved.

Now, what can we say about the mean square error MSE? Well, one thing is...

Theorem

The mean square error MSE is (always) an unbiased estimator of \(\sigma^2\).

Recall that to show that MSE is an unbiased estimator of \(\sigma^2\), we need to show that \(E(MSE) = \sigma^2\). Also, recall that the expected value of a chi-square random variable is its degrees of freedom. The results of the previous theorem, therefore, suggest that:

\(E\left[ \dfrac{SSE}{\sigma^2}\right]=n-m\)

That said, here's the crux of the proof:

\(E[MSE]=E\left[\dfrac{SSE}{n-m} \right]=E\left[\dfrac{\sigma^2}{n-m} \cdot \dfrac{SSE}{\sigma^2} \right]=\dfrac{\sigma^2}{n-m}(n-m)=\sigma^2\)

The first equality comes from the definition of MSE. The second equality comes from multiplying MSE by 1 in a special way. The third equality comes from taking the expected value of \(\dfrac{SSE}{\sigma^2}\). And, the fourth and final equality comes from simple algebra.

Because \(E(MSE) = \sigma^2\), we have shown that, no matter what, MSE is an unbiased estimator of \(\sigma^2\)... always!

The Treatment Sum of Squares (SST)

Recall that the treatment sum of squares:

\(SS(T)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i}(\bar{X}_{i.}-\bar{X}_{..})^2\)

quantifies the distance of the treatment means from the grand mean. We'll just state the distribution of SST without proof.

Theorem

If the null hypothesis:

\(H_0: \text{all }\mu_i \text{ are equal}\)

is true, then:

\(\dfrac{SST}{\sigma^2}\)

follows a chi-square distribution with m−1 degrees of freedom.

When we investigated the mean square error MSE above, we were able to conclude that MSE was always an unbiased estimator of \(\sigma^2\). Can the same be said for the mean square due to treatment MST = SST/(m−1)? Well...

Theorem

The mean square due to treatment is an unbiased estimator of \(\sigma^2\) only if the null hypothesis is true, that is, only if the m population means are equal.

Answer

Since MST is a function of the sum of squares due to treatment SST, let's start with finding the expected value of SST. We learned, on the previous page, that the definition of SST can be written as:

\(SS(T)=\sum\limits_{i=1}^{m}n_i\bar{X}^2_{i.}-n\bar{X}_{..}^2\)

Therefore, the expected value of SST is:

\(E(SST)=E\left[\sum\limits_{i=1}^{m}n_i\bar{X}^2_{i.}-n\bar{X}_{..}^2\right]=\left[\sum\limits_{i=1}^{m}n_iE(\bar{X}^2_{i.})\right]-nE(\bar{X}_{..}^2)\)

Now, because, in general, \(E(X^2)=Var(X)+\mu^2\), we can do some substituting into that last equation, which simplifies to:

\(E(SST)=\left[\sum\limits_{i=1}^{m}n_i\left(\dfrac{\sigma^2}{n_i}+\mu_i^2\right)\right]-n\left[\dfrac{\sigma^2}{n}+\bar{\mu}^2\right]\)

where:

\(\bar{\mu}=\dfrac{1}{n}\sum\limits_{i=1}^{m}n_i \mu_i\)

because:

\(E(\bar{X}_{..})=\dfrac{1}{n}\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} E(X_{ij})=\dfrac{1}{n}\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_i} \mu_i=\dfrac{1}{n}\sum\limits_{i=1}^{m}n_i \mu_i=\bar{\mu}\)

Simplifying our expectiation yet more, we get:

\(E(SST)=\left[\sum\limits_{i=1}^{m}\sigma^2\right]+\left[\sum\limits_{i=1}^{m}n_i\mu^2_i\right]-\sigma^2-n\bar{\mu}^2\)

And, simplifying yet again, we get:

\(E(SST)=\sigma^2(m-1)+\left[\sum\limits_{i=1}^{m}n_i(\mu_i-\bar{\mu})^2\right]\)

Okay, so we've simplified E(SST) as far as is probably necessary. Let's use it now to find E(MST).

Well, if the null hypothesis is true, \(\mu_1=\mu_2=\cdots=\mu_m=\bar{\mu}\), say, the expected value of the mean square due to treatment is:

\(E[M S T]=E\left[\frac{S S T}{m-1}\right]=\sigma^{2}+\frac{1}{m-1} \color{red}\overbrace{\color{black}\sum\limits_{i=1}^{m} n_{i}\left(\mu_{i}-\bar{\mu}\right)^{2}}^0 \color{black}=\sigma^{2}\)

On the other hand, if the null hypothesis is not true, that is, if not all of the \(\mu_i\) are equal, then:

\(E(MST)=E\left[\dfrac{SST}{m-1}\right]=\sigma^2+\dfrac{1}{m-1}\sum\limits_{i=1}^{m} n_i(\mu_i-\bar{\mu})^2>\sigma^2\)

So, in summary, we have shown that MST is an unbiased estimator of \(\sigma^2\) if the null hypothesis is true, that is, if all of the means are equal. On the other hand, we have shown that, if the null hypothesis is not true, that is, if all of the means are not equal, then MST is a biased estimator of \(\sigma^2\) because E(MST) is inflated above \(\sigma^2\). Our proof is complete.

Our work on finding the expected values of MST and MSE suggests a reasonable statistic for testing the null hypothesis:

\(H_0: \text{all }\mu_i \text{ are equal}\)

against the alternative hypothesis:

\(H_A: \text{at least one of the }\mu_i \text{ differs from the others}\)

is:

\(F=\dfrac{MST}{MSE}\)

Now, why would this F be a reasonable statistic? Well, we showed above that \(E(MSE) = \sigma^2\). We also showed that under the null hypothesis, when the means are assumed to be equal, \(E(MST) = \sigma^2\), and under the alternative hypothesis when the means are not all equal, E(MST) is inflated above \(\sigma^2\). That suggests then that:

If the null hypothesis is true, that is, if all of the population means are equal, we'd expect the ratio MST/MSE to be close to 1.
If the alternative hypothesis is true, that is, if at least one of the population means differs from the others, we'd expect the ratio MST/MSE to be inflated above 1.

Now, just two questions remain:

Why do you suppose we call MST/MSE an F-statistic?
And, how inflated would MST/MSE have to be in order to reject the null hypothesis in favor of the alternative hypothesis?

Both of these questions are answered by knowing the distribution of MST/MSE.

The F-statistic

Theorem

If \(X_{ij} ~ N(\mu\), \(\sigma^2\)), then:

\(F=\dfrac{MST}{MSE}\)

follows an F distribution with m−1 numerator degrees of freedom and n−m denominator degrees of freedom.

Answer

It can be shown (we won't) that SST and SSE are independent. Then, it's just a matter of recalling that an F random variable is defined to be the ratio of two independent chi-square random variables. That is:

\(F=\dfrac{SST/(m-1)}{SSE/(n-m)}=\dfrac{MST}{MSE} \sim F(m-1,n-m)\)

as was to be proved.

Now this all suggests that we should reject the null hypothesis of equal population means:

if \(F\geq F_{\alpha}(m-1,n-m)\) or if \(P=P(F(m-1,n-m)\geq F)\leq \alpha\)

If you go back and look at the assumptions that we made in deriving the analysis of variance F-test, you'll see that the F-test for the equality of means depends on three assumptions about the data:

independence
normality
equal group variances

That means that you'll want to use the F-test only if there is evidence to believe that the assumptions are met. That said, as is the case with the two-sample t-test, the F-test works quite well even if the underlying measurements are not normally distributed unless the data are highly skewed or the variances are markedly different. If the data are highly skewed, or if there is evidence that the variances differ greatly, we have two analysis options at our disposal. We could attempt to transform the observations (take the natural log of each value, for example) to make the data more symmetric with more similar variances. Alternatively, we could use nonparametric methods (that are unfortunately not covered in this course).

13.4 - Another Example

Example 13-3

A researcher was interested in investigating whether Holocaust survivors have more sleep problems than others. She evaluated \(n = 120\) subjects in total, a subset of them were Holocaust survivors, a subset of them were documented as being depressed, and another subset of them were deemed healthy. (Of course, it's not at all obvious that these are mutually exclusive groups.) At any rate, all n = 120 subjects completed a questionnaire about the quality and duration of their regular sleep patterns. As a result of the questionnaire, each subject was assigned a Pittsburgh Sleep Quality Index (PSQI). Here's a dot plot of the resulting data:

Is there sufficient evidence at the \(\alpha = 0.05\) level to conclude that the mean PSQI for the three groups differ?

Answer

We can use Minitab to obtain the analysis of variance table. Doing so, we get:

Source	DF	SS	MS	F	P
Factor	2	1723.8	861.9	61.69	0.000
Error	117	1634.8	14.0
Total	119	3358.6

Since P < 0.001 ≤ 0.05, we reject the null hypothesis of equal means in favor of the alternative hypothesis of unequal means. There is sufficient evidence at the 0.05 level to conclude that the mean Pittsburgh Sleep Quality Index differs among the three groups.

Minitab^®

Using Minitab

There is no doubt that you'll want to use Minitab when performing an analysis of variance. The commands necessary to perform a one-factor analysis of variance in Minitab depends on whether the data in your worksheet are "stacked" or "unstacked." Let's illustrate using the learning method study data. Here's what the data would look like unstacked:

std1	osm1	shk1
51	58	77
45	68	72
40	64	78
41	63	73
41	62	75

That is, the data from each group resides in a different column in the worksheet. If your data are entered in this way, then follow these instructions for performing the one-factor analysis of variance:

Under the Stat menu, select ANOVA.
Select One-Way (Unstacked).
In the box labeled Responses, specify the columns containing the data.
If you want dot plots and/or boxplots of the data, select Graphs...
Select OK.
The output should appear in the Session Window.

Here's what the data would look like stacked:

Method	Score
1	51
1	45
1	40
1	41
1	41
2	58
2	68
2	64
2	63
2	62
3	77
3	72
3	78
3	73
3	75

That is, one column contains a grouping variable, and another column contains the responses. If your data are entered in this way, then follow these instructions for performing the one-factor analysis of variance:

Under the Stat menu, select ANOVA.
Select One-Way.
In the box labeled Response, specify the column containing the responses.
In the box labeled Factor, specify the column containing the grouping variable.
If you want dot plots and/or boxplots of the data, select Graphs...
Select OK.
The output should appear in the Session Window.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility

Brand1	Brand2	Brand3	Brand4	Brand5
194	189	185	183	195
184	204	183	193	197
189	190	186	184	194
189	190	183	186	202
188	189	179	194	200
186	207	191	199	211
195	203	188	196	203
186	193	196	188	206
183	181	189	193	202
188	206	194	196	195

Brand1	Brand2	Brand3	Brand4	Brand5
194	189	185	183	195
184	204	183	193	197
189	190	186	184	194
189	190	183	186	202
188	189	179	194	200
186	207	191	199	211
195	203	188	196	203
186	193	196	188	206
183	181	189	193	202
188	206	194	196	195

Lesson 13: One-Factor Analysis of Variance

13.1 - The Basic Idea

Example 13-1

Answer

Example 13-2

Answer

The Basic Idea Behind Analysis of Variance

13.2 - The ANOVA Table

The Sums of Squares

13.3 - Theoretical Results

The Error Sum of Squares (SSE)

Proof

The Treatment Sum of Squares (SST)

Answer

The F-statistic

Answer

13.4 - Another Example

Example 13-3

Answer

Minitab®

Using Minitab

Minitab^®

Brand1	Brand2	Brand3	Brand4	Brand5
194	189	185	183	195
184	204	183	193	197
189	190	186	184	194
189	190	183	186	202
188	189	179	194	200
186	207	191	199	211
195	203	188	196	203
186	193	196	188	206
183	181	189	193	202
188	206	194	196	195