# Lesson 17: Contingency Tables

Lesson 17: Contingency Tables## Overview

In this lesson, we'll investigate two more applications of the chi-square test. We'll first look at a method for testing whether two or more multinomial distributions are equal. This method is often referred to as the **test for homogeneity**. (Homogenized milk... get it?)

We'll then look at a method for testing whether two or more categorical variables are independent. This test is often referred to as the **test for independence**. It allows us to test for independence of, say, an individual's political affiliation and his/her preference for a particular presidential candidate.

## Objectives

- To learn how to conduct a test for homogeneity.
- To learn how to conduct a test for independence.
- To understand the proofs in the lesson.
- To be able to apply the methods learned in the lesson to new situations.

# 17.1 - Test For Homogeneity

17.1 - Test For HomogeneityAs suggested in the introduction to this lesson, the test for homogeneity is a method, based on the chi-square statistic, for testing whether two or more multinomial distributions are equal. Let's start by trying to get a feel for how our data might "look" if we have two equal multinomial distributions.

## Example 17-1

A university admissions officer was concerned that males and females were accepted at different rates into the four different schools (business, engineering, liberal arts, and science) at her university. She collected the following data on the acceptance of 1200 males and 800 females who applied to the university:

#(Acceptances) | Business | Engineer | Lib Arts | Science | (FIXED) Total |
---|---|---|---|---|---|

Male | 300 (25%) | 240 (20%) | 300 (25%) | 360 (30%) | 1200 |

Female | 200 (25%) | 160 (20%) | 200 (25%) | 240 (30%) | 800 |

Total | 500 (25%) | 400 (20%) | 500 (25%) | 600 (30%) | 2000 |

Are males and females distributed equally among the various schools?

### Answer

Let's start by focusing on the business school. We can see that, of the 1200 males who applied to the university, 300 (or 25%) were accepted into the business school. Of the 800 females who applied to the university, 200 (or 25%) were accepted into the business school. So, the business school looks to be in good shape, as an equal percentage of males and females, namely 25%, were accepted into it.

Now, for the engineering school. We can see that, of the 1200 males who applied to the university, 240 (or 20%) were accepted into the engineering school. Of the 800 females who applied to the university, 160 (or 20%) were accepted into the engineering school. So, the engineering school also looks to be in good shape, as an equal percentage of males and females, namely 20%, were accepted into it.

We probably don't have to drag this out any further. If we look at each column in the table, we see that the proportion of males accepted into each school is the same as the proportion of females accepted into each school... which therefore happens to equal the proportion of students accepted into each school, regardless of gender. Therefore, we can conclude that males and females are distributed equally among the four schools.

## Example 17-2

A university admissions officer was concerned that males and females were accepted at different rates into the four different schools (business, engineering, liberal arts, and science) at her university. She collected the following data on the acceptance of 1200 males and 800 females who applied to the university:

#(Acceptances) | Business | Engineer | Lib Arts | Science | (FIXED) Total |
---|---|---|---|---|---|

Male | 240 (20%) | 480 (40%) | 120 (10%) | 360 (30%) | 1200 |

Female | 240 (30%) | 80 (10%) | 320 (40%) | 160 (20%) | 800 |

Total | 480 (24%) | 560 (28%) | 440 (22%) | 520 (26%) | 2000 |

Are males and females distributed equally among the various schools?

### Answer

Let's again start by focusing on the business school. In this case, of the 1200 males who applied to the university, 240 (or 20%) were accepted into the business school. And, of the 800 females who applied to the university, 240 (or 30%) were accepted into the business school. So, the business school appears to have different rates of acceptance for males and females, 20% compared to 30%.

Now, for the engineering school. We can see that, of the 1200 males who applied to the university, 480 (or 40%) were accepted into the engineering school. Of the 800 females who applied to the university, only 80 (or 10%) were accepted into the engineering school. So, the engineering school also appears to have different rates of acceptance for males and females, 40% compared to 10%.

Again, there's no need drag this out any further. If we look at each column in the table, we see that the proportion of males accepted into each school is different than the proportion of females accepted into each school... and therefore the proportion of students accepted into each school, regardless of gender, is different than the proportion of males and females accepted into each school. Therefore, we can conclude that males and females are not distributed equally among the four schools.

In the context of the two examples above, it quickly becomes apparent that if we wanted to formally test the hypothesis that males and females are distributed equally among the four schools, we'd want to test the hypotheses:

\(H_0 : p_{MB} =p_{FB} \text{ and } p_{ME} =p_{FE} \text{ and } p_{ML} =p_{FL} \text{ and } p_{MS} =p_{FS}\)

\(H_A : p_{MB} \ne p_{FB} \text{ or } p_{ME} \ne p_{FE} \text{ or } p_{ML} \ne p_{FL} \text{ or } p_{MS} \ne p_{FS}\)

where:

- \(p_{Mj}\) is the proportion of males accepted into school
*j*=*B*,*E*,*L*, or*S* - \(p_{Fj}\) is the proportion of females accepted into school
*j*=*B*,*E*,*L*, or*S*

In conducting such a hypothesis test, we're comparing the proportions of two multinomial distributions. Before we can develop the method for conducting such a hypothesis test, that is, for comparing the proportions of two multinomial distributions, we first need to define some notation.

## Notation

We'll use what I think most statisticians would consider standard notation, namely that:

- The letter
will index the**i***h***row**categories, and - The letter
will index the**j***k***column**categories

(The text reverses the use of the *i* index and the *j* index.) That said, let's use the framework of the previous examples to introduce the notation we'll use. That is, rewrite the tables above using the following generic notation:

#(Acc) | Bus \(\left(j = 1 \right)\) | Eng \(\left(j = 2 \right)\) | L Arts \(\left(j = 3 \right)\) | Sci \(\left(j = 4 \right)\) | (FIXED) Total |
---|---|---|---|---|---|

M \(\left(i = 1 \right)\) | \(y_{11} \left(\hat{p}_{11} \right)\) | \(y_{12} \left(\hat{p}_{12} \right)\) | \(y_{13} \left(\hat{p}_{13} \right)\) | \(y_{14} \left(\hat{p}_{14} \right)\) | \(n_{1}=\sum_\limits{j=1}^{k} y_{1 j}\) |

F \(\left(i = 2 \right)\) | \(y_{21} \left(\hat{p}_{21} \right)\) | \(y_{22} \left(\hat{p}_{22} \right)\) | \(y_{23} \left(\hat{p}_{23} \right)\) | \(y_{24} \left(\hat{p}_{24} \right)\) | \(n_{2}=\sum_\limits{j=1}^{k} y_{2 j}\) |

Total | \(y_{11} + y_{21} \left(\hat{p}_1 \right)\) | \(y_{12} + y_{22} \left(\hat{p}_2 \right)\) | \(y_{13} + y_{23} \left(\hat{p}_3 \right)\) | \(y_{14} + y_{24} \left(\hat{p}_4 \right)\) | \(n_1 + n_2\) |

with:

- \(y_{ij}\) denoting the number falling into the \(j^{th}\) category of the \(i^{th}\) sample
- \(\hat{p}_{ij}=y_{ij}/n_i\)denoting the proportion in the \(i^{th}\) sample falling into the \(j^{th}\) category
- \(n_i=\sum_{j=1}^{k}y_{ij}\)denoting the total number in the \(i^{th}\) sample
- \( \hat{p}_{j}=(y_{1j}+y_{2j})/(n_1+n_2) \)denoting the (overall) proportion falling into the \(j^{th}\) category

With the notation defined as such, we are now ready to formulate the chi-square test statistic for testing the equality of two multinomial distributions.

## The Chi-Square Test Statistic

### Theorem

The chi-square test statistic for testing the equality of two multinomial distributions:

\(Q=\sum_{i=1}^{2}\sum_{j=1}^{k}\frac{(y_{ij}- n_i\hat{p}_j)^2}{n_i\hat{p}_j}\)

follows an approximate chi-square distribution with *k*−1 degrees of freedom. Reject the null hypothesis of equal proportions if *Q* is large, that is, if:

\(Q \ge \chi_{\alpha, k-1}^{2}\)

### Proof

For the sake of concreteness, let's again use the framework of our example above to derive the chi-square test statistic. For one of the samples, say for the males, we know that:

\(\sum_{j=1}^{k}\frac{(\text{observed }-\text{ expected})^2}{\text{expected}}=\sum_{j=1}^{k}\frac{(y_{1j}- n_1p_{1j})^2}{n_1p_{1j}} \)

follows an approximate chi-square distribution with *k*−1 degrees of freedom. For the other sample, that is, for the females, we know that:

\(\sum_{j=1}^{k}\frac{(\text{observed }-\text{ expected})^2}{\text{expected}}=\sum_{j=1}^{k}\frac{(y_{2j}- n_2p_{2j})^2}{n_2p_{2j}} \)

follows an approximate chi-square distribution with *k*−1 degrees of freedom. Therefore, by the independence of two samples, we can "add up the chi-squares," that is:

\(\sum_{i=1}^{2}\sum_{j=1}^{k}\frac{(y_{ij}- n_ip_{ij})^2}{n_ip_{ij}}\)

follows an approximate chi-square distribution with *k*−1+ *k*−1 = 2(*k*−1) degrees of freedom.

Oops.... but we have a problem! The \(p_{ij}\)'s are unknown to us. Of course, we know by now that the solution is to estimate the \(p_{ij}\)'s. Now just how to do that? Well, if the null hypothesis is true, the proportions are equal, that is, if:

\(p_{11}=p_{21}, p_{21}=p_{22}, ... , p_{1k}=p_{2k} \)

we would be best served by using all of the data across the sample categories. That is, the best estimate for each\(j^{th}\) category is the pooled estimate:

\(\hat{p}_j=\frac{y_{1j}+y_{2j}}{n_1+n_2}\)

We also know by now that because we are estimating some paremeters, we have to adjust the degrees of freedom. The pooled estimates \(\hat{p}_j\) estimate the true unknown proportions \(p_{1j} = p_{2j} = p_j\). Now, if we know the first *k*−1 estimates, that is, if we know:

\(\hat{p}_1, \hat{p}_2, ... , \hat{p}_{k-1}\)

then the \(k^{th}\) one, that is \(\hat{p}_k\), is determined because:

\(\sum_{j=1}^{k}\hat{p}_j=1\)

That is:

\(\hat{p}_k=1-(\hat{p}_1+\hat{p}_2+ ... + \hat{p}_{k-1})\)

So, we are estimating *k*−1 parameters, and therefore we have to subtract *k*−1 from the degrees of freedom. Doing so, we get that

\(Q=\sum_{i=1}^{2}\sum_{j=1}^{k}\frac{(y_{ij}- n_i\hat{p}_j)^2}{n_i\hat{p}_j}\)

follows an approximate chi-square distribution with 2(*k*−1) − (*k*−1) = *k* − 1 degrees of freedom. As was to be proved!

## Note

Our only example on this page has involved \(h = 2\) samples. If there are more than two samples, that is, if \(h > 2\), then the definition of the chi-square statistic is appropriately modified. That is:

\(Q=\sum_{i=1}^{h}\sum_{j=1}^{k}\frac{(y_{ij}- n_i\hat{p}_j)^2}{n_i\hat{p}_j}\)

follows an approximate chi-square distribution with \(h(k−1) − (k−1) = (h−1)(k − 1)\) degrees of freedom.

Let's take a look at another example.

## Example 17-3

The head of a surgery department at a university medical center was concerned that surgical residents in training applied unnecessary blood transfusions at a different rate than the more experienced attending physicians. Therefore, he ordered a study of the **49 Attending Physicians** and **71 Residents in Training** with privileges at the hospital. For each of the 120 surgeons, the number of blood transfusions prescribed unnecessarily in a one-year period was recorded. Based on the number recorded, a surgeon was identified as either prescribing unnecessary blood transfusions Frequently, Occasionally, Rarely, or Never. Here's a summary table (or "**contingency table**") of the resulting data:

Physician | Frequent | Occasionally | Rarely | Never | Total |
---|---|---|---|---|---|

Attending | 2 (4.1%) | 3 (6.1%) | 31 (63.3%) | 13 (26.5%) | 49 |

Resident | 15 (21.1%) | 28 (39.4%) | 23 (32.4%) | 5 (7.0%) | 71 |

Total | 17 | 31 | 54 | 18 | 120 |

Are attending physicians and residents in training distributed equally among the various unnecessary blood transfusion categories?

### Answer

We are interested in testing the null hypothesis:

\(H_0 : p_{RF} =p_{AF} \text{ and } p_{RO} =p_{AO} \text{ and } p_{RR} =p_{AR} \text{ and } p_{RN} =p_{AN}\)

against the alternative hypothesis:

\(H_A : p_{RF} \ne p_{AF} \text{ or } p_{RO} \ne p_{AO} \text{ or } p_{RR} \ne p_{AR} \text{ or } p_{RN} \ne p_{AN}\)

The observed data were given to us in the table above. So, the next thing we need to do is find the expected counts for each cell of the table:

Physician | Frequent | Occasionally | Rarely | Never | Total |
---|---|---|---|---|---|

Attending | 49 | ||||

Resident | 71 | ||||

Total | 17 | 31 | 54 | 18 | 120 |

It is in the calculation of the expected values that you can readily see why we have (2−1)(4−1) = 3 degrees of freedom in this case. That's because, we only have to calculate three of the cells directly.

Physician | Frequent | Occasionally | Rarely | Never | Total |
---|---|---|---|---|---|

Attending | 6.942 | 12.658 | 22.05 | 49 | |

Resident | 71 | ||||

Total | 17 | 31 | 54 | 18 | 120 |

Once we do that, the remaining five cells can be calculated by way of subtraction:

Physician | Frequent | Occasionally | Rarely | Never | Total |
---|---|---|---|---|---|

Attending | 6.942 | 12.658 | 22.05 | 7.35 | 49 |

Resident | 10.058 | 18.342 | 31.95 | 10.65 | 71 |

Total | 17 | 31 | 54 | 18 | 120 |

Now that we have the observed and expected counts, calculating the chi-square statistic is a straightforward exercise:

\(Q=\frac{(2-6.942)^2}{6.942}+ ... +\frac{(5-10.65)^2}{10.65} =31.88 \)

The chi-square test tells us to reject the null hypothesis, at the 0.05 level, if *Q* is greater than a chi-square random variable with 3 degrees of freedom, that is, if \(Q > 7.815\). Because \(Q = 31.88 > 7.815\), we reject the null hypothesis. There is sufficient evidence at the 0.05 level to conclude that the distribution of unnecessary transfusions differs among attending physicians and residents.

##
Minitab^{®}

## Using Minitab

If you...

- Enter the data (in the inside of the frequency table only) into the columns of the worksheet
- Select
`Stat`>>`Tables`>>`Chi-square test`

then you'll get typical chi-square test output that looks something like this:

Freq | Occ | Rare | Never | Total | |
---|---|---|---|---|---|

1 | 2 6.94 |
3 12.66 |
31 22.05 |
13 7.35 |
49 |

2 | 15 10.06 |
28 18.34 |
23 31.95 |
5 10.65 |
71 |

Total | 17 | 31 | 54 | 18 | 120 |

Chi- sq = 3.518 + 7.369 + 3.633 + 4.343 +

2.428 + 5.086 + 2.507 + 2.997 = 31.881

DF = 3, P-Value = 0.000

# 17.2 - Test for Independence

17.2 - Test for IndependenceOne of the primary things that distinguishes the test for independence, that we'll be studying on this page, from the test for homogeneity is the way in which the data are collected. So, let's start by addressing the sampling schemes for each of the two situations.

## The Sampling Schemes

For the sake of concreteness, suppose we're interested in comparing the proportions of high school freshmen and high school seniors falling into various driving categories — perhaps, those who don't drive at all, those who drive unsafely, and those who drive safely. We randomly select 100 freshmen and 100 seniors and then observe into which of the three driving categories each student falls:

Driving Habits | Categories | |||||
---|---|---|---|---|---|---|

Samples | OBSERVED | \( j = 1\) | \(j = 2\) | \(\cdots\) | \(j = k\) | Total |

Freshmen \(i = 1\) | \(n_1 = 100\) | |||||

Seniors \(i = 2\) | \(n_2 = 100\) | |||||

Total |

In this case, we are interested in conducting a **test of homogeneity** for testing the null hypothesis:

\(H_0 : p_{F1}=p_{S1} \text{ and }p_{F2}=p_{S2} \text{ and } ... p_{Fk}=p_{Sk}\)

against the alternative hypothesis:

\(H_A : p_{F1}\ne p_{S1} \text{ or }p_{F2}\ne p_{S2} \text{ or } ... p_{Fk}\ne p_{Sk}\).

For this example, the sampling scheme involves:

- Taking two random (and therefore independent) samples with
*n*_{1}and*n*_{2}fixed in advance, - Observing into which of the
*k*categories the freshmen fall, and - Observing into which of the
*k*categories the seniors fall.

Now, lets consider a different example to illustrate an alternative sampling scheme. Suppose 395 people are randomly selected, and are "cross-classified" into one of eight cells, depending into which age category they fall and whether or not they support legalizing marijuana:

Marijuana Support | Variable B (Age) | |||||
---|---|---|---|---|---|---|

Variable A | OBSERVED | (18-24) \(B_1\) | (25-34) \(B_12\) | (35-49) \(B_3\) | (50-64) \(B_4\) | Total |

(YES) \(A_1\) | 60 | 54 | 46 | 41 | 201 | |

(NO) \(A_2\) | 40 | 44 | 53 | 57 | 194 | |

Total | 100 | 98 | 99 | 98 | \(n = 395\) |

In this case, we are interested in conducting a **test of independence** for testing the null hypothesis:

\(H_0 \colon\) Variable *A* is independent of variable *B*, that is, \(P(A_i \cap B_j)=P(A_i) \times P(B_j)\) for all* i* and *j*.

against the alternative hypothesis \(H_A \colon\) Variable *A* is not independent of variable *B*.

For this example, the sampling scheme involves:

- Taking one random sample of size
*n*, with*n*fixed in advance, and - Then "cross-classifying" each subject into one and only one of the mutually exclusive and exhaustive \(A_i \cap B_j \) cells.

Note that, in this case, both the row totals and column totals are random... it is only the total number *n* sampled that is fixed in advance. It is this sampling scheme and the resulting test for independence that will be the focus of our attention on this page. Now, let's jump right to the punch line.

## The Punch Line

The same chi-square test works! It doesn't matter how the sampling was done. But, it's traditional to still think of the two tests, the one for homogeneity and the one for independence, in different lights.

Just as we did before, let's start with clearly defining the notation we will use.

## Notation

Suppose we have *k* (column) levels of Variable B indexed by the letter *j*, and *h* (row) levels of Variable A indexed by the letter *i*. Then, we can summarize the data and probability model in tabular format, as follows:

Variable B | |||||
---|---|---|---|---|---|

Variable A | \(B_1 \left(j = 1\right)\) | \(B_2 \left(j = 2\right)\) | \(B_3 \left(j = 3\right)\) | \(B_4 \left(j = 4\right)\) | Total |

\(A_1 \left(i = 1\right)\) | \(Y_{11} \left(p_{11}\right)\) | \(Y_{12} \left(p_{12}\right)\) | \(Y_{13} \left(p_{13}\right)\) | \(Y_{14} \left(p_{14}\right)\) | \(\left(p_{.1}\right)\) |

\(A_2 \left(i = 2\right)\) | \(Y_{21} \left(p_{21}\right)\) | \(Y_{22} \left(p_{22}\right)\) | \(Y_{23} \left(p_{23}\right)\) | \(Y_{24} \left(p_{24}\right)\) | \(\left(p_{.2}\right)\) |

Total |
\(\left(p_{.1}\right)\) |
\(\left(p_{.2}\right)\) | \(\left(p_{.3}\right)\) | \(\left(p_{.4}\right)\) | \(n\) |

where:

- \(Y_ij\) denotes the frequency of event \(A_i \cap B_j \)
- The probability that a randomly selected observation falls into the cell defined by \(A_i \cap B_j \) is \(p_{ij}=P(A_i \cap B_j)\) and is estimated by \(Y_{ij}/n\)
- The probability that a randomly selected observation falls into a row defined by
*A*is \(p_{i.}=P(A_i )\) and is estimated by \(\sum_{j=1}^{k}p_{ij}\) ("_{i}**dot notation**") - The probability that a randomly selected observation falls into a column defined by
*B*is \(p_{.j}=P(B_j) \) and is estimated by \(\sum_{i=1}^{h}p_{ij}\) ("_{j}**dot notation**")

With the notation defined as such, we are now ready to formulate the chi-square test statistic for testing the independence of two categorical variables.

## The Chi-Square Test Statistic

### Theorem

The chi-square test statistic:

\(Q=\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{(y_{ij}-\frac{y_{i.}y_{.j}}{n})^2}{\frac{y_{i.}y_{.j}}{n}} \)

for testing the independence of two categorical variables, one with *h* levels and the other with *k* levels, follows an approximate chi-square distribution with (*h*−1)(*k*−1) degrees of freedom.

### Proof

We should be getting to be pros at deriving these chi-square tests. We'll do the proof in four steps.

**Step 1**We can think of the \(h \times k\) cells as arising from a multinomial distribution with \(h \times k\) categories. Then, in that case, as long as

*n*is large, we know that:\(Q_{kh-1}=\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{(\text{observed }-\text{ expected})^2}{\text{ expected}} =\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{(y_{ij}-np_{ij})^2}{np_{ij}}\)

follows an approximate chi-square distribution with \(kh−1\) degrees of freedom.

**Step 2**But the chi-square statistic, as defined in the first step, depends on some unknown parameters \(p_{ij}\). So, we'll estimate the \(p_{ij}\) assuming that the null hypothesis is true, that is, assuming independence:

\(p_{ij}=P(A_i \cap B_j)=P(A_i) \times P(B_j)=p_{i.}p_{.j} \)

Under the assumption of independence, it is therefore reasonable to estimate the \(p_{ij}\) with:

\(\hat{p}_{ij}=\hat{p}_{i.}\hat{p}_{.j}=\left(\frac{\sum_{j=1}^{k}y_{ij}}{n}\right) \left(\frac{\sum_{i=1}^{h}y_{ij}}{n}\right)=\frac{y_{i.}y_{.j}}{n^2}\)

**Step 3**Now, we have to determine how many parameters we estimated in the second step. Well, the fact that the row probabilities add to 1:

\(\sum_{i=1}^{h}p_{i.}=1 \)

implies that we've estimated \(h−1\) row parameters. And, the fact that the column probabilities add to 1:

\(\sum_{j=1}^{k}p_{.j}=1 \)

implies that we've estimated \(k−1\) column parameters. Therefore, we've estimated a total of \(h−1 + k − 1 = h + k − 2\) parameters.

**Step 4**Because we estimated \(h + k − 2\) parameters, we have to adjust the test statistic and degrees of freedom accordingly. Doing so, we get that:

\(Q=\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{\left(y_{ij}-n\left(\frac{y_{i.}y_{.j}}{n^2}\right) \right)^2}{n\left(\frac{y_{i.}y_{.j}}{n^2}\right)} =\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{\left(y_{ij}-\frac{y_{i.}y_{.j}}{n} \right)^2}{\left(\frac{y_{i.}y_{.j}}{n}\right)} \)

follows an approximate chi-square distribution with \((kh − 1)− ( h + k − 2)\) parameters, that is, upon simplification, \((h − 1)(k − 1)\) degrees of freedom.

By the way, I think I might have mumbled something up above about the equivalence of the chi-square statistic for homogeneity and the chi-square statistic for independence. In order to prove that the two statistics are indeed equivalent, we just have to show, for example, in the case when \(h = 2\), that:

\(\sum_{i=1}^{2}\sum_{j=1}^{k}\frac{\left(y_{ij}-n_i\left(\frac{y_{1j}+y_{2j}}{n_1+n_2}\right) \right)^2}{n_i\left(\frac{y_{1j}+y_{2j}}{n_1+n_2}\right)} =\sum_{i=1}^{2}\sum_{j=1}^{k}\frac{\left(y_{ij}-\frac{y_{i.}y_{.j}}{n} \right)^2}{\left(\frac{y_{i.}y_{.j}}{n}\right)} \)

Errrrrrr. That probably looks like a scarier proposition than it is, as showing that the above is true amounts to showing that:

\(n_i \binom{y_{1j}+y_{2j}}{n_1+n_2}=\binom{y_{i.}y_{.j}}{n} \)

Well, rewriting the left-side a bit using dot notation, we get:

\(n_i \binom{y_{.j}}{n}=\binom{y_{i.}y_{.j}}{n} \)

and doing some algebraic simplification, we get:

\(n_i= y_{i.}\)

which certainly holds true, as \(n_i\) and \(y_{i·}\) mean the same thing, that is, the number of experimental units in the \(i^{th}\) row.

## Example 17-4

Is age independent of the desire to ride a bicycle? A random sample of 395 people were surveyed. Each person was asked their interest in riding a bicycle (Variable A) and their age (Variable B). The data that resulted from the survey is summarized in the following table:

Bicycle Riding Interest | Variable B (Age) | |||||
---|---|---|---|---|---|---|

Variable A | OBSERVED | (18-24) | (25-34) | (35-49) | (50-64) | Total |

YES | 60 | 54 | 46 | 41 | 201 | |

NO | 40 | 44 | 53 | 57 | 194 | |

Total | 100 | 98 | 99 | 98 | 395 |

Is there evidence to conclude, at the 0.05 level, that the desire to ride a bicycle depends on age?

### Answer

Here's the table of expected counts:

Bicycle Riding Interest | Variable B (Age) | |||||
---|---|---|---|---|---|---|

Variable A | EXPECTED | 18-24 | 25-34 | 35-49 | 50-64 | Total |

YES | 50.886 | 49.868 | 50.377 | 49.868 | 201 | |

NO | 49.114 | 48.132 | 48.623 | 48.132 | 194 | |

Total | 100 | 98 | 99 | 98 | 395 |

which results in a chi-square statistic of 8.006:

\(Q=\frac{(60-50.886)^2}{50.886}+ ... +\frac{(57-48.132)^2}{48.132}=8.006 \)

The chi-square test tells us to reject the null hypothesis, at the 0.05 level, if *Q* is greater than a chi-square random variable with 3 degrees of freedom, that is, if *Q* > 7.815. Because *Q* = 8.006 > 7.815, we reject the null hypothesis. There is sufficient evidence at the 0.05 level to conclude that the desire to ride a bicycle depends on age.

## Using Minitab

If you...

- Enter the data (in the inside of the observed frequency table only) into the columns of the worksheet
- Select
`Stat`>>`Tables`>>`Chi-square test`

then Minitab will display typical chi-square test output that looks something like this:

\(\color{white}\text{noheader}\) | 18-24 | 25-34 | 35-49 | 50-64 | Total |
---|---|---|---|---|---|

1 | 60 50.89 |
54 49.87 |
46 50.38 |
41 49.87 |
201 |

2 | 40 49.11 |
44 48.13 |
53 48.62 |
57 48.13 |
194 |

Total | 100 | 98 | 99 | 98 | 395 |

Chi- sq = 1.632 + 0.342 + 0.380 + 1.577 +

1.691 + 0.355 + 0.394 + 1.634 = 8.006

DF = 3, P-Value = 0.000