17.2 - Test for Independence

One of the primary things that distinguishes the test for independence, that we'll be studying on this page, from the test for homogeneity is the way in which the data are collected. So, let's start by addressing the sampling schemes for each of the two situations.

The Sampling Schemes

For the sake of concreteness, suppose we're interested in comparing the proportions of high school freshmen and high school seniors falling into various driving categories — perhaps, those who don't drive at all, those who drive unsafely, and those who drive safely. We randomly select 100 freshmen and 100 seniors and then observe into which of the three driving categories each student falls:

Driving Habits		Categories
Samples	OBSERVED	\( j = 1\)	\(j = 2\)	\(\cdots\)	\(j = k\)	Total
	Freshmen \(i = 1\)					\(n_1 = 100\)
	Seniors \(i = 2\)					\(n_2 = 100\)
	Total

In this case, we are interested in conducting a test of homogeneity for testing the null hypothesis:

\(H_0 : p_{F1}=p_{S1} \text{ and }p_{F2}=p_{S2} \text{ and } ... p_{Fk}=p_{Sk}\)

against the alternative hypothesis:

\(H_A : p_{F1}\ne p_{S1} \text{ or }p_{F2}\ne p_{S2} \text{ or } ... p_{Fk}\ne p_{Sk}\).

For this example, the sampling scheme involves:

Taking two random (and therefore independent) samples with n₁ and n₂ fixed in advance,
Observing into which of the k categories the freshmen fall, and
Observing into which of the k categories the seniors fall.

Now, lets consider a different example to illustrate an alternative sampling scheme. Suppose 395 people are randomly selected, and are "cross-classified" into one of eight cells, depending into which age category they fall and whether or not they support legalizing marijuana:

Marijuana Support		Variable B (Age)
Variable A	OBSERVED	(18-24) \(B_1\)	(25-34) \(B_12\)	(35-49) \(B_3\)	(50-64) \(B_4\)	Total
	(YES) \(A_1\)	60	54	46	41	201
	(NO) \(A_2\)	40	44	53	57	194
	Total	100	98	99	98	\(n = 395\)

In this case, we are interested in conducting a test of independence for testing the null hypothesis:

\(H_0 \colon\) Variable A is independent of variable B, that is, \(P(A_i \cap B_j)=P(A_i) \times P(B_j)\) for all i and j.

against the alternative hypothesis \(H_A \colon\) Variable A is not independent of variable B.

For this example, the sampling scheme involves:

Taking one random sample of size n, with n fixed in advance, and
Then "cross-classifying" each subject into one and only one of the mutually exclusive and exhaustive \(A_i \cap B_j \) cells.

Note that, in this case, both the row totals and column totals are random... it is only the total number n sampled that is fixed in advance. It is this sampling scheme and the resulting test for independence that will be the focus of our attention on this page. Now, let's jump right to the punch line.

The Punch Line

The same chi-square test works! It doesn't matter how the sampling was done. But, it's traditional to still think of the two tests, the one for homogeneity and the one for independence, in different lights.

Just as we did before, let's start with clearly defining the notation we will use.

Notation

Suppose we have k (column) levels of Variable B indexed by the letter j, and h (row) levels of Variable A indexed by the letter i. Then, we can summarize the data and probability model in tabular format, as follows:

Variable B
Variable A	\(B_1 \left(j = 1\right)\)	\(B_2 \left(j = 2\right)\)	\(B_3 \left(j = 3\right)\)	\(B_4 \left(j = 4\right)\)	Total
\(A_1 \left(i = 1\right)\)	\(Y_{11} \left(p_{11}\right)\)	\(Y_{12} \left(p_{12}\right)\)	\(Y_{13} \left(p_{13}\right)\)	\(Y_{14} \left(p_{14}\right)\)	\(\left(p_{.1}\right)\)
\(A_2 \left(i = 2\right)\)	\(Y_{21} \left(p_{21}\right)\)	\(Y_{22} \left(p_{22}\right)\)	\(Y_{23} \left(p_{23}\right)\)	\(Y_{24} \left(p_{24}\right)\)	\(\left(p_{.2}\right)\)
Total	\(\left(p_{.1}\right)\)	\(\left(p_{.2}\right)\)	\(\left(p_{.3}\right)\)	\(\left(p_{.4}\right)\)	\(n\)

where:

\(Y_ij\) denotes the frequency of event \(A_i \cap B_j \)
The probability that a randomly selected observation falls into the cell defined by \(A_i \cap B_j \) is \(p_{ij}=P(A_i \cap B_j)\) and is estimated by \(Y_{ij}/n\)
The probability that a randomly selected observation falls into a row defined by A_i is \(p_{i.}=P(A_i )\) and is estimated by \(\sum_{j=1}^{k}p_{ij}\) ("dot notation")
The probability that a randomly selected observation falls into a column defined by B_j is \(p_{.j}=P(B_j) \) and is estimated by \(\sum_{i=1}^{h}p_{ij}\) ("dot notation")

With the notation defined as such, we are now ready to formulate the chi-square test statistic for testing the independence of two categorical variables.

The Chi-Square Test Statistic

Theorem

The chi-square test statistic:

\(Q=\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{(y_{ij}-\frac{y_{i.}y_{.j}}{n})^2}{\frac{y_{i.}y_{.j}}{n}} \)

for testing the independence of two categorical variables, one with h levels and the other with k levels, follows an approximate chi-square distribution with (h−1)(k−1) degrees of freedom.

Proof

We should be getting to be pros at deriving these chi-square tests. We'll do the proof in four steps.

Step 1
We can think of the \(h \times k\) cells as arising from a multinomial distribution with \(h \times k\) categories. Then, in that case, as long as n is large, we know that:

\(Q_{kh-1}=\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{(\text{observed }-\text{ expected})^2}{\text{ expected}} =\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{(y_{ij}-np_{ij})^2}{np_{ij}}\)

follows an approximate chi-square distribution with \(kh−1\) degrees of freedom.
Step 2
But the chi-square statistic, as defined in the first step, depends on some unknown parameters \(p_{ij}\). So, we'll estimate the \(p_{ij}\) assuming that the null hypothesis is true, that is, assuming independence:

\(p_{ij}=P(A_i \cap B_j)=P(A_i) \times P(B_j)=p_{i.}p_{.j} \)

Under the assumption of independence, it is therefore reasonable to estimate the \(p_{ij}\) with:

\(\hat{p}_{ij}=\hat{p}_{i.}\hat{p}_{.j}=\left(\frac{\sum_{j=1}^{k}y_{ij}}{n}\right) \left(\frac{\sum_{i=1}^{h}y_{ij}}{n}\right)=\frac{y_{i.}y_{.j}}{n^2}\)
Step 3
Now, we have to determine how many parameters we estimated in the second step. Well, the fact that the row probabilities add to 1:

\(\sum_{i=1}^{h}p_{i.}=1 \)

implies that we've estimated \(h−1\) row parameters. And, the fact that the column probabilities add to 1:

\(\sum_{j=1}^{k}p_{.j}=1 \)

implies that we've estimated \(k−1\) column parameters. Therefore, we've estimated a total of \(h−1 + k − 1 = h + k − 2\) parameters.
Step 4
Because we estimated \(h + k − 2\) parameters, we have to adjust the test statistic and degrees of freedom accordingly. Doing so, we get that:

\(Q=\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{\left(y_{ij}-n\left(\frac{y_{i.}y_{.j}}{n^2}\right) \right)^2}{n\left(\frac{y_{i.}y_{.j}}{n^2}\right)} =\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{\left(y_{ij}-\frac{y_{i.}y_{.j}}{n} \right)^2}{\left(\frac{y_{i.}y_{.j}}{n}\right)} \)

follows an approximate chi-square distribution with \((kh − 1)− ( h + k − 2)\) parameters, that is, upon simplification, \((h − 1)(k − 1)\) degrees of freedom.

By the way, I think I might have mumbled something up above about the equivalence of the chi-square statistic for homogeneity and the chi-square statistic for independence. In order to prove that the two statistics are indeed equivalent, we just have to show, for example, in the case when \(h = 2\), that:

\(\sum_{i=1}^{2}\sum_{j=1}^{k}\frac{\left(y_{ij}-n_i\left(\frac{y_{1j}+y_{2j}}{n_1+n_2}\right) \right)^2}{n_i\left(\frac{y_{1j}+y_{2j}}{n_1+n_2}\right)} =\sum_{i=1}^{2}\sum_{j=1}^{k}\frac{\left(y_{ij}-\frac{y_{i.}y_{.j}}{n} \right)^2}{\left(\frac{y_{i.}y_{.j}}{n}\right)} \)

Errrrrrr. That probably looks like a scarier proposition than it is, as showing that the above is true amounts to showing that:

\(n_i \binom{y_{1j}+y_{2j}}{n_1+n_2}=\binom{y_{i.}y_{.j}}{n} \)

Well, rewriting the left-side a bit using dot notation, we get:

\(n_i \binom{y_{.j}}{n}=\binom{y_{i.}y_{.j}}{n} \)

and doing some algebraic simplification, we get:

\(n_i= y_{i.}\)

which certainly holds true, as \(n_i\) and \(y_{i·}\) mean the same thing, that is, the number of experimental units in the \(i^{th}\) row.

Example 17-4

Is age independent of the desire to ride a bicycle? A random sample of 395 people were surveyed. Each person was asked their interest in riding a bicycle (Variable A) and their age (Variable B). The data that resulted from the survey is summarized in the following table:

Bicycle Riding Interest		Variable B (Age)
Variable A	OBSERVED	(18-24)	(25-34)	(35-49)	(50-64)	Total
	YES	60	54	46	41	201
	NO	40	44	53	57	194
	Total	100	98	99	98	395

Is there evidence to conclude, at the 0.05 level, that the desire to ride a bicycle depends on age?

Answer

Here's the table of expected counts:

Bicycle Riding Interest		Variable B (Age)
Variable A	EXPECTED	18-24	25-34	35-49	50-64	Total
	YES	50.886	49.868	50.377	49.868	201
	NO	49.114	48.132	48.623	48.132	194
	Total	100	98	99	98	395

which results in a chi-square statistic of 8.006:

\(Q=\frac{(60-50.886)^2}{50.886}+ ... +\frac{(57-48.132)^2}{48.132}=8.006 \)

The chi-square test tells us to reject the null hypothesis, at the 0.05 level, if Q is greater than a chi-square random variable with 3 degrees of freedom, that is, if Q > 7.815. Because Q = 8.006 > 7.815, we reject the null hypothesis. There is sufficient evidence at the 0.05 level to conclude that the desire to ride a bicycle depends on age.

Using Minitab

If you...

Enter the data (in the inside of the observed frequency table only) into the columns of the worksheet
Select Stat >> Tables >> Chi-square test

then Minitab will display typical chi-square test output that looks something like this:

Chi-Square Test: 18-24, 25-34, 35-49, 50-64
Expected counts are printed below observed counts
\(\color{white}\text{noheader}\)	18-24	25-34	35-49	50-64	Total
1	60 50.89	54 49.87	46 50.38	41 49.87	201
2	40 49.11	44 48.13	53 48.62	57 48.13	194
Total	100	98	99	98	395

Chi- sq = 1.632 + 0.342 + 0.380 + 1.577 +

1.691 + 0.355 + 0.394 + 1.634 = 8.006

DF = 3, P-Value = 0.000

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility