17.2 - Test for Independence

One of the primary things that distinguishes the test for independence, that we'll be studying on this page, from the test for homogeneity is the way in which the data are collected. So, let's start by addressing the sampling schemes for each of the two situations.

The Sampling Schemes Section

For the sake of concreteness, suppose we're interested in comparing the proportions of high school freshmen and high school seniors falling into various driving categories — perhaps, those who don't drive at all, those who drive unsafely, and those who drive safely. We randomly select 100 freshmen and 100 seniors and then observe into which of the three driving categories each student falls:

Driving Habits Categories
Samples OBSERVED \( j = 1\) \(j = 2\) \(\cdots\) \(j = k\) Total
Freshmen \(i = 1\)         \(n_1 = 100\)
Seniors \(i = 2\)         \(n_2 = 100\)
Total          

In this case, we are interested in conducting a test of homogeneity for testing the null hypothesis:

\(H_0 : p_{F1}=p_{S1} \text{ and }p_{F2}=p_{S2} \text{ and } ... p_{Fk}=p_{Sk}\)

against the alternative hypothesis:

\(H_A : p_{F1}\ne p_{S1} \text{ or }p_{F2}\ne p_{S2} \text{ or } ... p_{Fk}\ne p_{Sk}\).

For this example, the sampling scheme involves:

  1. Taking two random (and therefore independent) samples with n1 and n2 fixed in advance,
  2. Observing into which of the k categories the freshmen fall, and
  3. Observing into which of the k categories the seniors fall.

Now, lets consider a different example to illustrate an alternative sampling scheme. Suppose 395 people are randomly selected, and are "cross-classified" into one of eight cells, depending into which age category they fall and whether or not they support legalizing marijuana:

Marijuana Support Variable B (Age)
Variable A OBSERVED (18-24) \(B_1\) (25-34) \(B_12\) (35-49) \(B_3\) (50-64) \(B_4\) Total
(YES) \(A_1\) 60 54 46 41 201
(NO) \(A_2\) 40 44 53 57 194
Total 100 98 99 98 \(n = 395\)

In this case, we are interested in conducting a test of independence for testing the null hypothesis:

\(H_0 \colon\) Variable A is independent of variable B, that is, \(P(A_i \cap B_j)=P(A_i) \times P(B_j)\) for all i and j.

against the alternative hypothesis \(H_A \colon\) Variable A is not independent of variable B.

For this example, the sampling scheme involves:

  1. Taking one random sample of size n, with n fixed in advance, and
  2. Then "cross-classifying" each subject into one and only one of the mutually exclusive and exhaustive \(A_i \cap B_j \) cells.

Note that, in this case, both the row totals and column totals are random... it is only the total number n sampled that is fixed in advance. It is this sampling scheme and the resulting test for independence that will be the focus of our attention on this page. Now, let's jump right to the punch line.

The Punch Line Section

boxing

The same chi-square test works! It doesn't matter how the sampling was done. But, it's traditional to still think of the two tests, the one for homogeneity and the one for independence, in different lights.

Just as we did before, let's start with clearly defining the notation we will use.

Notation Section

Suppose we have k (column) levels of Variable B indexed by the letter j, and h (row) levels of Variable A indexed by the letter i. Then, we can summarize the data and probability model in tabular format, as follows:

Variable B
Variable A \(B_1 \left(j = 1\right)\) \(B_2 \left(j = 2\right)\) \(B_3 \left(j = 3\right)\) \(B_4 \left(j = 4\right)\) Total
\(A_1 \left(i = 1\right)\) \(Y_{11} \left(p_{11}\right)\) \(Y_{12} \left(p_{12}\right)\) \(Y_{13} \left(p_{13}\right)\) \(Y_{14} \left(p_{14}\right)\) \(\left(p_{.1}\right)\)
\(A_2 \left(i = 2\right)\) \(Y_{21} \left(p_{21}\right)\) \(Y_{22} \left(p_{22}\right)\) \(Y_{23} \left(p_{23}\right)\) \(Y_{24} \left(p_{24}\right)\) \(\left(p_{.2}\right)\)
Total

\(\left(p_{.1}\right)\)

\(\left(p_{.2}\right)\) \(\left(p_{.3}\right)\) \(\left(p_{.4}\right)\) \(n\)

where:

  1. \(Y_ij\) denotes the frequency of event \(A_i \cap B_j \)
  2. The probability that a randomly selected observation falls into the cell defined by \(A_i \cap B_j \) is \(p_{ij}=P(A_i \cap B_j)\) and is estimated by \(Y_{ij}/n\)
  3. The probability that a randomly selected observation falls into a row defined by Ai is \(p_{i.}=P(A_i )\) and is estimated by \(\sum_{j=1}^{k}p_{ij}\) ("dot notation")
  4. The probability that a randomly selected observation falls into a column defined by Bj is \(p_{.j}=P(B_j) \) and is estimated by \(\sum_{i=1}^{h}p_{ij}\) ("dot notation")

With the notation defined as such, we are now ready to formulate the chi-square test statistic for testing the independence of two categorical variables.

The Chi-Square Test Statistic Section

Theorem

The chi-square test statistic:

\(Q=\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{(y_{ij}-\frac{y_{i.}y_{.j}}{n})^2}{\frac{y_{i.}y_{.j}}{n}} \)

for testing the independence of two categorical variables, one with h levels and the other with k levels, follows an approximate chi-square distribution with (h−1)(k−1) degrees of freedom.

Proof

We should be getting to be pros at deriving these chi-square tests. We'll do the proof in four steps.

  1. Step 1

    We can think of the \(h \times k\) cells as arising from a multinomial distribution with \(h \times k\) categories. Then, in that case, as long as n is large, we know that:

    \(Q_{kh-1}=\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{(\text{observed }-\text{ expected})^2}{\text{ expected}} =\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{(y_{ij}-np_{ij})^2}{np_{ij}}\)

    follows an approximate chi-square distribution with \(kh−1\) degrees of freedom.

  2. Step 2

    But the chi-square statistic, as defined in the first step, depends on some unknown parameters \(p_{ij}\). So, we'll estimate the \(p_{ij}\) assuming that the null hypothesis is true, that is, assuming independence:

    \(p_{ij}=P(A_i \cap B_j)=P(A_i) \times P(B_j)=p_{i.}p_{.j} \)

    Under the assumption of independence, it is therefore reasonable to estimate the \(p_{ij}\) with:

    \(\hat{p}_{ij}=\hat{p}_{i.}\hat{p}_{.j}=\left(\frac{\sum_{j=1}^{k}y_{ij}}{n}\right) \left(\frac{\sum_{i=1}^{h}y_{ij}}{n}\right)=\frac{y_{i.}y_{.j}}{n^2}\)

  3. Step 3

    Now, we have to determine how many parameters we estimated in the second step. Well, the fact that the row probabilities add to 1:

    \(\sum_{i=1}^{h}p_{i.}=1 \)

    implies that we've estimated \(h−1\) row parameters. And, the fact that the column probabilities add to 1:

    \(\sum_{j=1}^{k}p_{.j}=1 \)

    implies that we've estimated \(k−1\) column parameters. Therefore, we've estimated a total of \(h−1 + k − 1 = h + k − 2\) parameters.

  4. Step 4

    Because we estimated \(h + k − 2\) parameters, we have to adjust the test statistic and degrees of freedom accordingly. Doing so, we get that:

    \(Q=\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{\left(y_{ij}-n\left(\frac{y_{i.}y_{.j}}{n^2}\right) \right)^2}{n\left(\frac{y_{i.}y_{.j}}{n^2}\right)} =\sum_{j=1}^{k}\sum_{i=1}^{h}\frac{\left(y_{ij}-\frac{y_{i.}y_{.j}}{n} \right)^2}{\left(\frac{y_{i.}y_{.j}}{n}\right)} \)

    follows an approximate chi-square distribution with \((kh − 1)− ( h + k − 2)\) parameters, that is, upon simplification, \((h − 1)(k − 1)\) degrees of freedom.

    By the way, I think I might have mumbled something up above about the equivalence of the chi-square statistic for homogeneity and the chi-square statistic for independence. In order to prove that the two statistics are indeed equivalent, we just have to show, for example, in the case when \(h = 2\), that:

    \(\sum_{i=1}^{2}\sum_{j=1}^{k}\frac{\left(y_{ij}-n_i\left(\frac{y_{1j}+y_{2j}}{n_1+n_2}\right) \right)^2}{n_i\left(\frac{y_{1j}+y_{2j}}{n_1+n_2}\right)} =\sum_{i=1}^{2}\sum_{j=1}^{k}\frac{\left(y_{ij}-\frac{y_{i.}y_{.j}}{n} \right)^2}{\left(\frac{y_{i.}y_{.j}}{n}\right)} \)

    Errrrrrr. That probably looks like a scarier proposition than it is, as showing that the above is true amounts to showing that:

    \(n_i \binom{y_{1j}+y_{2j}}{n_1+n_2}=\binom{y_{i.}y_{.j}}{n} \)

    Well, rewriting the left-side a bit using dot notation, we get:

    \(n_i \binom{y_{.j}}{n}=\binom{y_{i.}y_{.j}}{n} \)

    and doing some algebraic simplification, we get:

    \(n_i= y_{i.}\)

    which certainly holds true, as \(n_i\) and \(y_{i·}\) mean the same thing, that is, the number of experimental units in the \(i^{th}\) row.

Example 17-4 Section

friends riding bikes togehter

Is age independent of the desire to ride a bicycle? A random sample of 395 people were surveyed. Each person was asked their interest in riding a bicycle (Variable A) and their age (Variable B). The data that resulted from the survey is summarized in the following table:

Bicycle Riding Interest Variable B (Age)
Variable A OBSERVED (18-24) (25-34) (35-49) (50-64) Total
YES 60 54 46 41 201
NO 40 44 53 57 194
Total 100 98 99 98 395

Is there evidence to conclude, at the 0.05 level, that the desire to ride a bicycle depends on age?

Answer

Here's the table of expected counts:

Bicycle Riding Interest Variable B (Age)
Variable A EXPECTED 18-24 25-34 35-49 50-64 Total
YES 50.886 49.868 50.377 49.868 201
NO 49.114 48.132 48.623 48.132 194
Total 100 98 99 98 395

which results in a chi-square statistic of 8.006:

\(Q=\frac{(60-50.886)^2}{50.886}+ ... +\frac{(57-48.132)^2}{48.132}=8.006 \)

The chi-square test tells us to reject the null hypothesis, at the 0.05 level, if Q is greater than a chi-square random variable with 3 degrees of freedom, that is, if Q > 7.815. Because Q = 8.006 > 7.815, we reject the null hypothesis. There is sufficient evidence at the 0.05 level to conclude that the desire to ride a bicycle depends on age.

Using Minitab Section

If you...

  1. Enter the data (in the inside of the observed frequency table only) into the columns of the worksheet
  2. Select Stat >> Tables >> Chi-square test

then Minitab will display typical chi-square test output that looks something like this:

Chi-Square Test: 18-24, 25-34, 35-49, 50-64
Expected counts are printed below observed counts
\(\color{white}\text{noheader}\) 18-24 25-34   35-49     50-64     Total
1  60
50.89
54
49.87
46
 50.38
41
49.87
201
2  40
49.11
44
48.13
53
48.62
57
48.13
194
Total 100 98 99 98 395

Chi- sq = 1.632 + 0.342 + 0.380 + 1.577 +

                1.691 + 0.355 + 0.394 + 1.634 = 8.006

DF = 3, P-Value = 0.000