6.1 - Chi-Square Test for Independence

We can take Donna’s table and begin to fill out the row and column totals. To do so, we simply add up the observations for each row and column.

Location Low Entrepreneurialism High Entrepreneurialism Total
Northeast 300 460 760
Midwest 249 95 344
Total 549 555 1104
Note! As we will see, the contingency table now include a 'total' row and a 'total' column which represent the marginal totals, i.e., the total count in each row and the total count in each column. This total row and total column are NOT included in the size of the table. The size refers to the number of levels to the actual categorical variables in the study.

From here, Donna wants to determine if an association (relationship) exists between Location and Entrepreneurialism. Note that we are focusing on an association, not whether one causes another. This is a very important limitation of our conclusion. The other way we can think about it is that these two variables are dependent. Knowing something about where a town is located tells us something about the level of entrepreneurial-ship in the town.

How do we test the independence of two categorical variables? It will be done using the Chi-square test of independence.

As with all prior statistical tests we need to define null and alternative hypotheses. Also, as we have learned, the null hypothesis is what is assumed to be true until we have evidence to go against it. In this lesson, we are interested in researching if two categorical variables are associated (i.e., dependent). Therefore, until we have evidence to suggest that they are, we must assume that they are not. This is the motivation behind the hypothesis for the Chi-square

Test of Independence

Hypotheses:

  • \(H_0\): In the population, the two categorical variables are independent.
  • \(H_a\): In the population, the two categorical variables are dependent.
Note! The are several ways to phrase these hypotheses. Instead of using the words "independent" and "dependent" one could say " Or "there is no association between the two categorical variables" versus "there is an association between the two variables." The important part is that the null hypothesis refers to the two categorical variables not being associated while the alternative is trying to show that they are related.

Once we have gathered our data, we summarize the data in the two-way contingency table. An example of a "generic" contingency table looks like this:

 

  Success Failure Total
Group 1 A B A + B
Group 2 C D C + D
Total A + C B + D A + B + C + D

The question becomes, "How would this table look if the two variables were not related?" That is, under the null hypothesis that the two variables are independent, what would we expect our data to look like? To answer this question, we calculate the "expected count" for each cell in the table. What are we expecting? In the chi square, our "expected cell count" represents the null hypothesis condition!

Expected Cell Count

The expected count for each cell under the null hypothesis is:

\(E=\dfrac{\text{(row total)}(\text{column total})}{\text{total sample size}}\)

Adding the expected counts to the observed counts for Donna’s data yield a table such as:

  Low Entrepreneurialism High Entrepreneurialism All
Northeast 300 460 760
377.9 382.1  
Midwest 249 95 344
171.1 172.9  
All 549 555 1104