We can take Donna’s table and begin to fill out the row and column totals. To do so, we simply add up the observations for each row and column.
Location | Low Entrepreneurialism | High Entrepreneurialism | Total |
---|---|---|---|
Northeast | 300 | 460 | 760 |
Midwest | 249 | 95 | 344 |
Total | 549 | 555 | 1104 |
From here, Donna wants to determine if an association (relationship) exists between Location and Entrepreneurialism. Note that we are focusing on an association, not whether one causes another. This is a very important limitation of our conclusion. The other way we can think about it is that these two variables are dependent. Knowing something about where a town is located tells us something about the level of entrepreneurial-ship in the town.
How do we test the independence of two categorical variables? It will be done using the Chi-square test of independence.
As with all prior statistical tests we need to define null and alternative hypotheses. Also, as we have learned, the null hypothesis is what is assumed to be true until we have evidence to go against it. In this lesson, we are interested in researching if two categorical variables are associated (i.e., dependent). Therefore, until we have evidence to suggest that they are, we must assume that they are not. This is the motivation behind the hypothesis for the Chi-square
Test of Independence
Hypotheses:
- \(H_0\): In the population, the two categorical variables are independent.
- \(H_a\): In the population, the two categorical variables are dependent.
Once we have gathered our data, we summarize the data in the two-way contingency table. An example of a "generic" contingency table looks like this:
Success | Failure | Total | |
---|---|---|---|
Group 1 | A | B | A + B |
Group 2 | C | D | C + D |
Total | A + C | B + D | A + B + C + D |
The question becomes, "How would this table look if the two variables were not related?" That is, under the null hypothesis that the two variables are independent, what would we expect our data to look like? To answer this question, we calculate the "expected count" for each cell in the table. What are we expecting? In the chi square, our "expected cell count" represents the null hypothesis condition!
Expected Cell Count
The expected count for each cell under the null hypothesis is:
Adding the expected counts to the observed counts for Donna’s data yield a table such as:
Low Entrepreneurialism | High Entrepreneurialism | All | |
---|---|---|---|
Northeast | 300 | 460 | 760 |
377.9 | 382.1 | ||
Midwest | 249 | 95 | 344 |
171.1 | 172.9 | ||
All | 549 | 555 | 1104 |