To better understand what these expected counts represent, first recall that the expected counts table is designed to reflect what the sample data counts would be if the two variables were independent (the null hypothesis). In other words, under the null hypothesis we expect the proportions of observations to be similar in each cell. For example, if we ONLY considered the Northeast, and look at the expected counts for the Northeast across the two level of entrepreneurialism, under the null hypothesis we should have 50% in each level of entrepreneurialism. With actual values observed of 300 and 460 we can begin to suspect levels of entrepreneurialism may not be "independent" of location.
You may be looking at the expected counts for the Northeast and wondering why they aren't exactly 50/50. This is because the expected value is calculated as a function of both the ROWS and the COLUMNS! The great thing is, that our software will do the calculations for you, but again, it is helpful to have a conceptual understanding of expected values.
Low Entrepreneurialism | High Entrepreneurialism | All | |
---|---|---|---|
Northeast | 300 | 460 | 760 |
377.9 | 382.1 | ||
Midwest | 249 | 95 | 344 |
171.1 | 172.9 | ||
All | 549 | 555 | 1104 |
The statistical question becomes, "Are the observed counts so different from the expected counts that we can conclude a relationship exists between the two variables?" To conduct this test we compute a Chi-square test statistic where we compare each cell's observed count to its respective expected count.
In a summary table, we have \(r\times c=rc\) cells. Let \(O_1, O_2, …, O_{rc}\) denote the observed counts for each cell and \(E_1, E_2, …, E_{rc}\) denote the respective expected counts for each cell.
- Chi-Square Test Statistic
-
The Chi-square test statistic is calculated as follows:
\(\chi^{2*}=\displaystyle\sum\limits_{i=1}^{rc} \dfrac{(O_i-E_i)^2}{E_i}\)
Under the null hypothesis and certain conditions (discussed below), the test statistic follows a Chi-square distribution with degrees of freedom equal to \((r-1)(c-1)\), where \(r\) is the number of rows and \(c\) is the number of columns. We leave out the mathematical details to show why this test statistic is used and why it follows a Chi-square distribution.
As we have done with other statistical tests, we make our decision by either comparing the value of the test statistic by finding the probability of getting this test statistic value or one more extreme. The p-value is found by \(P(\chi^2>\chi^{2*})\) with degrees of freedom =\((r - 1)(c - 1)\).
So for Donna’s data, we compute the chi-square statistics
Chi-Square | DF | P-Value | |
---|---|---|---|
Pearson | 102.596 | 1 | 0.000 |
Likelihood | 105.357 | 1 | 0.000 |
The resulting chi-square statistic is 102.596 with a p-value of .000. The 2X2 table also includes the expected values. Remember the chi-square statistic is comparing the expected values to the observed values from Donna’s study. The results of the chi-square indicate this difference (observed – expected is large). Thus, Donna can reject the null hypothesis that entrepreneurialism and geographic location are independent and she can conclude that Entrepreneurialism levels depend on geographic location.
Conditions for Using the Chi-Square Test Section
Exercise caution when there are small expected counts. Minitab will give a count of the number of cells that have expected frequencies less than five. Some statisticians hesitate to use the chi-square test if more than 20% of the cells have expected frequencies below five, especially if the p-value is small and these cells give a large contribution to the total chi-square value.
Caution! Section
Sometimes researchers will categorize quantitative data (e.g., take height measurements and categorize as 'below average,' 'average,' and 'above average.'') Doing so results in a loss of information - one cannot do the reverse of taking the categories and reproducing the raw quantitative measurements. Instead of categorizing, the data should be analyzed using quantitative methods.