6: Categorical Data Comparisons

Case Study: Entrepreneurialism

As a town planner, Donna is always thinking about ways in which the economy of her town might grow. She starts thinking about her town in the Northeast and how enthusiastic residence of her town are for starting and supporting small businesses. This is called entrepreneurialism. She begins to wonder the increases in supporting the start-up of small businesses (entrepreneurialism) is a growing trend across the country or if this is something unique to her own town. She decides to compare levels of entrepreneurialism between her town and a town in the Midwest to see if location makes a difference. Her measure of entrepreneurialism categorizes respondents as “high” or “low”. Donna recognizes that her data is categorical but is not quite sure how to proceed from there.

When we need to represent two categorical variables, such as location and categories of Entrepreneurialism, a table, called a contingency table, is typically the best format.

Location	Low Entrepreneurialism	High Entrepreneurialism
Northeast
Midwest

This table is referred to as a “2X2” contingency table because there are two categories of each variable. If Donna added a third region, say the South, she would have a 2X3 table. Donna can add her data from the table,

Location	Low Entrepreneurialism	High Entrepreneurialism
Northeast	300	460
Midwest	249	95

Now Donna can see her data in a table form. Let’s take a look at some of the way’s Donna can begin to describe and analyze her data.

Objectives

Upon completion of this lesson, you should be able to:

Recognize applications dealing with multiple categorical variables
Construct a 2X2 contingency table
Interpret row and column totals and percentages from a contingency table
State the null and alternative hypotheses for a chi square test
Identify the elements of the formula for a chi square test
Identify the similarity between a chi square test and a test of two proportions

6.1 - Chi-Square Test for Independence

We can take Donna’s table and begin to fill out the row and column totals. To do so, we simply add up the observations for each row and column.

Location	Low Entrepreneurialism	High Entrepreneurialism	Total
Northeast	300	460	760
Midwest	249	95	344
Total	549	555	1104

Note! As we will see, the contingency table now include a 'total' row and a 'total' column which represent the marginal totals, i.e., the total count in each row and the total count in each column. This total row and total column are NOT included in the size of the table. The size refers to the number of levels to the actual categorical variables in the study.

From here, Donna wants to determine if an association (relationship) exists between Location and Entrepreneurialism. Note that we are focusing on an association, not whether one causes another. This is a very important limitation of our conclusion. The other way we can think about it is that these two variables are dependent. Knowing something about where a town is located tells us something about the level of entrepreneurial-ship in the town.

How do we test the independence of two categorical variables? It will be done using the Chi-square test of independence.

As with all prior statistical tests we need to define null and alternative hypotheses. Also, as we have learned, the null hypothesis is what is assumed to be true until we have evidence to go against it. In this lesson, we are interested in researching if two categorical variables are associated (i.e., dependent). Therefore, until we have evidence to suggest that they are, we must assume that they are not. This is the motivation behind the hypothesis for the Chi-square

Test of Independence

Hypotheses:

\(H_0\): In the population, the two categorical variables are independent.
\(H_a\): In the population, the two categorical variables are dependent.

Note! The are several ways to phrase these hypotheses. Instead of using the words "independent" and "dependent" one could say " Or "there is no association between the two categorical variables" versus "there is an association between the two variables." The important part is that the null hypothesis refers to the two categorical variables not being associated while the alternative is trying to show that they are related.

Once we have gathered our data, we summarize the data in the two-way contingency table. An example of a "generic" contingency table looks like this:

	Success	Failure	Total
Group 1	A	B	A + B
Group 2	C	D	C + D
Total	A + C	B + D	A + B + C + D

The question becomes, "How would this table look if the two variables were not related?" That is, under the null hypothesis that the two variables are independent, what would we expect our data to look like? To answer this question, we calculate the "expected count" for each cell in the table. What are we expecting? In the chi square, our "expected cell count" represents the null hypothesis condition!

Expected Cell Count

The expected count for each cell under the null hypothesis is:

\(E=\dfrac{\text{(row total)}(\text{column total})}{\text{total sample size}}\)

Adding the expected counts to the observed counts for Donna’s data yield a table such as:

	Low Entrepreneurialism	High Entrepreneurialism	All
Northeast	300	460	760
Northeast	377.9	382.1
Midwest	249	95	344
Midwest	171.1	172.9
All	549	555	1104

6.2 - Chi-Square Test Statistic

To better understand what these expected counts represent, first recall that the expected counts table is designed to reflect what the sample data counts would be if the two variables were independent (the null hypothesis). In other words, under the null hypothesis we expect the proportions of observations to be similar in each cell. For example, if we ONLY considered the Northeast, and look at the expected counts for the Northeast across the two level of entrepreneurialism, under the null hypothesis we should have 50% in each level of entrepreneurialism. With actual values observed of 300 and 460 we can begin to suspect levels of entrepreneurialism may not be "independent" of location.

You may be looking at the expected counts for the Northeast and wondering why they aren't exactly 50/50. This is because the expected value is calculated as a function of both the ROWS and the COLUMNS! The great thing is, that our software will do the calculations for you, but again, it is helpful to have a conceptual understanding of expected values.

	Low Entrepreneurialism	High Entrepreneurialism	All
Northeast	300	460	760
Northeast	377.9	382.1
Midwest	249	95	344
Midwest	171.1	172.9
All	549	555	1104

The statistical question becomes, "Are the observed counts so different from the expected counts that we can conclude a relationship exists between the two variables?" To conduct this test we compute a Chi-square test statistic where we compare each cell's observed count to its respective expected count.

In a summary table, we have \(r\times c=rc\) cells. Let \(O_1, O_2, …, O_{rc}\) denote the observed counts for each cell and \(E_1, E_2, …, E_{rc}\) denote the respective expected counts for each cell.

Chi-Square Test Statistic

The Chi-square test statistic is calculated as follows:

\(\chi^{2*}=\displaystyle\sum\limits_{i=1}^{rc} \dfrac{(O_i-E_i)^2}{E_i}\)

Under the null hypothesis and certain conditions (discussed below), the test statistic follows a Chi-square distribution with degrees of freedom equal to \((r-1)(c-1)\), where \(r\) is the number of rows and \(c\) is the number of columns. We leave out the mathematical details to show why this test statistic is used and why it follows a Chi-square distribution.

As we have done with other statistical tests, we make our decision by either comparing the value of the test statistic by finding the probability of getting this test statistic value or one more extreme. The p-value is found by \(P(\chi^2>\chi^{2*})\) with degrees of freedom =\((r - 1)(c - 1)\).

So for Donna’s data, we compute the chi-square statistics

	Chi-Square	DF	P-Value
Pearson	102.596	1	0.000
Likelihood	105.357	1	0.000

The resulting chi-square statistic is 102.596 with a p-value of .000. The 2X2 table also includes the expected values. Remember the chi-square statistic is comparing the expected values to the observed values from Donna’s study. The results of the chi-square indicate this difference (observed – expected is large). Thus, Donna can reject the null hypothesis that entrepreneurialism and geographic location are independent and she can conclude that Entrepreneurialism levels depend on geographic location.

Conditions for Using the Chi-Square Test

Exercise caution when there are small expected counts. Minitab will give a count of the number of cells that have expected frequencies less than five. Some statisticians hesitate to use the chi-square test if more than 20% of the cells have expected frequencies below five, especially if the p-value is small and these cells give a large contribution to the total chi-square value.

Caution!

Sometimes researchers will categorize quantitative data (e.g., take height measurements and categorize as 'below average,' 'average,' and 'above average.'') Doing so results in a loss of information - one cannot do the reverse of taking the categories and reproducing the raw quantitative measurements. Instead of categorizing, the data should be analyzed using quantitative methods.

6.3 - Risk, Relative Risk and Odds

In this section, we will introduce some other measures we can find using a contingency table. One of the most straightforward measures to find is the risk of any given event.

Risk: The probability that an event will occur.

In simple terms, a risk for a group is the same as the proportion of "success" for a particular group.

Have you ever heard a doctor tell you or a family member something similar to the following: "If you do not lose weight or get your cholesterol under control you are about five times more likely to suffer a heart attack than if you had these numbers in the normal range." If so, how alarmed should one be? "Five times" sounds alarming!

First off, this "five times" represents what is called relative risk.

Relative Risk: Relative risk is a ratio of the risks of two groups.

In the example described above, it would be the risk of heart attack for a person in their current condition compared to the risk of heart attack if that person were in the normal ranges. However, to truly interpret the severity of a relative risk we have to know the baseline risk.

Baseline Risk: The baseline risk is the denominator of relative risk, i.e., the risk of the group being compared to.

In our example, this would be the risk of heart attack for the normal range. If this baseline risk is high, then a relative risk of 5 would be alarming; if the baseline risk is small, then a relative risk of 5 may not be too serious.

For instance, if the risk of a heart attack for someone in the normal range was 1 out of 10, then the risk of a heart attack for a person with the above average numbers would be five times this or 5 out of 10. That is, the person would have roughly a 50/50 chance of suffering a heart attack if they didn't get their weight and cholesterol in check. However, if the risk of a heart attack for the normal range group was 1 out of 500, then the risk of a heart attack for a person with above average numbers would be 5 out of 500 or 0.01. The person would have about a 1% chance of a heart attack if they didn't improve their health. In both cases the relative risk was 5, but with entirely different levels of impact. Please note this example is not meant to be interpreted that taking care of your health is not important!!!

Another measure we can find is odds.

Odds: Odds is a ratio of the number of “success” over the number of “failures.” It can be reported as a fraction or as “number of success: number of failures.” When we reflect back on Donna's contingency table, the odds are calculated using the interior cells of the table.

	Low Entrepreneurialism	High Entrepreneurialism	All
Northeast	300	460	760
Northeast	377.9	382.1
Midwest	249	95	344
Midwest	171.1	172.9
All	549	555	1104

Let's say we want to calculate the "odds" that the Northeast is high entrepreneurialism. We would simply take the number of observations of high entrepreneurialism divided by the number of low entrepreneurialism for the Northeast. In this example that would be 460/300 or an odds of 1.5. An observation from the Northeast is 1.5 times more likely to be high entrepreneurialism than low.

Let's try this for the Midwest where we end up with an odds of 95/249 or .38. This brings up a really important point about odds. An odds of 1 are "equal odds" or 50/50. Any odds more than 1 means the numerator category (in our example high entrepreneurialism) is more likely, and any odds less than 1 means the numerator category is less likely. For the Midwest being high entrepreneurialism is actually less likely than being low entrepreneurialism. This gets hard to interpret so typically what we do is take the inverse (249/95) and say that an observation from the Midwest is 2.6 times more likely to be low entrepreneurialism.

We can also calculate an "odds ratio" which is the ratio of two odds if we want to compare two groups. In Donna's example, we could calculate the odds of high entrepreneurialism in the Northeast compared to the odds of high entrepreneurialism in the Midwest. We need to make sure we are dealing with both odds the same format (so both with high entrepreneurialism in the numerator). Now we simply take the odds of the two odds!

(460/300)/(95/249) or an odds ratio of 3.94. Now we can say that the odds of an observation from the Northwest being high entrepreneurialism are 3.94 times more likely than an observation from the Midwest!

6.4 - Lesson Summary

From this lesson, Donna can now conduct a chi square to answer her question about the level of entrepreneurialism in the Northeast and Midwest. Using the chi square test. She can now move forward with her study.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility