6: Categorical Data Comparisons
6: Categorical Data ComparisonsCase Study: Entrepreneurialism
As a town planner, Donna is always thinking about ways in which the economy of her town might grow. She starts thinking about her town in the Northeast and how enthusiastic residence of her town are for starting and supporting small businesses. This is called entrepreneurialism. She begins to wonder the increases in supporting the startup of small businesses (entrepreneurialism) is a growing trend across the country or if this is something unique to her own town. She decides to compare levels of entrepreneurialism between her town and a town in the Midwest to see if location makes a difference. Her measure of entrepreneurialism categorizes respondents as “high” or “low”. Donna recognizes that her data is categorical but is not quite sure how to proceed from there.When we need to represent two categorical variables, such as location and categories of Entrepreneurialism, a table, called a contingency table, is typically the best format.
Location  Low Entrepreneurialism  High Entrepreneurialism 

Northeast  
Midwest 
This table is referred to as a “2X2” contingency table because there are two categories of each variable. If Donna added a third region, say the South, she would have a 2X3 table. Donna can add her data from the table,
Location  Low Entrepreneurialism  High Entrepreneurialism 

Northeast  300  460 
Midwest  249  95 
Now Donna can see her data in a table form. Let’s take a look at some of the way’s Donna can begin to describe and analyze her data.
Objectives
 Recognize applications dealing with multiple categorical variables
 Construct a 2X2 contingency table
 Interpret row and column totals and percentages from a contingency table
 State the null and alternative hypotheses for a chi square test
 Identify the elements of the formula for a chi square test
 Identify the similarity between a chi square test and a test of two proportions
6.1  ChiSquare Test for Independence
6.1  ChiSquare Test for IndependenceWe can take Donna’s table and begin to fill out the row and column totals. To do so, we simply add up the observations for each row and column.
Location  Low Entrepreneurialism  High Entrepreneurialism  Total 

Northeast  300  460  760 
Midwest  249  95  344 
Total  549  555  1104 
From here, Donna wants to determine if an association (relationship) exists between Location and Entrepreneurialism. Note that we are focusing on an association, not whether one causes another. This is a very important limitation of our conclusion. The other way we can think about it is that these two variables are dependent. Knowing something about where a town is located tells us something about the level of entrepreneurialship in the town.
How do we test the independence of two categorical variables? It will be done using the Chisquare test of independence.
As with all prior statistical tests we need to define null and alternative hypotheses. Also, as we have learned, the null hypothesis is what is assumed to be true until we have evidence to go against it. In this lesson, we are interested in researching if two categorical variables are associated (i.e., dependent). Therefore, until we have evidence to suggest that they are, we must assume that they are not. This is the motivation behind the hypothesis for the Chisquare
Test of Independence
Hypotheses:
 \(H_0\): In the population, the two categorical variables are independent.
 \(H_a\): In the population, the two categorical variables are dependent.
Once we have gathered our data, we summarize the data in the twoway contingency table. An example of a "generic" contingency table looks like this:
Success  Failure  Total  

Group 1  A  B  A + B 
Group 2  C  D  C + D 
Total  A + C  B + D  A + B + C + D 
The question becomes, "How would this table look if the two variables were not related?" That is, under the null hypothesis that the two variables are independent, what would we expect our data to look like? To answer this question, we calculate the "expected count" for each cell in the table. What are we expecting? In the chi square, our "expected cell count" represents the null hypothesis condition!
Expected Cell Count
The expected count for each cell under the null hypothesis is:
Adding the expected counts to the observed counts for Donna’s data yield a table such as:
Low Entrepreneurialism  High Entrepreneurialism  All  

Northeast  300  460  760 
377.9  382.1  
Midwest  249  95  344 
171.1  172.9  
All  549  555  1104 
6.2  ChiSquare Test Statistic
6.2  ChiSquare Test StatisticTo better understand what these expected counts represent, first recall that the expected counts table is designed to reflect what the sample data counts would be if the two variables were independent (the null hypothesis). In other words, under the null hypothesis we expect the proportions of observations to be similar in each cell. For example, if we ONLY considered the Northeast, and look at the expected counts for the Northeast across the two level of entrepreneurialism, under the null hypothesis we should have 50% in each level of entrepreneurialism. With actual values observed of 300 and 460 we can begin to suspect levels of entrepreneurialism may not be "independent" of location.
You may be looking at the expected counts for the Northeast and wondering why they aren't exactly 50/50. This is because the expected value is calculated as a function of both the ROWS and the COLUMNS! The great thing is, that our software will do the calculations for you, but again, it is helpful to have a conceptual understanding of expected values.
Low Entrepreneurialism  High Entrepreneurialism  All  

Northeast  300  460  760 
377.9  382.1  
Midwest  249  95  344 
171.1  172.9  
All  549  555  1104 
The statistical question becomes, "Are the observed counts so different from the expected counts that we can conclude a relationship exists between the two variables?" To conduct this test we compute a Chisquare test statistic where we compare each cell's observed count to its respective expected count.
In a summary table, we have \(r\times c=rc\) cells. Let \(O_1, O_2, …, O_{rc}\) denote the observed counts for each cell and \(E_1, E_2, …, E_{rc}\) denote the respective expected counts for each cell.
 ChiSquare Test Statistic

The Chisquare test statistic is calculated as follows:
\(\chi^{2*}=\displaystyle\sum\limits_{i=1}^{rc} \dfrac{(O_iE_i)^2}{E_i}\)
Under the null hypothesis and certain conditions (discussed below), the test statistic follows a Chisquare distribution with degrees of freedom equal to \((r1)(c1)\), where \(r\) is the number of rows and \(c\) is the number of columns. We leave out the mathematical details to show why this test statistic is used and why it follows a Chisquare distribution.
As we have done with other statistical tests, we make our decision by either comparing the value of the test statistic by finding the probability of getting this test statistic value or one more extreme. The pvalue is found by \(P(\chi^2>\chi^{2*})\) with degrees of freedom =\((r  1)(c  1)\).
So for Donna’s data, we compute the chisquare statistics
ChiSquare  DF  PValue  

Pearson  102.596  1  0.000 
Likelihood  105.357  1  0.000 
The resulting chisquare statistic is 102.596 with a pvalue of .000. The 2X2 table also includes the expected values. Remember the chisquare statistic is comparing the expected values to the observed values from Donna’s study. The results of the chisquare indicate this difference (observed – expected is large). Thus, Donna can reject the null hypothesis that entrepreneurialism and geographic location are independent and she can conclude that Entrepreneurialism levels depend on geographic location.
Conditions for Using the ChiSquare Test
Exercise caution when there are small expected counts. Minitab will give a count of the number of cells that have expected frequencies less than five. Some statisticians hesitate to use the chisquare test if more than 20% of the cells have expected frequencies below five, especially if the pvalue is small and these cells give a large contribution to the total chisquare value.
Caution!
Sometimes researchers will categorize quantitative data (e.g., take height measurements and categorize as 'below average,' 'average,' and 'above average.'') Doing so results in a loss of information  one cannot do the reverse of taking the categories and reproducing the raw quantitative measurements. Instead of categorizing, the data should be analyzed using quantitative methods.
6.3  Risk, Relative Risk and Odds
6.3  Risk, Relative Risk and OddsIn this section, we will introduce some other measures we can find using a contingency table. One of the most straightforward measures to find is the risk of any given event.
 Risk
 The probability that an event will occur.
In simple terms, a risk for a group is the same as the proportion of "success" for a particular group.
Have you ever heard a doctor tell you or a family member something similar to the following: "If you do not lose weight or get your cholesterol under control you are about five times more likely to suffer a heart attack than if you had these numbers in the normal range." If so, how alarmed should one be? "Five times" sounds alarming!
First off, this "five times" represents what is called relative risk.
 Relative Risk
 Relative risk is a ratio of the risks of two groups.
In the example described above, it would be the risk of heart attack for a person in their current condition compared to the risk of heart attack if that person were in the normal ranges. However, to truly interpret the severity of a relative risk we have to know the baseline risk.
 Baseline Risk
 The baseline risk is the denominator of relative risk, i.e., the risk of the group being compared to.
In our example, this would be the risk of heart attack for the normal range. If this baseline risk is high, then a relative risk of 5 would be alarming; if the baseline risk is small, then a relative risk of 5 may not be too serious.
For instance, if the risk of a heart attack for someone in the normal range was 1 out of 10, then the risk of a heart attack for a person with the above average numbers would be five times this or 5 out of 10. That is, the person would have roughly a 50/50 chance of suffering a heart attack if they didn't get their weight and cholesterol in check. However, if the risk of a heart attack for the normal range group was 1 out of 500, then the risk of a heart attack for a person with above average numbers would be 5 out of 500 or 0.01. The person would have about a 1% chance of a heart attack if they didn't improve their health. In both cases the relative risk was 5, but with entirely different levels of impact. Please note this example is not meant to be interpreted that taking care of your health is not important!!!
Another measure we can find is odds.
 Odds
 Odds is a ratio of the number of “success” over the number of “failures.” It can be reported as a fraction or as “number of success: number of failures.” When we reflect back on Donna's contingency table, the odds are calculated using the interior cells of the table.
Low Entrepreneurialism  High Entrepreneurialism  All  

Northeast  300  460  760 
377.9  382.1  
Midwest  249  95  344 
171.1  172.9  
All  549  555  1104 
Let's say we want to calculate the "odds" that the Northeast is high entrepreneurialism. We would simply take the number of observations of high entrepreneurialism divided by the number of low entrepreneurialism for the Northeast. In this example that would be 460/300 or an odds of 1.5. An observation from the Northeast is 1.5 times more likely to be high entrepreneurialism than low.
Let's try this for the Midwest where we end up with an odds of 95/249 or .38. This brings up a really important point about odds. An odds of 1 are "equal odds" or 50/50. Any odds more than 1 means the numerator category (in our example high entrepreneurialism) is more likely, and any odds less than 1 means the numerator category is less likely. For the Midwest being high entrepreneurialism is actually less likely than being low entrepreneurialism. This gets hard to interpret so typically what we do is take the inverse (249/95) and say that an observation from the Midwest is 2.6 times more likely to be low entrepreneurialism.
We can also calculate an "odds ratio" which is the ratio of two odds if we want to compare two groups. In Donna's example, we could calculate the odds of high entrepreneurialism in the Northeast compared to the odds of high entrepreneurialism in the Midwest. We need to make sure we are dealing with both odds the same format (so both with high entrepreneurialism in the numerator). Now we simply take the odds of the two odds!
(460/300)/(95/249) or an odds ratio of 3.94. Now we can say that the odds of an observation from the Northwest being high entrepreneurialism are 3.94 times more likely than an observation from the Midwest!
6.4  Lesson Summary
6.4  Lesson SummaryFrom this lesson, Donna can now conduct a chi square to answer her question about the level of entrepreneurialism in the Northeast and Midwest. Using the chi square test. She can now move forward with her study.