6: Relationships Between Categorical Variables
6: Relationships Between Categorical VariablesLesson Overview
We have seen that the way in which you display and summarize variables depends on whether it is a categorical variable or a measurement variable. For example a pie chart or bar graph might be used to display the distribution of a categorical variable while a boxplot or histogram might be used to picture the distribution of a measurement variable. To study the relationship between two variables, a comparative bar graph will show associations between categorical variables while a scatterplot illustrates associations for measurement variables. We have also learned different ways to summarize quantitative variables with measures of center and spread and correlation. In this lesson we focus on statistical summaries of categorical variables and their relationships.
Example 6.1: One Sample of One Categorical Variable
The following question was asked of 555 students taking STAT 100.
Survey Question: How would you describe your hometown? "Rural," "Suburban," "Small Town," or "Big City?"
The results from this question are pictured in Figure 6.2 below where you can see that the majority of Penn State students who were enrolled in STAT 100 during that semester were from the suburbs.
Figure 6.2. Bar Graph of Hometown Description
The bar graph provides a more informative picture than a pie chart in this case as it allows us to see the natural ordering of the categories.
Now, it is important to remember that before data is displayed in a bar graph like the one above, it must first be tabulated to calculate the percents that let us see the variable's distribution. For one variable that just involves dividing the count in each category by the total to get the proportion  and then converting those to percents by multiplying the proportions by 100% (if percents are desired). Table 6.1 shows the distribution and the calculations for the data in Example 6.1.
Hometown  Count  Proportion  Percent 

Rural  75  \(75 / 555 = 0.14\)  \(0.14 × 100% = 14%\) 
Suburb  296  \(296 / 555 = 0.53\)  \(0.53 × 100% = 53%\) 
Small Town  139  \(139 / 555 = 0.25\)  \(0.25 × 100% = 25%\) 
Big City  45  \(45 / 555 = 0.08\)  \(0.08 × 100% = 8%\) 
Total  n = 555  \(555 / 555 = 1.0\)  \(1.0 × 100% = 100%\) 
Describing Categorical Data
 Contingency Table
 A contingency table displays how two categorical variables are related in a table with how many individuals fall in each combination of categories. The categories of one variable define the rows and categories of the other variable define the columns of the table.
Credit Card Response: Freshman Sophomore Junior Senior Total Yes 42 55 76 81 254 No 58 45 24 19 146 Total 100 100 100 100 400
 2x2 Table
 A 2×2 table is a contingency table with 2 rows and 2 columns (i.e. it shows how categorical variables that have only two possibilities each are related).
Poverty Rate of Neighborhood Uneven Sidewalks Even Sidewalks High (over 20%) 98 418 Low (under 10%) 9 301 Total 107 719
 Sample Proportion
 A sample proportion is the proportion of times something happens in the sample data. The sample proportion is often used to estimate the value of a population proportion (the corresponding proportion of times it happens in the whole population).
 Conditional proportion
 A conditional proportion is the proportion of times something occurs assuming a specific condition is true (such as assuming the data come from one specific row of a contingency table)
 Population probability
 A is the proportion of times you expect something to occur when you draw randomly from a population. A conditional probability is the proportion of times you expect something to occur assuming a specific condition is true.
 Odds
 The odds of something happening is the proportion of times it happens divided by the proportion of times it doesn’t.
 Odds Ration
 An odds ratio is the odds of something happening under one circumstance divided by the odds in another circumstance.
 Risk
 A risk is a proportion of individuals that have an undesirable trait.
 Relative Risk
 A relative risk is a risk under one circumstance divided by the risk in another circumstance.
 Increased Risk
 An increased risk is the percentage increase in risk under one circumstance over a baseline risk without that circumstance. Increased risk = 100% × (relative risk  1)
Objectives
 Use a table of crossclassified data to find summaries that answer questions of interest about data. This includes the ability to:
 Distinguish between and interpret: percentages, probabilities, conditional percentages, and conditional probabilities
 Distinguish between and interpret: risk, relative risk, and increased risk.
 Distinguish between and interpret: odds and odds ratio
 Understand that an observed association between two variables can be misleading or even reverse direction when there is another variable that interacts strongly with both variables (Simpson's Paradox)
6.1  Two Different Categorical Variables
6.1  Two Different Categorical VariablesExample 6.2
Consider the following survey question that was asked of four different samples of Penn State students: 100 freshman (Fr), 100 sophomores (So), 100 juniors (Jr), and 100 seniors (Sr).
Question: Do you currently own at least one credit card?
 Yes
 No
The results for the responses to this question are found in Table 6.2 below.
Credit Card Response  Freshman  Sophomore  Junior  Senior  Total 

Yes  42  55  76  81  254 
No  58  45  24  19  146 
Total  100  100  100  100  400 
This is an example of a 2 × 4 contingency table because there are 2 rows and 4 columns to the data in the table. Conditioning on the class rank (i.e. looking at the distribution within each column separately), we find that the percentage of Seniors who have a credit card is 81%. Conditioning on credit card ownership, we find that the percentage of credit card holders in the study who are seniors = 81 / 254 or about 32%.
In this example, the most relevant percentages of interest for comparison are the ones that condition in the class rank. Table 6.3 shows the conversion of counts to percents for this sample. Each of these percents is called conditional percents because each calculation is restricted to or contingent on the year in school. In this case, it was trivial to convert the counts into percents because the sample size is exactly 100 for each sample. However, since this doesn't usually happen, it is good practice to include the percentages most relevant to the problem at hand in the table and to include a total that allows the reader to quickly pick out what is adding to 100%. Of course, the comparison of interest might also be displayed graphically in a cluster bar graph. Figure 6.4 is an example of a cluster bar graph that displays the conditional percents for the data found in Table 6.3.
Credit Card Response  Freshman  Sophomore  Junior  Senior  Total 

Yes  42 / 100 = .42 (42%)  55 (55%)  76 (76%)  81 (81%)  254 (63.5%) 
No  58 / 100 = .58 (58%)  45 (45%)  24 (24%)  19 (19%)  146 (36.5%) 
Total  100 / 100 = 1.00 (100%)  100 (100%)  100 (100%)  100 (100%)  400 (100%) 
Figure 6.3. Credit Card Ownership by Year in School
The graph in Figure 6.3 above does suggest that there is a difference in the percent of Penn State students who own at least one credit card when considering the year in school. Specifically, as a Penn State student progresses from freshman to senior year, it is more likely that he or she will own at least one credit card.
You should also notice that there is redundant information on the graph because the question allows for only a "yes" or "no" response. As the percent who say "yes" increases from freshman to senior year, the percent who say "no" also decreases from freshman to senior year. This holds true because the data is summarized as percents within each school year.
6.2  Numbers That Can Describe 2×2 Tables
6.2  Numbers That Can Describe 2×2 TablesHere we examine five measures that are often used to describe data collected in a 2 × 2 table.
 Risk = proportion with the undesirable trait = (number with trait/total)
 Relative Risk = Risk_{1} / Risk_{2}
 Increased Risk = (Relative Risk  1.0) × 100%
 Odds = (number or proportion with a trait/number or proportion without the trait)
 Odds Ratio= Odds_{1 }/ Odds_{2}
The first three of these, involving "risk", are applied in situations describing an outcome variable that is undesirable (e.g. pertaining to outcomes that are diseases or injuries). The last two, involving "odds" are applied more generally.
Example 6.4
A 2013 study in the journal Medicine & Science in Sports & Exercise examined the incidence of knee injuries for male and female high school athletes that occur in one of nine different sports as recorded in the new National SportsRelated Injury Surveillance System. The response variable was whether the athlete experienced a knee injury in a sports competition for each game they participated in. The data are summarized in Table 6.4 below.
Experienced Knee Injury?  

Sex  Yes  No  Total number of games 
Female  1,492  6,513,792  6,515,284 
Male  3,624  10,655,200  10,658,824 
Total  5,116  17,168,992  17,174,108 
Risk
In this example, the undesirable trait (outcome) is experiencing a knee injury. So the calculated risk of injury for each gender is:
For Females: Risk = (number with trait)/total = 1492/6515284 = 0.000229 (0.0229%)
For Males: Risk = (number with trait)/total = 3624/10658824 = 0.000340 (0.0340%)
Risk interpretation: High School girls run a risk of knee injury in about 2.29 out of every 10,000 games they take part in while High School boys have a risk of about 3.40 knee injuries per 10,000 games.
Risk is just another name for a probability or proportion for an adverse outcome. Depending on the context, we might be speaking about the true population risk (probability) or about a samplebased proportion that provides an estimate of the probability. Risks are often reported in percentage terms or, for risks of rare events, in terms of cases per some large number of events (for example per 10,000 events as above).
Relative and increased risk
Risks between groups or between different situations are then compared using the relative risk and the increased risk. For example, we might compare the risk of knee injuries for males to the risk for females competing in high school sports.
Relative risk = Risk_{males} / Risk_{females} = 3.40 / 2.29 ≈ 1.485
Relative risk interpretation: For each high school sporting event they take part in, a male athlete is 1.485 times more likely to experience a knee injury than a female athlete.
Increased risk = (Relative Risk  1.0) × 100% = (1.485  1.0) × 100% = 48.5%
Increased risk interpretation: The risk of knee injury during a high school sporting event is 48.5% higher for male athletes than for female athletes.
Relative risks and increased risks are reported in the news all the time. However, these measures are almost always descriptive statistics arising from observational data so be careful to examine possible confounders before drawing conclusions. In this example, high school boys are seen to have a greater risk of knee injury than high school girls competing in sports competitions. However, there is a very different mix of the type of sports played and it turns out that girls have a higher risk if you condition on a particular sport played by both sexes like basketball, volleyball, or soccer. The data for the males is strongly affected by football which has the highest rate of knee injuries of any sport and is played exclusively in high school by boys.
Cautions
Some ways you can be misled...
Watch the baseline:
In reading about the relative or increased risks, it is important to keep in mind the baseline risk when deciding if a risk is acceptable for your own life. For example, a recent report found that the risk of dying in a plane crash is about 2.7 times greater when flying on a commuter airline compared with flying on a major carrier. This high relative risk might dissuade you from ever flying on a commuter airline. But, for major U.S. carriers, the chance of being killed on a particular flight is about 1 in 20 million chances; so the risk is very small regardless.
Think about whether the risk pertains to you:
As mentioned, the risk of knee injuries in high school sports depends a great deal on the particular sport. If you are taking part in swimming or diving competitions knee injuries occur in just one or two swim meets out of every 100,000. On the other hand, a boy playing football might injure his knee in about 1 or 2 of every 1000 games. That latter risk might be a deterrent to some when considering the risk during the 40 or 50 games that would be played over a high school player's four years in school. The lesson: risk calculations are often conditional on a specific population or situation. To evaluate your own risk it is important to consider how close the situation fits your own.
Example 6.5
An article in The Journal of Epidemiology and Community Health examined how well sidewalks are maintained in different neighborhoods in St. Louis, Missouri. A number of sidewalk segments were randomly selected from throughout the city and classified as to whether the sidewalk was too uneven for walking or whether it was satisfactory for walking. Some data from this study is provided in Table 6.5 below, which focuses on how the condition of the sidewalks relates to the poverty rate of the households in the neighborhood. Low poverty rate areas are those with less than 10% of the households in poverty, while high poverty rate neighborhoods would be those with more than 20% of the households in poverty.
Poverty Rate of neighborhood  Uneven Sidewalks  Even Sidewalks  Total 
High (over 20%)  98  418  516 
Low (under 10%)  9  301  310 
Total  107  719  826 
You should be able to answer questions like the following about this table:

What percent of the sidewalk segments were too uneven for walking?Answer = 107 / 826 or about 13%

What percent of the sidewalk segments in high poverty areas were too uneven for walking?Answer = 98 / 516 or about 19%

What are the odds that a sidewalk would be too uneven for walking?Answer = 107 / 719 = 0.149 or about 1 to 6.7

What is the ratio of the odds that a sidewalk segment is too uneven in a high poverty area to the odds of being too uneven in a low poverty area?Answer = 98 / 418 ÷ 9 / 301 = 98(301) / 418(9) ≈ 7.84
Odds Ratio
The last question leads us to the odds ratio.
Odds ratio interpretation: The odds of a sidewalk being uneven are about 7.84 times as big in a high poverty neighborhood compared with a low poverty area.
An interesting property of the odds ratio is that it comes out the same regardless of whether your condition on the explanatory variable or on the response variable (e.g. it will be the same for prospective studies that condition on the exposure and for retrospective studies that condition on the outcome). In the sidewalk study, the odds that an uneven sidewalk is in a high poverty area is 98/9 and the odds that an even sidewalk is in a high poverty area is 418/301. The odds ratio is then 98 / 9÷ 418 / 301 = 98(301) / 418(9) ≈ 7.84
6.3  Simpson's Paradox
6.3  Simpson's ParadoxExample 6.6
The Civil Rights Act of 1964 outlawed discriminatory practices based on race in voting rights, segregation in schools, workplace rules, and at facilities that serve the public. After a long filibuster by some southern Senators, the final bill was approved in a bipartisan vote in June of 1964.
The following tables show the results of the votes in both the Senate and the House broken down by political party and region of the country.
Table 6.6: House and Senate Votes
The House Vote  Democrats  Republicans  

Yes  No  Yes  No  
Northern  145  9  138  24 
Southern  7  87  0  10 
Total  152  96  138  34 
The Senate Vote  Democrats  Republicans  

Yes  No  Yes  No  
Northern  45  1  27  5 
Southern  1  20  0  1 
Total  46  21  27  6 
Try it!  By Party
Use Table 6.6 above to answer the following questions (the answers are all shown in Table 6.7  but try not to look there until you try them on your own).

What percent of Democrats in the House voted for the bill?61%

What percent of Republicans in the House voted for the bill?80%

What percent of Democrats in the Senate voted for the bill?69%

What percent of Republicans in the Senate voted for the bill?82%

In each chamber of Congress, what party voted proportionately more for the Civil Rights Bill of 1964?Republicans
Table 6.7: House and Senate Votes with Percentages
The House Vote  Democrats  Republicans  

Yes  No  Yes  No  
Northern  145 (94%)  9  138 (85%)  24 
Southern  7 (7%)  87  0 (0%)  10 
Total  152 (61%)  96  138 (80%)  34 
The Senate Vote  Democrats  Republicans  

Yes  No  Yes  No  
Northern  45 (98%)  1  27 (84%)  5 
Southern  1 (5%)  20  0 (0%)  1 
Total  46 (69%)  21  27 (82%)  6 
Try it!  By Region
Now use Table 6.6 to consider the following questions where the region represented (north or south) is taken into account (again the answers are all shown in Table 6.7  but try not to look there until you try them on your own).

What percent of Northern Senate Democrats voted for the bill? How about Southern Senate Democrats?Northern Democrats: 98%
Southern Democrats: 5% 
What percent of Northern Senate Republicans voted for the bill? How about Southern Senate Republicans?Northern Republicans: 84%
Southern Republicans: 0% 
What percent of Northern House Democrats voted for the bill? How about Southern House Democrats?
Northern Democrats: 94%
Southern Democrats: 7% 
What percent of Northern House Republicans voted for the bill? How about Southern House Republicans?Northern Republicans: 85%
Southern Republicans: 0% 
In each chamber of Congress, and in each region of the country, what party voted proportionately more for the Civil Rights Bill of 1964?Chamber: Senate
Region: Northern
Party: Republicans
Table 6.7: House and Senate Votes with Percentages
The House Vote  Democrats  Republicans  

Yes  No  Yes  No  
Northern  145 (94%)  9  138 (85%)  24 
Southern  7 (7%)  87  0 (0%)  10 
Total  152 (61%)  96  138 (80%)  34 
The Senate Vote  Democrats  Republicans  

Yes  No  Yes  No  
Northern  45 (98%)  1  27 (84%)  5 
Southern  1 (5%)  20  0 (0%)  1 
Total  46 (69%)  21  27 (82%)  6 
Simpson’s Paradox:
Answering the above questions leads to an interesting finding. For each chamber of Congress Democrats have a higher percentage than Republicans voting for the bill in both the North and the South (the only two possibilities), and yet they had a lower percentage overall. Look again at the numbers in Table 6.7 and you will see how that happened. Back in 1964, there was a huge imbalance in representation by region most Southern Senators and Representatives were Democrats at the time and the negative votes were almost all associated with that region.
The relationship between Party and vote on the civil rights bill was highly affected by a third variable  the region represented. This was an example of Simpson's Paradox.
An observed association between two variables can change or even reverse direction when there is another variable that interacts strongly with both variables.
Example 6.7: Smoking and Survival
1314 women took part in a study of heart disease and smoking that was conducted in 19721974 in Newcastle, United Kingdom. A followup study of the same subjects was recently conducted nearly thirty years later. Of the 582 women who were smokers in the original study, 76.2% were still alive in the followup study. Of the 732 nonsmokers, 68.6% were still alive twenty years later. Does this show a beneficial effect of smoking? What might have caused this counterintuitive result?
6.4  Test Yourself!
6.4  Test Yourself!Think About It!
Select the answer you think is correct  then click the right arrow to proceed to the next question.
6.5  Have Fun With It!
6.5  Have Fun With It!Have Fun With It!
J.B. Landers ©