Lesson 6: Relationships Between Categorical Variables

Lesson Overview Section

We have seen that the way in which you display and summarize variables depends on whether it is a categorical variable or a measurement variable. For example a pie chart or bar graph might be used to display the distribution of a categorical variable while a boxplot or histogram might be used to picture the distribution of a measurement variable. To study the relationship between two variables, a comparative bar graph will show associations between categorical variables while a scatterplot illustrates associations for measurement variables. We have also learned different ways to summarize quantitative variables with measures of center and spread and correlation. In this lesson we focus on statistical summaries of categorical variables and their relationships.

Example 6.1: One Sample of One Categorical Variable Section

The following question was asked of 555 students taking STAT 100.

Survey Question: How would you describe your hometown? "Rural," "Suburban," "Small Town," or "Big City?"

The results from this question are pictured in Figure 6.2 below where you can see that the majority of Penn State students who were enrolled in STAT 100 during that semester were from the suburbs.

bar graph or bar chart is composed of several bars with the same width and different height across x-axis. Each bar represents one category and its height shows the number of measurements within this category.

Figure 6.2. Bar Graph of Hometown Description

The bar graph provides a more informative picture than a pie chart in this case as it allows us to see the natural ordering of the categories.

Now, it is important to remember that before data is displayed in a bar graph like the one above, it must first be tabulated to calculate the percents that let us see the variable's distribution. For one variable that just involves dividing the count in each category by the total to get the proportion - and then converting those to percents by multiplying the proportions by 100% (if percents are desired). Table 6.1 shows the distribution and the calculations for the data in Example 6.1.

Table 6.1. Numerical Summary of Hometown Description

Hometown Count Proportion Percent
Rural 75 \(75 / 555 = 0.14\) \(0.14 × 100% = 14%\)
Suburb ​​​​​​​ 296 \(296 / 555 = 0.53\) \(0.53 × 100% = 53%\)
Small Town ​​​​​​​ 139 \(139 / 555 = 0.25\) \(0.25 × 100% = 25%\)
Big City 45 \(45 / 555 = 0.08\) \(0.08 × 100% = 8%\)
Total n = 555 \(555 / 555 = 1.0\) \(1.0 × 100% = 100%\)

Describing Categorical Data Section

Contingency Table
A contingency table displays how two categorical variables are related in a table with how many individuals fall in each combination of categories. The categories of one variable define the rows and categories of the other variable define the columns of the table.
Credit Card Response:
  Freshman Sophomore Junior Senior Total
Yes 42 55 76 81 254
No 58 45 24 19 146
Total 100 100 100 100 400
2x2 Table
A 2×2 table is a contingency table with 2 rows and 2 columns (i.e. it shows how categorical variables that have only two possibilities each are related).
Poverty Rate of Neighborhood
  Uneven Sidewalks Even Sidewalks
High (over 20%) 98 418
Low (under 10%) 9 301
Total 107 719
Sample Proportion
A sample proportion is the proportion of times something happens in the sample data. The sample proportion is often used to estimate the value of a population proportion (the corresponding proportion of times it happens in the whole population).
Conditional proportion
A conditional proportion is the proportion of times something occurs assuming a specific condition is true (such as assuming the data come from one specific row of a contingency table)
Population probability
A is the proportion of times you expect something to occur when you draw randomly from a population. A conditional probability is the proportion of times you expect something to occur assuming a specific condition is true.
Odds
The odds of something happening is the proportion of times it happens divided by the proportion of times it doesn’t.
Odds Ration
An odds ratio is the odds of something happening under one circumstance divided by the odds in another circumstance.
Risk
A risk is a proportion of individuals that have an undesirable trait.
Relative Risk
A relative risk is a risk under one circumstance divided by the risk in another circumstance.
Increased Risk
An increased risk is the percentage increase in risk under one circumstance over a baseline risk without that circumstance. Increased risk = 100% × (relative risk - 1)

Objectives

After successfully completing this lesson, you should be able to:

  • Use a table of cross-classified data to find summaries that answer questions of interest about data.  This includes the ability to:
    • Distinguish between and interpret: percentages, probabilities, conditional percentages, and conditional probabilities
    • Distinguish between and interpret: risk, relative risk, and increased risk.
    • Distinguish between and interpret: odds and odds ratio
  • Understand that an observed association between two variables can be misleading or even reverse direction when there is another variable that interacts strongly with both variables (Simpson's Paradox)