Comparing Two Categorical Variables

Understand that categorical variables either exist naturally (e.g. a person’s race, political party affiliation, or class standing), while others are created by grouping a quantitative variable (e.g. taking height and creating groups Short, Medium, and Tall). We analyze categorical data by recording counts or percents of cases occurring in each category. Although you can compare several categorical variables we are only going to consider the relationship between two such variables.

Example

The Class Survey data set, (CLASS_SURVEY.MTW or CLASS_SURVEY.XLS), consists of student responses to survey given last semester in a Stat200 course. We can construct a two-way table showing the relationship between Smoke Cigarettes (row variable) and Gender (column variable) using either Minitab or SPSS.

To create a two-way table in Minitab:

  1. Open the Class Survey data set.
  2. From the menu bar select Stat > Tables > Cross Tabulation and Chi-Square
  3. In the text box For Rows enter the variable Smoke Cigarettes and in the text box For Columns enter the variable Gender
  4. Under Display be sure the box is checked for Counts (should be already checked as this is the default display in Minitab).
  5. Click OK

minitab output

To create a two-way table in SPSS:

  1. Import the data set
  2. From the menu bar select Analyze > Descriptive Statistics > Crosstabs
  3. Click on variable Smoke Cigarettes and enter this in the Rows box.
  4. Click on variable Gender and enter this in the Columns box.
  5. Click OK

This should result in the following two-way table:

SPSS Output

The marginal distribution along the bottom (the bottom row All) gives the distribution by gender only (disregarding Smoke Cigarettes). The marginal distribution on the right (the values under the column All) is for Smoke Cigarettes only (disregarding Gender). Since there were more females (127) than males (99) who participated in the survey, we should report the percentages instead of counts in order to compare cigarette smoking behavior of females and males. This tells the conditional distribution of smoke cigarettes given gender, suggesting we are considering gender as an explanatory variable (i.e. a variable that we use to explain what is happening with another variable). These conditional percentages are calculated by taking the number of observations for each level smoke cigarettes (No, Yes) within each level of gender (Female, Male). For example, the conditional percentage of No given Female is found by 120/127 = 94.5%.

We can calculate these marginal probabilities using either Minitab or SPSS:

To calculate these marginal probabilities using Minitab:

  1. Opening the Class Survey data set.
  2. From the menu bar select Stat > Tables > Cross Tabulation and Chi-Square
  3. In the text box For Rows enter the variable Smoke Cigarettes and in the text box For Columns enter the variable Gender
  4. Under Display be sure the box is checked for Counts and also check the box for Column Percents.
  5. Click OK

minitab output

To create a two-way table in SPSS:

  1. Import the data set
  2. From the menu bar select Analyze > Descriptive Statistics > Crosstabs
  3. Click on variable Smoke Cigarettes and enter this in the Rows box.
  4. Click on variable Gender and enter this in the Columns box.
  5. Click the tab labeled Cells and select column under Percentages.
  6. Click Continue
  7. Click OK

This should result in the following two-way table with column percents:

SPSS Output

Although you do not need the counts, having those visible aids in the understanding of how the conditional probabilities of smoking behavior within gender are calculated. We can see from this display that the 94.49% conditional probability of No Smoking given the Gender is Female is found by the number of No and Female (count of 120) divided by then number of Females (count of 127). The data under Cell Contents tells you what is being displayed in each cell: the top value is Count and the bottom value is Percent of Column. Alternatively, we could compute the conditional probabilities of Gender given Smoking by calculating the Row Percents; i.e. take for example 120 divided by 209 to get 57.42%. This would be interpreted then as for those who say they do not smoke 57.42% are Females – meaning that for those who do not smoke 42.58% are Male (found by 100% – 57.42%).

Simpson’s Paradox

Hypothetically, suppose sugar and hyperactivity observational studies have been conducted; first separately for boys and girls, and then the data is combined. The following tables list these hypothetical results:

Results of hyperactivity study for boys.
Boys Normal Hyper Rate of Hyperactivity
Low Sugar 25 50 50/75 = 0.67
High Sugar 50 100 100/150 = 0.67
Results of hyperactivity study for girls.
Girls Normal Hyper Rate of Hyperactivity
Low Sugar 75 25 25/100 = 0.25
High Sugar 25 8 8/33 = 0.25
Combined results of hyperactivity studys for boys and girls.
Combined Normal Hyper Rate of Hyperactivity
Low Sugar 100 75 75/175 = 0.43
High Sugar 75 108 108/183 = 0.59

Notice how the rates for Boys (67%) and Girls (25%) are the same regardless of sugar intake. What we observe by these percentages is exactly what we would expect if no relationship existed between sugar intake and activity level. However, when we consider the data when the two groups are combined, the hyperactivity rates do differ: 43% for Low Sugar and 59% for High Sugar. This difference appears large enough to suggest that a relationship does exist between sugar intake and activity level. This phenomenon is known as Simpson’s Paradox, which describes the apparent change in a relationship in a two-way table when groups are combined. In this hypothetical example, boys tended to consume more sugar than girls, and also tended to be more hyperactive than girls. This results in the apparent relationship in the combined table. The confounding variable, gender, should be controlled for by studying boys and girls separately instead of ignored when combining. By definition, a confounding variable is a variable that when combined with another variable produces mixed effects compared to when analyzing each separately. By contrast, a lurking variable is a variable not included in the study but has the potential to confound. Consider the previous example where the combined statistics are analyzed then a researcher considers a variable such as gender. At this point gender would be a lurking variable as gender would not have been measured and analyzed.