2.1 - Categorical Variables

2.1 - Categorical Variables

Categorical variables are discussed in Sections 2.1 and P.1 of the Lock5 textbook.

Variables can be classified as categorical or quantitative. In this section of the lesson, we will be focusing on categorical variables. Categorical variables are those that provide groupings that may have no logical order, or a logical order with inconsistent difference between groups (e.g., the difference between 1 and 2 is not equivalent to the difference between 3 and 4).

This course includes many examples and practice problems for you. Many of these will apply the concepts that we learn to experiments involving rolling a die or randomly selecting a card from a standard 52-card deck. If you are unfamiliar with either of these, take a moment here to review.

Die

A standard die has 6 sides: 1, 2, 3, 4, 5, 6

52-Card Deck

A standard 52-card deck of playing cards has 13 Hearts, 13 Diamonds, 13 Spades, and 13 Clubs. Hearts (♥) and Diamonds (♦) are red suits. Spades (♠) and Clubs (♣) are black suits. For each suit, there is a 2, 3, 4, 5, 6, 7, 8, 9, 10, Jack, Queen, King, and Ace. Jacks, Queens, and Kings are "face cards."

Clubs SUIT Ace King Queen Jack 10 9 8 7 6 5 4 3 2 Spades Hearts Diamonds

2.1.1 - One Categorical Variable

2.1.1 - One Categorical Variable

Data concerning one categorical variable can be summarized using a proportion.

Proportion
\(Proportion=\dfrac{Number\;in\;the\;category}{Total\;number}\)

The symbol for a sample proportion is \(\widehat{p}\) and is read as "p-hat." The symbol for a population proportion is \(p\). 

The formula for a sample proportion may also be written as \(\widehat p = \frac{x}{n}\) where \(x\) is the number in the sample with the trait of interest and \(n\) is the sample size.

A proportion must be between 0 and 1.00.

Example: Black Cards

A standard 52-card deck contains \(26\) red cards and \(26\) black cards. What proportion of cards are black?

\(p=\dfrac{26}{52}=0.50\)

The symbol \(p\) was used because this is the proportion of all cards (i.e., the population) that are black.

Example: World Campus Undergraduate Students

In the Fall 2014 semester, there were \(82,382\) undergraduate students enrolled in Penn State. Of those, \(6,245\) were World Campus students. What proportion of all Penn State undergraduate students were World Campus students?

\(p=\dfrac{6245}{82382}=0.076\)

The symbol \(p\) was used because this is the proportion of all Penn State undergraduate students (i.e., the population) that are World Campus students.

Example: Broken Cookies

In a sample of \(30\) randomly selected packages of chocolate chip cookies, \(18\) contained broken cookies. What proportion of these selected packages had broken cookies?

\(\widehat{p}=\dfrac{18}{30}=0.60\)

These data were collected from a sample so the symbol \(\widehat{p}\) was used to denote a sample proportion. 


2.1.1.1 - Risk and Odds

2.1.1.1 - Risk and Odds

You may have heard the terms risk and odds before. They are both ways to communicate the likelihood of an event.

Risk and odds are often confused with one another. The formulas for computing risk and odds are different and their interpretations are different.

Risk

In statistics, the word risk communicates the likelihood of an event occurring. This is synonymous with probability or proportion (i.e., the formulas are the same).

Risk
The probability that an event will occur. It may be written as a decimal, a fraction, or a percent.
Risk
\(Risk= \dfrac{number \;with \;the\; outcome}{total\;number\;of\;outcomes}\)

Example: Asthma Risk

\(60\) out of \(1000\) teens have asthma.

\(risk=\dfrac{60}{1000}=0.06\)

This means that \(6\%\) of teens experience asthma.

Example: Flu Risk

\(45\) out of \(100\) children get the flu each year.

\(risk=\dfrac{45}{100}=0.45\) or \(45\%\)

Odds

Odds
Express risk by comparing the likelihood of an event happening to the likelihood it does not happen.
Odds

\(odds = \dfrac {number \;with \;the\; outcome}{number \;without \;the \;outcome}\)

OR

\(odds=\dfrac{risk}{1-risk}\)

We often interpret odds in relation to the value of 1. For example, if the odds of a game are in favor of the house 2 to 1, that means for every 2 games the house wins it will lose 1. 

Example: Passing Odds

In one large class, 850 students passed an exam while 150 students failed. Because we have the raw counts, we can use the first odds formula.

\(odds=\dfrac {number \;with \;the\; outcome}{number \;without \;the \;outcome}=\dfrac{850}{150}=5.667\)

The odds of passing were 5.667 to 1. In other words, for every 5.667 students who passed the exam there was 1 who failed.

Example: Flu Odds

The risk of a child getting the flu is \(45\%\) which can also be written as \(0.45\). Because we have the risk, we can use the second odds formula.

\(odds=\dfrac{risk}{1-risk}=\dfrac{0.45}{1-0.45}=\dfrac{0.45}{0.55}=0.818\)

The odds of a child getting the flu is \(0.818\) to \(1\).


2.1.1.2 - Visual Representations

2.1.1.2 - Visual Representations

Frequency tables, pie charts, and bar charts can all be used to display data concerning one categorical (i.e., nominal- or ordinal-level) variable. Below are descriptions for each along with some examples. At the end of this lesson you will learn how to construct each of these using Minitab.

Frequency Tables

frequency table contains the counts of how often each value occurs in the dataset. Some statistical software, such as Minitab, will use the term tally to describe a frequency table. Frequency tables are most commonly used with nominal- and ordinal-level variables, though they may also be used with interval- or ratio-level variables if there are a limited number of possible outcomes. 

In addition to containing counts, some frequency tables may also include the percent of the dataset that falls into each category, and some may include cumulative values. A cumulative count is the number of cases in that category and all previous categories. A cumulative percent is the percent in that category and all previous categories. Cumulative counts and cumulative percentages should only be presented when the data are at least ordinal-level. 

The first example is a frequency table displaying the counts and percentages for Penn State undergraduate student enrollment by campus. Because this is a nominal-level variable, cumulative values were not included.

 

Frequencies of Campus
Campus Count Percent
University Park 40,639 50.1%
Commonwealth Campuses 27,100 33.4%
PA College of Technology 4,981 6.1%
World Campus 8,360 10.3%
Total 81,080 100%

Penn State Fall 2019 Undergraduate Enrollments

 

The next example is a frequency table for an ordinal-level variable: class standing. Because ordinal-level variables have a meaningful order, we sometimes want to look at the cumulative counts or cumulative percents, which tell us the number or percent of cases at or below that level.

As an example, let's interpret the values in the "Sophomore" row. There are 22 sophomore students in this sample. There are 27 students who are sophomore or below (i.e., first-year or sophomore). In terms of percentages, 34.4% of students are sophomores and 42.2% of students are sophomores or below.

Frequencies of Class Standing
Class Standing Count Cumulative Count Percent Cumulative Percent
First-Year 5 5 7.8% 7.8%
Sophomore 22 27 34.4% 42.2%
Junior 17 44 26.6% 68.8%
Senior 20 64 31.3% 100.0%

Pie Charts

A pie chart displays data concerning one categorical variable by partitioning a circle into "slices" that represent the proportion in each category. When constructing a pie chart, pay special attention to the colors being used to ensure that it is accessible to individuals with different types of colorblindness. 

Pie Chart of Campus
Category
  •  University Park (48.5%)
  •  Commonwealth Campuses (34.9%)
  •  PA College of Technology (6.5%)
  •  World Campus (10.1%)
Penn State Fall 2017 Undergraduate Enrollments

Bar Charts

A bar chart is a graph that can be used to display data concerning one nominal- or ordinal-level variable. The bars, which may be vertical or horizontal, symbolize the number of cases in each category. Note that the bars on a bar chart are separated by spaces; this communicates that this a categorical variable. 

The first example below is a bar chart with vertical bars. The second example is a bar chart with horizontal bars. Both examples are displaying the same data. On both charts, the size of the bar represents the number of cases in that category. 

Bar Chart of Undergraduate Enrollment Campus University Park 0 10000 20000 30000 40000 CommonwealthCampuses PA College ofTechnology WorldCampus Count

Penn State Fall 2019 Undergraduate Enrollments

 

Bar Chart of Undergraduate Enrollment 0 10000 20000 30000 40000 University Park Commonwealth Campuses PA College of Technology World Campus Campus Count

Penn State Fall 2019 Undergraduate Enrollments

Considerations

Pie charts tend to work best when there are only a few categories. If a variable has many categories, a pie chart may be difficult to read. In those cases, a frequency table or bar chart may be more appropriate. Each visual display has its own strengths and weaknesses. When first starting out, you may need to make a few different types of displays to determine which most clearly communicates your data.


2.1.1.2.1 - Minitab: Frequency Tables

2.1.1.2.1 - Minitab: Frequency Tables

Minitab®  – Frequency Table

This example will use data collected from a sample of STAT 200 students. These data can be downloaded using:

WCStudentData.xlsx

To create a frequency table of the primary campus variable in Minitab:

  1. Open the data file in Minitab
  2. From the tool bar, select Stat > Tables > Tally Individual Variables
  3. Double click the variable Primary Campus in the box on the left to insert it into the Variable box on the right
  4. Under Statistics, check Counts and Percents
  5. Click OK

This should result in the following frequency table:

Tally
Primary Campus Count Percent
Commonwealth Campus 5 1.46
University Park 223 65.01
World Campus 115 33.53
N= 343  
Video Walkthrough


2.1.1.2.2 - Minitab: Pie Charts

2.1.1.2.2 - Minitab: Pie Charts

Minitab®  – Pie Chart (Raw Data)

This example will use data collected from a sample of students enrolled in online sections of STAT 200 during the Summer 2020 semester. These data can be downloaded as a CSV file:

WCStudentData.csv

To create a pie chart using raw data:

  1. Open the data file in Minitab 
  2. From the tool bar, select Graph > Pie Chart...
  3. Select Counts of Unique Values
  4. Click OK
  5. Double click the variable Primary Campus in the box on the left to insert it into the Categorical variables box on the right
  6. Click OK

This should result in the pie chart below:

Pie chart of primary campus made in Minitab
Video Walkthrough

Minitab®  – Pie Chart (Summarized Data)

In the example above, raw data were used. In other words, the data file contained one row for each case. It is also possible to use Minitab to construct a pie chart with summarized data, for example, if you have your counts in a frequency table. If this is the case, follow the steps below. This example uses the following data concerning Penn State undergraduate enrollment:

Enrollment by Campus
Campus Count
University Park 40,639
Commonwealth Campuses 27,100
PA College of Technology 4,981
World Campus 8,360

Penn State Fall 2019 Undergraduate Enrollments

 

To create a pie chart using summarized data:

  1. Enter the data into a blank Minitab worksheet with one column containing the Campus names and a second column containing the Count for each campus
  2. From the tool bar, select Graph > Pie Chart...
  3. Select Summarized Data in a Table
  4. Click OK
  5. Double click Campus in the box on the left to insert it into the Categorical variable box on the right
  6. Double click Count in the box on the left to insert it into the Summary variables box on the right
  7. Click OK

This should result in the pie chart below:

Pie chart of primary campus made in Minitab using summarized data in a table
Video Walkthrough


2.1.1.2.3 - Minitab: Bar Charts

2.1.1.2.3 - Minitab: Bar Charts

Minitab®  – Bar Chart (Raw Data)

This example will use data collected from a sample of students enrolled in online sections of STAT 200 during the Summer 2020 semester. These data can be downloaded as a CSV file:

WCStudentData.csv

To create a bar graph of the primary campus variable in Minitab:

  1. Open the data file in Minitab
  2. From the tool bar, select Graph > Bar Chart > Counts of Unique Values...
  3. Select One Variable
  4. Click OK
  5. Double click the variable Primary Campus in the box on the left to insert it into the Categorical variable box on the right
  6. Click OK

This should result in the bar graph below:

Bar chart of primary campus made using Minitab
Video Walkthrough

Minitab®  – Bar Chart (Summarized Data)

In the example above, raw data were used. In other words, the data file contained one row for each case. It is also possible to use Minitab to construct a bar chart with summarized data, for example, if you have your counts in a frequency table. If this is the case, follow the steps below. This example uses the following data concerning Penn State undergraduate enrollment:

Enrollment by Campus
Campus Count
University Park 40,639
Commonwealth Campuses 27,100
PA College of Technology 4,981
World Campus 8,360

Penn State Fall 2019 Undergraduate Enrollments

 

To create a bar chart using summarized data:

  1. Enter the data into a blank Minitab worksheet with one column containing the Campus names and a second column containing the Count for each campus
  2. From the tool bar, select Graph > Bar Chart > Summarized Data in a Table...
  3. Under One Column of Values, select Simple
  4. Click OK
  5. Double click Count in the box on the left to insert it into the Y-variable box on the right
  6. Double click Campus in the box on the left to insert it into the Categorical variable box on the right
  7. Click OK

This should result in the bar chart below:

Bar chart of enrollment made using data in a summarized table
Video Walkthrough


2.1.2 - Two Categorical Variables

2.1.2 - Two Categorical Variables

Data concerning two categorical (i.e., nominal- or ordinal-level) variables can be displayed in a two-way contingency table, clustered bar chart, or stacked bar chart. Here, we'll look at an example of each. At the end of this lesson, you will learn how Minitab can be used to make two-way contingency tables and clustered bar charts.

Two-Way Contingency Table

two-way contingency table, also know as a two-way table or just contingency table, displays data from two categorical variables. This is similar to the frequency tables we saw in the last lesson, but with two dimensions. One variable will be represented in the rows and a second variable will be represented in the columns. Later in this lesson we'll see how a two-way table can be used to compute a variety of different proportions.

The example below displays the counts of Penn State undergraduate and graduate students who are Pennsylvania residents and not Pennsylvania residents.

Two-Way Table of Penn State Enrollment by Academic Level & State Residency
  PA Resident Non-PA Resident Total
Undergraduate 54,239 26,841 81,080
Graduate 5,596 9,732 15,328
Total 59,835 36,573 96,408

Stacked Bar Chart

stacked bar chart is also known as a segmented bar chart. One categorical variable is represented on the x-axis and the second categorical variable is displayed as different parts (i.e., segments) of each bar. The stacked bar chart below was constructed using the statistical software program R.

Stacked Bar ChartPenn State Enrollment byAcademic Level and PA State Residency 10000 0 30000 50000 70000 PA Resident Non-PA Resident Academic Level Undergraduate Graduate

On this stacked bar chart, the bar on the left represents the number of students who are Pennsylvania residents. The bar on the right represents the number of students who are not Pennsylvania residents. The bottom of each bar, which is light green, represents the number of students who are enrolled at the undergraduate-level. The top of each bar, which is blue, represents the number of students who are enrolled at the graduate-level. 

From this bar chart, we can see that overall there are more students who are Pennsylvania residents than non-Pennsylvania residents because the bar on the left is higher than the bar on the right. In both bars, the light green section is much bigger than the blue section, which tells us that there are more undergraduate-students than there are graduate-students in both groups.

The light green section is bigger in the left bar compared to the right bar, which tells us that undergraduate-students are more likely to be Pennsylvania residents. The blue section is bigger in the right bar compared to the left bar, which tells us that graduate-students are more likely to be non-Pennsylvania residents.

Clustered Bar Chart

In a clustered bar chart each bar represents one combination of the two categorical variables. If you compare this to the two-way contingency table above, each bar represents the value in one cell. This is also known as a side-by-side bar chart. The clustered bar chart below was made using Minitab.

Clustered Bar Chart Penn State Enrollment by Academic Level and PA State Residency 0 10000 20000 30000 40000 50000 60000 Undergraduate Academic Level Graduate Undergraduate Graduate Data PA Resident Non-PA Resident

Choosing the Best Visual Display

The two-way contingency table, stacked bar chart, and clustered bar chart shown above were all made using the same data concerning Penn State enrollments by academic level and state residency. The best visual display depends on the scenario. For example, if our primary goal was to compare the number of students who are Pennsylvania residents and non-Pennsylvania residents, and academic level was a secondary variable of interest, the stacked bar chart may be preferred. If we wanted to compare the number of students in each combination of academic level and state residency to see which groups were largest and smallest, the clustered bar chart may be preferred. Often, more than one of these graphs may be appropriate. 


2.1.2.1 - Minitab: Two-Way Contingency Table

2.1.2.1 - Minitab: Two-Way Contingency Table

Minitab®  – Two-Way Contingency Table

This example will use data collected from a sample of students enrolled in online sections of STAT 200. 

WCStudentData.csv

To create a two-way table of the Work Status and Primary Campus variables in Minitab:

  1. Open the data file in Minitab 
  2. From the tool bar, select Stat > Tables > Cross Tabulation and Chi-Square
  3. We have a data file where each row represents one case, so we will keep the default data entry method of Raw data (categorical variables) in the drop down menu
  4. Click in the Rows box, then double click the variable Work Status to insert it into the Rows box on the right
  5. Click in the Columns box, then double click the variable Primary Campus to insert it into the Columns box on the right
  6. Click OK

This should result in the two-way table below:

Tabulated Statistics: Work Status, Primary Campus
Rows: Work Status  Columns: Primary Campus
  Commonwealth Campus University Park World Campus All
Full-time 0 26 78 104
Not working 1 99 25 125
Part-Time 4 96 12 112
Missing 0 2 0 *
All 5 221 115 341
Cell Contents: Count  
Video Walkthrough

Additional Display Options

The default in Minitab is to display the counts. Under Display you also have the option to select Row percentsColumn percents, and Total percents.

 

Screenshot of Minitab showing where to select the row percents, column percents, and total percents

 

The output below is what you would get if you selected all four display options:

 

Tabulated Statistics: Work Status, Primary Campus
Rows: Work Status  Columns: Primary Campus
  Commonwealth Campus University Park World Campus All
Full-time

0

0.00

0.00

0.00

26

25.00

11.76

7.62

78

75.00

67.83

22.87

104

100.00

30.50

30.50

Not working

1

0.80

20.00

0.29

99

79.20

44.80

29.03

25

20.00

21.74

7.33

125

100.00

36.66

36.66

Part-Time

4

3.57

80.00

1.17

96

85.71

43.44

28.15

12

10.71

10.43

3.52

112

100.00

32.84

32.84

Missing

0

*

*

*

2

*

*

*

0

*

*

*

*

*

*

*

All

5

1.47

100.00

1.47

221

64.81

100.00

64.81

115

33.72

100.00

33.72

341

100.00

100.00

100.00

Cell Contents
    Count
    % of Row
    % of Column
    % of Total

 

 

Here, each cell contains four values. The top number in each cell is the count. This is the number of students in that group. For example, there were 78 World Campus students who were working full-time.

The second number in each cell is the percentage of the row. In the cell for World Campus students working full-time, that value is 75.00. The row represents the students who were working full-time. This means that 75% of all students who were working full time were World Campus students. This is an example of a conditional probability: P(World Campus | Full-Time).

The third number in each cell is the percentage for that column. In the cell for World Campus students working full-time, that value is 67.83. The column represents World Campus. This means that 67.83% of all World Campus students were working full-time. This is an example of a conditional probability: P(Full-Time | World Campus).

The last number in each cell is the percentage of the total. In the cell for World Campus students working full-time, that value is 22.87. This means that 22.87% of all students who completed this survey were World Campus students who were working full-time. This is an example of an intersection: P(World Campus ∩ Full-Time).


2.1.2.2 - Minitab: Clustered Bar Chart

2.1.2.2 - Minitab: Clustered Bar Chart

Minitab®  – Clustered Bar Chart

This example will use data collected from a sample of students enrolled in online sections of STAT 200. 

WCStudentData.csv

To create a clustered bar chart of the Work Status and Primary Campus variables in Minitab:

  1. Open the data file in Minitab 
  2. From the tool bar, select Graph > Bar Chart > Counts of  Unique Values
  3. Select Multiple Variables
  4. Click OK
  5. Double click the variables Work Status and Primary Campus to insert them both into the Categorical variables box on the right
  6. Click OK

This should result in the clustered bar chart below:

Clustered bar chart for work status and primary campus
Video Walkthrough

Note: The order in which the variables are entered into the Categorical variables box determines how the bars will be clustered. For example, if we entered Primary Campus and then Work Status, the result would be the following clustered bar chart:

Clustered bar chart for work status and primary campus

Summarized Data

In the example above, raw data were used. In other words, our Minitab worksheet contained one row for each case. It is also possible to use Minitab to construct a clustered bar chart with summarized data, for example, if you have data in a frequency table. To do this, select Graph > Bar Chart > Summarized Data in a Table > Two-Way Table > Clustered or Stacked. Double click each of your variables to move them into the Y-variables box. Move the column containing row labels into the Row labels box. The default is to Cluster variables, which is what should be selected to create a clustered bar chart, with Rows first, Y's below. You also have the option of choosing which variable your bars are clustered by; to flip the variables, select Y's first, rows below from the drop-down. 


2.1.2.3 - Minitab: Stacked Bar Chart

2.1.2.3 - Minitab: Stacked Bar Chart

Minitab®  – Stacked Bar Chart (Raw Data)

This example will use data collected from a sample of students enrolled in online sections of STAT 200. 

WCStudentData.csv

To create a stacked bar chart of the Work Status and Primary Campus variables in Minitab:

  1. Open the data file in Minitab
  2. Select Graph > Bar Chart > Counts of Unique Values
  3. Select Multiple Variables
  4. Click OK
  5. Double click the variables Work Status and Primary Campus to insert them both into the Categorical variables box on the right
  6. Under Display categorical variables select Last variable stacked
  7. Click OK

This should result in the stacked bar chart below:

Stacked bar chart of work status by primary campus
Video Walkthrough

 

Note: The order in which the variables are entered into the Categorical variables box in Minitab determines how the bars will be clustered. For example, if we entered Primary Campus and then Work Status, the result would be the following clustered bar chart:

Stacked bar chart of primary campus by work status

Summarized Data

In the example above, raw data were used. In other words, our Minitab worksheet contained one row for each case. It is also possible to use Minitab to construct a stacked bar chart with summarized data, for example, if you have data in a frequency table. To do this, select Graph > Bar Chart > Summarized Data in a Table > Two-Way Table > Clustered or Stacked. Double click each of your variables to move them into the Y-variables box. Move the column containing row labels into the Row labels box. Select Stack variables. The default is to Stack Y-variables; you can flip the variables by changing this to Stack rows.


2.1.3 - Probability Rules

2.1.3 - Probability Rules

The probability rules covered in this lesson can be found in section P.1 of the Lock5 textbook.

Earlier in this lesson you were introduced to proportions. We used the notation: \(Proportion=\frac{Number\;in\;the\;category}{Total\;number}\).

When we discuss probabilities, we will use the notation below where \(P(A)\) is the probability of event \(A\) occurring. Probabilities are typically written in decimal form but may also be translated to percentages. 

Note that this is the same formula that you learned earlier in Lesson 2.1.1 for a proportion.

Probability of Event A
\(P(A)=\dfrac{Number\;in\;group\;A}{Total\;number}\)

Example: Spades

What is the probability that a randomly selected card from a standard 52-card deck will be a spade? There are 13 spades in the deck of 52.

\(P(spade)=\dfrac{13}{52}=0.25\)

The probability of pulling a spade is 0.25. We could also say that there is a 25% chance of pulling a spade.

Example: Odd Numbers

If you roll a six-sided die, what is the probability of getting an odd number? There are three odd numbers on the die (1, 3, 5).

\(P(odd)=\dfrac{3}{6}=0.50\)

The probability of rolling an odd number is 0.50. We could also say that there 50% chance of rolling an odd number.

Example: Raffle

There are a total of 500 raffle tickets and you have purchased 10. What is the probability that one of your tickets will be randomly selected to win the raffle?

\(P(winning)=\dfrac{10}{500}=0.02\)

The probability of you winning is 0.02. We could also say that there is a 2% chance that you will win.


2.1.3.1 - Range of Probabilities

2.1.3.1 - Range of Probabilities

The probability of an impossible event is 0 and the probability of a certain event is 1. The range of possible probabilities is: \(0 \leq P(A) \leq 1\). It is not possible to have a probability less than 0 or greater than 1. 

Example: Rolling an 8

It is impossible to roll an eight on a six-sided die.

\(P(rolling\; 8)= \dfrac{0}{6} = 0\)

Example: Blue Cards

In a standard 52-card deck all cards are black or red. There are no blue cards.

\(P(blue)=\dfrac{0}{52}=0\)

Example: Rolling a Value Between 1 and 6

A six-sided die contains the values 1, 2, 3, 4, 5, and 6. All rolls will result in a value between 1 and 6.

\(P(rolling \;1 \;to\; 6)=\dfrac{6}{6}=1.00\)


2.1.3.2 - Combinations of Events

2.1.3.2 - Combinations of Events

In situations with two or more categorical variables there are a number of different ways that combinations of events can be described: intersections, unions, complements, and conditional probabilities. Each of these combinations of events is covered in your textbook. However, note that your textbook does not use the symbols that are most commonly used when discussing these combinations of events. The symbols that we will be using are in the table below. In this section, you will also learn about disjoint events and independent events

Combination of Events
Combination Symbol Definition
Disjoint   Never occurring together
Independent   Unrelated
Intersection \(P(A\cap B)\) Probability of A and B
Union \(P(A\cup B)\)

Probability of A or B

Note: This includes the possibility of A and B

Complement \(P(A^C)\) The probability of NOT A
Conditional \(P(A\mid B)\) The probability of A given B

2.1.3.2.1 - Disjoint & Independent Events

2.1.3.2.1 - Disjoint & Independent Events

Disjoint events and independent events are different. Events are considered disjoint if they never occur at the same time; these are also known as mutually exclusive events. Events are considered independent if they are unrelated.

Disjoint Events

Disjoint events are events that never occur at the same time. These are also known as mutually exclusive events

These are often visually represented by a Venn diagram, such as the below. In this diagram, there is no overlap between event A and event B. These two events never occur together, so they are disjoint events.

Mutually Exclusive

Example: First-Year & Sophomore Students

Let's consider undergraduate class level. A student can be classified as a first-year student, sophomore, junior, or senior.

Being a first-year student and being a sophomore are disjoint events because an individual cannot be classified as both at the same time. 

Independent Events

Independent events are unrelated events. The outcome of one event does not impact the outcome of the other event. Independent events can, and do often, occur together. 

The following examples use stacked bar charts to demonstrated what two variables that are and are not independent look like in relation to one another. 

Example: Penguin Species & Biological Sex

The segmented bar chart above displays data from a research study concerning penguins (see Palmer Penguins). Within each of the three species of penguin, half of the penguins are male and half are female. In this sample, penguin species and biological sex are independent. Knowing the species of a penguin does not change the probability that they are male or female. And, knowing the biological sex of a penguin does not change the probability that it is an Adelie, Chinstrap, or Gentoo penguin.

Non-Example: Enrollment Status by Campus

Full-Time Part-Time Stacked Bar ChartPenn State Enrollment Status by Campus 0 10000 20000 30000 40000 50000 University Park CommonwealthCampuses PA College ofTechnology World Campus

The segmented bar chart above displays data concerning Penn State students' status as full- or part-time and their primary campus (data from Penn State's Data Digest). The proportion of students who are part-time is different at each campus. Only 2.7% of University Park students are enrolled part-time while 69.2% of World Campus students are enrolled part-time. Enrollment status and primary campus are not independent. If we know a student's campus, that changes the probability of them being a full- or part-time student. If we know that a student is full- or part-time, that chances the probability that they came from a specific campus. 


2.1.3.2.2 - Intersections

2.1.3.2.2 - Intersections

The term intersection is used to describe the overlap or two or more events. This is communicated using the character ∩. The phrase \(P(A \cap B)\) is read as "the probability of A and B."

In the form of a Venn diagram, we can picture this as the overlap between two [or more] events. 

Intersection of A and B

Example: Cards

What is the probability of randomly selecting a card from a standard 52-card deck that is a red card and a king?

There are 2 kings that are red cards: the king of hearts and the king of diamonds.

\(P(red \cap king)=\dfrac{2}{52}=.0385\)

Example: Penn State Enrollment

The two-way contingency table below displays the Penn State's undergraduate enrollments from Fall 2019 in terms of status (full-time and part-time) and primary campus (data from the Penn State Factbook).

  Full-Time Part-Time Total
University Park 39529 1110 40639
Commonwealth Campuses 24306 2794 27100
PA College of Technology 4110 871 4981
World Campus 2574 5786 8360
Total 70519 10561 81080

 

 

What proportion of Penn State students were full-time University Park students?

This is an example of an intersection because we are looking for the proportion of all students who are both full-time and University Park.

\(P(FullTime \cap UniversityPark)=\dfrac{39529}{81080}=0.488\)

 

What proportion of Penn State students were part-time World Campus students?

This is an example of an intersection because we are looking for the proportion of all students who are both part-time and World Campus.

\(P(PartTime \cap WorldCampus) = \dfrac{5786}{81080}=0.071\)


2.1.3.2.3 - Unions

2.1.3.2.3 - Unions

union is communicated using the symbol ∪. \(P(A \cup B)\) is read as "the probability of A or B." Note that in mathematics, "or" means "and/or." The Venn diagram below depicts the union of A and B.

Union of A and B

 

If the values of P(A), P(B), and P(A ∩ B) are all known, the formula below can be used to compute the union of A and B. Conceptually, the union of A and B is equal to A plus B minus the overlap of A and B.

Union
\(P(A\cup B) = P(A)+P(B)-P(A\cap B)\)

Example: Hearts or Spades

What is the probability of randomly selecting a card from a standard 52-card deck that is a heart or spade?

In a standard 52-card deck, 13 cards are hearts, 13 cards are spades, and no cards are both a heart and a spade.

\(P(heart) = \dfrac{13}{52}\)

\(P(spade) = \dfrac{13}{52}\)

\(P(heart \cap spade) = \dfrac{0}{52}\)

Using the formula given above:

\(P(heart \cup spade)=\dfrac{13}{52}+\dfrac{13}{52}-\dfrac{0}{52}= \dfrac {26}{52}=0.5\)

Example: Hearts or Aces

What is the probability of randomly selecting a card from a standard 52-card deck that is a heart or an ace?

In a standard 52-card deck, 13 cards are hearts and 4 cards are aces. There is one ace of hearts.

\(P(heart) = \dfrac{13}{52}\)

\(P(ace) = \dfrac{4}{52}\)

\(P(heart \cap ace) = \dfrac{1}{52}\)

Using the formula given above:

\(P(heart \cup ace)=\dfrac{13}{52}+\dfrac{4}{52}-\dfrac{1}{52}=\dfrac{16}{52}=0.308\)

Example: Part-Time or World Campus

The two-way contingency table below displays the Penn State's undergraduate enrollments from Fall 2019 in terms of status (full-time and part-time) and primary campus (data from the Penn State Factbook).

  Full-Time Part-Time Total
University Park 39529 1110 40639
Commonwealth Campuses 24306 2794 27100
PA College of Technology 4110 871 4981
World Campus 2574 5786 8360
Total 70519 10561 81080

 

What proportion of all students are part-time students or World Campus students?

When we have a contingency table, we can take the appropriate values from the table as opposed to using the formula given above. In this table there are 1110 part-time University Park students, 2794 part-time Commonwealth Campus students, 871 PA College of Technology students, 5786 part-time World Campus students, and 2574 full-time World Campus students. Combined, these are all of the cells this question is asking about.

\(P(PartTime \cup WorldCampus)=\dfrac{1110+2794+871+5786+2574}{81080}=\dfrac{13135}{12242}=0.162\)

Note that the final answer would be the same if we had used the formula:

\(P(PartTime \cup WorldCampus) = \dfrac{10561}{81080}+\dfrac{8360}{81080}-\dfrac{5786}{81080}= \dfrac{13135}{81080}=0.162\)


2.1.3.2.4 - Complements

2.1.3.2.4 - Complements

The complement of an event is the probability that the event does not occur. The complement of  \(P(A)\) is written as \(P(A^C)\) or \(P(A')\).

In the diagram below, we can see that \(A^{C}\) is everything in the sample space that is not A.

Complement of A

Mathematically, if we know \(P(A)\), we can use that value to compute \(P(A^{C})\) using the following formula.

Complement of A
\(P(A^{C})=1−P(A)\)

Example: Coin Flip

When flipping a coin, one can flip heads or tails. Thus, \(P(Tails^{C})=P(Heads)\) and \(P(Heads^{C})=P(Tails)\)

Example: Hearts

If you randomly select a card from a standard 52-card deck, you could pull a heart, diamond, spade, or club. The complement of pulling a heart is the probability of pulling a diamond, spade, or club. In other words: \(P(Heart^{C})=P(Diamond,\; Spade,\;\;Club)\)

Example: Rain

Light Rain Showers

According to the weather report, there is a 30% chance of rain today: \(P(Rain) = .30\) 

Raining and not raining are complements.

\(P(Not \:rain)=P(Rain^{C})=1-P(Rain)=1-.30=.70\)

There is a 70% chance that it will not rain today.

The sum of all of the probabilities for possible events is equal to 1.

Example: Cards

In a standard 52-card deck there are 26 black cards and 26 red cards. All cards are either black or red.

\(P(red)+P(black)=\frac{26}{52}+\frac{26}{52}=1\)

Example: Dominant Hand

Of individuals with two hands, it is possible to be right-handed, left-handed, or ambidextrous. Assuming that these are the only three possibilities and that there is no overlap between any of these possibilities:

\(P(right\;handed)+P(left\;handed)+P(ambidextrous) = 1\)


2.1.3.2.5 - Conditional Probability

2.1.3.2.5 - Conditional Probability

A conditional probability is the probability of one event occurring given that a second event is known to have occurred. This is communicated using the symbol \(\mid\) which is read as "given." For example, \(P(A\mid B)\) is read as "Probability of A given B."

A conditional probability can be computed using a two-way contingency table. In the examples below, note that we're only interested in the events in one row or column.

Example: PA Resident given Undergraduate

The two-way contingency table below displays Penn State World Campus enrollments from Fall 2019 in terms of academic level (undergraduate and graduate) and state residency (Pennsylvania and non-Pennsylvania). 

Two-Way Table of Academic Level and State Residency
  Pennsylvania Non-Pennsylvania Total
Undergraduate 3757 4603 8360
Graduate 2253 4074 6327
Total 6010 8677 14687

 

Given an individual is an undergraduate student, what is the probability they are a Pennsylvania resident?

We know the individual is an undergraduate student, so we will only look at the row containing the 8360 undergraduate students. Of those 8360 undergraduate students, 3757 were Pennsylvania residents.

\(P(PA \mid Undergrad) = \dfrac{3757}{8360}=0.449\)

 

Given an individual is a Pennsylvania resident, what is the probability they are an undergraduate student?

Note that most cases, \(P(A\mid B) \ne P(B \mid A)\). This question is different from the first question because the two events are flipped. Here, we know the individual is a Pennsylvania resident, we we will only look at the column containing the 6010 Pennsylvania residents. Of those 6010 Pennsylvania residents, 3757 were undergraduate students. 

\(P(Undergrad \mid PA) = \dfrac{3757}{6010}=0.625\)

 

What proportion of graduate students are Pennsylvania residents?

This question is worded slightly differently, but it is also a conditional probability. This translates to \(P(PA \mid Graduate)\). Of the 6327 graduate students, 2253 were Pennsylvania residents. 

\(P(PA \mid Graduate) = \dfrac{2253}{6327}=0.356\)

Sensitivity & Specificity

Sensitivity and specificity are two specific types of conditional probabilities that are often applied in situations involving testing (e.g., medical testing for a given condition). Sensitivity is the probability of testing positive given that one actually has the condition. Specificity is the probability of testing negative given that one actually does not have the condition. Ideally, we would like both sensitivity and specificity rates to be high.

Example: Sensitivity & Specificity

Compute the sensitivity and specificity of the test data presented in the following two-way contingency table.

Two-Way Table of Test Results and Actual Health
  Actually Sick Actually Healthy Total
Tested Positive 15 5 20
Tested Negative 2 19 21
Total 17 24 41

 

Sensitivity is the proportion of all people who were actually sick who tested positive. As a conditional probability, \(P(positive \mid sick)\). There were 17 people in the sample who were actually sick. Of those, 15 tested positive.

\(Sensitivity = \dfrac{15}{17}=0.882\)

 

Specificity is the proportion of all people who were actually healthy who tested negative. As a conditional probability, \(P(negative \mid healthy)\). There were 24 people in the sample who were actually healthy. Of those, 19 tested negative.

\(Specificity = \dfrac {19}{24}=0.792\)


2.1.3.2.5.1 - Advanced Conditional Probability Applications

2.1.3.2.5.1 - Advanced Conditional Probability Applications

Advanced Formulas

Conditional probabilities can also be computed using the following formulas. Note that these two formulas are identical, but A and B are switched. Again, if the contingency table is available it is usually most efficient to take the appropriate values from the table, as shown above, as opposed to using these formulas.

Conditional Probability of A Given B
\(P(A\mid B)=\dfrac{P(A \: \cap\: B)}{P(B)}\)
Conditional Probability of B Given A
\(P(B\mid A)=\dfrac{P(A \: \cap\: B)}{P(A)}\)

Example: Clubs

In a standard 52-card deck, there are 26 black cards including 13 clubs. All clubs are black, therefore there are 13 black clubs.

What is the probability that a randomly selected card is a club given that it is a black card?

We are given that \(P(club)=\frac{13}{52}=0.25\), \(P(black)=\frac{26}{52}=0.50\), and  \(P(club \: \cap\: black)=\frac{13}{52}=0.25\)

\(P(club\mid black)=\dfrac{P(club \: \cap\: black)}{P(black)}=\dfrac{0.25}{0.50}=0.50\)

Given that a randomly selected card is black, there is a 50% chance that it's a club.

Independent Events Written as Conditional Probabilities

If events A and B are independent then \(P(A) = P(A \mid B)\). In other words, whether or not event B occurs does not change the probability of event A occurring.

Example: Checking for Independence, Aces and Hearts

A card is randomly drawn from a 52-card deck. Are the events of drawing an ace and drawing a heart independent?

In a standard 52-card deck, there are 4 aces and 13 hearts. Therefore \(P(ace)=\frac{4}{52}\) and \(P(heart)=\frac{13}{52}\). Out of 13 hearts, 1 is an ace, which translates to \(P(ace \mid heart) = \frac{1}{13}\).

To determine if these two events are independent we can compare \(P(A)\) to \(P(A\mid B)\). If we call being an ace event A and being a heart event B, then we're comparing \(P(ace)\) to \(P(ace \mid heart)\).

\(P(ace)=\frac{4}{52}=0.0769\)

\(P(ace \mid heart) = \frac{1}{13}=0.0769\)

These values are identical, therefore we can conclude that the events of drawing an ace and drawing a heart are independent. 

 


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility