Lesson 2: Describing Data, Part 1

Lesson 2: Describing Data, Part 1

Objectives

Upon successful completion of this lesson, you will be able to:

• Compute and interpret a basic proportion/risk/probability and odds
• Select and interpret the appropriate visual representations for one categorical variable, two categorical variables, and one quantitative variable
• Use Minitab Express to construct frequency tables, pie charts, bar charts, two-way tables, clustered bar charts, histograms, and dotplots
• Compute and interpret complements, intersections, unions, and conditional probabilities given a two-way table
• Identify outliers on a histogram or dotplot
• Interpret the shape of a distribution
• Compute and interpret the mean, median, mode, and standard deviation
• Compute and interpret percentiles and z scores
• Apply the Empirical Rule
• Interpret a five number summary

This lesson corresponds to Sections 2.1-2.3, and P.1 in the Lock5 textbook.

Recall from Lesson 1 that variables can be classified as categorical or quantitative:

Categorical
Names or labels (i.e., categories) with no logical order or with a logical order but inconsistent differences between groups, also known as qualitative.
Quantitative
Numerical values with magnitudes that can be placed in a meaningful order with consistent intervals, also known as numerical.

The graphs, descriptive statistics, and inferential statistics that are appropriate depending on the nature of the variable(s) in a given scenario. Before beginning this lesson, you should be able to classify variables as categorical or quantitative. If you are having difficulties with this, go back to review Lesson 1 or speak with your instructor.

2.1 - Categorical Variables

2.1 - Categorical Variables

Categorical variables are discussed in Sections 2.1 and P.1 of the Lock5 textbook.

Variables can be classified as categorical or quantitative. In this section of the lesson, we will be focusing on categorical variables. Categorical variables are those that provide groupings that may have no logical order, or a logical order with inconsistent difference between groups (e.g., the difference between 1 and 2 is not equivalent to the difference between 3 and 4).

This course includes many examples and practice problems for you. Many of these will apply the concepts that we learn to experiments involving rolling a die or randomly selecting a card from a standard 52-card deck. If you are unfamiliar with either of these, take a moment here to review.

Die

A standard die has 6 sides: 1, 2, 3, 4, 5, 6

52-Card Deck

A standard 52-card deck of playing cards has 13 Hearts, 13 Diamonds, 13 Spades, and 13 Clubs. Hearts (♥) and Diamonds (♦) are red suits. Spades (♠) and Clubs (♣) are black suits. For each suit, there is a 2, 3, 4, 5, 6, 7, 8, 9, 10, Jack, Queen, King, and Ace. Jacks, Queens, and Kings are "face cards."

2.1.1 - One Categorical Variable

2.1.1 - One Categorical Variable

Data concerning one categorical variable can be summarized using a proportion.

Proportion
$Proportion=\dfrac{Number\;in\;the\;category}{Total\;number}$

The symbol for a sample proportion is $\widehat{p}$ and is read as "p-hat." The symbol for a population proportion is $p$.

The formula for a sample proportion may also be written as $\widehat p = \frac{x}{n}$ where $x$ is the number in the sample with the trait of interest and $n$ is the sample size.

A proportion must be between 0 and 1.00.

Example: Black Cards

A standard 52-card deck contains $26$ red cards and $26$ black cards. What proportion of cards are black?

$p=\dfrac{26}{52}=0.50$

The symbol $p$ was used because this is the proportion of all cards (i.e., the population) that are black.

In the Fall 2014 semester, there were $82,382$ undergraduate students enrolled in Penn State. Of those, $6,245$ were World Campus students. What proportion of all Penn State undergraduate students were World Campus students?

$p=\dfrac{6245}{82382}=0.076$

The symbol $p$ was used because this is the proportion of all Penn State undergraduate students (i.e., the population) that are World Campus students.

In a sample of $30$ randomly selected packages of chocolate chip cookies, $18$ contained broken cookies. What proportion of these selected packages had broken cookies?

$\widehat{p}=\dfrac{18}{30}=0.60$

These data were collected from a sample so the symbol $\widehat{p}$ was used to denote a sample proportion.

2.1.1.1 - Risk and Odds

2.1.1.1 - Risk and Odds

You may have heard the terms risk and odds before. They are both ways to communicate the likelihood of an event.

Risk and odds are often confused with one another. The formulas for computing risk and odds are different and their interpretations are different.

Risk

In statistics, the word risk communicates the likelihood of an event occurring. This is synonymous with probability or proportion (i.e., the formulas are the same).

Risk
The probability that an event will occur. It may be written as a decimal, a fraction, or a percent.
Risk
$Risk= \dfrac{number \;with \;the\; outcome}{total\;number\;of\;outcomes}$

Example: Asthma Risk

$60$ out of $1000$ teens have asthma.

$risk=\dfrac{60}{1000}=0.06$

This means that $6\%$ of teens experience asthma.

Example: Flu Risk

$45$ out of $100$ children get the flu each year.

$risk=\dfrac{45}{100}=0.45$ or $45\%$

Odds

Odds
Express risk by comparing the likelihood of an event happening to the likelihood it does not happen.
Odds

$odds = \dfrac {number \;with \;the\; outcome}{number \;without \;the \;outcome}$

OR

$odds=\dfrac{risk}{1-risk}$

We often interpret odds in relation to the value of 1. For example, if the odds of a game are in favor of the house 2 to 1, that means for every 2 games the house wins it will lose 1.

Example: Passing Odds

In one large class, 850 students passed an exam while 150 students failed. Because we have the raw counts, we can use the first odds formula.

$odds=\dfrac {number \;with \;the\; outcome}{number \;without \;the \;outcome}=\dfrac{850}{150}=5.667$

The odds of passing were 5.667 to 1. In other words, for every 5.667 students who passed the exam there was 1 who failed.

Example: Flu Odds

The risk of a child getting the flu is $45\%$ which can also be written as $0.45$. Because we have the risk, we can use the second odds formula.

$odds=\dfrac{risk}{1-risk}=\dfrac{0.45}{1-0.45}=\dfrac{0.45}{0.55}=0.818$

The odds of a child getting the flu is $0.818$ to $1$.

2.1.1.2 - Visual Representations

2.1.1.2 - Visual Representations

Frequency tables, pie charts, and bar charts are the most appropriate graphical displays for categorical variables. Below are a frequency table, a pie chart, and a bar graph for data concerning Penn State’s undergraduate enrollments by campus in Fall 2017.

Note that in the bar chart, the bars are separated by a space. The spaces between the bars signify that this is a categorical variable. On the following pages you will learn how to make these graphs using Minitab Express.

Frequency Table

A table containing the counts of how often each category occurs.

Tally
Campus Count Percent
University Park 40835 48.5%
Commonwealth Campuses 29388 34.9%
PA College of Technology 5465 6.5%
World Campus 8513 10.1%
Total 84201 100.0%

Penn State Fall 2017 Undergraduate Enrollments

Pie chart

Graphical representation for categorical data in which a circle is partitioned into “slices” on the basis of the proportions of each category.

Bar chart

Graphical representation for categorical data in which vertical (or sometimes horizontal) bars are used to depict the number of experimental units in each category; bars are separated by space.

Penn State Fall 2017 Undergraduate Enrollments

Tips

Pie charts tend to work best when there are only a few categories. If a variable has many categories, a pie chart may be more difficult to read. In those cases, a frequency table or bar chart may be more appropriate.

When selecting a visual display for your data you should first determine how many variables you are going to display and whether they are categorical or quantitative. Then, you should think about what you are trying to communicate. Each visual display has its own strengths and weaknesses. When first starting out, you may need to make a few different types of displays to determine which best communicates your data.

2.1.1.2.1 - Minitab Express: Frequency Tables

2.1.1.2.1 - Minitab Express: Frequency Tables

The following data set (from College Board) contain the mean SAT scores for each of the 50 states and Washington, DC, as well the participation rates and geographic region of each state. To get an idea of the pattern of variation of a categorical variable such as region, we can display the information with a frequency table, pie chart, or bar graph.

MinitabExpress – Frequency Table

To create a frequency table in Minitab Express:

1. Open the data set:
2. On a PC: In the menu bar select STATISTICS > Describe > Tally
On a Mac: In the menu bar select Statistics > Summary Statistics > Tally
3. Double click the variable Region in the box on the left to insert the variable into the Variable box
4. Under Statistics, check Counts and Percents
5. Click OK

This should result in the following frequency table:

Tally
Region Count Percent
ENC 5 9.8039%
ESC 4 7.8431%
MA 3 5.8824%
MTN 8 15.6863%
NE 6 11.7647%
PAC 5 9.8039%
SA 9 17.6471%
WNC 7 13.7255%
WSC 4 7.8431%
N= 51
Video Walkthrough

Select your operating system below to see a step-by-step guide for this example.

2.1.1.2.2 - Minitab Express: Pie Charts

2.1.1.2.2 - Minitab Express: Pie Charts

The following data set (from College Board) contain the mean SAT scores for each of the 50 states and Washington, DC, as well the participation rates and geographic region of each state.

MinitabExpress – Pie Chart (Raw Data)

To create a pie chart in Minitab Express:

1. Open the data set:
2. On a PC or Mac: Select Graphs > Pie Chart
3. Select Counts of Unique Values
4. Double click the variable Region in the box on the left to insert the variable into the Categorical variable box
5. Click OK

This should result in the pie chart below:

Video Walkthrough

Select your operating system below to see a step-by-step guide for this example.

Summarized Data

In the examples above raw data were used. In other words, the dataset contained one row for each case. It is also possible to use Minitab Express to construct a pie chart given summarized data, for example, if you had your counts in a frequency table. If this were the case, in step 3 you would select Summarized Data and enter the names of the categories in the Category names box and the frequency counts in the Summary values box.

2.1.1.2.3 - Minitab Express: Bar Charts

2.1.1.2.3 - Minitab Express: Bar Charts

The following data set (from College Board) contain the mean SAT scores for each of the 50 states and Washington, DC, as well the participation rates and geographic region of each state.

MinitabExpress – Bar Chart (Raw Data)

To create a bar graph in Minitab Express:

1. Open the data set
2. On a PC or Mac: Select Graphs > Bar Chart
3. Use the default from the drop down Bars represent of Counts of unique values in a categorical variable
4. Select Simple
5. Double click the variable Region in the box on the left to insert the variable into the Categorical variable box
6. Click OK

This should result in the bar graph below:

Video Walkthrough

Select your operating system below to see a step-by-step guide for this example.

Summarized Data

In the examples above raw data were used. In other words, the Minitab Express file consisted of one row for each case. We can also use Minitab Express to construct a bar chart with summarized data, for example, if you had data in a frequency table. To do this, in the third step shown above you will change the dropdown of Bars represent to Summarized values for each category in a table. You will still select Simple. The Summary variable will be the numerical values and the Categorical variable will be the names of the categories.

2.1.2 - Two Categorical Variables

2.1.2 - Two Categorical Variables

Data concerning two categorical variables may be communicated using a two-way table, also known as a contingency table. Data concerning two categorical variables can visualized using a segmented bar chart or a clustered bar chart. A clustered bar chart is also known as a side-by-side bar chart.

Two-Way Table
A display of counts for two categorical variables in which the rows represent one variable and the columns represent a second variable. Also known as a contingency table.

Example: World Campus Enrollments by Sex

We will use the two-way table below of Penn State World Campus enrollments by biological sex and level to walk through a few examples of how to read a two-way table. These data are from the Penn State Factbook and from the Fall of 2015.

Female Male Total
Total 6027 6215 12242

What proportion of the population of World Campus students is undergraduate?

$p=\frac{7242}{12242}=0.596$

Proportion of Females who are Undergraduates

What proportion of females in this population are undergraduate students?

$p=\frac{3814}{6027}=0.633$

Later in this lesson, you will learn that this is known as a conditional probability.

Proportion of Undergraduates who are Female

What proportion of undergraduate students in this population are female?

$p=\frac{3814}{7242}=0.527$

Segmented Bar Chart

Also known as a stacked bar chart, one categorical variable is represented on the x-axis while the second categorical variable is denoted within the bars. Minitab Express will not construct a stacked bar chart, but other softwares will. The segmented bar chart below was constructed using Excel.

Clustered Bar Chart

Each bar represents one combination of the two categorical variables (i.e., one cell in a contingency table). This is also known as a side-by-side bar chart.

2.1.2.1 - Minitab Express: Two-Way Table

2.1.2.1 - Minitab Express: Two-Way Table

This dataset consists of STAT 200 students' responses to survey. We can construct a two-way table showing the relationship between Smoke Cigarettes (row variable) and Biological Sex (column variable) using Minitab Express.

MinitabExpress – Two-Way Table

To create a two-way table in Minitab Express:

1. Open the data set:
2. On a PC: Select STATISTICS > Cross Tabulation and Chi-square
On a Mac: Select Statistics > Tables > Cross Tabulation and Chi-Square
3. Select Raw data (categorical variable) from the drop down menu
4. Double click the variable Smoke Cigarettes in the box on the left to insert the variable into the Rows box
5. Double click the variable Biological Sex in the box on the left to insert the variable into the Columns box
6. Click OK

This should result in the two-way table below:

Tabulated Statistics: Smoke Cigarettes, Biological Sex
Female Male All 120 89 209 7 10 17 127 99 226 Cell Contents: Count
Video Walkthrough

Select your operating system below to see a step-by-step guide for this example.

2.1.2.2 - Minitab Express: Clustered Bar Chart

2.1.2.2 - Minitab Express: Clustered Bar Chart

We are going to use the the Class Survey data set in this example again:

MinitabExpress – Clustered Bar Chart

To create a clustered bar chart in Minitab Express:

1. Open the data set:
2. On a PC or Mac: Select Graphs > Bar Chart
3. In this example we have a datafile with the responses from each case so for Bars represent select Counts of unique values in a categorical variable
4. Select Clustered
5. Double click the variables Biological sex and Smoke Cigarettes in the box on the left to insert the variable into the Categorical variables box
6. Click OK

This should result in the clustered bar chart below:

Video Walkthrough

Select your operating system below to see a step-by-step guide for this example.

Note: The order in which the variables are entered into the Categorical variables box in Minitab Express determines how the bars will be clustered. For example, if we entered Smoke Cigarettes and then Gender, the result would be the following clustered bar chart:

2.1.3 - Probability Rules

2.1.3 - Probability Rules

The probability rules covered in this lesson can be found in section P.1 of the Lock5 textbook.

Earlier in this lesson you were introduced to proportions. We used the notation: $Proportion=\frac{Number\;in\;the\;category}{Total\;number}$.

When we discuss probabilities, we will use the notation below where $P(A)$ is the probability of event $A$ occurring. Probabilities are typically written in decimal form but may also be translated to percentages.

Note that this is the same formula that you learned earlier in Lesson 2.1.1 for a proportion.

Probability of Event A
$P(A)=\dfrac{Number\;in\;group\;A}{Total\;number}$

What is the probability that a randomly selected card from a standard 52-card deck will be a spade? There are 13 spades in the deck of 52.

$P(spade)=\dfrac{13}{52}=0.25$

The probability of pulling a spade is 0.25. We could also say that there is a 25% chance of pulling a spade.

Example: Odd Numbers

If you roll a six-sided die, what is the probability of getting an odd number? There are three odd numbers on the die (1, 3, 5).

$P(odd)=\dfrac{3}{6}=0.50$

The probability of rolling an odd number is 0.50. We could also say that there 50% chance of rolling an odd number.

Example: Raffle

There are a total of 500 raffle tickets and you have purchased 10. What is the probability that one of your tickets will be randomly selected to win the raffle?

$P(winning)=\dfrac{10}{500}=0.02$

The probability of you winning is 0.02. We could also say that there is a 2% chance that you will win.

2.1.3.1 - Range of Probabilities

2.1.3.1 - Range of Probabilities

The probability of an impossible event is 0 and the probability of a certain event is 1. The range of possible probabilities is: $0 \leq P(A) \leq 1$. It is not possible to have a probability less than 0 or greater than 1.

Example: Rolling an 8

It is impossible to roll an eight on a six-sided die.

$P(rolling\; 8)= \dfrac{0}{6} = 0$

Example: Blue Cards

In a standard 52-card deck all cards are black or red. There are no blue cards.

$P(blue)=\dfrac{0}{52}=0$

Example: Rolling a Value Between 1 and 6

A six-sided die contains the values 1, 2, 3, 4, 5, and 6. All rolls will result in a value between 1 and 6.

$P(rolling \;1 \;to\; 6)=\dfrac{6}{6}=1.00$

2.1.3.2 - Combinations of Events

2.1.3.2 - Combinations of Events

In situations with two or more categorical variables there are a number of different ways that combinations of events can be described: intersections, unions, complements, and conditional probabilities. Each of these combinations of events is covered in your textbook. However, note that your textbook does not use the symbols that are most commonly used when discussing these combinations of events. The symbols that we will be using are in the table below. In this section, you will also learn about disjoint events and independent events

Combination of Events
Combination Symbol Definition
Intersection $P(A\cap B)$ Probability of A and B
Union $P(A\cup B)$

Probability of A or B

Note: This includes the possibility of A and B

Complement $P(A^C)$ The probability of NOT A
Conditional $P(A\mid B)$ The probability of A given B

2.1.3.2.1 - Disjoint & Independent Events

2.1.3.2.1 - Disjoint & Independent Events

Note that disjoint events and independent events are different. Events are considered disjoint if they never occur at the same time; these are also known as mutually exclusive events. Events are considered independent if they are unrelated.

Disjoint Events

Two events that do not occur at the same time. These are also known as mutually exclusive events

In the Venn diagram below event A and event B are disjoint events because the two do not overlap.

Venn diagram
A visual representation in which the sample space is depicted as a box and events are represented as circles within the sample space.
Independent Events
Unrelated events. The outcome of one event does not impact the outcome of the other event.

Example: Freshmen & Sophomores

Let's consider undergraduate class status. A student can be classified as a freshman, sophomore, junior, or senior.

Being a freshman and being a sophomore are disjoint events because an individual cannot be classified as both at the same time.

Being a freshman is not independent of being a sophomore. If I know that an individual is a freshman then the probability that they are a sophomore is 0; knowing that the student was a freshman provided information that influenced my prediction of them being a sophomore.

Example: Class Status & Gender

Assume that there is no relationship between gender and class status. This means that within each class (freshmen, sophomores, juniors, seniors) the proportion of students who are men is consistent. It also means that within each gender the proportion of students who are freshmen, sophomores, juniors, and seniors is consistent.

In this case, we could say that the events of being a man and being a senior are independent events. Knowing that a student is a man does not influence the likelihood of him being a senior. Knowing that a student is a senior does not change the likelihood of them being a man.

There are some men who are seniors, so these events are not disjoint.

2.1.3.2.2 - Intersections

2.1.3.2.2 - Intersections
Intersection

The overlap of two or more events is symbolized by the character $\cap$.

$P(A \cap B)$ is read as "the probability of A and B."

Example: Red King

What is the probability of randomly selecting a card from a standard 52-card deck that is a red card and a king?

There are 2 kings that are red cards: the king of hearts and the king of diamonds.

$P(red \cap king)=\dfrac{2}{52}=.0385$

The two-way table below displays the World Campus enrollment from Fall 2015 in terms of level (undergraduate and graduate) and biological sex. What proportion of World Campus students were female and undergraduate students?

Female Male Total 3814 3428 7242 2213 2787 5000 6027 6215 12242

There are 3814 students who are females and undergraduates out of a total of 12242 students.

$P(F \cap U)=\dfrac{3814}{12242}=0.312$

2.1.3.2.3 - Unions

2.1.3.2.3 - Unions
Union

A union contains the area in A or B and is symbolized by $\cup$. Note that this also includes the overlap of A and B (i.e., the intersection).

$P(A \cup B)$ is read as "the probability of A or B."

Union
$P(A\cup B) = P(A)+P(B)-P(A\cap B)$

What is the probability of randomly selecting a card from a standard 52-card deck that is a heart or spade?

There are 13 cards that are hearts, 13 cards that are spades, and no cards that are both a heart and a spade.

$P(heart \cup spade)=\dfrac{13}{52}+\dfrac{13}{52}-\dfrac{0}{52}= \dfrac {26}{52}=0.5$

Example: Hearts or Aces

What is the probability of randomly selecting a card from a standard 52-card deck that is a heart or an ace?

There are 13 cards that are hearts and 4 cards that are aces. There is one ace of hearts, so one of those 4 aces has already been counted.

$P(heart \cup ace)=\dfrac{13}{52}+\dfrac{4}{52}-\dfrac{1}{52}=\dfrac{16}{52}=0.308$

The two-way table below displays the World Campus enrollment from Fall 2015 in terms of level (undergraduate and graduate) and biological sex. What proportion of World Campus students were female or undergraduate students?

Female Male Total
Total 6027 6215 12242

When we have a contingency table we can take the appropriate values from the table as opposed to using the formula given above. There are 3814 female undergraduate students, 3428 male undergraduate students, 2213 female graduate students, and a total of 12242 students.

$P(F \cup U)=\dfrac{3814+3428+2213}{12242}=\dfrac{9455}{12242}=0.772$

Note that the final answer would be the same if we had used the formula:

$P(F \cup U) = \dfrac{6027}{12242}+\dfrac{7242}{12242}-\dfrac{3814}{12242}= \dfrac{9455}{12242}=0.772$

2.1.3.2.4 - Complements

2.1.3.2.4 - Complements
Complement

The probability that the event does not occur. The complement of $P(A)$ is $P(A^C)$. This may also be written as $P(A')$.

In the diagram below we can see that $A^{C}$ is everything in the sample space that is not A.

Complement of A
$P(A^{C})=1−P(A)$

Example: Coin Flip

When flipping a coin, one can flip heads or tails. Thus, $P(Tails^{C})=P(Heads)$ and $P(Heads^{C})=P(Tails)$

Example: Hearts

If you randomly select a card from a standard 52-card deck, you could pull a heart, diamond, spade, or club. The complement of pulling a heart is the probability of pulling a diamond, spade, or club. In other words: $P(Heart^{C})=P(Diamond,\; Spade,\;\;Club)$

The complement of any outcome is equal to one minus the outcome. In other words: $P(A^{C})=1-P(A)$

It is also true then that: $P(A)=1-P(A^{C})$

Example: Rain

According to the weather report, there is a 30% chance of rain today: $P(Rain) = .30$

Raining and not raining are complements.

$P(Not \:rain)=P(Rain^{C})=1-P(Rain)=1-.30=.70$

There is a 70% chance that it will not rain today.

Example: Winning

The probability that your team will win their next game is calculated to be .45, in other words:

$P(Winning)=.45$

Winning and losing are complements of one another. Therefore the probability that they will lose is:

$P(Losing)=P(Winning^{C})=1-.45=.55$

The sum of all of the probabilities for possible events is equal to 1.

Example: Cards

In a standard 52-card deck there are 26 black cards and 26 red cards. All cards are either black or red.

$P(red)+P(black)=\frac{26}{52}+\frac{26}{52}=1$

Example: Dominant Hand

Of individuals with two hands, it is possible to be right-handed, left-handed, or ambidextrous. Assuming that these are the only three possibilities and that there is no overlap between any of these possibilities:

$P(right\;handed)+P(left\;handed)+P(ambidextrous) = 1$

2.1.3.2.5 - Conditional Probability

2.1.3.2.5 - Conditional Probability
Conditional Probability

The probability of one event occurring given that it is known that a second event has occurred. This is communicated using the symbol $\mid$ which is read as "given."

For example, $P(A\mid B)$ is read as "Probability of A given B."

The two-way table below displays the World Campus enrollment from Fall 2019 in terms of level (undergraduate and graduate) and residency (Pennsylvania and non-Pennsylvania). Given that an individual is an undergraduate student, what is the probability that the student is a Pennsylvania resident?

Pennsylvania Non-Pennsylvania Total
Total 6010 8677 14687

We know that the individual is an undergraduate student so we will only look at the 8360 undergraduate students. Of those 8360 undergraduate students, 3757 were Pennsylvania residents.

$P(PA \mid Undergrad) = \dfrac{3757}{8360}=0.449$

2.1.3.2.5.1 - Advanced Conditional Probability Applications

2.1.3.2.5.1 - Advanced Conditional Probability Applications

Conditional probabilities can also be computed using the following formulas. Note that these two formulas are identical, but A and B are switched. Again, if the contingency table is available it is usually most efficient to take the appropriate values from the table, as shown above, as opposed to using these formulas.

Conditional Probability of A Given B
$P(A\mid B)=\dfrac{P(A \: \cap\: B)}{P(B)}$
Conditional Probability of B Given A
$P(B\mid A)=\dfrac{P(A \: \cap\: B)}{P(A)}$

Example: Clubs

In a standard 52-card deck, there are 26 black cards including 13 clubs. All clubs are black, therefore there are 13 black clubs.

What is the probability that a randomly selected card is a club given that it is a black card?

We are given that $P(club)=\frac{13}{52}=0.25$, $P(black)=\frac{26}{52}=0.50$, and  $P(club \: \cap\: black)=\frac{13}{52}0.25$

$P(club\mid black)=\dfrac{P(club \: \cap\: black)}{P(black)}=\dfrac{0.25}{0.50}=0.50$

Given that a randomly selected card is black, there is a 50% chance that it's a club.

Independent Events Written as Conditional Probabilities

If events A and B are independent then $P(A) = P(A \mid B)$. In other words, whether or not event B occurs does not change the probability of event A occurring.

Example: Checking for Independence, Aces and Hearts

A card is randomly drawn from a 52-card deck. Are the events of drawing an ace and drawing a heart independent?

In a standard 52-card deck, there are 4 aces and 13 hearts. Therefore $P(ace)=\frac{4}{52}$ and $P(heart)=\frac{13}{52}$. Out of 13 hearts, 1 is an ace, which translates to $P(ace \mid heart) = \frac{1}{13}$.

To determine if these two events are independent we can compare $P(A)$ to $P(A\mid B)$. If we call being an ace event A and being a heart event B, then we're comparing $P(ace)$ to $P(ace \mid heart)$.

$P(ace)=\frac{4}{52}=0.0769$

$P(ace \mid heart) = \frac{1}{13}=0.0769$

These values are identical, therefore we can conclude that the events of drawing an ace and drawing a heart are independent.

2.2 - One Quantitative Variable

2.2 - One Quantitative Variable

One quantitative variable is covered in Sections 2.2 and 2.3 of the Lock5 textbook. In these sections, you will learn how to describe the distribution of a quantitative variable in terms of shape, central tendency, and variability. You will be introduced to the normal distribution, scores, percentiles, graphs, and the five-number-summary.

2.2.1 - Graphs: Dotplots and Histograms

2.2.1 - Graphs: Dotplots and Histograms

Dotplots and histograms are both graphical displays that can be used with one quantitative variable. In both of these plots the horizontal axis represents the values of the variable. The number of dots in a dotplot, or the height of the bars in a histogram, represent the number of cases with each value or range of values.

Dotplot

Histogram

MinitabExpress – Dotplot

To create a dotplot in Minitab Express:

1. Open the data set:
2. On a PC or Mac: Select GRAPHS > Dotplot
3. Select Simple
4. Double click the variable Verbal SAT (2005) in the box on the left to insert the variable into the Y variable box
5. Click OK

This should result in the following dotplot:

Video Walkthrough

Select your operating system below to see a step-by-step guide for this example.

MinitabExpress – Histogram

To create a histogram in Minitab Express:

1. Open the data set:
2. On a PC or Mac: Select GRAPHS > Histogram
3. Select Simple
4. Double click the variable Verbal SAT(2005) in the box on the left to insert the variable into the Y variable box
5. Click OK

This should result in the following histogram:

Video Walkthrough

Select your operating system below to see a step-by-step guide for this example.

2.2.2 - Outliers

2.2.2 - Outliers

Some observations within a set of data may fall outside the general scope of the other observations. Such observations are called outliers. Outliers can be identified by looking at a dotplot or histogram. In Lesson 3 you'll learn about boxplots which can also be used to identify outliers. When constructing a boxplot, Minitab Express identifies outliers using mathematical methods that you will see next week. This week we will identify outliers by making a relatively subjective judgement from a given a list of data points, a dotplot, or a histogram.

Example: Dotplot of Hours Watching TV

A sample of STAT 200 students was surveyed and asked how many hours per week they watch television. A dotplot was constructed using these data.

The right-most dot is definitely an outlier because it is much higher than any other points. The other higher points, around 55, 50, and 46, may be outliers. Next week we will learn about some mathematical methods for identifying outliers that can help us make decisions in cases like this where it is not obvious which values are outliers.

Example: Histogram of Best Marriage Age

This sample of students was also asked what they believed was the best age to get married. A histogram was constructed using these data.

There appear to be three outliers in this sample, all on the higher end.

2.2.3 - Shape

2.2.3 - Shape

Quantitative variables are often discussed in terms of their shape. Both dotplots and histograms can be used to interpret a distribution's shape. A distribution may be described in terms of symmetry and skewness.

Symmetrical Distribution

A distribution that is similar on both sides of the center.

Normal Distribution

One specific type of symmetrical distribution. This is also known as a bell-shaped distribution.

Skewed
A distribution in which values are more spread out on one side of the center than on the other.
Right Skewed

A distribution in which the higher values (towards the right on a number line) are more spread out than the lower values. This is also known as positively skewed.

Left Skewed

A distribution in which the lower values (towards the left on a number line) are more spread out than the higher values. This is also known as negatively skewed.

2.2.4 - Measures of Central Tendency

2.2.4 - Measures of Central Tendency

Quantitative variables are often summarized using numbers to communicate their central tendency. The mean, median, and mode are three of the most commonly used measures of central tendency.

Mean

The numerical average; calculated as the sum of all of the data values divided by the number of values.

The sample mean is represented as $\overline{x}$ ("x-bar") and the population mean is denoted as the Greek letter $\mu$ ("mu"). The formula is the same for the sample mean and the population mean.

Population Mean
$\mu=\dfrac{\Sigma x}{N}$
Sample Mean
$\overline {x} = \dfrac{\Sigma x}{n}$
Median
The middle of the distribution that has been ordered from smallest to largest; for distributions with an even number of values, this is the mean of the two middle values.
Mode
The most frequently occurring value(s) in the distribution, may be used with quantitative or categorical variables.

Example: Hours Spent Studying

A professor asks a sample of 7 students how many hours they spent studying for the final. Their responses are: 5, 7, 8, 9, 9, 11, and 13.

Mean

$\overline{x} = \dfrac{\sum x}{n} =\dfrac{5+7+8+9+9+11+13}{7} =\dfrac{62}{7} =8.857$

The mean is 8.857 hours.

Median

The observations are already in order from smallest to largest. The middle observation is 9 hours. The median is 9 hours.

Mode

The most frequently occurring observation was 9 hours. The mode is 9 hours.

In this example, the mean, median, and mode are all similar. Recall from our discussion of shape, the mean, median, and mode are all equal when a distribution is symmetric. This distribution of hours spent studying is probably close to symmetrical.

Example: Test Scores

A teacher wants to examine students’ test scores. Their scores are: 74, 88, 78, 90, 94, 90, 84, 90, 98, and 80.

Mean

$\overline{x}\: =\: \dfrac{\sum x}{n} = \dfrac{74+88+78+90+94+90+84+90+98+80}{10} = \dfrac{866}{10}=86.6$

The mean score was 86.6.

Median

First, we need to put the scores in order from lowest to highest: 74, 78, 80, 84, 88, 90, 90, 90, 94, 98

Because there is an even number of scores, the median will be the mean of the middle two values. The middle two values are 88 and 90. $\frac{88+90}{2}=89$

The median is 89.

Mode

The most frequently occurring score was 90. There were 3 students who scored a 90; this is the mode. Because this distribution has one mode, it is unimodal.

In this example the mean is slightly lower than the median which is slightly lower than the mode. Recall from our discussion of shape that this occurs when a distribution is skewed to the left. This distribution is probably slightly skewed to the left.

Example: Household Size

A group of children are asked how many people live in their household. The following data is collected: 4, 3, 6, 2, 2, 4, 3.

Mean

$\overline{x} = \dfrac{\sum x}{n}=\dfrac{4+3+6+2+2+4+3}{7}=\dfrac{24}{7}=3.429$

The mean household size in this group of children is 3.429 people.

Median

First, we need to put all of the values in order from smallest to largest: 2, 2, 3, 3, 4, 4, 6

The value in the middle of this distribution is 3. The median is 3.

Mode

In this distribution, the most common values are 2, 3, and 4. Each of these values occurs twice. There are 3 modes: 2, 3, and 4. This distribution is multimodal.

2.2.4.1 - Skewness & Central Tendency

2.2.4.1 - Skewness & Central Tendency

The preferred measure of central tendency often depends on the shape of the distribution. In a symmetrical distribution, the mean, median, and mode are all equal. In these cases, the mean is often the preferred measure of central tendency.

For distributions that are strongly skewed or have outliers, the median is often the most appropriate measure of central tendency because in skewed distributions the mean is pulled out toward the tail. The median is more resistant to outliers compared to the mean. Of these three measures of central tendency, the mean is most influenced by outliers. Below you will see how the direction of skewness impacts the order of the mean, median, and mode.

Variance and standard deviation are measures of variability. The standard deviation is the most commonly used measure of variability when data are quantitative and approximately normally distributed. When computing the standard deviation by hand, it is necessary to first compute the variance. The standard deviation is equal to the square root of the variance. Here, you will learn how to compute these values by hand. After this lesson, you will always be computing standard deviation using software such as Minitab Express.

Standard Deviation
Roughly the average difference between individual data values and the mean. The standard deviation of a sample is denoted as $s$. The standard deviation of a population is denoted as $\sigma$.
Sample Standard Deviation
$s=\sqrt{\dfrac{\sum (x-\overline{x})^{2}}{n-1}}$

In order to compute the standard deviation for a sample we first compute deviations. The sum of the squared deviations (SS) divided by $n-1$, this is the variance ($s^2$).

The square root of the variance is the standard deviation: $\sqrt{s^2}=s$.

Deviation
An individual score minus the mean.
Sum of Squared Deviations
Deviations squared and added together. This is also known as the sum of squares or SS.
Variance
Approximately the average of all of the squared deviations; for a sample represented as $s^{2}$.
Sum of Squares
$SS={\sum (x-\overline{x})^{2}}$
Sample Variance
$s^{2}=\dfrac{\sum (x-\overline{x})^{2}}{n-1}$

There are a number of methods for calculating the standard deviation. If you look through different textbooks or search online, you may find different formulas and procedures. To compute the standard deviation for a sample, we will use the formulas above and the following steps:

Step 1: Compute the sample mean: $\overline{x} = \frac{\sum x}{n}$.

Step 2: Subtract the sample mean from each individual value: $x-\overline{x}$, these are the deviations.

Step 3: Square each deviation: $(x-\overline{x})^{2}$, these are the squared deviations.

Step 4: Add the squared deviations: $\sum (x-\overline{x})^{2}$, this is the sum of squares.

Step 5: Divide the sum of squares by $n-1$: $\frac{\sum (x-\overline{x})^{2}}{n-1}$, this is the sample variance $(s^{2})$.

Step 6: Take the square root of the sample variance: $\sqrt{\frac{\sum (x-\overline{x})^{2}}{n-1}}$, this is the sample standard deviation.

Video Example

The video below walks through an example of computing a sample standard deviation by hand.

Example: Hours Spent Studying

A professor asks a sample of 7 students how many hours they spent studying for the final. Their responses are: 5, 7, 8, 9, 9, 11, and 13.

Step 1: Compute the mean

$\overline{x} = \dfrac{\sum x}{n}=\dfrac{5+7+8+9+9+11+13}{7}=8.857$

Step 2: Compute the deviations

$x$ $x - \overline{x}$
5 $5 - 8.857 = -3.857$
7 $7 - 8.857 = -1.857$
8 $8 - 8.857 = -0.857$
9 $9 - 8.857 = 0.143$
9 $9 - 8.857 = 0.143$
11 $11 - 8.857 = 2.143$
13 $13 - 8.857 = 4.143$

Step 3: Square the deviations

$x$ $x - \overline{x}$ $(x-\overline{x})^{2}$
5 $5 - 8.857 = -3.857$ $-3.857^{2} = 14.876$
7 $7 - 8.857 = -1.857$ $-1.857^{2} = 3.448$
8 $8 - 8.857 = -0.857$ $-0.857^{2} = 0.734$
9 $9 - 8.857 = 0.143$ $0.143^{2} = 0.0020$
9 $9 - 8.857 = 0.143$ $0.143^{2} = 0.0020$
11 $11 - 8.857 = 2.143$ $2.143^{2} = 4.592$
13 $13 - 8.857 = 4.143$ $4.143^{2} = 17.164$

Step 4: Sum the squared deviations

$SS=\sum (x-\overline{x})^{2}=14.876+3.448+0.734+.020+.020+4.592+17.164=40.854$

The sum of squares is 40.854

Step 5: Divide by n - 1 to compute the variance

$s^{2}=\dfrac{\sum (x-\overline{x})^{2}}{n-1}=\dfrac{40.854}{7-1}=6.809$

The variance is 6.809

Step 6: Take the square root of the variance

$s=\sqrt{s^{2}}=\sqrt{6.809}=2.609$

The standard deviation is 2.609

2.2.6 - Minitab Express: Central Tendency & Variability

2.2.6 - Minitab Express: Central Tendency & Variability

Minitab Express may be used to compute descriptive statistics such as the mean, median, mode, standard deviation, and variance.

Note that these are the default setting in Minitab Express:

If you want the mode or variance, you will need to select them under the Statistics tab.

MinitabExpress – Central Tendency

To obtain measures of central tendency and variability in Minitab Express:

1. Open the data set:
2. On a PC: from the menu select STATISTICS > Describe
On a Mac: from the menu select Statistics > Summary Statistics > Descriptive Statistics
3. Double click the variable Height in the box on the left to insert the variable into the Variable box
4. Click on the Statistics tab and select the descriptive statistics that you want displayed
5. Click OK

This should result in the following output:

Descriptive Statistics: Height
Statistics
Variable N N* Mean SE Mean StDev Minimum Q1 Median Maximum
Height 525 0 67.0090 0.1947 4.4616 51.0000 64.0000 67.0000 82.0000
Video Walkthrough

Select your operating system below to see a step-by-step guide for this example.

2.2.7 - The Empirical Rule

2.2.7 - The Empirical Rule

A normal distribution is symmetrical and bell-shaped.

The Empirical Rule is a statement about normal distributions. Your textbook uses an abbreviated form of this, known as the 95% Rule, because 95% is the most commonly used interval. The 95% Rule states that approximately 95% of observations fall within two standard deviations of the mean on a normal distribution.

Normal Distribution
A specific type of symmetrical distribution, also known as a bell-shaped distribution
Empirical Rule

On a normal distribution about 68% of data will be within one standard deviation of the mean, about 95% will be within two standard deviations of the mean, and about 99.7% will be within three standard deviations of the mean

95% Rule
On a normal distribution approximately 95% of data will fall within two standard deviations of the mean; this is an abbreviated form of the Empirical Rule

Example: Pulse Rates

Suppose the pulse rates of 200 college men are bell-shaped with a mean of 72 and standard deviation of 6.

• About 68% of the men have pulse rates in the interval $72\pm1(6)=[66, 78]$.
• About 95% of the men have pulse rates in the interval $72\pm2(6)=[60, 84]$.
• About 99.7% of the men have pulse rates in the interval $72\pm 3(6)=[54, 90]$.

Example: IQ Scores

IQ scores are normally distributed with a mean of 100 and a standard deviation of 15.

• About 68% of individuals have IQ scores in the interval $100\pm 1(15)=[85,115]$.
• About 95% of individuals have IQ scores in the interval $100\pm 2(15)=[70,130]$.
• About 99.7% of individuals have IQ scores in the interval $100\pm 3(15)=[55,145]$.

2.2.8 - z-scores

2.2.8 - z-scores

Often we want to describe an observation in relation to the distribution of all observations. We can do this using a z-score. By converting observations to z-scores, we can compare observations from different distributions.

z-score

Distance between an individual score and the mean in standard deviation units; also known as a standardized score.

z-score
$z=\dfrac{x - \overline{x}}{s}$

$x$ = original data value
$\overline{x}$ = mean of the original distribution
$s$ = standard deviation of the original distribution

This equation could also be rewritten in terms of population values: $z=\frac{x-\mu}{\sigma}$

Later in the course, we will learn more about the z-distribution, which is a special case of the normal distribution.

z-distribution

A bell-shaped distribution with a mean of 0 and standard deviation of 1, also known as the standard normal distribution.

Example: Milk

A study of 66,831 dairy cows found that the mean milk yield was 12.5 kg per milking with a standard deviation of 4.3 kg per milking (data from Berry, et al., 2013).

A cow produces 18.1 kg per milking. What is this cow’s z-score?

$z=\frac{x-\overline{x}}{s} =\frac{18.1-12.5}{4.3}=1.302$

This cow’s z-score is 1.302; her milk production was 1.302 standard deviations above the mean.

A cow produces 12.5 kg per milking. What is this cow’s z-score?

$z=\frac{x-\overline{x}}{s} =\frac{12.5-12.5}{4.3}=0$

This cow’s z-score is 0; her milk production was the same as the mean.

A cow produces 8 kg per milking. What is this cow’s z-score?

$z=\frac{x-\overline{x}}{s} =\frac{8-12.5}{4.3}=-1.047$

This cow’s z-score is -1.047; her milk production was 1.047 standard deviations below the mean.

Berry, D. P., Coyne, J., Boughlan, B., Burke, M., McCarthy, J., Enright, B., Cromie, A. R., McParland, S. (2013). Genetics of milking characteristics in dairy cows. Animal, 7(11), 1750-1758.

Example: Comparing Test Scores

SAT-Math scores are normally distributed with a mean of 500 and standard deviation of 100. ACT-Math scores are normally distributed with a mean of 18 and standard deviation of 6. A student has taken both tests. They scored 600 on the SAT-Math and 22 on the ACT-Math. Which score is more impressive?

We can't directly compare the student's SAT and ACT scores because they are on different scales. We can convert these test scores into z-scores so we can directly compare them.

$z_{SAT}=\frac{600-500}{100}=1$

This student scored 1 standard deviation above the mean on the SAT-Math.

$z_{ACT}=\frac{22-18}{6}=0.667$

This student scored 0.667 standard deviations above the mean on the ACT-Math.

The student's SAT-Math score is more impressive than their ACT-Math score because the z-score is higher. They scored better than a larger proportion of other test takers on the SAT-Math.

Practice: Computing z-scores

Type in the answer you think is correct - then click the 'Check' button to see how you did.

Click the right arrow to proceed to the next question.  When you have completed all of the questions you will see how many you got right and the correct answers.

For each question, compute the z-score.

Type in the answer you think is correct - then click the 'Check' button to see how you did.

Click the right arrow to proceed to the next question.  When you have completed all of the questions you will see how many you got right and the correct answers.

For each question, compute the z-score.

2.2.9 - Percentiles

2.2.9 - Percentiles

There are slightly different definitions of percentiles and different statistical software and textbooks may use different formulas. In this course, we will be using the definition from your textbook:

Percentile
Proportion of a distribution less than a given value.

Example: Test Scores

Test scores are often reported in terms of percentiles. For example, if a student scores in the 90th percentile on a test then he or she scored better than 90% of students who took the test.

2.2.10 - Five Number Summary

2.2.10 - Five Number Summary
Five Number Summary
Minimum, Q1, Median, Q3, Maximum

Q1 is the first quartile, this is the 25th percentile
Q3 is the third quartile, this is the 75th percentile

Five number summaries are used to describe some of the key features of a distribution. Using the values in a five number summary we can also compute the range and interquartile range.

Range
The difference between the maximum and minimum values.
Range
$Range = Maximum - Minimum$
Note:
The range is heavily influenced by outliers. For this reason, the interquartile range is often preferred because it is resistant to outliers.
Interquartile range (IQR)
The difference between the first and third quartiles.
Interquartile Range
$IQR = Q_3 - Q_1$

Example: Hours Spent Studying

A professor asks a sample of students how many hours they spent studying for the final. The five number summary for their responses is (5, 7, 9, 11, 13).

Range

The maximum is 13 and the minimum is 5.

$Range = 13 - 5 = 8$

Interquartile Range

The third quartile is 11 and the first quartile is 7.

$IQR = Q_3 - Q_1 = 11 - 7 = 4$

Example: Test Scores

A teacher wants to examine students’ test scores. The five number summary for their scores is (74, 80, 89, 90, 98).

Range

The highest score is 98. The lowest score is 74.

$Range = 98 - 74 = 24$

Interquartile Range

The third quartile is 90 and the first quartile is 80.

$IQR = Q3 - Q1 = 90 - 80 = 10$

2.3 - Lesson 2 Summary

2.3 - Lesson 2 Summary

Objectives

Upon successful completion of this lesson, you will be able to:

• Compute and interpret a basic proportion/risk/probability and odds
• Select and interpret the appropriate visual representations for one categorical variable, two categorical variable, and one quantitative variable
• Use Minitab Express to construct frequency tables, pie charts, bar charts, two-way tables, clustered bar charts, histograms, and dotplots
• Compute and interpret complements, intersections, unions, and conditional probabilities given a two-way table
• Identify outliers on a histogram or dotplot
• Interpret the shape of a distribution
• Compute and interpret the mean, median, mode, and standard deviation
• Compute and interpret percentiles and z scores
• Apply the Empirical Rule
• Interpret a five number summary

In this lesson, you learned how to display and summarize data concerning one categorical variable, two categorical variables, and one quantitative variable. Review the learning objectives above. You should be able to successfully complete each of these tasks before moving on. If you have any questions, post them on the Lesson 2 Discussion Board in Canvas.

In Lesson 3 we will build on this as we examine how to display and summarize data concerning the relationship between a categorical and a quantitative variable and two quantitative variables.

 [1] Link ↥ Has Tooltip/Popover Toggleable Visibility