Lesson 2: Describing Data, Part 1
Lesson 2: Describing Data, Part 1Objectives
 Compute and interpret a basic proportion/risk/probability and odds
 Select and interpret the appropriate visual representations for one categorical variable, two categorical variables, and one quantitative variable
 Use Minitab Express to construct frequency tables, pie charts, bar charts, twoway tables, clustered bar charts, histograms, and dotplots
 Compute and interpret complements, intersections, unions, and conditional probabilities given a twoway table
 Identify outliers on a histogram or dotplot
 Interpret the shape of a distribution
 Compute and interpret the mean, median, mode, and standard deviation
 Compute and interpret percentiles and z scores
 Apply the Empirical Rule
 Interpret a five number summary
This lesson corresponds to Sections 2.12.3, and P.1 in the Lock^{5} textbook.
Recall from Lesson 1 that variables can be classified as categorical or quantitative:
 Categorical
 Names or labels (i.e., categories) with no logical order or with a logical order but inconsistent differences between groups, also known as qualitative.
 Quantitative
 Numerical values with magnitudes that can be placed in a meaningful order with consistent intervals, also known as numerical.
The graphs, descriptive statistics, and inferential statistics that are appropriate depending on the nature of the variable(s) in a given scenario. Before beginning this lesson, you should be able to classify variables as categorical or quantitative. If you are having difficulties with this, go back to review Lesson 1 or speak with your instructor.
2.1  Categorical Variables
2.1  Categorical VariablesCategorical variables are discussed in Sections 2.1 and P.1 of the Lock5 textbook.
Variables can be classified as categorical or quantitative. In this section of the lesson, we will be focusing on categorical variables. Categorical variables are those that provide groupings that may have no logical order, or a logical order with inconsistent difference between groups (e.g., the difference between 1 and 2 is not equivalent to the difference between 3 and 4).
This course includes many examples and practice problems for you. Many of these will apply the concepts that we learn to experiments involving rolling a die or randomly selecting a card from a standard 52card deck. If you are unfamiliar with either of these, take a moment here to review.
 Die

A standard die has 6 sides: 1, 2, 3, 4, 5, 6
 52Card Deck

A standard 52card deck of playing cards has 13 Hearts, 13 Diamonds, 13 Spades, and 13 Clubs. Hearts (♥) and Diamonds (♦) are red suits. Spades (♠) and Clubs (♣) are black suits. For each suit, there is a 2, 3, 4, 5, 6, 7, 8, 9, 10, Jack, Queen, King, and Ace. Jacks, Queens, and Kings are "face cards."
2.1.1  One Categorical Variable
2.1.1  One Categorical VariableData concerning one categorical variable can be summarized using a proportion.
 Proportion
 \(Proportion=\dfrac{Number\;in\;the\;category}{Total\;number}\)
The symbol for a sample proportion is \(\widehat{p}\) and is read as "phat." The symbol for a population proportion is \(p\).
The formula for a sample proportion may also be written as \(\widehat p = \frac{x}{n}\) where \(x\) is the number in the sample with the trait of interest and \(n\) is the sample size.
A proportion must be between 0 and 1.00.
Example: Black Cards
A standard 52card deck contains \(26\) red cards and \(26\) black cards. What proportion of cards are black?
\(p=\dfrac{26}{52}=0.50\)
The symbol \(p\) was used because this is the proportion of all cards (i.e., the population) that are black.
Example: World Campus Undergraduate Students
In the Fall 2014 semester, there were \(82,382\) undergraduate students enrolled in Penn State. Of those, \(6,245\) were World Campus students. What proportion of all Penn State undergraduate students were World Campus students?
\(p=\dfrac{6245}{82382}=0.076\)
The symbol \(p\) was used because this is the proportion of all Penn State undergraduate students (i.e., the population) that are World Campus students.
Example: Broken Cookies
In a sample of \(30\) randomly selected packages of chocolate chip cookies, \(18\) contained broken cookies. What proportion of these selected packages had broken cookies?
\(\widehat{p}=\dfrac{18}{30}=0.60\)
These data were collected from a sample so the symbol \(\widehat{p}\) was used to denote a sample proportion.
2.1.1.1  Risk and Odds
2.1.1.1  Risk and OddsYou may have heard the terms risk and odds before. They are both ways to communicate the likelihood of an event.
Risk and odds are often confused with one another. The formulas for computing risk and odds are different and their interpretations are different.
In statistics, the word risk communicates the likelihood of an event occurring. This is synonymous with probability or proportion (i.e., the formulas are the same).
 Risk
 The probability that an event will occur. It may be written as a decimal, a fraction, or a percent.
 Risk
 \(Risk= \dfrac{number \;with \;the\; outcome}{total\;number\;of\;outcomes}\)
Example: Asthma Risk
\(60\) out of \(1000\) teens have asthma.
\(risk=\dfrac{60}{1000}=0.06\)
This means that \(6\%\) of teens experience asthma.
Example: Flu Risk
\(45\) out of \(100\) children get the flu each year.
\(risk=\dfrac{45}{100}=0.45\) or \(45\%\)
Odds
 Odds
 Express risk by comparing the likelihood of an event happening to the likelihood it does not happen.
 Odds

\(odds = \dfrac {number \;with \;the\; outcome}{number \;without \;the \;outcome}\)
OR
\(odds=\dfrac{risk}{1risk}\)
We often interpret odds in relation to the value of 1. For example, if the odds of a game are in favor of the house 2 to 1, that means for every 2 games the house wins it will lose 1.
Example: Passing Odds
In one large class, 850 students passed an exam while 150 students failed. Because we have the raw counts, we can use the first odds formula.
\(odds=\dfrac {number \;with \;the\; outcome}{number \;without \;the \;outcome}=\dfrac{850}{150}=5.667\)
The odds of passing were 5.667 to 1. In other words, for every 5.667 students who passed the exam there was 1 who failed.
Example: Flu Odds
The risk of a child getting the flu is \(45\%\) which can also be written as \(0.45\). Because we have the risk, we can use the second odds formula.
\(odds=\dfrac{risk}{1risk}=\dfrac{0.45}{10.45}=\dfrac{0.45}{0.55}=0.818\)
The odds of a child getting the flu is \(0.818\) to \(1\).
2.1.1.2  Visual Representations
2.1.1.2  Visual RepresentationsFrequency tables, pie charts, and bar charts are the most appropriate graphical displays for categorical variables. Below are a frequency table, a pie chart, and a bar graph for data concerning Penn State’s undergraduate enrollments by campus in Fall 2017.
Note that in the bar chart, the bars are separated by a space. The spaces between the bars signify that this is a categorical variable. On the following pages you will learn how to make these graphs using Minitab Express.
 Frequency Table

A table containing the counts of how often each category occurs.
Tally Campus Count Percent University Park 40835 48.5% Commonwealth Campuses 29388 34.9% PA College of Technology 5465 6.5% World Campus 8513 10.1% Total 84201 100.0% Penn State Fall 2017 Undergraduate Enrollments
 Pie chart

Graphical representation for categorical data in which a circle is partitioned into “slices” on the basis of the proportions of each category.
 Bar chart

Graphical representation for categorical data in which vertical (or sometimes horizontal) bars are used to depict the number of experimental units in each category; bars are separated by space.
Penn State Fall 2017 Undergraduate Enrollments
Pie charts tend to work best when there are only a few categories. If a variable has many categories, a pie chart may be more difficult to read. In those cases, a frequency table or bar chart may be more appropriate.
When selecting a visual display for your data you should first determine how many variables you are going to display and whether they are categorical or quantitative. Then, you should think about what you are trying to communicate. Each visual display has its own strengths and weaknesses. When first starting out, you may need to make a few different types of displays to determine which best communicates your data.
2.1.1.2.1  Minitab Express: Frequency Tables
2.1.1.2.1  Minitab Express: Frequency TablesThe following data set (from College Board) contain the mean SAT scores for each of the 50 states and Washington, DC, as well the participation rates and geographic region of each state. To get an idea of the pattern of variation of a categorical variable such as region, we can display the information with a frequency table, pie chart, or bar graph.
MinitabExpress – Frequency Table
To create a frequency table in Minitab Express:
 Open the data set:
 On a PC: In the menu bar select STATISTICS > Describe > Tally
On a Mac: In the menu bar select Statistics > Summary Statistics > Tally  Double click the variable Region in the box on the left to insert the variable into the Variable box
 Under Statistics, check Counts and Percents
 Click OK
This should result in the following frequency table:
Region  Count  Percent 

ENC  5  9.8039% 
ESC  4  7.8431% 
MA  3  5.8824% 
MTN  8  15.6863% 
NE  6  11.7647% 
PAC  5  9.8039% 
SA  9  17.6471% 
WNC  7  13.7255% 
WSC  4  7.8431% 
N=  51 
Select your operating system below to see a stepbystep guide for this example.
2.1.1.2.2  Minitab Express: Pie Charts
2.1.1.2.2  Minitab Express: Pie ChartsThe following data set (from College Board) contain the mean SAT scores for each of the 50 states and Washington, DC, as well the participation rates and geographic region of each state.
MinitabExpress – Pie Chart (Raw Data)
To create a pie chart in Minitab Express:
 Open the data set:
 On a PC or Mac: Select Graphs > Pie Chart
 Select Counts of Unique Values
 Double click the variable Region in the box on the left to insert the variable into the Categorical variable box
 Click OK
This should result in the pie chart below:
Select your operating system below to see a stepbystep guide for this example.
Summarized Data
In the examples above raw data were used. In other words, the dataset contained one row for each case. It is also possible to use Minitab Express to construct a pie chart given summarized data, for example, if you had your counts in a frequency table. If this were the case, in step 3 you would select Summarized Data and enter the names of the categories in the Category names box and the frequency counts in the Summary values box.
2.1.1.2.3  Minitab Express: Bar Charts
2.1.1.2.3  Minitab Express: Bar ChartsThe following data set (from College Board) contain the mean SAT scores for each of the 50 states and Washington, DC, as well the participation rates and geographic region of each state.
MinitabExpress – Bar Chart (Raw Data)
To create a bar graph in Minitab Express:
 Open the data set
 On a PC or Mac: Select Graphs > Bar Chart
 Use the default from the drop down Bars represent of Counts of unique values in a categorical variable
 Select Simple
 Double click the variable Region in the box on the left to insert the variable into the Categorical variable box
 Click OK
This should result in the bar graph below:
Select your operating system below to see a stepbystep guide for this example.
Summarized Data
In the examples above raw data were used. In other words, the Minitab Express file consisted of one row for each case. We can also use Minitab Express to construct a bar chart with summarized data, for example, if you had data in a frequency table. To do this, in the third step shown above you will change the dropdown of Bars represent to Summarized values for each category in a table. You will still select Simple. The Summary variable will be the numerical values and the Categorical variable will be the names of the categories.
2.1.2  Two Categorical Variables
2.1.2  Two Categorical VariablesData concerning two categorical variables may be communicated using a twoway table, also known as a contingency table. Data concerning two categorical variables can visualized using a segmented bar chart or a clustered bar chart. A clustered bar chart is also known as a sidebyside bar chart.
 TwoWay Table
 A display of counts for two categorical variables in which the rows represent one variable and the columns represent a second variable. Also known as a contingency table.
Example: World Campus Enrollments by Sex
We will use the twoway table below of Penn State World Campus enrollments by biological sex and level to walk through a few examples of how to read a twoway table. These data are from the Penn State Factbook and from the Fall of 2015.
Female  Male  Total  

Undergraduate  3814  3428  7242 
Graduate  2213  2787  5000 
Total  6027  6215  12242 
Proportion Undergraduate
What proportion of the population of World Campus students is undergraduate?
\(p=\frac{7242}{12242}=0.596\)
Proportion of Females who are Undergraduates
What proportion of females in this population are undergraduate students?
\(p=\frac{3814}{6027}=0.633\)
Later in this lesson, you will learn that this is known as a conditional probability.
Proportion of Undergraduates who are Female
What proportion of undergraduate students in this population are female?
\(p=\frac{3814}{7242}=0.527\)
 Segmented Bar Chart

Also known as a stacked bar chart, one categorical variable is represented on the xaxis while the second categorical variable is denoted within the bars. Minitab Express will not construct a stacked bar chart, but other softwares will. The segmented bar chart below was constructed using Excel.
 Clustered Bar Chart

Each bar represents one combination of the two categorical variables (i.e., one cell in a contingency table). This is also known as a sidebyside bar chart.
2.1.2.1  Minitab Express: TwoWay Table
2.1.2.1  Minitab Express: TwoWay TableThis dataset consists of STAT 200 students' responses to survey. We can construct a twoway table showing the relationship between Smoke Cigarettes (row variable) and Biological Sex (column variable) using Minitab Express.
MinitabExpress – TwoWay Table
To create a twoway table in Minitab Express:
 Open the data set:
 On a PC: Select STATISTICS > Cross Tabulation and Chisquare
On a Mac: Select Statistics > Tables > Cross Tabulation and ChiSquare  Select Raw data (categorical variable) from the drop down menu
 Double click the variable Smoke Cigarettes in the box on the left to insert the variable into the Rows box
 Double click the variable Biological Sex in the box on the left to insert the variable into the Columns box
 Click OK
This should result in the twoway table below:
Female  Male  All  

No  120  89  209 
Yes  7  10  17 
All  127  99  226 
Cell Contents: Count 
Select your operating system below to see a stepbystep guide for this example.
2.1.2.1.1  Video Example: Reading a TwoWay Table
2.1.2.1.1  Video Example: Reading a TwoWay TableThe video example below uses the dataset:
2.1.2.2  Minitab Express: Clustered Bar Chart
2.1.2.2  Minitab Express: Clustered Bar ChartWe are going to use the the Class Survey data set in this example again:
MinitabExpress – Clustered Bar Chart
To create a clustered bar chart in Minitab Express:
 Open the data set:
 On a PC or Mac: Select Graphs > Bar Chart
 In this example we have a datafile with the responses from each case so for Bars represent select Counts of unique values in a categorical variable
 Select Clustered
 Double click the variables Biological sex and Smoke Cigarettes in the box on the left to insert the variable into the Categorical variables box
 Click OK
This should result in the clustered bar chart below:
Select your operating system below to see a stepbystep guide for this example.
Note: The order in which the variables are entered into the Categorical variables box in Minitab Express determines how the bars will be clustered. For example, if we entered Smoke Cigarettes and then Gender, the result would be the following clustered bar chart:
2.1.3  Probability Rules
2.1.3  Probability RulesThe probability rules covered in this lesson can be found in section P.1 of the Lock^{5} textbook.
Earlier in this lesson you were introduced to proportions. We used the notation: \(Proportion=\frac{Number\;in\;the\;category}{Total\;number}\).
When we discuss probabilities, we will use the notation below where \(P(A)\) is the probability of event \(A\) occurring. Probabilities are typically written in decimal form but may also be translated to percentages.
Note that this is the same formula that you learned earlier in Lesson 2.1.1 for a proportion.
 Probability of Event A
 \(P(A)=\dfrac{Number\;in\;group\;A}{Total\;number}\)
Example: Spades
What is the probability that a randomly selected card from a standard 52card deck will be a spade? There are 13 spades in the deck of 52.
\(P(spade)=\dfrac{13}{52}=0.25\)
The probability of pulling a spade is 0.25. We could also say that there is a 25% chance of pulling a spade.
Example: Odd Numbers
If you roll a sixsided die, what is the probability of getting an odd number? There are three odd numbers on the die (1, 3, 5).
\(P(odd)=\dfrac{3}{6}=0.50\)
The probability of rolling an odd number is 0.50. We could also say that there 50% chance of rolling an odd number.
Example: Raffle
There are a total of 500 raffle tickets and you have purchased 10. What is the probability that one of your tickets will be randomly selected to win the raffle?
\(P(winning)=\dfrac{10}{500}=0.02\)
The probability of you winning is 0.02. We could also say that there is a 2% chance that you will win.
2.1.3.1  Range of Probabilities
2.1.3.1  Range of ProbabilitiesThe probability of an impossible event is 0 and the probability of a certain event is 1. The range of possible probabilities is: \(0 \leq P(A) \leq 1\). It is not possible to have a probability less than 0 or greater than 1.
Example: Rolling an 8
It is impossible to roll an eight on a sixsided die.
\(P(rolling\; 8)= \dfrac{0}{6} = 0\)
Example: Blue Cards
In a standard 52card deck all cards are black or red. There are no blue cards.
\(P(blue)=\dfrac{0}{52}=0\)
Example: Rolling a Value Between 1 and 6
A sixsided die contains the values 1, 2, 3, 4, 5, and 6. All rolls will result in a value between 1 and 6.
\(P(rolling \;1 \;to\; 6)=\dfrac{6}{6}=1.00\)
2.1.3.2  Combinations of Events
2.1.3.2  Combinations of EventsIn situations with two or more categorical variables there are a number of different ways that combinations of events can be described: intersections, unions, complements, and conditional probabilities. Each of these combinations of events is covered in your textbook. However, note that your textbook does not use the symbols that are most commonly used when discussing these combinations of events. The symbols that we will be using are in the table below. In this section, you will also learn about disjoint events and independent events.
Combination  Symbol  Definition 

Intersection  \(P(A\cap B)\)  Probability of A and B 
Union  \(P(A\cup B)\) 
Probability of A or B Note: This includes the possibility of A and B 
Complement  \(P(A^C)\)  The probability of NOT A 
Conditional  \(P(A\mid B)\)  The probability of A given B 
2.1.3.2.1  Disjoint & Independent Events
2.1.3.2.1  Disjoint & Independent EventsNote that disjoint events and independent events are different. Events are considered disjoint if they never occur at the same time; these are also known as mutually exclusive events. Events are considered independent if they are unrelated.
 Disjoint Events

Two events that do not occur at the same time. These are also known as mutually exclusive events.
In the Venn diagram below event A and event B are disjoint events because the two do not overlap.
 Venn diagram
 A visual representation in which the sample space is depicted as a box and events are represented as circles within the sample space.
 Independent Events
 Unrelated events. The outcome of one event does not impact the outcome of the other event.
Example: Freshmen & Sophomores
Let's consider undergraduate class status. A student can be classified as a freshman, sophomore, junior, or senior.
Being a freshman and being a sophomore are disjoint events because an individual cannot be classified as both at the same time.
Being a freshman is not independent of being a sophomore. If I know that an individual is a freshman then the probability that they are a sophomore is 0; knowing that the student was a freshman provided information that influenced my prediction of them being a sophomore.
Example: Class Status & Gender
Assume that there is no relationship between gender and class status. This means that within each class (freshmen, sophomores, juniors, seniors) the proportion of students who are men is consistent. It also means that within each gender the proportion of students who are freshmen, sophomores, juniors, and seniors is consistent.
In this case, we could say that the events of being a man and being a senior are independent events. Knowing that a student is a man does not influence the likelihood of him being a senior. Knowing that a student is a senior does not change the likelihood of them being a man.
There are some men who are seniors, so these events are not disjoint.
2.1.3.2.2  Intersections
2.1.3.2.2  Intersections Intersection

The overlap of two or more events is symbolized by the character \(\cap\).
\(P(A \cap B)\) is read as "the probability of A and B."
Example: Red King
What is the probability of randomly selecting a card from a standard 52card deck that is a red card and a king?
There are 2 kings that are red cards: the king of hearts and the king of diamonds.
\(P(red \cap king)=\dfrac{2}{52}=.0385\)
Example: Female Undergraduate Students
The twoway table below displays the World Campus enrollment from Fall 2015 in terms of level (undergraduate and graduate) and biological sex. What proportion of World Campus students were female and undergraduate students?
Female  Male  Total  

Undergraduate  3814  3428  7242 
Graduate  2213  2787  5000 
Total  6027  6215  12242 
There are 3814 students who are females and undergraduates out of a total of 12242 students.
\(P(F \cap U)=\dfrac{3814}{12242}=0.312\)
2.1.3.2.3  Unions
2.1.3.2.3  Unions Union

A union contains the area in A or B and is symbolized by \(\cup\). Note that this also includes the overlap of A and B (i.e., the intersection).
\(P(A \cup B)\) is read as "the probability of A or B."
 Union
 \(P(A\cup B) = P(A)+P(B)P(A\cap B)\)
Example: Hearts or Spades
What is the probability of randomly selecting a card from a standard 52card deck that is a heart or spade?
There are 13 cards that are hearts, 13 cards that are spades, and no cards that are both a heart and a spade.
\(P(heart \cup spade)=\dfrac{13}{52}+\dfrac{13}{52}\dfrac{0}{52}= \dfrac {26}{52}=0.5\)
Example: Hearts or Aces
What is the probability of randomly selecting a card from a standard 52card deck that is a heart or an ace?
There are 13 cards that are hearts and 4 cards that are aces. There is one ace of hearts, so one of those 4 aces has already been counted.
\(P(heart \cup ace)=\dfrac{13}{52}+\dfrac{4}{52}\dfrac{1}{52}=\dfrac{16}{52}=0.308\)
Example: Female or Undergraduate
The twoway table below displays the World Campus enrollment from Fall 2015 in terms of level (undergraduate and graduate) and biological sex. What proportion of World Campus students were female or undergraduate students?
Female  Male  Total  

Undergraduate  3814  3428  7242 
Graduate  2213  2787  5000 
Total  6027  6215  12242 
When we have a contingency table we can take the appropriate values from the table as opposed to using the formula given above. There are 3814 female undergraduate students, 3428 male undergraduate students, 2213 female graduate students, and a total of 12242 students.
\(P(F \cup U)=\dfrac{3814+3428+2213}{12242}=\dfrac{9455}{12242}=0.772\)
Note that the final answer would be the same if we had used the formula:
\(P(F \cup U) = \dfrac{6027}{12242}+\dfrac{7242}{12242}\dfrac{3814}{12242}= \dfrac{9455}{12242}=0.772\)
2.1.3.2.4  Complements
2.1.3.2.4  Complements Complement

The probability that the event does not occur. The complement of \(P(A)\) is \(P(A^C)\). This may also be written as \(P(A')\).
In the diagram below we can see that \(A^{C}\) is everything in the sample space that is not A.
 Complement of A
 \(P(A^{C})=1−P(A)\)
Example: Coin Flip
When flipping a coin, one can flip heads or tails. Thus, \(P(Tails^{C})=P(Heads)\) and \(P(Heads^{C})=P(Tails)\)
Example: Hearts
If you randomly select a card from a standard 52card deck, you could pull a heart, diamond, spade, or club. The complement of pulling a heart is the probability of pulling a diamond, spade, or club. In other words: \(P(Heart^{C})=P(Diamond,\; Spade,\;\;Club)\)
The complement of any outcome is equal to one minus the outcome. In other words: \(P(A^{C})=1P(A)\)
It is also true then that: \(P(A)=1P(A^{C})\)
Example: Rain
According to the weather report, there is a 30% chance of rain today: \(P(Rain) = .30\)
Raining and not raining are complements.
\(P(Not \:rain)=P(Rain^{C})=1P(Rain)=1.30=.70\)
There is a 70% chance that it will not rain today.
Example: Winning
The probability that your team will win their next game is calculated to be .45, in other words:
\(P(Winning)=.45\)
Winning and losing are complements of one another. Therefore the probability that they will lose is:
\(P(Losing)=P(Winning^{C})=1.45=.55\)
The sum of all of the probabilities for possible events is equal to 1.
Example: Cards
In a standard 52card deck there are 26 black cards and 26 red cards. All cards are either black or red.
\(P(red)+P(black)=\frac{26}{52}+\frac{26}{52}=1\)
Example: Dominant Hand
Of individuals with two hands, it is possible to be righthanded, lefthanded, or ambidextrous. Assuming that these are the only three possibilities and that there is no overlap between any of these possibilities:
\(P(right\;handed)+P(left\;handed)+P(ambidextrous) = 1\)
2.1.3.2.5  Conditional Probability
2.1.3.2.5  Conditional Probability Conditional Probability

The probability of one event occurring given that it is known that a second event has occurred. This is communicated using the symbol \(\mid\) which is read as "given."
For example, \(P(A\mid B)\) is read as "Probability of A given B."
Example: PA Resident given Undergraduate
The twoway table below displays the World Campus enrollment from Fall 2019 in terms of level (undergraduate and graduate) and residency (Pennsylvania and nonPennsylvania). Given that an individual is an undergraduate student, what is the probability that the student is a Pennsylvania resident?
Pennsylvania  NonPennsylvania  Total  

Undergraduate  3757  4603  8360 
Graduate  2253  4074  6327 
Total  6010  8677  14687 
We know that the individual is an undergraduate student so we will only look at the 8360 undergraduate students. Of those 8360 undergraduate students, 3757 were Pennsylvania residents.
\(P(PA \mid Undergrad) = \dfrac{3757}{8360}=0.449\)
2.1.3.2.5.1  Advanced Conditional Probability Applications
2.1.3.2.5.1  Advanced Conditional Probability ApplicationsAdvanced Formulas
Conditional probabilities can also be computed using the following formulas. Note that these two formulas are identical, but A and B are switched. Again, if the contingency table is available it is usually most efficient to take the appropriate values from the table, as shown above, as opposed to using these formulas.
 Conditional Probability of A Given B
 \(P(A\mid B)=\dfrac{P(A \: \cap\: B)}{P(B)}\)
 Conditional Probability of B Given A
 \(P(B\mid A)=\dfrac{P(A \: \cap\: B)}{P(A)}\)
Example: Clubs
In a standard 52card deck, there are 26 black cards including 13 clubs. All clubs are black, therefore there are 13 black clubs.
What is the probability that a randomly selected card is a club given that it is a black card?
We are given that \(P(club)=\frac{13}{52}=0.25\), \(P(black)=\frac{26}{52}=0.50\), and \(P(club \: \cap\: black)=\frac{13}{52}0.25\)
\(P(club\mid black)=\dfrac{P(club \: \cap\: black)}{P(black)}=\dfrac{0.25}{0.50}=0.50\)
Given that a randomly selected card is black, there is a 50% chance that it's a club.
Independent Events Written as Conditional Probabilities
If events A and B are independent then \(P(A) = P(A \mid B)\). In other words, whether or not event B occurs does not change the probability of event A occurring.
Example: Checking for Independence, Aces and Hearts
A card is randomly drawn from a 52card deck. Are the events of drawing an ace and drawing a heart independent?
In a standard 52card deck, there are 4 aces and 13 hearts. Therefore \(P(ace)=\frac{4}{52}\) and \(P(heart)=\frac{13}{52}\). Out of 13 hearts, 1 is an ace, which translates to \(P(ace \mid heart) = \frac{1}{13}\).
To determine if these two events are independent we can compare \(P(A)\) to \(P(A\mid B)\). If we call being an ace event A and being a heart event B, then we're comparing \(P(ace)\) to \(P(ace \mid heart)\).
\(P(ace)=\frac{4}{52}=0.0769\)
\(P(ace \mid heart) = \frac{1}{13}=0.0769\)
These values are identical, therefore we can conclude that the events of drawing an ace and drawing a heart are independent.
2.2  One Quantitative Variable
2.2  One Quantitative VariableOne quantitative variable is covered in Sections 2.2 and 2.3 of the Lock^{5} textbook. In these sections, you will learn how to describe the distribution of a quantitative variable in terms of shape, central tendency, and variability. You will be introduced to the normal distribution, z scores, percentiles, graphs, and the fivenumbersummary.
2.2.1  Graphs: Dotplots and Histograms
2.2.1  Graphs: Dotplots and HistogramsDotplots and histograms are both graphical displays that can be used with one quantitative variable. In both of these plots the horizontal axis represents the values of the variable. The number of dots in a dotplot, or the height of the bars in a histogram, represent the number of cases with each value or range of values.
 Dotplot
 Histogram
MinitabExpress – Dotplot
To create a dotplot in Minitab Express:
 Open the data set:
 On a PC or Mac: Select GRAPHS > Dotplot
 Select Simple
 Double click the variable Verbal SAT (2005) in the box on the left to insert the variable into the Y variable box
 Click OK
This should result in the following dotplot:
Select your operating system below to see a stepbystep guide for this example.
MinitabExpress – Histogram
To create a histogram in Minitab Express:
 Open the data set:
 On a PC or Mac: Select GRAPHS > Histogram
 Select Simple
 Double click the variable Verbal SAT(2005) in the box on the left to insert the variable into the Y variable box
 Click OK
This should result in the following histogram:
Select your operating system below to see a stepbystep guide for this example.
2.2.2  Outliers
2.2.2  OutliersSome observations within a set of data may fall outside the general scope of the other observations. Such observations are called outliers. Outliers can be identified by looking at a dotplot or histogram. In Lesson 3 you'll learn about boxplots which can also be used to identify outliers. When constructing a boxplot, Minitab Express identifies outliers using mathematical methods that you will see next week. This week we will identify outliers by making a relatively subjective judgement from a given a list of data points, a dotplot, or a histogram.
Example: Dotplot of Hours Watching TV
A sample of STAT 200 students was surveyed and asked how many hours per week they watch television. A dotplot was constructed using these data.
The rightmost dot is definitely an outlier because it is much higher than any other points. The other higher points, around 55, 50, and 46, may be outliers. Next week we will learn about some mathematical methods for identifying outliers that can help us make decisions in cases like this where it is not obvious which values are outliers.
Example: Histogram of Best Marriage Age
This sample of students was also asked what they believed was the best age to get married. A histogram was constructed using these data.
There appear to be three outliers in this sample, all on the higher end.
2.2.3  Shape
2.2.3  ShapeQuantitative variables are often discussed in terms of their shape. Both dotplots and histograms can be used to interpret a distribution's shape. A distribution may be described in terms of symmetry and skewness.
 Symmetrical Distribution

A distribution that is similar on both sides of the center.
 Normal Distribution

One specific type of symmetrical distribution. This is also known as a bellshaped distribution.
 Skewed
 A distribution in which values are more spread out on one side of the center than on the other.
 Right Skewed

A distribution in which the higher values (towards the right on a number line) are more spread out than the lower values. This is also known as positively skewed.
 Left Skewed

A distribution in which the lower values (towards the left on a number line) are more spread out than the higher values. This is also known as negatively skewed.
2.2.4  Measures of Central Tendency
2.2.4  Measures of Central TendencyQuantitative variables are often summarized using numbers to communicate their central tendency. The mean, median, and mode are three of the most commonly used measures of central tendency.
 Mean

The numerical average; calculated as the sum of all of the data values divided by the number of values.
The sample mean is represented as \(\overline{x}\) ("xbar") and the population mean is denoted as the Greek letter \(\mu\) ("mu"). The formula is the same for the sample mean and the population mean.
 Population Mean
 \(\mu=\dfrac{\Sigma x}{N}\)
 Sample Mean
 \(\overline {x} = \dfrac{\Sigma x}{n}\)
 Median
 The middle of the distribution that has been ordered from smallest to largest; for distributions with an even number of values, this is the mean of the two middle values.
 Mode
 The most frequently occurring value(s) in the distribution, may be used with quantitative or categorical variables.
Example: Hours Spent Studying
A professor asks a sample of 7 students how many hours they spent studying for the final. Their responses are: 5, 7, 8, 9, 9, 11, and 13.
Mean
\(\overline{x} = \dfrac{\sum x}{n} =\dfrac{5+7+8+9+9+11+13}{7} =\dfrac{62}{7} =8.857\)
The mean is 8.857 hours.
Median
The observations are already in order from smallest to largest. The middle observation is 9 hours. The median is 9 hours.
Mode
The most frequently occurring observation was 9 hours. The mode is 9 hours.
In this example, the mean, median, and mode are all similar. Recall from our discussion of shape, the mean, median, and mode are all equal when a distribution is symmetric. This distribution of hours spent studying is probably close to symmetrical.
Example: Test Scores
A teacher wants to examine students’ test scores. Their scores are: 74, 88, 78, 90, 94, 90, 84, 90, 98, and 80.
Mean
\(\overline{x}\: =\: \dfrac{\sum x}{n} = \dfrac{74+88+78+90+94+90+84+90+98+80}{10} = \dfrac{866}{10}=86.6\)
The mean score was 86.6.
Median
First, we need to put the scores in order from lowest to highest: 74, 78, 80, 84, 88, 90, 90, 90, 94, 98
Because there is an even number of scores, the median will be the mean of the middle two values. The middle two values are 88 and 90. \(\frac{88+90}{2}=89\)
The median is 89.
Mode
The most frequently occurring score was 90. There were 3 students who scored a 90; this is the mode. Because this distribution has one mode, it is unimodal.
In this example the mean is slightly lower than the median which is slightly lower than the mode. Recall from our discussion of shape that this occurs when a distribution is skewed to the left. This distribution is probably slightly skewed to the left.
Example: Household Size
A group of children are asked how many people live in their household. The following data is collected: 4, 3, 6, 2, 2, 4, 3.
Mean
\(\overline{x} = \dfrac{\sum x}{n}=\dfrac{4+3+6+2+2+4+3}{7}=\dfrac{24}{7}=3.429\)
The mean household size in this group of children is 3.429 people.
Median
First, we need to put all of the values in order from smallest to largest: 2, 2, 3, 3, 4, 4, 6
The value in the middle of this distribution is 3. The median is 3.
Mode
In this distribution, the most common values are 2, 3, and 4. Each of these values occurs twice. There are 3 modes: 2, 3, and 4. This distribution is multimodal.
2.2.4.1  Skewness & Central Tendency
2.2.4.1  Skewness & Central TendencyThe preferred measure of central tendency often depends on the shape of the distribution. In a symmetrical distribution, the mean, median, and mode are all equal. In these cases, the mean is often the preferred measure of central tendency.
For distributions that are strongly skewed or have outliers, the median is often the most appropriate measure of central tendency because in skewed distributions the mean is pulled out toward the tail. The median is more resistant to outliers compared to the mean. Of these three measures of central tendency, the mean is most influenced by outliers. Below you will see how the direction of skewness impacts the order of the mean, median, and mode.
2.2.5  Measures of Spread
2.2.5  Measures of SpreadVariance and standard deviation are measures of variability. The standard deviation is the most commonly used measure of variability when data are quantitative and approximately normally distributed. When computing the standard deviation by hand, it is necessary to first compute the variance. The standard deviation is equal to the square root of the variance. Here, you will learn how to compute these values by hand. After this lesson, you will always be computing standard deviation using software such as Minitab Express.
 Standard Deviation
 Roughly the average difference between individual data values and the mean. The standard deviation of a sample is denoted as \(s\). The standard deviation of a population is denoted as \(\sigma\).
 Sample Standard Deviation
 \(s=\sqrt{\dfrac{\sum (x\overline{x})^{2}}{n1}}\)
In order to compute the standard deviation for a sample we first compute deviations. The sum of the squared deviations (SS) divided by \(n1\), this is the variance (\(s^2\)).
The square root of the variance is the standard deviation: \(\sqrt{s^2}=s\).
 Deviation
 An individual score minus the mean.
 Sum of Squared Deviations
 Deviations squared and added together. This is also known as the sum of squares or SS.
 Variance
 Approximately the average of all of the squared deviations; for a sample represented as \(s^{2}\).
 Sum of Squares
 \(SS={\sum (x\overline{x})^{2}}\)
 Sample Variance
 \(s^{2}=\dfrac{\sum (x\overline{x})^{2}}{n1}\)
There are a number of methods for calculating the standard deviation. If you look through different textbooks or search online, you may find different formulas and procedures. To compute the standard deviation for a sample, we will use the formulas above and the following steps:
Step 1: Compute the sample mean: \(\overline{x} = \frac{\sum x}{n}\).
Step 2: Subtract the sample mean from each individual value: \(x\overline{x}\), these are the deviations.
Step 3: Square each deviation: \((x\overline{x})^{2}\), these are the squared deviations.
Step 4: Add the squared deviations: \(\sum (x\overline{x})^{2}\), this is the sum of squares.
Step 5: Divide the sum of squares by \(n1\): \(\frac{\sum (x\overline{x})^{2}}{n1}\), this is the sample variance \((s^{2})\).
Step 6: Take the square root of the sample variance: \(\sqrt{\frac{\sum (x\overline{x})^{2}}{n1}}\), this is the sample standard deviation.
Video Example
The video below walks through an example of computing a sample standard deviation by hand.
Example: Hours Spent Studying
A professor asks a sample of 7 students how many hours they spent studying for the final. Their responses are: 5, 7, 8, 9, 9, 11, and 13.
Step 1: Compute the mean
\(\overline{x} = \dfrac{\sum x}{n}=\dfrac{5+7+8+9+9+11+13}{7}=8.857\)
Step 2: Compute the deviations
\(x\)  \(x  \overline{x}\) 

5  \(5  8.857 = 3.857\) 
7  \(7  8.857 = 1.857\) 
8  \(8  8.857 = 0.857\) 
9  \(9  8.857 = 0.143\) 
9  \(9  8.857 = 0.143\) 
11  \(11  8.857 = 2.143\) 
13  \(13  8.857 = 4.143\) 
Step 3: Square the deviations
\(x\)  \(x  \overline{x}\)  \((x\overline{x})^{2}\) 

5  \(5  8.857 = 3.857\)  \(3.857^{2} = 14.876\) 
7  \(7  8.857 = 1.857\)  \(1.857^{2} = 3.448\) 
8  \(8  8.857 = 0.857\)  \(0.857^{2} = 0.734\) 
9  \(9  8.857 = 0.143\)  \(0.143^{2} = 0.0020\) 
9  \(9  8.857 = 0.143\)  \(0.143^{2} = 0.0020\) 
11  \(11  8.857 = 2.143\)  \(2.143^{2} = 4.592\) 
13  \(13  8.857 = 4.143\)  \(4.143^{2} = 17.164\) 
Step 4: Sum the squared deviations
\(SS=\sum (x\overline{x})^{2}=14.876+3.448+0.734+.020+.020+4.592+17.164=40.854\)
The sum of squares is 40.854
Step 5: Divide by n  1 to compute the variance
\(s^{2}=\dfrac{\sum (x\overline{x})^{2}}{n1}=\dfrac{40.854}{71}=6.809\)
The variance is 6.809
Step 6: Take the square root of the variance
\(s=\sqrt{s^{2}}=\sqrt{6.809}=2.609\)
The standard deviation is 2.609
2.2.6  Minitab Express: Central Tendency & Variability
2.2.6  Minitab Express: Central Tendency & VariabilityMinitab Express may be used to compute descriptive statistics such as the mean, median, mode, standard deviation, and variance.
Note that these are the default setting in Minitab Express:
If you want the mode or variance, you will need to select them under the Statistics tab.
MinitabExpress – Central Tendency
To obtain measures of central tendency and variability in Minitab Express:
 Open the data set:
 On a PC: from the menu select STATISTICS > Describe
On a Mac: from the menu select Statistics > Summary Statistics > Descriptive Statistics  Double click the variable Height in the box on the left to insert the variable into the Variable box
 Click on the Statistics tab and select the descriptive statistics that you want displayed
 Click OK
This should result in the following output:
Variable  N  N*  Mean  SE Mean  StDev  Minimum  Q1  Median  Maximum 

Height  525  0  67.0090  0.1947  4.4616  51.0000  64.0000  67.0000  82.0000 
Select your operating system below to see a stepbystep guide for this example.
2.2.7  The Empirical Rule
2.2.7  The Empirical RuleA normal distribution is symmetrical and bellshaped.
The Empirical Rule is a statement about normal distributions. Your textbook uses an abbreviated form of this, known as the 95% Rule, because 95% is the most commonly used interval. The 95% Rule states that approximately 95% of observations fall within two standard deviations of the mean on a normal distribution.
 Normal Distribution
 A specific type of symmetrical distribution, also known as a bellshaped distribution
 Empirical Rule

On a normal distribution about 68% of data will be within one standard deviation of the mean, about 95% will be within two standard deviations of the mean, and about 99.7% will be within three standard deviations of the mean
 95% Rule
 On a normal distribution approximately 95% of data will fall within two standard deviations of the mean; this is an abbreviated form of the Empirical Rule
Example: Pulse Rates
Suppose the pulse rates of 200 college men are bellshaped with a mean of 72 and standard deviation of 6.
 About 68% of the men have pulse rates in the interval \(72\pm1(6)=[66, 78]\).
 About 95% of the men have pulse rates in the interval \(72\pm2(6)=[60, 84]\).
 About 99.7% of the men have pulse rates in the interval \(72\pm 3(6)=[54, 90]\).
Example: IQ Scores
IQ scores are normally distributed with a mean of 100 and a standard deviation of 15.
 About 68% of individuals have IQ scores in the interval \(100\pm 1(15)=[85,115]\).
 About 95% of individuals have IQ scores in the interval \(100\pm 2(15)=[70,130]\).
 About 99.7% of individuals have IQ scores in the interval \(100\pm 3(15)=[55,145]\).
2.2.8  zscores
2.2.8  zscoresOften we want to describe an observation in relation to the distribution of all observations. We can do this using a zscore. By converting observations to zscores, we can compare observations from different distributions.
 zscore

Distance between an individual score and the mean in standard deviation units; also known as a standardized score.
 zscore
 \(z=\dfrac{x  \overline{x}}{s}\)

\(x\) = original data value
\(\overline{x}\) = mean of the original distribution
\(s\) = standard deviation of the original distribution
This equation could also be rewritten in terms of population values: \(z=\frac{x\mu}{\sigma}\)
Later in the course, we will learn more about the zdistribution, which is a special case of the normal distribution.
 zdistribution

A bellshaped distribution with a mean of 0 and standard deviation of 1, also known as the standard normal distribution.
Example: Milk
A study of 66,831 dairy cows found that the mean milk yield was 12.5 kg per milking with a standard deviation of 4.3 kg per milking (data from Berry, et al., 2013).
A cow produces 18.1 kg per milking. What is this cow’s zscore?
\(z=\frac{x\overline{x}}{s} =\frac{18.112.5}{4.3}=1.302\)
This cow’s zscore is 1.302; her milk production was 1.302 standard deviations above the mean.
A cow produces 12.5 kg per milking. What is this cow’s zscore?
\(z=\frac{x\overline{x}}{s} =\frac{12.512.5}{4.3}=0\)
This cow’s zscore is 0; her milk production was the same as the mean.
A cow produces 8 kg per milking. What is this cow’s zscore?
\(z=\frac{x\overline{x}}{s} =\frac{812.5}{4.3}=1.047\)
This cow’s zscore is 1.047; her milk production was 1.047 standard deviations below the mean.
Berry, D. P., Coyne, J., Boughlan, B., Burke, M., McCarthy, J., Enright, B., Cromie, A. R., McParland, S. (2013). Genetics of milking characteristics in dairy cows. Animal, 7(11), 17501758.
Example: Comparing Test Scores
SATMath scores are normally distributed with a mean of 500 and standard deviation of 100. ACTMath scores are normally distributed with a mean of 18 and standard deviation of 6. A student has taken both tests. They scored 600 on the SATMath and 22 on the ACTMath. Which score is more impressive?
We can't directly compare the student's SAT and ACT scores because they are on different scales. We can convert these test scores into zscores so we can directly compare them.
\(z_{SAT}=\frac{600500}{100}=1\)
This student scored 1 standard deviation above the mean on the SATMath.
\(z_{ACT}=\frac{2218}{6}=0.667\)
This student scored 0.667 standard deviations above the mean on the ACTMath.
The student's SATMath score is more impressive than their ACTMath score because the zscore is higher. They scored better than a larger proportion of other test takers on the SATMath.
Practice: Computing zscores
Type in the answer you think is correct  then click the 'Check' button to see how you did.
Click the right arrow to proceed to the next question. When you have completed all of the questions you will see how many you got right and the correct answers.
For each question, compute the zscore.
Type in the answer you think is correct  then click the 'Check' button to see how you did.
Click the right arrow to proceed to the next question. When you have completed all of the questions you will see how many you got right and the correct answers.
For each question, compute the zscore.
2.2.9  Percentiles
2.2.9  PercentilesThere are slightly different definitions of percentiles and different statistical software and textbooks may use different formulas. In this course, we will be using the definition from your textbook:
 Percentile
 Proportion of a distribution less than a given value.
Example: Test Scores
Test scores are often reported in terms of percentiles. For example, if a student scores in the 90th percentile on a test then he or she scored better than 90% of students who took the test.
2.2.10  Five Number Summary
2.2.10  Five Number Summary Five Number Summary
 Minimum, Q_{1}, Median, Q_{3}, Maximum
Q_{1} is the first quartile, this is the 25th percentile
Q_{3} is the third quartile, this is the 75th percentile
Five number summaries are used to describe some of the key features of a distribution. Using the values in a five number summary we can also compute the range and interquartile range.
 Range
 The difference between the maximum and minimum values.
 Range
 \(Range = Maximum  Minimum\)
 Note:
 The range is heavily influenced by outliers. For this reason, the interquartile range is often preferred because it is resistant to outliers.
 Interquartile range (IQR)
 The difference between the first and third quartiles.
 Interquartile Range
 \(IQR = Q_3  Q_1\)
Example: Hours Spent Studying
A professor asks a sample of students how many hours they spent studying for the final. The five number summary for their responses is (5, 7, 9, 11, 13).
Range
The maximum is 13 and the minimum is 5.
\(Range = 13  5 = 8\)
Interquartile Range
The third quartile is 11 and the first quartile is 7.
\(IQR = Q_3  Q_1 = 11  7 = 4\)
Example: Test Scores
A teacher wants to examine students’ test scores. The five number summary for their scores is (74, 80, 89, 90, 98).
Range
The highest score is 98. The lowest score is 74.
\(Range = 98  74 = 24\)
Interquartile Range
The third quartile is 90 and the first quartile is 80.
\(IQR = Q3  Q1 = 90  80 = 10\)
2.3  Lesson 2 Summary
2.3  Lesson 2 SummaryObjectives
 Compute and interpret a basic proportion/risk/probability and odds
 Select and interpret the appropriate visual representations for one categorical variable, two categorical variable, and one quantitative variable
 Use Minitab Express to construct frequency tables, pie charts, bar charts, twoway tables, clustered bar charts, histograms, and dotplots
 Compute and interpret complements, intersections, unions, and conditional probabilities given a twoway table
 Identify outliers on a histogram or dotplot
 Interpret the shape of a distribution
 Compute and interpret the mean, median, mode, and standard deviation
 Compute and interpret percentiles and z scores
 Apply the Empirical Rule
 Interpret a five number summary
In this lesson, you learned how to display and summarize data concerning one categorical variable, two categorical variables, and one quantitative variable. Review the learning objectives above. You should be able to successfully complete each of these tasks before moving on. If you have any questions, post them on the Lesson 2 Discussion Board in Canvas.
In Lesson 3 we will build on this as we examine how to display and summarize data concerning the relationship between a categorical and a quantitative variable and two quantitative variables.