# Lesson 1: The Big Picture

Lesson 1: The Big Picture## Overview

In this lesson, our primary aim is to get a big picture of the entire course that lies ahead of us. Along the way, we'll also learn some basic concepts to help us begin to build our probability tool box.

## Objectives

- Learn the distinction between a population and a sample.
- Learn how to define an outcome (sample) space.
- Learn how to identify the different types of data: discrete, continuous, categorical, binary.
- Learn how to summarize quantitative data graphically using a histogram.
- Learn how statistical packages construct histograms for discrete data.
- Learn how statistical packages construct histograms for continuous data.
- Learn the distinction between frequency histograms, relative frequency histograms, and density histograms.
- Learn how to "read" information from the three types of histograms.
- Learn the big picture of the course, that is, put the material of sections 1 through 5 here on-line, and chapters 1 through 5 in the text, into a framework for the course.

# 1.1 - Some Research Questions

1.1 - Some Research QuestionsResearch studies are conducted in order to answer some kind of research question(s). For example, the researchers in the Vegan Health Study define at least eight primary questions that they would like answered about the health of people who eat an entirely animal-free diet (no meat, no dairy, no eggs). Another research study was recently conducted to determine whether people who take the pain medications Vioxx or Celebrex are at a higher risk for heart attacks than people who don't take them. The list goes on. Researchers are working every day to answer their research questions.

What do you think about these research questions?

- What percentage of college students feel sleep-deprived?
- What is the probability that a randomly selected PSU student gets more than seven hours of sleep each night?
- Do women typically cry more than men?
- What is the typical number of credit cards owned by Stat 414 students?

Assuming that the above questions don't float your boat, can you formulate a few research questions that do interest you?

If we were to attempt to answer our research questions, we would soon learn that we couldn't ask every person in the population if they feel sleep-deprived, how often they cry, or the number of credit cards they have.

## Try It!

How can we answer our research question if we can't ask every person in the population our research question?

We could take a **random sample** from the **population**, and use the resulting sample to learn something about the population.

# 1.2 - Populations and Random Samples

1.2 - Populations and Random SamplesIn trying to answer each of our research questions, whether yours or mine, we unfortunately can't ask every person in the population. Instead, we take a random sample from the population, and use the resulting sample to learn something, or **make an inference**, about the population:

For the research question "what percentage of college students feel sleep-deprived?", the population of interest is all college students. Therefore, assuming we are restricting the population to be U.S. college students, a random sample might consist of 1300 randomly selected students from all of the possible colleges in the United States. For the research question "what is the probability that a randomly selected Penn State student gets more than 7 hours of sleep each night?", the population of interest is a little narrower, namely only Penn State students. In this case, a random sample might consist of, say, 300 randomly selected Penn State students. For the research question "what is the typical number of credit cards owned by Stat 414 students?", the population of interest is even more narrow, namely only the students enrolled in Stat 414. Ahhhh If we are only interested in students currently enrolled in Stat 414, we have no need for taking a random sample Instead, we can conduct a census, in which all of the students are polled.

## Try It!

Now, for each of the research questions you previously defined, identify the population of interest and describe a potential random sample.

The answers (or data) we get to our research questions of course depend on who ends up in our random sample. We can't possibly predict the possible outcomes with certainty, but we can at least create a list of possible outcomes.

# 1.3 - Sample Spaces

1.3 - Sample Spaces

- Sample Space
- The
**sample space**(or**outcome space**), denoted \(\mathbf{S}\), is the collection of all possible outcomes of a random study.

In order to answer my first research question, we would need to take a random sample of U.S. college students, and ask each one "Do you feel sleep-deprived?" Each student should reply either "yes" or "no." Therefore, we would write the sample space as:

\(\mathbf{S} = \{\text{yes}, \text{no}\}\)

In order to answer my second research question, we would need to know how many hours of sleep a random sample of college students gets each night. One way of getting this information is to ask each selected student to record the number of hours of sleep they had last night. In this case, if we let *h* denote the number of hours slept, we would write the sample space as:

\(\mathbf{S} = \{h: h \ge 0 \text{ hours}\}\)

Hmmm, if we conducted a random study to answer my third research question, how would we define our sample space? Well, of course, it depends on how we went about trying to answer the question. If we asked a random sample of men and women "on how many days did you cry last month?", we would write the sample space as:

\(\mathbf{S} = \{0, 1, 2, ..., 31\}\)

Finally, if we were interested in learning about students who took Stat 414 in the past decade when trying to answer my fourth research question, we might ask all current Stat 414 students "how many credit cards do you have?" In that case, we would write our sample space as:

\(\mathbf{S} = \{0, 1, 2, ...\}\)

There is not always just one way of obtaining an answer to a research question. For my second research question, how would we define the sample space if we instead asked a random sample of college students "did you get more than seven hours of sleep last night?"

For each of the research questions you created:

- Formulate the question you would ask (or describe the measurement technique you would use).
- Define the resulting sample space.

Once we collect sample data, we need to do something with it. Like summarizing it would be good! How we summarize data depends on the type of data we collect.

# 1.4 - Types of data

1.4 - Types of data## Example 1-1

Your instructor asked a random sample of 20 college students "do you consider yourself to be sleep-deprived?" Their replies were:

yes | yes | yes | no | no | no | yes | yes | yes | yes |

yes | no | no | yes | yes | no | yes | yes | yes | yes |

Of course, it would be good to summarize the students' responses. What we do with the data though depends on the type of data collected. For our purposes, we will primarily be concerned with three types of data:

- discrete
- continuous
- categorical

Now, for their definitions!

- Discrete Data
- Quantitative data are called
**discrete**if the sample space contains a finite or countably infinite number of values. -
Recall that a set of elements are countably infinite if the elements in the set can be put into one-to-one correspondence with the positive integers. My third research question yields discrete data, because of its sample space:

\(\mathbf{S} = \{0, 1, 2, ..., 31\}\)

contains a finite number of values. And, my fourth research question yields discrete data, because of its sample space:

\(\mathbf{S} = \{0, 1, 2, ...\}\)

contains a countably infinite number of values.

- Continuous Data
- Quantitative data are called
**continuous**if the sample space contains an interval or continuous span of real numbers. -
My second research question yields continuous data, because of its sample space:

\(\mathbf{S} = \{h: h \ge 0 \text{ hours}\}\)

is the entire positive real line. For continuous data, there is theoretically an infinite number of possible outcomes; the measurement tool is the restricting factor. For example, if I were to ask how much each student in the class weighed (in pounds), I would most likely get responses such as 126, 172, and 210. The responses are seemingly discrete. But, are they? If I report that I weigh 118 pounds, am I exactly 118 pounds? Probably not; I'm perhaps 118.0120980335927.... pounds. It's just that the scale that I get on in the morning tells me that I weigh 118 pounds. Again, the measurement tool is the restricting factor — something you always have to think about when trying to distinguish between discrete and continuous data.

- Categorical Data
- Qualitative data are called
**categorical**if the sample space contains objects that are grouped or categorized based on some qualitative trait. When there are only two such groups or categories, the data are considered**binary**. -
My first research question yields binary data because its sample space is:

\(\mathbf{S} = \{\text{yes}, \text{ no}\}\)

Two other examples of categorical data are eye color (brown, blue, hazel, and so on) and semester standing (freshman, sophomore, junior and senior).

# 1.5 - Summarizing Quantitative Data Graphically

1.5 - Summarizing Quantitative Data Graphically## Example 1-2

As discussed previously, how we summarize a set of data depends on the type of data. Let's take a look at an example. A sample of 40 female statistics students were asked how many times they cried in the previous month. Their replies were as follows:

9 | 5 | 3 | 2 | 6 | 3 | 2 | 2 | 3 | 4 | 2 | 8 | 4 | 4 |

5 | 0 | 3 | 0 | 2 | 4 | 2 | 1 | 1 | 2 | 2 | 1 | 3 | 0 |

2 | 1 | 3 | 0 | 0 | 2 | 2 | 3 | 4 | 1 | 1 | 5 |

That is, one student reported having cried nine times in the one month, while five students reported having cried not at all. It's pretty hard to draw too many conclusions about the frequency of crying for females statistics students without summarizing the data in some way.

Of course, a common way of summarizing such discrete data is by way of a **histogram**.

Here's what a **frequency histogram** of these data look like:

As you can see, a histogram gives a nice picture of the "**distribution**" of the data. And, in many ways, it's pretty self-explanatory. What are the notable features of the data? Well, the picture tells us:

- The most common number of times that the women cried in the month was two (called the "
**mode**"). - The numbers ranged from 0 to 9 (that is, the "
**range**" of the data is 9). - A majority of women (22 out of 40) cried two or fewer times, but a few cried as much as six or more times.

Can you think of anything else that the frequency histogram tells us? If we took another sample of 40 female students, would a frequency histogram of the new data look the same as the one above? No, of course not — that's what variability is all about.

Can you create a series of steps that a person would have to take in order to make a frequency histogram such as the one above? Does the following set of steps seem reasonable?

## To create a frequency histogram of (finite) discrete data

- Determine the number, \(n\), in the sample.
- Determine the frequency, \(f_i\), of each outcome \(i\).
- Center a rectangle with base of length 1 at each observed outcome \(i\) and make the height of the rectangle equal to the frequency.

For our crying (out loud) data, we would first tally the frequency of each outcome:

and then we'd use the first column for the horizontal-axis and the third column for the vertical-axis to draw our frequency histogram:

Well, of course, in practice, we'll not need to create histograms by hand. Instead, we'll just let statistical software (such as Minitab) create histograms for us.

Okay, so let's use the above frequency histogram to answer a few more questions:

- What percentage of the surveyed women reported not crying at all in the month?
- What percentage of the surveyed women reported crying two times in the month? and three times?

Clearly, the frequency histogram is not a 100%-user friendly. To answer these types of questions, it would be better to use a **relative frequency histogram**:

Now, the answers to the questions are a little more obvious — about 12% reported not crying at all; about 28% reported crying two times; and about 18% reported crying three times.

## To create a relative frequency histogram of (finite) discrete data

- Determine the number, \(n\), in the sample.
- Determine the frequency, \(f_i\), of each outcome \(i\).
- Calculate the relative frequency (proportion) of each outcome \(i\) by dividing the frequency of outcome \(i\) by the total number in the sample \(n\) — that is, calculate \(\frac{f_i}{n}\) for each outcome \(i\).
- Center a rectangle with base of length 1 at each observed outcome i and make the height of the rectangle equal to the relative frequency.

While using a relative frequency histogram to summarize discrete data is a worthwhile pursuit in and of itself, my primary motive here in addressing such histograms is to motivate the material of the course. In our example, if we

- let X = the number of times (days) a randomly selected student cried in the last month, and
- let x = 0, 1, 2, ..., 31 be the possible values

Then \(h_0=\frac{f_0}{n}\) is the relative frequency (or proportion) of students, in a sample of size \(n\), crying \(x_0\) times. You can imagine that for really small samples \(\frac{f_0}{n}\) is quite unstable (think \(n = 5\), for example). However, as the sample size \(n\) increases, \(\frac{f_0}{n}\) tends to stabilize and approach some limiting probability \(p_0=f(x_0)\) (think \(n = 1000\), for example). You can think of the relative frequency histogram serving as a sample estimate of the true probabilities of the population.

It is this \(f(x_0)\), called a (discrete) probability mass function, that will be the focus of our attention in Section 2 of this course.

## Example 1-3

Let's take a look at another example. The following numbers are the measured nose lengths (in millimeters) of 60 students:

38 | 50 | 38 | 40 | 35 | 52 | 45 | 50 | 40 | 32 | 40 | 47 | 70 | 55 | 51 |

43 | 40 | 45 | 45 | 55 | 37 | 50 | 45 | 45 | 55 | 50 | 45 | 35 | 52 | 32 |

45 | 50 | 40 | 40 | 50 | 41 | 41 | 40 | 40 | 46 | 45 | 40 | 43 | 45 | 42 |

45 | 45 | 48 | 45 | 45 | 35 | 45 | 45 | 40 | 45 | 40 | 40 | 45 | 35 | 52 |

How would we create a histogram for these data? The numbers look discrete, but they are technically continuous. The measuring tools, which consisted of a piece of string and a ruler, were the limiting factors in getting more refined measurements. Do you also notice that, in most cases, nose lengths come in five-millimeter increments... 35, 40, 45, 55...? Of course not, silly me... that's, again, just measurement error. In any case, if we attempted to use the guidelines for creating a histogram for discrete data, we'd soon find that the large number of disparate outcomes would prevent us from creating a meaningful summary of the data. Let's instead follow these guidelines:

## To create a histogram of continuous data (or discrete data with many possible outcomes)

The major difference is that you first have to group the data into a set of classes, typically of equal length. There are many, many sets of rules for defining the classes. For our purposes, we'll just rely on our common sense — having too few classes is as bad as having too many.

- Determine the number, \(n\), in the sample.
- Define \(k\) class intervals \((c_0, c_1], (c_1, c_2], ..., (c_{k-1}, c_k]\).
- Determine the frequency, \(f_i\), of each class \(i\).
- Calculate the relative frequency (proportion) of each class by dividing the class frequency by the total number in the sample — that is, \(\frac{f_i}{n}\).
- For a
**frequency histogram**: draw a rectangle for each class with the class interval as the base and the height equal to the frequency of the class. - For a
**relative frequency histogram**: draw a rectangle for each class with the class interval as the base and the height equal to the relative frequency of the class. - For a
**density histogram**: draw a rectangle for each class with the class interval as the base and the height equal to \(h(x)=\dfrac{f_i}{n(c_i-c_{i-1})}\) for \(c_{i-1}<x \leq c_i\), \(i = 1, 2,..., k\).

Here's what the work would like for our nose length example if we used 5 mm classes centered at 30, 35, ... 70:

For example, the relative frequency for the first class (27.5 to 32.5) is 2/60 or 0.033, whereas the height of the rectangle for the first class in a density histogram is 0.033/5 or 0.0066. Here is what the density histogram would like in its entirety:

Note that a density histogram is just a modified relative frequency histogram. That is, a density histogram is defined so that:

- the area of each rectangle equals the relative frequency of the corresponding class, and
- the area of the entire histogram equals 1.

Again, while using a density histogram to summarize continuous data is a worthwhile pursuit in and of itself, my primary motive here in addressing such histograms is to motivate the material of the course. As the sample size \(n\) increases, we can imagine our density histogram approaching some limiting continuous function \(f(x)\), say. It is this continuous curve \(f(x)\) that we will come to know in Section 3 as a (continuous) probability density function.

So, in Section 2, we'll learn about discrete probability mass functions (**p.m.f.**s). In Section 3, we'll learn about continuous probability density functions (**p.d.f.**s). In Section 4, we'll learn about p.m.f.s and p.d.f.s for two (random) variables (instead of one). In Section 5, we'll learn how to find the probability distribution for functions of two or more (random) variables. Wow! That's a lot of work. Before we can take it on, however, we will first spend some time in this Section 1 filling up our probability toolbox with some basic probability rules and tools.