Lesson 1: Collecting Data
Lesson 1: Collecting DataObjectives
 Identify cases and variables in a research study
 Classify variables as categorical or quantitative
 Identify explanatory and response variables in a research study
 Distinguish between a sample and a population
 Determine whether a given sample is representative of the intended population
 Identify simple random sampling and convenience sampling methods
 Use Minitab Express to draw a simple random sample from a known population
 Identify potential nonresponse and response bias
 Distinguish between experimental and observational designs
 Identify confounding variables
 Identify randomized experiments
 Determine when causal conclusions (as opposed to associations) can be made
 Classify samples as being independent or paired
 Identify control groups, placebos, and blinding in research studies and explain why each is used
In this lesson, you will learn about how data are collected. You will be introduced to the terminology that will be used throughout the course. At the end of this lesson, there are flash cards that you can use to review these terms. You may also want to make your own flash cards by hand to review these terms throughout the semester.
1.1  Cases & Variables
1.1  Cases & VariablesWhen conducting a research study, information is collected concerning cases. Cases are also sometimes known as units or experimental units.
A variable is a characteristic that is measured and can take on different values. In other words, something that can vary. This is in contrast to a constant which is the same for all cases in a study.
 Case
 An experimental unit from which data are collected
 Variable
 Characteristic of cases that can take on different values (in other words, something that can vary)
Example: Student Data
Data are collected from a sample of STAT 200 students. Each student's major, quiz score, and lab assignment score is recorded.
The cases are the STAT 200 students. There are three variables: major, quiz score, and lab assignment score.
Example: Study Time & Grades
A third grade teacher wants to know if students who spend more time studying at home get higher homework and exam grades.
The third grade students are the cases. There are numerous variables: the amount of time spent studying at home, the homework grades, and the exam grades.
Example: Dog Food
A researcher wants to know if dogs who are fed only canned food have different body mass indexes (BMI) than dogs who are fed only hard food. They collect BMI data from 50 dogs who eat only canned food and 50 dogs who eat only hard food.
The cases are the dogs. There are two variables: type of food and BMI.
Example: Chocolate Chip Cookie
Research question: What is the average weight of a chocolate chip cookie?
The cases are the cookies. The variable is weight.
1.1.1  Categorical & Quantitative Variables
1.1.1  Categorical & Quantitative VariablesVariables can be classified as categorical or quantitative. Categorical variables are those that provide groupings that may have no logical order, or a logical order with inconsistent differences between groups (e.g., the difference between 1st place and 2 second place in a race is not equivalent to the difference between 3rd place and 4th place). Quantitative variables have numerical values with consistent intervals.
 Categorical variable
 Names or labels (i.e., categories) with no logical order or with a logical order but inconsistent differences between groups (e.g., rankings), also known as qualitative.
 Quantitative variable
 Numerical values with magnitudes that can be placed in a meaningful order with consistent intervals, also known as numerical.
Example: Weight
A team of medical researchers weigh participants in kilograms. Weight in kilograms is a quantitative variable because it takes on numerical values with meaningful magnitudes and equal intervals.
Example: Favorite Ice Cream Flavor
A teacher conducts a poll in her class. She asks her students if they would prefer chocolate, vanilla, or strawberry ice cream at their class party. Preferred ice cream flavor is a categorical variable because the different flavors are categories with no meaningful order of magnitudes.
Example: Birth Location
A survey asks “On which continent were you born?” This is a categorical variable because the different continents represent categories without a meaningful order of magnitudes.
Example: Children per Household
A census asks every household in a city how many children under the age of 18 reside there. Number of children in a household is a quantitative variable because it has a numerical value with a meaningful order and equal intervals.
Example: Highway Mile Markers
When a car breaks down on the highway, the emergency dispatcher may ask for the nearest mile marker. Highway mile marker value is a quantitative variable because it is numeric with a meaningful order of magnitudes and equal intervals.
Example: Running Distance
A runner records the distance he runs each day in miles. Distance in miles is a quantitative variable because it takes on numerical values with meaningful magnitudes and equal intervals.
Example: Highest Level of Education
A census asks residents for the highest level of education they have obtained: less than high school, high school, 2year degree, 4year degree, master's degree, doctoral/professional degree. This is a categorical variable. While there is a meaningful order of educational attainment, the differences between each category are not consistent. For example, the difference between high school and 2year degree is not the same as the difference between a master's degree and a doctoral/professional degree. Because there are not equal intervals, this variable cannot be classified as quantitative.
Example: Online Courses Taught
A survey designed for online instructors asks, "How many online courses have you taught?" Three options are given: "none," "some," or "many." While there is a meaningful order of magnitudes, there are not equal intervals. This is a categorical variable.
If the survey had asked, "How many online courses have you taught? Enter a number." this would be a quantitative variable. Here, participants are answering with the number of online courses they have taught. This is a numerical value with a meaningful order of magnitudes and equal intervals.
1.1.2  Explanatory & Response Variables
1.1.2  Explanatory & Response VariablesIn some research studies one variable is used to predict or explain differences in another variable. In those cases, the explanatory variable is used to predict or explain differences in the response variable. In an experimental study, the explanatory variable is the variable that is manipulated by the researcher.
 Explanatory Variable

Also known as the independent or predictor variable, it explains variations in the response variable; in an experimental study, it is manipulated by the researcher
 Response Variable

Also known as the dependent or outcome variable, its value is predicted or its variation is explained by the explanatory variable; in an experimental study, this is the outcome that is measured following manipulation of the explanatory variable
Example: Panda Fertility Treatments
A team of veterinarians wants to compare the effectiveness of two fertility treatments for pandas in captivity. The two treatments are invitro fertilization and male fertility medications. This experiment has one explanatory variable: type of fertility treatment. The response variable is a measure of fertility rate.
Example: Public Speaking Approaches
A public speaking teacher has developed a new lesson that she believes decreases student anxiety in public speaking situations more than the old lesson. She designs an experiment to test if her new lesson works better than the old lesson. Public speaking students are randomly assigned to receive either the new or old lesson; their anxiety levels during a variety of public speaking experiences are measured. This experiment has one explanatory variable: the lesson received. The response variable is anxiety level.
Example: Coffee Bean Origin
A researcher believes that the origin of the beans used to make a cup of coffee affects hyperactivity. He wants to compare coffee from three different regions: Africa, South America, and Mexico. The explanatory variable is the origin of coffee bean; this has three levels: Africa, South America, and Mexico. The response variable is hyperactivity level.
Example: Height & Age
A group of middle school students wants to know if they can use height to predict age. They take a random sample of 50 people at their school, both students and teachers, and record each individual's height and age. This is an observational study. The students want to use height to predict age so the explanatory variable is height and the response variable is age.
Example: Gender & Height
Research question: Do third grade boys tend to be taller than third grade girls?
This is an observational study. The researcher wants to use gender to explain differences in height. The explanatory variable is gender. The response variable is height.
1.2  Samples & Populations
1.2  Samples & PopulationsWe often have questions concerning large populations. Gathering information from the entire population is not always possible due to barriers such as time, accessibility, or cost. Instead of gathering information from the whole population, we often gather information from a smaller subset of the population, known as a sample.
Values concerning a sample are referred to as sample statistics while values concerning a population are referred to as population parameters.
 Population
 The entire set of possible cases
 Sample
 A subset of the population from which data are collected
 Statistic
 A measure concerning a sample (e.g., sample mean)
 Parameter
 A measure concerning a population (e.g., population mean)
The process of using sample statistics to make conclusions about population parameters is known as inferential statistics. In other words, data from a sample are used to make an inference about a population.
 Inferential Statistics
 Statistical procedures that use data from an observed sample to make a conclusion about a population
Example: Student Housing
A survey is carried out at Penn State Altoona to estimate the proportion of all undergraduate students living at home during the current term. Of the 3,838 undergraduate students enrolled at the campus, a random sample of 100 was surveyed.
 Population: All 3,838 undergraduate students at Penn State Altoona
 Sample: The 100 undergraduate students surveyed
We can use the data collected from the sample of 100 students to make inferences about the population of all 3,838 students.
Example: Polling Teachers
Educational policy researchers randomly selected 400 teachers at random from the National Science Teachers Association database of members and asked them whether or not they believed that evolution should be taught in public schools. They received responses from 252 teachers.
 Population: All National Science Teachers Association members
 Sample: The 252 respondents
The researchers can use the data collected from the 252 teachers who responded to the survey to make inferences about the population of all National Science Teachers Association members.
Example: Flipping a Coin
A fair coin is flipped 500 times and the number of heads is recorded.
 Population: All flips of this coin
 Sample: The 500 flips recorded in this study
We can use data from these 500 flips to make inferences about the population of all flips of this coin.
1.2.1  Sampling Bias
1.2.1  Sampling BiasRecall the entire group of individuals of interest is called the population. It may be unrealistic or even impossible to gather data from the entire population. The subset of the population from which data are actually gathered is the sample. A sample should be selected from a population randomly, otherwise it may be prone to bias. Our goal is to obtain a sample that is representative of the population.
 Representative Sample
 A subset of the population from which data are collected that accurately reflects the population
 Bias
 The systematic favoring of certain outcomes
 Sampling Bias
 Systematic favoring of certain outcomes due to the methods employed to obtain the sample
Example: Weight Loss Study Volunteers
A medical research center is testing a new weight loss treatment. They advertise on a social media site that they are looking for volunteers to participate. There is sampling bias because the sample will be limited to people who use the social media site where they advertised. The individuals who choose to participate may be different from the overall population. For example, volunteers may be individuals who are already actively trying to lose weight. This is not a representative sample because the sample may have characteristics that are different from the population of interest.
Example: NYC Advertising Study
The marketing department for a large retail chain wants to survey their customers about a new advertising plan. They go into one of their largest New York City stores on a Tuesday morning and survey the first 50 people who make a purchase. There is sampling bias for a number of reasons. They are only sampling at one store, in New York City; there may be differences between the customers at this store and those that shop at their other locations. By conducting their survey on a Tuesday morning they are limiting themselves to individuals who are out shopping at that time; the sample may lack people who work during the day. Finally, they only survey people who make a purchase; individuals who do not make a purchase, perhaps because they are not satisfied with the store, will not be included in their sample. This is not a representative sample because the sample selected may be different from the population of interest.
1.2.2  Sampling Methods
1.2.2  Sampling MethodsThere are many different ways to select a sample from a population. Some of these methods are probabilitybased, such as the simple random sampling method that you'll read about below and in your textbook. Other probabilitybased methods include cluster sampling methods and stratified sampling methods. You may learn more about these if you take a research methods course in the future. Other sampling methods are not probabilitybased, such as convenience sampling methods which you'll read about below.
To prevent sampling bias and obtain a representative sample, a sample should be selected using a probabilitybased sampling design which gives each individual a known chance of being selected. The most common probabilitybased sampling method is the simple random sampling method.
Using this method, a sample is selected without replacement. This means that once an individual has been selected to be a part of the sample they cannot be selected a second time. If multiple samples are being taken (e.g., when constructing a sampling distribution in Lesson 4), an individual can appear in more than one sample, but only once in each sample.
 Simple Random Sampling
 A method of obtaining a sample from a population in which every member of the population has an equal chance of being selected
Examples: Community Service Attitudes
An institutional researcher is conducting a study of World Campus students’ attitudes toward community service. He takes a list of all 12,242 World Campus students and uses a random number generator to select 30 students whom he contacts to complete the survey. This researcher used simple random sampling because participants were selected from the overall population in a way that each individual had an equal chance of being selected.
Example: Languages
A student wants to learn more about the languages spoken in her town. She has access to the census forms submitted by all 3,500 households in her town. It would take too long for her to go through all 3,500 forms, so she uses a random number generator to select 100 households. She finds those 100 census forms and records data concerning the languages spoken in those households. This is a simple random sample because the sample of 100 households was selected in a way that each of the 3,500 households had an equal chance of being selected.
Convenience Sampling
While probabilitybased sampling methods are considered better because they can prevent sampling bias, there are times when it is not possible to use one of these methods. For example, a researcher may not have access to the entire population. In cases were probabilitybased sampling methods are not practical, convenience samples are often used.
 Convenience Sampling
 A method of obtaining a sample from a population by ease of accessibility; such a sample is not random and may not be representative of the intended population.
Example: Weight Loss Supplements
A weight loss company wants to compare how much weight adult women lose on their supplement versus a competitor's supplement. To recruit participants they post an advertisement in a newspaper asking for women who want to lose weight. This is an example of a volunteer sample which is a convenience sampling method. The researchers are using a sample of individuals who volunteer to participate.
Example: Chocolate Preferences
A chocolate company wants to know if customers prefer their dark chocolate with or without peanuts. They set up a table in a grocery store on a Monday morning, offer customers samples of their dark chocolate with and without peanuts, and ask which they prefer. This is an example of a convenience sampling method. The sample is not being selected using any probabilitybased method and may not be representative of the company's intended population. People who grocery shop may be a special subset of the population. For example, people who do not work traditional fulltime jobs may be more likely to grocery shop at that time. The researchers are using a sample of individuals who happen to be grocery shopping on a Monday morning and who volunteer to eat their chocolate.
1.2.2.1  Minitab Express: Simple Random Sampling
1.2.2.1  Minitab Express: Simple Random SamplingUsing simple random sampling methods, each member of the population has an equal chance of being selected. We can use statistical software to select a simple random sample.
In the example below we will randomly select 10 names from a class list.
MinitabExpress – Random Sampling from a Column
We could place those names into a column in Minitab Express.
Open the following data set:
and randomly select 10 using Minitab Express by:
 On a PC or Mac select DATA > Sample from Columns
 Doubleclick on the variable Name in the box to the left to insert it into the "Take a sample from the following columns" box.
 In the box labeled "Number of rows in each sample", enter 10.
 By default, leave the method as "Sample without replacement".
 Click OK.
The result should be the following output:
Input  

Source data column  Name 
Number of rows sampled  10 
Method  Without replacement 
Output  

Sampled data column  C2 
10 rows were sampled from Name and stored in C2. 
Along with a random sample of the names in the second column in the data worksheet:
C1  C2  

Name  Sample From Name  
1  Beckman  Qi 
2  Beeson  Song 
3  Boone  Walia 
4  Botero  Gruver 
5  Brooks  Corey 
6  Brown  Cingolani 
7  Campbell  Farooq 
8  Cao  Yan 
9  Chen  O'Donnell 
10  Chen  Wang 
11  Chung 
Since we are using simple random sampling procedures, the results will be different each time due to random sampling variation. Try these steps a few times, you should see that you get a different set of 10 names each time.
Select your operating system below to see a stepbystep guide for this example:
1.3  Other Sources of Bias
1.3  Other Sources of BiasOn the previous pages you learned about sampling bias and how simple random sampling methods can be used to avoid sampling bias. Here, we will discuss two other sources of bias: nonresponse bias and response bias. These are both problems that should be prevented in the design of a research study.
 NonResponse Bias
 Systematic favoring of certain outcomes that occurs when the individuals who choose participate in a study differ from the individuals who choose to not participate
 Response Bias
 Systematic favoring of certain outcomes that occurs when participants do not respond truthfully; they may do so to align with social norms or to appease the researcher
Example: Restaurant Experience Survey
A restaurant invited their recent customers to complete an online survey. Customers who had really strong feelings about their experience, either positive or negative, were very likely to complete the survey while customers who had a neutral experience were much less likely to complete the survey. This is an example of nonresponse bias because the individuals who chose to participate differed from those who chose to not participate.
Example: Retail Store Hours
A retail store was considering expanding their operating hours. To determine if this was a need perceived by their customers, they conducted a survey over the telephone to obtain data. Research assistants called the phone numbers of customers who were randomly selected to participate between the hours of 9AM and 4PM. Individuals who were at work were less likely to answer their phone call or agree to participate in the study than individuals who were at home at that time. This is an example of nonresponse bias because the individuals who responded to the survey were different from individuals who did not respond in terms of their work schedule.
Example: Sexual Activity Survey
A psychologist is conducting a research study concerning sexual activities. The survey is administered over the phone and many of the questions are personal. Some participants feel uncomfortable and do not answer honestly due to embarrassment. This is an example of response bias because the participants are not responding truthfully; instead their responses are biased toward what they perceive as being socially acceptable.
Example: Cheating in Class
Using an anonymous online survey, a professor asks his students “Have you cheated on an exam in my class?” Many of the students who have cheated still answered “no.” This is an example of response bias because the participants are not responding truthfully; instead their responses are biased toward responses that are less likely to get them in trouble.
1.4  Research Study Design
1.4  Research Study DesignExperimental and Observational Designs
Research studies are often classified in terms of their designs. Here, we will make the distinction between experimental and observational research designs.
 Experimental Research Design

A study in which the researcher manipulates the treatments received by subjects and collects data; also known as a scientific study
 Observational Research Design

A study in which the researcher collects data without performing any manipulations; also known as a nonexperimental study
Example: Caffeinated Coffee Studies
An organization wants to know if drinking caffeinated coffee causes hyperactivity in college students. To test their research question, they select a sample of college students and give them a survey concerning their intake of caffeinated coffee and their hyperactivity levels. This is an observational study because the researchers are not making any manipulations. They are observing what is happening without intervening. This is not an experiment because no treatment was imposed by the researchers.
Another organization also wants to know if drinking caffeinated coffee causes hyperactivity in college students. They design a different study. They select a random sample of college students and randomly assign them to drink coffee with or without caffeine. The researchers observe the students' behaviors. This is an experimental study because a treatment is being imposed. The researchers are manipulating the treatment that each participant receives.
On Your Own
A team of researchers want to know if Advil or Tylenol is more effective.
Think about the following data collection methods, then click on the method to compare your answers.
1.4.1  Confounding Variables
1.4.1  Confounding VariablesExperimental studies are typically preferred over observational studies because they allow for more control. A common problem with observational studies is that there may be other variables influencing the results that the researchers were not able to take into account. These are known as confounding variables.
 Confounding Variable

Characteristic that varies between cases and is related to both the explanatory and response variables; also known as a lurking variable or a third variable
Example: Ice Cream & Home Invasions
There is a positive relationship between ice cream sales and home invasions (i.e., as ice cream sales increase throughout the year so do home invasions). It is clear that increases in ice cream sales do not cause home invasions to increase, and home invasions do not cause an increase in ice cream sales. There is a third variable at play here: outdoor temperature. When the weather is warmer both ice cream sales and home invasions increase. In this case, outdoor temperature is a confounding variable.
1.4.2  Causal Conclusions
1.4.2  Causal ConclusionsIn order to control for confounding variables, participants can be randomly assigned to different levels of the explanatory variable. This act of randomly assigning cases to different levels of the explanatory variable is known as randomization. An experiment that involves randomization may be referred to as a randomized experiment or randomized comparative experiment. By randomly assigning cases to different conditions, a causal conclusion can be made; in other words, we can say that differences in the response variable are caused by differences in the explanatory variable. Without randomization, an association can be noted, but a causal conclusion cannot be made.
Note that randomization and random sampling are different concepts. Randomization refers to the random assignment of experimental units to different conditions (e.g., different treatment groups). Random sampling refers to probabilitybased methods for selecting a sample from a population.
 Randomization
 The act of randomly assigning cases to different levels of the explanatory variable
 Causation
 Changes in one variable can be attributed to changes in a second variable
 Association
 A relationship between variables
1.4.3  Independent and Paired Samples
1.4.3  Independent and Paired SamplesIn both observational and experimental studies, we often want to compare two or more groups. When comparing two or more groups, cases may be independent or paired.
 Independent Groups
 Cases in each group are unrelated to one another.
 Paired Groups

Cases in each group are meaningfully matched with one another; also known as dependent samples or matched pairs
Example: Exam Scores
An instructor wants to compare students' scores on the midterm and final exam. This is most often done by obtaining a sample of students and recording each student's midterm exam score and final exam score. In other words, there would be two measurements for each student. This is an example of a matched pairs design because data would be paired by student.
Example: Shoes
A shoe company is studying how many shoes Italian men and women own. In one research study they take a random sample of 500 Italian adults and ask each individual if they identify as a man or women and how many pairs of shoes they own. The men and women in this study are in two independent groups.
In a second study the researchers use a different design. This time they take a random sample of 250 heterosexual married couples in Italy (i.e., 250 husbands and 250 wives). They record the number of shoes owned by each husband and each wife. This is an example of a matched pairs design. Data are paired by couple.
1.4.4  Control and Placebo Groups
1.4.4  Control and Placebo GroupsA control group is an experimental condition that does not receive the actual treatment and may serve as a baseline. A control group may receive a placebo or they may receive no treatment at all. A placebo is something that appears to the participants to be an active treatment, but does not actually contain the active treatment. For example, a placebo pill is a sugar pill that participants may take not knowing that it does not contain any active medicine. This can lead to a psychological phenomena called the placebo effect which occurs when participants who are given a placebo treatment experience a change even though they are not receiving any active treatment. Researchers use placebos in the control group to determine if any differences between groups are due to the active medicine or the participants' perceptions (the placebo effect).
 Control Group
 A level of the explanatory variable that does not receive an active treatment; they may receive no treatment or a placebo
 Placebo Group
 A group that receives what, to them, appears to be a treatment, but actually is neutral and does not contain any active treatment (e.g., a sugar pill in a medication study)
Example: Vitamin B Energy Study
Researchers want to know if adult women who consume a drink that is high in vitamin B12 have increased energy. They obtain a representative sample of adult women. All of the women are given a drink that they are told to consume every morning. They are not told what is in the drink. Half of the women are given a drink that is high in vitamin B12 while the other half are given a drink that tastes the same but contains no vitamin B12.
The women who received the drink with no vitamin B12 are the placebo group. The purpose of the placebo group in this study is to make the two groups equivalent except for the presence of the vitamin B12. By comparing these two groups, the researchers will be able to determine what impact the vitamin B12 had on the response variable. We could also say that this served as a control group because this group did not receive any active ingredients.
1.4.5  Blinding
1.4.5  BlindingBlinding techniques are also used to avoid bias. In a singleblind study the participants do not know what treatment groups they are in, but the researchers interacting with them do know. In a doubleblind study, the participants do not know what treatment groups they are in and neither do the researchers who are interacting with them directly. Doubleblind studies are used to prevent researcher bias.
 Blinding
 Procedure employed in research to prevent bias in which the participants and/or the researchers interacting with the participations do not know which treatment each case is receiving
 SingleBlind Study
 Research study in which the participants do not know the treatment group that they have been assigned to
 DoubleBlind Study
 Research study in which neither the participants nor the researchers interacting with them know which cases have been assigned to which treatment groups
Example: Yogurt Tasting
Researchers are comparing a lowfat blueberry yogurt to a highfat blueberry yogurt. Participants are randomly assigned to receive one type of yogurt. After tasting it, they complete an online survey. The researchers know which yogurt containers are lowfat and which are highfat, but participants are not told. This is an example of a singleblind study because the researchers know which participants are in the low and highfat groups but the participants do not know. A doubleblind study may not be necessary in this case since the researchers have only minimal contact with the participants.
Example: Caffeine Energy Study
Researchers want to know if adult males who consume high amounts of caffeine interact more energetically. They obtain a representative sample and randomly assign half of the participants to take a caffeine pill and half to take a placebo pill. The pills are randomly numbered and coded so at the time the researchers do not know which participants have been given caffeine and which have been given the placebo. All participants are told that they may have been given a caffeine pill. After taking the pill, researchers observe the participants interacting with one another and rate the interactions in terms of level of energy.
This is a doubleblind study because neither the researchers nor the participants know who is in which group at the time the data are collected. After the data are collected, researchers can look at the pill codes to determine which groups the participants were in to conduct their analyses. A doubleblind study is necessary here because the researchers are observing and rating the participants. If the researchers know who is in the caffeine group they may be more likely to rate their levels of energy as very high because that is consistent with their hypothesis.
1.5  Lesson 1 Summary
1.5  Lesson 1 SummaryLesson 1: Learning Objectives
 Identify cases and variables in a research study
 Classify variables as categorical or quantitative
 Identify explanatory and response variables in a research study
 Distinguish between a sample and a population
 Determine whether a given sample is representative of the intended population
 Identify simple random sampling and convenience sampling methods
 Use Minitab Express to draw a simple random sample from a known population
 Identify potential nonresponse and response bias
 Distinguish between experimental and observational designs
 Identify confounding variables
 Identify randomized experiments
 Determine when causal conclusions (as opposed to associations) can be made
 Classify samples as being independent or paired
 Identify control groups, placebos, and blinding in research studies and explain why each is used
In this lesson you learned about how data are collected. You were introduced to terminology that will be used throughout the course and you examined different types of research study designs (experimental and observational), sampling methods, and sources of bias. You learned that in order to make generalizations from a sample to a population the sample must be representative of the population; ideally the sample should be randomly selected using a probabilitybased sampling method such as a simple random sampling. In order to make a causal conclusion, randomization is required.