# 1.1 - Collecting Data

1.1 - Collecting Data

Collecting data is an important first step in statistical analysis. The goal of statistics is to make inferences about a population based on a sample. How we collect the data is important. If the sample is not representative of the whole population, we cannot make inferences about the population from that sample.

The following are a few frequently used methods for collecting data:

• Personal Interview
• People usually respond when asked by a person but their answers may be influenced by the interviewer.
• Telephone Interview
• Cost-effective but need to keep it short since respondents tend to be impatient.
• Cost-effective but the response rate is lower and the respondents may be a biased sample.
• Direct Observation
• For certain quantities of interest, one may be able to measure it from the sample.
• Web-Based Survey
• Can only target the population who uses the web.

# 1.1.1 - Types of Bias

1.1.1 - Types of Bias

Whenever data is collected, there is a risk that the sample is biased. Here are some potential types of bias.

### Types of Bias

Non-Response Bias
When a large percentage of those sampled do not respond or participate.
Response Bias
When study participants either do not respond truthfully or give answers they feel the researcher wants to hear. For example, when students are asked if they ever cheated on an exam even those who have would respond with "no."
Selection
This bias occurs when the sample selected does not reflect the population of interest. For instance, you are interested in the attitude of female students regarding campus safety but when sampling you also include males. In this case, your population of interest was female students however your sample included subjects not in that population (i.e. males).

Students interested in pursuing topics related to the design of experiments might explore STAT 503: Design of Experiments. STAT 503 includes extensive coverage implementation and analysis of a wide range of experimental designs.

# 1.1.2 - Strategies for Collecting Data

1.1.2 - Strategies for Collecting Data

How can we get data? How do we select observations or measurements for a study?

There are two types of methods for collecting data, non-probability methods and probability methods.

Non-probability Methods

These might include:

Convenience sampling (haphazard): Collecting data from subjects who are conveniently obtained.
• Example: surveying students as they pass by in the university's student union building.
Gathering volunteers: Collecting data from subjects who volunteer to provide data.
• Example: using an advertisement in a magazine or on a website inviting people to complete a form or participate in a study.
Probability Methods
• Simple random sample: making selections from a population where each subject in the population has an equal chance of being selected.
• Stratified random sample: where you have first identified the population of interest, you then divide this population into strata or groups based on some characteristic (e.g. sex, geographic region), then perform simple random sample from each strata.
• Cluster sample: where a random cluster of subjects is taken from the population of interest. For instance, if we were to estimate the average salary for faculty members at Penn State - University Park Campus, we could take a simple random sample of departments and find the salary of each faculty member within the sampled department. This would be our cluster sample.

There are advantages and disadvantages to both types of methods. Non-probability methods are often easier and cheaper to facilitate. When non-probability methods are used it is often the case that the sample is not representative of the population. If it is not representative, you can make generalizations only about the sample, not the population. The primary benefit of using probability sampling methods is the ability to make inference. We can assume that by using random sampling we attain a representative sample of the population The results can be “extended” or “generalized” to the population from which the sample came.

## Example 1-1: Survey Methods

#### Airline Company Survey of Passengers

Let's say that you are the owner of a large airline company and you live in Los Angeles. You want to survey your L.A. passengers on what they like and dislike about traveling on your airline. For each of the methods, determine if a non-probability method or a probability method is used. Then determine the type of sampling.

1. Since you live in L.A. you go to the airport and just interview passengers as they approach your ticket counter.
Non-probability method; convenience sampling.
2. You have your ticket counter personnel distribute a questionnaire to each passenger requesting they complete the survey and return it at end of the flight.
Non-probability methods; Volunteer sampling
3. You randomly select a set of passengers flying on your airline and question those that you have selected.
Probability method; Simple random sampling
4. You group your passengers by the class they fly (first, business, economy), and then take a random sample from each of these groups.
Probability method: Stratified sampling
5. You group your passengers by the class they fly (first, business, economy) and randomly select such classes from various flights and survey each passenger in that class and flight selected.
Probability method; Cluster sampling

## Think About it!

In predicting the 2008 Iowa Caucus results a phone survey said that Hillary Clinton would win, but instead, Obama won. Where did they go wrong?

The survey was based on landline phones, which was skewed to older people who tended to support Hillary. However, lots of younger people got involved in this election and voted for Obama. The younger people could only be reached by cell phone.

Students interested in pursuing topics related to sampling might explore STAT 506: Sampling Theory. STAT 506 covers sampling design and analysis methods that are useful for research and management in many fields. A well-designed sampling procedure ensures that we can summarize and analyze data with a minimum of assumptions and complications.

# 1.1.3 - Types of Studies

1.1.3 - Types of Studies

Now that we know how to collect data, the next step is to determine the type of study. The type of study will determine what type of relationship we can conclude.

There are predominantly two different types of studies:

Observational
A study where a researcher records or observes the observations or measurements without manipulating any variables. These studies show that there may be a relationship but not necessarily a cause and effect relationship.
Experimental
A study that involves some random assignment* of a treatment; researchers can draw cause and effect (or causal) conclusions. An experimental study may also be called a scientific study or an experiment.

Note! Random selection (a probability method of sampling) is not random assignment (as in an experiment). In an ideal world you would have a completely randomized experiment; one that incorporates random sampling and random assignment.

## Example 1-2: Types of Studies

#### Quiz and Exam Score Studies

Let's say that there is an option to take quizzes throughout this class. In an observational study, we may find that better students tend to take the quizzes and do better on exams. Consequently, we might conclude that there may be a relationship between quizzes and exam scores.

In an experimental study, we would randomly assign quizzes to specific students to look for improvements. In other words, we would look to see whether taking quizzes causes higher exam scores.

## Causation

It is very important to distinguish between observational and experimental studies since one has to be very skeptical about drawing cause and effect conclusions using observational studies. The use of random assignment of treatments (i.e. what distinguishes an experimental study from an observational study) allows one to employ cause and effect conclusions.

Ethics is an important aspect of experimental design to keep in mind. For example, the original relationship between smoking and lung cancer was based on an observational study and not an assignment of smoking behavior.

## Try It!

We want to decide whether Advil or Tylenol is more effective in reducing fever.

#### Method 1

Ask the subjects which one they use and ask them to rate the effectiveness. Is this an observational study or experimental study?
This is an observational study since we just observe the data and have no control on which subject to use what type of treatment.

#### Method 2

Randomly assign half of the subjects to take Tylenol and the other half to take Advil. Ask the subjects to rate the effectiveness. Is this an observational study or experimental study?
This is an experimental study since we can decide which subject to use what type of treatment. Thus the self selection bias will be eliminated.

# 1.1.4 - Variables

1.1.4 - Variables

There may be many variables in a study. The variables may play different roles in the study. Variables can be classified as either explanatory or response variables.

Variable
A variable is any characteristic, number, or quantity that can be measured, counted, or observed for record.
Response Variable
Variable that about which the researcher is posing the question. May also be called the outcome or the dependent variable.
Explanatory Variable
Variables that serve to explain changes in the response. They may also be called the predictor or independent variables.
Note! A variable can serve as an explanatory variable in one study but response in another.

## Example 1-3: Response and Explanatory Variables

Consider the variables Sex (Female, Male) and Height (in inches). Which variable do you believe explains the other? In other words, would it make more sense to say a person's sex more likely explains that person's height, or to say a person's height explains that person's sex?
In this case, Sex would explain Height, making Sex the explanatory variable and Height the response.
Consider the variable Height and Weight. Which is the response? Which is the explanatory?
In this case, a person's height would more likely explain their weight than the other way around.

## Other Variables

Other types of variables include:

Lurking variable
A variable that is neither the explanatory variable nor the response variable but has a relationship (e.g. may be correlated) with the response and the explanatory variable. It is not considered in the study but could influence the relationship between the variables in the study.
Confounding variable
A variable that is in the study and is related to the other study variables, thus having an effect on the relationship between these variables.

A lurking variable, if included in the study, could have a confounding effect and then be classified as a confounding variable.

## Example 1-4: Lurking and Confounding Variables

Suppose you teach a class where students must submit weekly homework and then take a weekly quiz. You want to see if there is a relationship between the scores on the two assignments (i.e. higher homework scores are aligned with higher quiz scores). As you look at the data you begin to consider whether the submission date of the homework has an effect on the quiz grades; that is, do students who submit the homework several days before taking the quiz perform better overall on the quiz than students who do not leave much of a time gap between completing the assignments (e.g. they do both on the same day). The rational is that students who allow time between the homework and quiz to study may perform better compared to the other group.

In this example, “days between submission of homework and quiz” would be a lurking variable as it was not included in the study. Now once you got that information and re-examined the relationship between the two assignments taking into consideration the time gap, if you saw a change in the relationship between the two assignments (i.e. the relationship changed somewhat from the analysis without the time gap compared to when the time gap was included) then this “days between submission” would be considered a confounding variable.

In an experiment where treatments are randomly assigned, one assumes these variables get evenly shared across the groups with the intention that any influence they may have on the outcome is negated or reduced.

# 1.1.5 - Principles of Experimental Design

1.1.5 - Principles of Experimental Design

The following principles of experimental design have to be followed to enable a researcher to conclude that differences in the results of an experiment, not reasonably attributable to chance, are likely caused by the treatments.

Control
Need to control for effects due to factors other than the ones of primary interest.
Randomization
Subjects should be randomly divided into groups to avoid unintentional selection bias in the groups.
Replication
A sufficient number of subjects should be used to ensure that randomization creates groups that resemble each other closely and to increase the chances of detecting differences among the treatments when such differences actually exist.

The benefits to randomization are:

1. If a random assignment of treatment is done then significant results can be concluded as causal or cause and effect conclusions. That is, that the treatment caused the result. This treatment can be referred to as the explanatory variable and the result as the response variable.
2. If random selection is done where the subjects are randomly selected from some population, then the results can be extended to that population. The random assignment is required for an experiment. When both random assignment and selection are part of the study then we have a completely randomized experiment. Without random assignment (i.e.an observational study) then the treatment can only be referred to as being related to the outcome.

 [1] Link ↥ Has Tooltip/Popover Toggleable Visibility