1.2 - An Overview of Sampling1.2 - An Overview of Sampling
Why take samples?
You want to understand certain things and have some objective in mind. In each case, there is a target population.
The goal for many research projects is to know more about your objective, i.e., your population. This is what you are interested in. For instance, if you were a conservation officer you might be interested in the number of deer in central Pennsylvania. In this case, you have a certain goal in mind. What steps can we take to understand the population better?
What we can do is take a sample. However, the difficulty and the one major objective in statistics that now arises is inference.
One important objective of statistics is to make inferences about a population from the information contained in a sample.
We should always keep in mind that we perform sampling because we want to make this inference. Because of this inference, we begin to talk about things like confidence intervals and hypothesis testing. A good picture to represent this situation follows:
We can draw a sample from the population. How do we do this? What type of scheme do we use to draw a sample? This is very important since the inferences that can be made will strongly depend upon how you do the sampling.
Examples of Sampling
Sampling is useful in many different fields, however, different sampling problems can arise in each of these areas.
- Economic: We might want to estimate the average household income in Centre County. This would be important in harder economic times or as it relates to taxes or the assessment of property values.
- Geologic: We might want to estimate the total pyrite content of the rocks at the I-99 construction site at Skytop Mountain here in Centre County.
- Marketing Research: We might want to estimate the total market size for electrical cars.
- Engineering: We might want to estimate the failure rate of a certain electronic component
To deal with all of these problems one thing we have to decide is:
How are we going to select a sample?
There are many ways to take a sample. Which method we choose depends on your problem. Once we know more about the research problem this will help us determine which sampling makes the most sense. Therefore, we will talk about sampling design.
Sampling design is the procedure by which the sample is selected. There are two very broad categories of sampling designs.
- Probability Sampling
- (all designs we will discuss in detail fall into this type)
- When we use probability sampling, randomness will be built into the sampling designs so that properties of the estimators can be assessed probabilistically, e.g., simple random sampling, stratified sampling, cluster sampling, systematic sampling, network sampling, etc.
- Quota Sampling
- This is what people used to do before 1948. Sampling here is based upon quotas. For instance, each interviewer will sample based upon quotas that are representative of the population where the selection of respondents is left up to the subjective judgment of the interviewers. The bad thing is the selection of the respondent is in the hands of the interviewers. How can you ensure that the sample of the students that you have selected is indeed representative? If you are subjective when it comes to the individuals sampled, then this is an example of quota sampling.
Let's illustrate this point a bit more. Suppose you were going to select and interview people that visit Penn State University Park's Hetzel Student Union Building (HUB). If you are just selecting people by walking around and picking them subjectively to interview based upon those you met, or that just walked by, this involves human subjectivity.
Interviewers in probability sampling are given specific sampling procedures to follow or names and addresses already selected by a randomization scheme, selected without human subjectivity. For example, if you were to sample every third person that walked in the door of the HUB regardless of who they are.
The main difference between these two approaches is that probability sampling removes human subjectivity. Probability sampling does not depend on your subjective judgment for determining samples.
This is an important distinction that you need to be able to make.
Example 1-1: Sample Results for the 1948 Washington State Presidential Poll
Here are the results of this poll. Using quota sampling Dewey had 52% of the votes and Truman had 45.3% of the votes.
The Gallop poll pioneered probability sampling. They used probability sampling to do this survey. Their results gave 46% of the votes to Dewey and 50.5% of the votes to Truman. The actual results of the election are given in the last column. See that in this case the quota sampling approach was off by quite a bit. From this time on probability sampling became the norm.
When you choose your respondent use objective criteria. The major reason for poor results from quota sampling is the subjectivity involved in the selection of subjects. As soon as we introduce this type of bias, we introduce problems with our data, some of which we cannot get rid of even by acquiring additional samples.
Basic Idea of Sampling and Estimation
One interesting and important fact to note is that in most useful sampling schemes, variability from sample to sample can be estimated using the single sample selected.
Using the sample we collect, we can construct estimates for the parameter of the population that we are interested in. Usually, there are many ways to construct estimates. Thus, we need some guidelines to determine which estimates are desirable.
Some desirable properties for estimators are:
- Unbiased or nearly unbiased.
- Have a low MSE (Mean Square Error) or a low variance when the estimator is unbiased. [MSE measures how far the estimate is from the parameter of interest whereas variance measures how far the estimate is from the mean of that estimate. Thus, when an estimator is unbiased, its MSE is the same as its variance.]
- Robust - so your answer does not fluctuate too much with respect to extreme values.
Sampling and Nonsampling Errors
- Sampling error
- error due to a sample rather than the whole population is used
- Nonsampling error
- nonresponse, variables measured with error, etc.