1 Estimating Population Mean and Total under SRS
Overview
This lesson starts with an introduction to the course. We then provide an overview of sampling. The distinction between probability sampling and quota sampling is discussed. And we proceed to talk about how to estimate the population mean and population total under simple random sampling.
Lesson 1: Ch. 1, 2.1-2.6, 3 of Sampling by Steven Thompson, 3rd Edition.
Objectives
Upon completion of this lesson, you should be able to:
- Recognize that sample design influences the type of estimation procedure,
- Distinguish between quota sampling and probability sampling,
- Identify and explain the desirable properties of estimators,
- Distinguish between sampling errors and nonsampling errors,
- Use Minitab to generate an SRS,
- Compute the point estimate to a population mean and estimate the variance of the estimate, and
- Compute the point estimate to a population total and estimate the variance of the estimate.
1.1 Introduction to the Course
This course covers sampling design and analysis methods useful for research and management in many fields. A well-designed sampling procedure ensures that we can summarize and analyze the data with a minimum of assumptions or complications. In this course, we’ll cover the basic methods of sampling and estimation and then explore selected topics and recent developments including:
- simple random sampling with associated estimation and confidence interval methods,
- selecting sample sizes,
- estimating proportions,
- unequal probability sampling,
- ratio and regression estimation,
- stratified sampling,
- cluster and systematic sampling,
- multistage designs, and
- doubles sampling.
One important point to consider as we move forward is that for different sampling designs the estimation procedure will depend on the sample design. Being able to identify what to use under different sampling designs is one of the things that you will learn in this course.
1.2 An Overview of Sampling
Why take samples?
You want to understand certain things and have some objective in mind. In each case, there is a target population.
The goal for many research projects is to know more about your objective, i.e., your population. This is what you are interested in. For instance, if you were a conservation officer you might be interested in the number of deer in central Pennsylvania. In this case, you have a certain goal in mind. What steps can we take to understand the population better?
What we can do is take a sample. However, the difficulty and the one major objective in statistics that now arises is inference.
One important objective of statistics is to make inferences about a population from the information contained in a sample.
We should always keep in mind that we perform sampling because we want to make this inference. Because of this inference, we begin to talk about things like confidence intervals and hypothesis testing. A good picture to represent this situation follows:
We can draw a sample from the population. How do we do this? What type of scheme do we use to draw a sample? This is very important since the inferences that can be made will strongly depend upon how you do the sampling.
Examples of Sampling
Sampling is useful in many different fields, however, different sampling problems can arise in each of these areas.
- Economic: We might want to estimate the average household income in Centre County. This would be important in harder economic times or as it relates to taxes or the assessment of property values.
- Geologic: We might want to estimate the total pyrite content of the rocks at the I-99 construction site at Skytop Mountain in Centre County.
- Marketing Research: We might want to estimate the total market size for electrical cars.
- Engineering: We might want to estimate the failure rate of a certain electronic component
To deal with all of these problems one thing we have to decide is:
How are we going to select a sample?
There are many ways to take a sample. Which method we choose depends on your problem. Once we know more about the research problem this will help us determine which sampling makes the most sense. Therefore, we will talk about sampling design.
Sampling Design
Sampling design is the procedure by which the sample is selected. There are two very broad categories of sampling designs.
Definition 1.1 (Probability Sampling) (all designs we will discuss in detail fall into this type)
When we use probability sampling, randomness will be built into the sampling designs so that properties of the estimators can be assessed probabilistically, e.g., simple random sampling, stratified sampling, cluster sampling, systematic sampling, network sampling, etc.
Definition 1.2 (Quota Sampling) This is what people used to do before 1948. Sampling here is based upon quotas. For instance, each interviewer will sample based upon quotas that are representative of the population where the selection of respondents is left up to the subjective judgment of the interviewers. The bad thing is the selection of the respondent is in the hands of the interviewers. How can you ensure that the sample of the students that you have selected is indeed representative? If you are subjective when it comes to the individuals sampled, then this is an example of quota sampling.
Let’s illustrate this point a bit more. Suppose you were going to select and interview people that visit Penn State University Park’s Hetzel Student Union Building (HUB). If you are just selecting people by walking around and picking them subjectively to interview based upon those you met, or that just walked by, this involves human subjectivity.
Interviewers in probability sampling are given specific sampling procedures to follow or names and addresses already selected by a randomization scheme, selected without human subjectivity. For example, if you were to sample every third person that walked in the door of the HUB regardless of who they are.
The main difference between these two approaches is that probability sampling removes human subjectivity. Probability sampling does not depend on your subjective judgment for determining samples.
This is an important distinction that you need to be able to make.
Example 1.1 (Sample Results for the 1948 Washington State Presidential Poll) Here are the results of this poll. Using quota sampling Dewey had 52% of the votes and Truman had 45.3% of the votes.
Candidate | Quota Sample | Probability Sample | Actual Result |
---|---|---|---|
Dewey (Rep) | 52.0 | 46.0 | 42.7 |
Truman (Dem) | 45.3 | 50.5 | 52.6 |
The Gallop poll pioneered probability sampling. They used probability sampling to do this survey. Their results gave 46% of the votes to Dewey and 50.5% of the votes to Truman. The actual results of the election are given in the last column. See that in this case the quota sampling approach was off by quite a bit. From this time on probability sampling became the norm.
When you choose your respondent use objective criteria. The major reason for poor results from quota sampling is the subjectivity involved in the selection of subjects. As soon as we introduce this type of bias, we introduce problems with our data, some of which we cannot get rid of even by acquiring additional samples.
Try It!
Use Google to search for quota sampling and read what you can find on this topic.
Basic Idea of Sampling and Estimation
One interesting and important fact to note is that in most useful sampling schemes, variability from sample to sample can be estimated using the single sample selected.
Using the sample we collect, we can construct estimates for the parameter of the population that we are interested in. Usually, there are many ways to construct estimates. Thus, we need some guidelines to determine which estimates are desirable.
Some desirable properties for estimators are:
- Unbiased or nearly unbiased.
- A low MSE (Mean Square Error) or a low variance when the estimator is unbiased. [MSE measures how far the estimate is from the parameter of interest whereas variance measures how far the estimate is from the mean of that estimate. Thus, when an estimator is unbiased, its MSE is the same as its variance.]
- Robust, so your answer does not fluctuate too much with respect to extreme values.
Sampling and Nonsampling Errors
Definition 1.3 (Sampling Error) error due to a sample rather than the whole population is used
Definition 1.4 (Nonsampling Error) nonresponse, variables measured with error, etc.
1.3 Estimating Population Mean and Total under SRS
Definition 1.5 (Simple Random Sampling) Random sampling without replacement is such that every possible sample of n units has the same probability of selection.
Example 1.2 (Patient Records) A hospital has 1,125 patient records. How can one randomly select 120 records to review?
Answer
Assign a number from 1 to 1,125 to each record and randomly select 120 numbers from 1 to 1,125 without replacement.
In Minitab use the following commands:
- Calc > Make Patterned Data > Simple Set of Numbers
- Then, Calc > Random Data > Sample From Columns… (without replacement is the default)
Example 1.3 (Total Number of Beetles) To estimate the total number of beetles in an agricultural field. Subdivide the field into 100 equally sized units. Take a simple random sample of eight units and count the number of beetles in these eight units.
- | C1 | C2 |
---|---|---|
1 | 1 | 46 |
2 | 2 | 100 |
3 | 3 | 51 |
4 | 4 | 15 |
5 | 5 | 30 |
6 | 6 | 91 |
7 | 7 | 94 |
8 | 8 | 73 |
9 | 9 | - |
10 | 10 | - |
11 | 11 | - |
12 | 12 | - |
13 | 13 | - |
Notation
Let \(Y_i\) denote the number of beetles in the ith unit. \(N\) denotes the number of units in the population.
Variable of interest: \(Y_i, \dots, Y_N\)
\(\mu=\dfrac{y_1+y_2+\ldots +y_N}{N}\) (the population mean)
\(\tau=y_1+y_2+\ldots +y_N=N \times \mu\) (the population total)
\(\text{sample mean}=\bar{y}=\hat{\mu}=\dfrac{y_1+y_2+\ldots +y_n}{n}\)
\(\text{estimate for population total}=\hat{\tau}=N \times \bar{y}\) (expansion estimator)
Example 1.4 (Total Number of Beetles: Continued) For the beetle example, the observed samples at the eight fields are: 234, 256, 128, 245, 211, 240, 202, 267
- \(\bar{y}=222.875\)
- \(s^2=1932.657\)
- \(s=43.962\)
The estimate for the population total is:
\[\begin{align} \hat{\tau} &= N \times \bar{y} \\ &= 100 \times 222.875 \\ &= 22287.5 \end{align}\]
Properties of \(\bar{y}\) when one uses random sampling
unbiased
\[\begin{align} E(\bar{y}) &= E\left( \dfrac{y_1+y_2+\ldots +y_n}{n} \right) \\ &= \dfrac{E(y_1)+E(y_2)+\ldots+E(y_n)}{n} \\ &= \dfrac{\mu+\mu+\ldots +\mu}{n} \\ &= \mu \end{align}\]
here under simple random sampling, we can estimate the variance of \(\bar{y}\) from a single sample as:
\[\operatorname{Var}(\bar{y})=\dfrac{N-n}{N} \cdot \dfrac{\sigma^2}{n}\]
Note that \(\frac{N-n}{N}=1-\frac{n}{N}\) is called the finite population correction fraction.
Read Chapter 2.6 for the proof of \(\operatorname{Var}(\bar{y})=\dfrac{N-n}{N}\cdot \dfrac{\sigma^2}{n}\)
If one wants to estimate \(\operatorname{Var}(\bar{y})\), one needs to estimate \(\sigma^2\) by \(s^2\) in the formula. The estimate for \(\operatorname{Var}(\bar{y})\) is denoted as \(\hat{\operatorname{Var}}(\bar{y})\) and \(\hat{\operatorname{Var}}(\bar{y})=\dfrac{N-n}{N}\cdot\dfrac{s^2}{n}\).
Try It!
Estimate \(\operatorname{Var}(\bar{y})\) for the data in Example 1.3.
\[\begin{align} \hat{\operatorname{Var}}(\bar{y})&= \dfrac{N-n}{N}\cdot \dfrac{s^2}{n} \\ &= \dfrac{100-8}{100}\cdot \dfrac{1932.657}{8} \\ &= 222.256 \end{align}\]
Properties of \(\hat{\tau}\) with a simple random sampling:
unbiased
\[\begin{align} E(\hat{\tau})&= E(N \times \bar{y}) \\ &= N \times \mu \\ &= \tau \end{align}\]
formula for \(\operatorname{Var}(\hat{\tau})\) is:
\[\begin{align} \operatorname{Var}(\hat{\tau})&= \operatorname{Var}(N \times \bar{y}) \\ &= N^2 \cdot \operatorname{Var}(\bar{y}) \\ &= N^2 \cdot \dfrac{N-n}{N} \cdot \dfrac{\sigma^2}{n} \\ &= N \cdot (N-n) \cdot \dfrac{\sigma^2}{n} \end{align}\]
The estimate for \(\operatorname{Var}(\hat{\tau})\) is thus: \(\hat{\operatorname{Var}}(\hat{\tau})=N(N-n)\dfrac{s^2}{n}\)
Try It!
Estimate the variance of \(\hat{\tau}\) for the data on the number of beetles.
\[\begin{align} \hat{\operatorname{Var}}(\hat{\tau})&= 100 \cdot (100-8) \cdot \frac{1932.657}{8} \\ &= 2222560 \\ &= N^2 \cdot \hat{\operatorname{Var}}(\bar{y}) \end{align}\]
1.4 Confidence Intervals and the Central Limit Theorem
The idea behind confidence intervals is that it is not enough just to use the sample mean to estimate the population mean. The sample mean by itself is a single point. This does not give people any idea as to how good your estimation is of the population mean.
If we want to assess the accuracy of this estimate we will use confidence intervals which provide us with information as to how good our estimation is.
A confidence interval, viewed before the sample is selected, is the interval that has a pre-specified probability of containing the parameter. To obtain this confidence interval you need to know the sampling distribution of the estimate. Once we know the distribution, we can talk about confidence.
We want to be able to say something about \(\theta\), or rather \(\hat{\theta}\) because \(\hat{\theta}\) should be close to \(\theta\).
So the type of statement that we want to make will look like this:
\[P(|\hat{\theta}-\theta|<d)=1-\alpha\]
Thus, we need to know the distribution of \(\hat{\theta}\). In certain cases, the distribution of \(\hat{\theta}\) can be stated easily. However, there are many different types of distributions.
The normal distribution is easy to use as an example because it does not bring with it too much complexity.
Central Limit Theorem
When we talk about the Central Limit Theorem for the sample mean, what are we talking about?
The finite population Central Limit Theorem for the sample mean: What happens when \(n\) gets large?
\(\bar{y}\) has a population mean \(\mu\) and a standard deviation of \(\frac{\sigma}{\sqrt{n}}\) since we do not know \(\sigma\) so we will use \(s\) to estimate \(\sigma\). We can thus estimate the standard deviation of \(\bar{y}\) to be: \(\dfrac{s}{\sqrt{n}}\).
The value \(n\) in the denominator helps us because as \(n\) is getting larger the standard deviation of \(\bar{y}\) is getting smaller.
The distribution of \(\bar{y}\) is very complicated when the sample size is small. When the sample size is larger there is more regularity and it is easier to see the distribution. This is not the case when the sample size is small.
We want to find a confidence interval for \(\mu\). If we go about picking samples we can determine a \(\bar{y}\) and from here we can construct an interval about the mean. However, there is a slight complication that comes out of \(\frac{\sigma}{\sqrt{n}}\). We have two unknowns, \(\mu\) and \(\sigma\). What do you do now?
We will estimate \(\sigma\) by \(s\), now \(\frac{\bar{y}-\mu}{s/\sqrt{n}}\) does not have a normal distribution but a \(t\) distribution with \(n-1\) degrees of freedom.
Thus, a \(100(1-\alpha)\)% confidence interval for \(\mu\) can be derived as follows:
Now, we can compute the confidence interval as:
\[\bar{y} \pm t_{\alpha/2} \sqrt{\hat{\operatorname{Var}}(\bar{y})}\] We are sampling without replacement here and the \(100(1-\alpha)%\) confidence interval for \(\mu\) is:
\[\bar{y} \pm t_{\alpha/2} \sqrt{(\dfrac{N-n}{N})(\dfrac{s^2}{n})}\] Note that since we are sampling without replacement, we need to make a correction (more precise) to the formula used in other courses which does not have the term \(\dfrac{(N-n)}{N}\).
What you now have above is the confidence interval for \(\mu\) and then the confidence interval for \(\tau\) is given below.
When to Apply the Confidence Interval Formulas
Be careful now, when can we use these? In what situation are these confidence intervals applicable?
These approximate intervals above are good when \(n\) is large (because of the Central Limit Theorem), or when the observations \(y_1, y_2, \dots, y_n\) are normal.
Sample size 30 or greater
When the sample size is 30 or more, we consider the sample size to be large and by Central Limit Theorem, \(\bar{y}\) will be normal even if the sample does not come from a Normal Distribution. Thus, when the sample size is 30 or more, there is no need to check whether the sample comes from a Normal Distribution. We can use the \(t\)-interval.
Sample size 8 to 29
When the sample size is 8 to 29, we would usually use a normal probability plot to see whether the data come from a normal distribution. If it does not violate the normal assumption then we can go ahead and use the \(t\)-interval.
Sample size less than 7
However, when the sample size is 7 or less, if we use a normal probability plot to check for normality, we may fail to reject Normality due to not having enough sample size. In the examples here in these lessons and in the textbook we typically use small sample sizes and this might be the wrong image to give you. These small samples have been set for illustration purposes only. When you have a sample size of 5 you really do not have enough power to say the distribution is normal and we will use nonparametric methods instead of \(t\).
Example 1.5 (Total Number of Beetles: Revisited) For the beetle example, an approximate 95% CI for \(\mu\) is:
\[\bar{y} \pm t_{\alpha/2} \sqrt{(\dfrac{N-n}{N})(\dfrac{s^2}{n})}\]
Note that the \(t\)-value for \(\alpha = 0.025\) and at \(n - 1 = 8 - 1 = 7\) df can be found by using the (t-table) to be 2.365.
\[\bar{y} \pm t_{\alpha/2} \sqrt{(\dfrac{N-n}{N})(\dfrac{s^2}{n})}\]
\[\begin{align} &= 222.875 \pm 2.365\sqrt{222.256} \\ &=222.875 \pm 2.365 \times 14.908 \\ &= 222.875 \pm 35.258 \end{align}\]
And, an approximate 95% CI for \(\tau\) is then:
\[\begin{align} &=22287.5 \pm 2.365 \sqrt{2222560} \\ &=22287.5 \pm 3525.802 \end{align}\]