Lesson 6: Stratified Sampling
Lesson 6: Stratified SamplingOverview
In Section 6.1, we discuss when and why to use stratified sampling. The estimate for mean and total are provided when the sampling scheme is stratified sampling. An example of using stratified sampling to compute the estimates as well as the standard deviation of the estimates is provided. Confidence intervals for these estimates are then discussed.
In Section 6.2, the optimal allocation of sample size under different conditions is given. Then we discuss post-stratification. It is important to note that the variance of estimates under post-stratification is different from under stratification. In Section 6.3, we use an example to illustrate that a stratified sample may not be better than a simple random sample if the variable one stratifies on is not related to the response. At the end of section 6.3, we discuss stratified sampling for proportions.
Lesson 6: Ch. 11.1-11.6 of Sampling by Steven Thompson, 3rd edition
Objectives
- Identify the appropriate reasons and situations for using stratified sampling,
- Estimate mean and total when stratified sampling is used,
- Compute confidence interval for the stratified mean and stratified total,
- Determine the optimal allocation of sample sizes,
- Compute estimates when post-stratification is used,
- Compute the variance for the estimates when post-stratification is used, and
- Estimate population proportions when stratified sampling is used.
6.1 - How to Use Stratified Sampling
6.1 - How to Use Stratified SamplingIn stratified sampling, the population is partitioned into non-overlapping groups, called strata and a sample is selected by some design within each stratum.
For example, geographical regions can be stratified into similar regions by means of some known variables such as habitat type, elevation, or soil type. Another example might be to determine the proportions of defective products being assembled in a factory. In this case, sampling may be stratified by production lines, factories, etc.
Can you think of a couple of additional examples where stratified sampling would make sense? Look for opportunities when the measurements within the strata are more homogeneous.
The principal reasons for using stratified random sampling rather than simple random sampling include:
- Stratification may produce a smaller error of estimation than would be produced by a simple random sample of the same size. This result is particularly true if measurements within strata are very homogeneous.
- The cost per observation in the survey may be reduced by stratification of the population elements into convenient groupings.
- Estimates of population parameters may be desired for subgroups of the population. These subgroups should then be identified.
Example 6-1: Average Hours Watching TV Per Week
Reference p.121 of Scheaffer, Mendenhall, and Ott
An advertising firm, interested in determining how much to emphasize television advertising in a certain county decides to conduct a sample survey to estimate the average number of hours each week that households within that county watch television. The county has two towns, A and B, and a rural area C. Town A is built around a factory and most households contain factory workers with school-aged children. Town B contains mainly retirees and rural area C residents are mainly farmers.
There are 155 households in town A, 62 in town B and 93 in rural area C. The firm decides to select 20 households from Town A, 8 households from Town B, and 12 households from the rural area. The results are given in the following table:
Town A |
35, 43, 36, 39, 28, 28, 29, 25, 38, 27, 26, 32, 29, 40, 35, 41, 37, 31, 45, 34 |
\(N_1\) = 155
|
---|---|---|
Town B | 27, 15, 4, 41, 49, 25, 10, 30 |
\(N_2\) = 62
|
Rural Area C | 8, 14, 12, 15, 30, 32, 21, 20, 34, 7, 11, 24 |
\(N_3\) = 93
|
Here is the Minitab output that describes the data from each stratum: ( N in the output denotes numbers of data)
Variable | N | Mean | StDev | SE Mean |
---|---|---|---|---|
Town A | 20 | 33.90 | 5.95 | 1.33 |
Town B | 8 | 25.12 | 15.25 | 5.39 |
Rural ar | 12 | 19.00 | 9.36 | 2.70 |
Usually, a sample is selected by some probability design from each of the L strata in the population, with selections in different strata independent of each other. The special case where from each stratum a simple random sample is drawn is called a stratified random sample.
Try it!
Notation
- L = the number of strata
- Nh = number of units in each stratum h
- nh = the number of samples taken from stratum h
- N = the total number of units in the population, i.e., N1 + N2 + ... + NL
For our "Watching TV" example the following values are:
L = 3, \(N_1\) = 155, \(N_2\) = 62, \(N_3\) = 93, N = 155 + 62 + 93 = 310
Estimating the Population Total
\(\hat{\tau}_{st}=\sum\limits_{h=1}^L \hat{\tau}_h\)
The total is from each stratum added up where \(\hat{\tau}_h\) is an unbiased estimator for \(\tau_h\).
Since selections in a different strata are independent, the variance is:
\(Var(\hat{\tau}_{st})=\sum\limits_{h=1}^L Var(\hat{\tau}_h)\), and
\(\hat{V}ar(\hat{\tau}_{st})=\sum\limits_{h=1}^L \hat{V}ar(\hat{\tau}_h)\)
The formula is computed differently according to the sampling scheme within each stratum. For stratified random sampling, i.e., take a random sample within each stratum:
\(\hat{\tau}_h=N_h \bar{y}_h\)
\(\hat{V}ar(\hat{\tau}_{st})=\sum\limits_{h=1}^L N_h \cdot (N_h-n_h)\cdot \dfrac{s^2_h}{n_h}\)
\(s^2_h=\dfrac{1}{n_h-1}\sum\limits_{i=1}^{n_h}(y_{hi}-\bar{y}_h)^2\)
You can see that this turns out pretty easy to remember, and one can easily obtain the estimates for the population mean.
\(\hat{\mu}_{st}=\dfrac{\hat{\tau}_{st}}{N}\)
\(\hat{V}ar(\hat{\mu}_{st})=\dfrac{1}{N^2}\hat{V}ar(\hat{\tau}_{st})\)
For stratified random sampling:
\(\bar{y}_{st}=\dfrac{1}{N} \sum\limits_{h=1}^L N_h \bar{y}_h\)
\(\hat{V}ar(\bar{y}_{st})=\sum\limits_{h=1}^L \left(\dfrac{N_h}{N}\right)^2 \left(\dfrac{N_h-n_h}{N_h}\right) \dfrac{s^2_h}{n_h}\)
\(s_h\) is the sample standard deviation of h stratum as given in Minitab.
Try it!
\begin{align}
\bar{y}_{st} &=\dfrac{1}{N}(N_1\bar{y}_1+N_2\bar{y}_2+N_3\bar{y}_3)\\
&= \dfrac{1}{155+62+93} [(155 \times 33.9)+ (62 \times 25.12)+(93 \times 19.0)]\\
&= 27.7\\
\end{align}
\begin{align}
\hat{V}ar(\bar{y}_{st}) &=\sum\limits_{h=1}^3 \left(\dfrac{N_h}{N}\right)^2 \left(\dfrac{N_h-n_h}{N_h}\right) \dfrac{s^2_h}{n_h}\\
&=\dfrac{1}{(310)^2}\left[\left((155)^2\cdot \dfrac{(155-20)}{155}\cdot \dfrac{(5.95)^2}{20}\right)+\left((62)^2\cdot \dfrac{(62-8)}{62}\cdot \dfrac{(15.25)^2}{8}\right) \right.\\
&\left.+\left((93)^2\cdot \dfrac{(93-12)}{93}\cdot \dfrac{(9.36)^2}{12}\right)\right]\\
&= 1.97\\
\end{align}
For the total hours watching TV example:
\(\hat{\tau}_{st}=N\cdot \bar{y}_{st}=310 \times 27.7=8587\)
\begin{align}
\hat{V}ar(\hat{\tau}_{st})&= N^2 \hat{V}ar(\bar{y}_{st})\\
&= (310)^2 \times 1.97=189317\\
\end{align}
Confidence Intervals
When all of the stratum sizes are small, an approximate 100(1-\(\alpha\))% CI for \(\tau\) is:
\(\hat{\tau}_{st} \pm t\sqrt{\hat{V}ar(\hat{\tau}_{st})}\)
However, when the stratum sample sizes are at least 30, use z to approximate t.
What are the degrees of freedom for the t used in this formula for the confidence interval? Intuitively we would want this to be, (\(n_1-1)+(n_2-1)+...+(n_L-1)\), and this is correct when the variances of all strata are all the same. But when this is not the case and we can not pool the degrees of freedom, we will need to use the Satterwaithe approximation for the degrees of freedom as follows:
\(d=\left(\sum\limits_{h=1}^L a_h s^2_h\right)^2/\sum\limits_{h=1}^L \dfrac{(a_h s^2_h)^2}{(n_h-1)}\)
where, \(a_h=\dfrac{N_h(N_h-n_h)}{n_h}\)
In particular, when \(N_h\) are all equal, \(n_h\) are all equal and \(s^2_h\) are all equal , the d.f. = n - L.
For the TV example:
\(a_1=\dfrac{N_1(N_1-n_1)}{n_1}=\dfrac{155(155-20)}{20}=1046.25\)
\(a_2=\dfrac{N_2(N_2-n_2)}{n_2}=\dfrac{62(62-8)}{8}=418.5\)
\(a_3=\dfrac{N_3(N_3-n_3)}{n_3}=\dfrac{93(93-12)}{12}=627.75\)
\begin{align}
d&= \dfrac{(a_1s^2_1+a_2s^2_2+a_3s^2_3)^2}{\dfrac{(a_1s^2_1)^2}{n_1-1}+\dfrac{(a_2s^2_2)^2}{n_2-1}+\dfrac{(a_3s^2_3)^2}{n_3-1}}\\
&= \dfrac{(1046.5\cdot(5.95)^2+418.5\cdot(15.25)^2+627.75\cdot(9.36)^2)^2}{\dfrac{(1046.5\cdot(5.95)^2)^2}{20-1}+\dfrac{(418.5\cdot(15.25)^2)^2}{8-1}+\dfrac{(627.75\cdot(9.36)^2)^2}{12-1}}\\
&=21.09\\
\end{align}
Try it!
We will use t with df=21, hence a 95% CI for \(\mu\) is:
\(\bar{y}_{st} \pm t\sqrt{\hat{V}ar(\bar{y}_{st})}\)
\begin{array}{lcl}
& = & 27.7 \pm 2.08 \times \sqrt{1.97} \\
& = & 27.7 \pm 2.91
\end{array}
Similarly, a 95% CI for \(\tau\) is:
\(\hat{\tau}_{st} \pm t\sqrt{\hat{V}ar(\hat{\tau}_{st})}\)
\begin{array}{lcl}
& = & 8587 \pm 2.08 \times \sqrt{189278.56} \\
& = & 8587 \pm 902.32
\end{array}
Using R
Here is the code for R for this example:
Datafile: TVhour.txt
R code: Chapter6_TVhour.R.txt
6.2 - The Stratification Principle
6.2 - The Stratification PrincipleThe Stratification Principle
If your only objective of stratification is to produce estimators with small variances, then we want to stratify such that within each stratum, the units are as similar as possible. In a survey of the human population, stratification may be based on socioeconomic factors or geographic regions.
For example, to estimate the average starting income for recent Penn State graduates, it would make sense to stratify by the department since the starting income for graduates of the same department would be similar.
Allocation in Stratified Random Sampling
The question is, given a total sample size of n, how do we allocate these among L strata?
Try it!
The best allocation scheme is affected by the following three factors:
- the total number of elements in each stratum,
- the variability of the measurements within each stratum, and
- the cost associated with obtaining an observation from each stratum.
If we don't have all this information, but we know the total number, we can use a simplistic allocation. This is a proportional allocation that will maintain a steady sampling fraction throughout the population.
\(n_h=\dfrac{n\cdot N_h}{N}\)
This does not take into consideration the variability within each stratum and is not the optimal choice.
If the cost of sampling from each stratum is the same, then the optimal allocation (the allocation with the lowest variances) is:
\(n_h=\dfrac{n \cdot N_h \sigma_h}{\sum\limits_{k=1}^L N_k \sigma_k}\)
read text section 11.8 for proof
However, if the cost of sampling differs from stratum to stratum and the total cost is:
\(c=c_0+c_1n_1+c_2n_2+...+c_Ln_L\)
where \(c_0\) is the overhead cost, \(c_h\) is the cost per unit for stratum h. The optimal allocation is:
\(n_h=\dfrac{(c-c_0)N_h \sigma_h/\sqrt{c_h}}{\sum\limits_{k=1}^L N_k \sigma_k \sqrt{c_k}}\)
Note!
- the sample size is directly proportional to \(N_h\) and \(\sigma_h\), i.e., allocate a larger sample size to the larger and more variable stratum.
- the sample size is inversely proportional to \(\sqrt{c_h}\), i.e., this allocates smaller sample sizes to the more expensive stratum.
In order to use the optimal allocation, one must be able to estimate σh
Let's take a look at this in the context of the TV Example...
Try it!
Optimal allocation:
\(n_h=\dfrac{n \cdot N_h \sigma_h}{\sum\limits_{k=1}^L N_k \sigma_k}\)
where,
\(N_1=155, \sigma_1=5\)
\(N_2=62, \sigma_2=15\)
\(N_3=93, \sigma_3=10\)
Then,
\(n_1=\dfrac{40 \times 155 \times 5}{155 \times 5+62 \times 15+93 \times 10}=11.7647\)
\(n_2=\dfrac{40 \times 62 \times 15}{155 \times 5+62 \times 15+93 \times 10}=14.1176\)
\(n_3=\dfrac{40 \times 93 \times 10}{155 \times 5+62 \times 15+93 \times 10}=14.1177\)
Thus we will choose \(n_1=12, n_2=14\) and \(n_3=14\).
Remember, it is important that \(n_1+n_2+n_3=40\) in this case.
6.3 - Poststratification and further topics on stratification
6.3 - Poststratification and further topics on stratificationSometimes, we would like to stratify on a key variable but cannot place the units into their correct strata until the units are sampled. For instance, in a telephone interview, the respondents can not be placed into a male or female stratum until after the respondent is contacted.
Poststratification (stratification after the sample has been selected by simple random sampling) is often appropriate when a simple random sample is not properly balanced by the representation.
Here is an example. We want to estimate the average weight and take a simple random sample of 100 people. Here is what was obtained.
Male | Female |
---|---|
\(n_1=20\) | \(n_2=80\) |
\(\bar{y}_1=180\) lbs. | \(\bar{y}_2=120\) lbs. |
\(\bar{y}\) = the overall sample mean = 132
This is obviously not balanced with respect to gender. This is likely an underestimate due to the underrepresentation of males in the data. How can we account for this?
In the population \(\dfrac{N_1}{N}=0.5\) and \(\dfrac{N_2}{N}=0.5\).
Thus,
\begin{align}
\bar{y}_{st} &= 0.5\cdot \bar{y}_1+0.5 \cdot \bar{y}_2\\
&= \dfrac{N_1}{N} \bar{y}_1+\dfrac{N_2}{N} \bar{y}_2=150\\
\end{align}
The poststratification estimator \(\bar{y}_{st}\) will not have the same variance as the stratified sample mean since the sample sizes \(n_h\) are random. Thus, the variance of the poststratification \(\bar{y}_{st}\) is the sum of the variance of the stratum. \(\bar{y}_{st}\) under the proportional allocation: \(nN_h/N\) and a term that shows the amount of increase one expects from the post- rather than the pre-stratification.
\(Var(\text{post}-\text{stratified }\bar{y}) \approx \dfrac{N-n}{nN}\sum\limits_{h=1}^L \left(\dfrac{N_h}{N}\right)\sigma^2_h + \dfrac{1}{n^2}\left(\dfrac{N-n}{N-1}\right)\sum\limits_{h=1}^L \dfrac{N-N_h}{N}\sigma^2_h\)
Example 6-2: Account Receivable
A firm knows that 40% of its accounts receivable are wholesale and 60% are retail. However, to identify an account without pulling a file and looking at it is difficult. An auditor randomly sampled 100 accounts without replacement. Here are the results of his sampling:
Wholesale | Retail |
---|---|
\(n_1=70\) | \(n_2=30\) |
\(\bar{y}_1=520\) | \(\bar{y}_2=280\) |
\(s_1=210\) | \(s_2=90\) |
Try it!
\begin{align}
\bar{y}_{st} &= \dfrac{N_1}{N} \bar{y}_1+\dfrac{N_2}{N} \bar{y}_2\\
&= 0.4\times 520+0.6 \times 280\\
&= 376\\
\end{align}
Given the firm has many, many accounts receivable we can ignore the finite correction factor.
\begin{align}
\hat{V}ar(\text{post}-\text{stratified }\bar{y}) & \approx \dfrac{1}{n}\left(\dfrac{N_1}{N}s^2_1+\dfrac{N_2}{N}s^2_2\right)+\dfrac{1}{n^2}\left[\left(1-\dfrac{N_1}{N}\right) s^2_1 + \left(1-\dfrac{N_2}{N}\right) s^2_2 \right]\\
&= \dfrac{1}{100}[0.4 \times (210)^2+ 0.6 \times (90)^2]+ \dfrac{1}{100^2}[0.6 \times (210)^2+ 0.4 \times (90)^2]\\
&= 225+2.97\\
&= 227.97\\
\end{align}
Note! Further Topic on Stratified Sampling
It is not true that stratified random sampling always produces an estimator with a smaller variance than that from simple random sampling.
Example 6-3: Students Weights
The principal of a Prep school for boys wants to estimate the average weight of the 7th-grade boys in the school. There are 4 classes, 24 students in class 1, 36 in class 2, 30 students in class 3, and 30 in class 4.
For administrative ease, he decides to use stratified sampling with each class as a stratum. The principal has enough time and money to obtain data for 20 students, and because the cost of sampling is the same in each stratum, he decides to use proportional allocation, which gives \(n_1=4, n_2=6, n_3=5\) and \(n_4=5\). The data (in lbs.) is given in the following table:
Weight of the student (in lbs.)
|
|
---|---|
Class 1 | 94, 90, 102, 110 |
Class 2 | 91, 99, 93, 105, 111, 101 |
Class 3 | 108, 96, 100, 93, 93 |
Class 4 | 92, 110, 94, 91, 113 |
Here is the Minitab output that describes the data from each stratum:
Variable | N | Mean | StDev | SE Mean |
---|---|---|---|---|
Class 1 | 4 | 99.00 | 8.87 | 4.43 |
Class 2 | 6 | 100.00 | 7.46 | 3.04 |
Class 3 | 5 | 98.00 | 6.28 | 2.81 |
Class 4 | 5 | 100.00 | 10.61 | 4.74 |
All | 20 | 99.30 | 7.73 | 1.73 |
Try it!
To estimate the average weight of the 7th-grade boys, using the Minitab output:
\(\bar{y}_{st}=\sum\limits_{h=1}^L \dfrac{N_h}{N}\bar{y}_h=99.3\)
\begin{align}
\hat{V}ar(\bar{y}_{st}) &= \dfrac{1}{N^2}\sum\limits_{i=1}^4 N^2_i \left(\dfrac{N_i-n_i}{N_i}\right)\dfrac{s^2_i}{n_i}\\
&= \dfrac{1}{120^2}\left[\left((24)^2\cdot \dfrac{5}{6} \cdot \dfrac{(8.87)^2}{4}\right)+\left((36)^2\cdot \dfrac{5}{6} \cdot \dfrac{(7.46)^2}{6}\right) \right.\\
&\left.+\left((30)^2\cdot \dfrac{5}{6} \cdot \dfrac{(6.28)^2}{5}\right)+\left((30)^2\cdot \dfrac{5}{6} \cdot \dfrac{(10.61)^2}{5}\right)\right]\\
&= 2.93\\
\end{align}
For a 95% CI, we need to compute Satterwaithe's formula to get the degree of freedom:
\(d=\dfrac{\left(\sum\limits_{h=1}^L a_h s^2_h \right)^2}{\sum\limits_{h=1}^L \dfrac{(a_h s^2_h)^2}{n_h-1}}\)
\(a_h=\dfrac{N_h(N_h-n_h)}{n_h}\)
Plug in the formula and we get that d = 13.7576.
Round it down to 13, to be more conservative, and use d.f. = 13.
Then, an approximate 95% CI is:
\(99.3 \pm 2.160\sqrt{2.93}\)
\(=99.3 \pm 3.697\)
Looking back at the data, if we had used simple random sampling, would our CI have been tighter or looser?
Usually, the stratified random sampling will overall perform better because we usually use stratified random sampling when the stratum is more homogeneous.
There is no reason that the classes are more homogeneous in weight, and therefore there is no reason why this stratified random sampling is any better than simple random sampling.
Try it!
\begin{align}
\hat{V}ar(\bar{y})&= \left(\dfrac{N-n}{N}\right) \left(\dfrac{s^2}{n}\right)\\
&= \left(\dfrac{120-20}{120}\right) \left(\dfrac{(7.73)^2}{20}\right)\\
&= 2.49\\
\end{align}
Then an approximate 95% CI is: df = 19
\(99.3 \pm 2.093\sqrt{2.49}\)
\(=99.3 \pm 3.30\)
Thus the margin of error is smaller and the confidence interval narrower.
Since the data had been collected by stratified sampling, the above method treating it as srs is the wrong way to compute the variance for this problem. How the variance is computed depends on the method by which the sample was taken. We did the computation just to show that if hypothetically, the data was collected by s.r.s. with the data turning out to be as shown (for illustration's sake), then the margin of error will be smaller.
Moral of this example:
Stratifying on class, which is not related to weight, does not result in smaller variances within the strata. On the other hand, if stratification had other purposes such as to estimate the parameters of each subgroup, it still makes sense to stratify, though the purpose is not to get estimates with smaller variance. For this particular example, the stratification to estimate the average weight for each class may be relevant.
Stratified sampling to estimate population proportion
\(\hat{p}_{st}=\dfrac{1}{N}\sum\limits_{h=1}^L N_h \hat{p}_h\)
\begin{align}
\hat{V}ar(\hat{p}_{st})&= \dfrac{1}{N^2}\sum\limits_{h=1}^L N^2_h \hat{V}ar(\hat{p}_h)\\
&= \dfrac{1}{N^2}\sum\limits_{h=1}^L N^2_h \left(\dfrac{N_h-n_h}{N_h}\right)\cdot \dfrac{\hat{p}_h(1-\hat{p}_h)}{n_h-1}\\
\end{align}
Example 6-4: TV Show Viewership
The advertising firm wants to estimate the proportion of households in the county that view the television show "American Idol".
\(N_1=155,N_2=62, N_3=93\). As before, we stratify by town and the sample results are:
Stratum | Sample Size | \(\hat{p}_h\) |
---|---|---|
Town A | \(n_1=20\) | 16/20 = 0.80 |
Town B | \(n_2=8\) | 2/8 = 0.25 |
Rural Area C | \(n_3=12\) | 6/12 = 0.50 |
We plug in the values and we can get the following:
Try it!
\begin{align}
\hat{p}_{st}&=\dfrac{1}{N}\sum\limits_{h=1}^L N_h \hat{p}_h\\
&= \dfrac{155}{310}\cdot 0.8 +\dfrac{62}{310}\cdot 0.25+\dfrac{93}{310}\cdot 0.5\\
&= 0.6\\
\end{align}
The following display the estimated variance for each stratum:
\begin{align}
\hat{V}ar(\hat{p}_1)&= \left(\dfrac{N_1-n_1}{N_1}\right)\cdot \dfrac{\hat{p}_1(1-\hat{p}_1)}{n_1-1}\\
&= \left(\dfrac{155-20}{155}\right)\cdot \dfrac{0.8(0.2)}{19}\\
&= 0.007\\
\end{align}
\begin{align}
\hat{V}ar(\hat{p}_2)&= \left(\dfrac{N_2-n_2}{N_2}\right)\cdot \dfrac{\hat{p}_2(1-\hat{p}_2)}{n_2-1}\\
&= \left(\dfrac{62-8}{62}\right)\cdot \dfrac{0.25(0.75)}{7}\\
&= 0.024\\
\end{align}
\begin{align}
\hat{V}ar(\hat{p}_3)&= \left(\dfrac{N_3-n_3}{N_3}\right)\cdot \dfrac{\hat{p}_3(1-\hat{p}_3)}{n_3-1}\\
&= \left(\dfrac{93-12}{93}\right)\cdot \dfrac{0.5(0.5)}{11}\\
&= 0.02\\
\end{align}
Try it!
\begin{align}
\hat{V}ar(\hat{p}_{st})&= \dfrac{1}{(310)^2}[(155)^2(0.007)+(62)^2(0.024)+(93)^2(0.02)]\\
&= 0.0045\\
\end{align}