Lesson 6: Stratified Sampling
Lesson 6: Stratified SamplingOverview
In Section 6.1, we discuss when and why to use stratified sampling. The estimate for mean and total are provided when the sampling scheme is stratified sampling. An example of using stratified sampling to compute the estimates as well as the standard deviation of the estimates are provided. Confidence intervals for these estimates are then discussed.
In Sections 6.2, the optimal allocation of sample size under different conditions is given. Then we discuss poststratification. It is important to note that the variance of estimates under poststratification is different from under stratification. In section 6.3, we use an example to illustrate that a stratified sample may not be better than simple random sample if the variable one stratifies on is not related to the response. At the end of section 6.3, we discuss stratified sampling for proportions.
Lesson 6: Ch. 11.111.6 of Sampling by Steven Thompson, 3rd edition
Objectives
 know why and when to use stratified sampling,
 know how to estimate mean and total when stratified sampling is used,
 compute confidence interval for these estimates,
 determine the optimal allocation of sample sizes,
 compute estimates when poststratification is used,
 compute the variance for the estimates when poststratification is used, and
 provide estimates for stratified sample for proportion.
6.1  How to Use Stratified Sampling
6.1  How to Use Stratified SamplingIn stratified sampling, the population is partitioned into nonoverlapping groups, called strata and a sample is selected by some design within each stratum.
For example, geographical regions can be stratified into similar regions by means of some known variables such as habitat type, elevation or soil type. Another example might be to determine the proportions of defective products being assembled in a factory. In this case, sampling may be stratified by production lines, factory, etc.
Can you think of a couple of additional examples where stratified sampling would make sense? Look for opportunities when the measurements within the strata are more homogeneous.
The principal reasons for using stratified random sampling rather than simple random sampling include:
 Stratification may produce a smaller error of estimation than would be produced by a simple random sample of the same size. This result is particularly true if measurements within strata are very homogeneous.
 The cost per observation in the survey may be reduced by stratification of the population elements into convenient groupings.
 Estimates of population parameters may be desired for subgroups of the population. These subgroups should then be identified.
Example 61: Average Hours Watching TV Per Week
Reference p.121 of Scheaffer, Mendenhall and Ott
An advertising firm, interested in determining how much to emphasize television advertising in a certain county decides to conduct a sample survey to estimate the average number of hours each week that households within that county watch television. The county has two towns, A and B, and a rural area C. Town A is built around a factory and most households contain factory workers with schoolaged children. Town B contains mainly retirees and the rural area C are mainly farmers.
There are 155 households in town A, 62 in town B and 93 in the rural area, C. The firm decides to select 20 households from Town A, 8 households from Town B and 12 households from the rural area. The results are given in the following table:
Town A 
35, 43, 36, 39, 28, 28, 29, 25, 38, 27, 26, 32, 29, 40, 35, 41, 37, 31, 45, 34 
\(N_1\) = 155

Town B  27, 15, 4, 41, 49, 25, 10, 30 
\(N_2\) = 62

Rural Area C  8, 14, 12, 15, 30, 32, 21, 20, 34, 7, 11, 24 
\(N_3\) = 93

Here is output from Minitab that describes the data from each stratum: ( N in the output denotes numbers of data)
Variable  N  Mean  StDev  SE Mean 

Town A  20  33.90  5.95  1.33 
Town B  8  25.12  15.25  5.39 
Rural ar  12  19.00  9.36  2.70 
Usually a sample is selected by some probability design from each of the L strata in the population, with selections in different strata independent of each other. The special case where from each stratum a simple random sample is drawn is called a stratified random sample.
Try it!
Notation
 L = the number of strata
 N_{h} = number of units in each stratum h
 n_{h} = the number of samples taken from stratum h
 N = the total number of units in the population , i.e., N_{1} + N_{2} + ... + N_{L}
For our "Watching TV" example the following values are:
L = 3, \(N_1\) = 155, \(N_2\) = 62, \(N_3\) = 93, N = 155 + 62 + 93 = 310
Estimating the Population Total
\(\hat{\tau}_{st}=\sum\limits_{h=1}^L \hat{\tau}_h\)
The total is from each stratum added up where \(\hat{\tau}_h\) is an unbiased estimator for \(\tau_h\).
Since selections in different stratum are independent, the variance is:
\(Var(\hat{\tau}_{st})=\sum\limits_{h=1}^L Var(\hat{\tau}_h)\), and
\(\hat{V}ar(\hat{\tau}_{st})=\sum\limits_{h=1}^L \hat{V}ar(\hat{\tau}_h)\)
The formula are computed differently according to the sampling scheme within each stratum. For stratified random sampling, i.e., take a random sample within each stratum:
\(\hat{\tau}_h=N_h \bar{y}_h\)
\(\hat{V}ar(\hat{\tau}_{st})=\sum\limits_{h=1}^L N_h \cdot (N_hn_h)\cdot \dfrac{s^2_h}{n_h}\)
\(s^2_h=\dfrac{1}{n_h1}\sum\limits_{i=1}^{n_h}(y_{hi}\bar{y}_h)^2\)
You can see that this turns out pretty easy to remember, and one can easily obtain the estimates for the population mean.
\(\hat{\mu}_{st}=\dfrac{\hat{\tau}_{st}}{N}\)
\(\hat{V}ar(\hat{\mu}_{st})=\dfrac{1}{N^2}\hat{V}ar(\hat{\tau}_{st})\)
For stratified random sampling:
\(\bar{y}_{st}=\dfrac{1}{N} \sum\limits_{h=1}^L N_h \bar{y}_h\)
\(\hat{V}ar(\bar{y}_{st})=\sum\limits_{h=1}^L \left(\dfrac{N_h}{N}\right)^2 \left(\dfrac{N_hn_h}{N_h}\right) \dfrac{s^2_h}{n_h}\)
\(s_h\) is the sample standard deviation of h stratum as given in Minitab.
Try it!
\begin{align}
\bar{y}_{st} &=\dfrac{1}{N}(N_1\bar{y}_1+N_2\bar{y}_2+N_3\bar{y}_3)\\
&= \dfrac{1}{155+62+93} [(155 \times 33.9)+ (62 \times 25.12)+(93 \times 19.0)]\\
&= 27.7\\
\end{align}
\begin{align}
\hat{V}ar(\bar{y}_{st}) &=\sum\limits_{h=1}^3 \left(\dfrac{N_h}{N}\right)^2 \left(\dfrac{N_hn_h}{N_h}\right) \dfrac{s^2_h}{n_h}\\
&=\dfrac{1}{(310)^2}\left[\left((155)^2\cdot \dfrac{(15520)}{155}\cdot \dfrac{(5.95)^2}{20}\right)+\left((62)^2\cdot \dfrac{(628)}{62}\cdot \dfrac{(15.25)^2}{8}\right) \right.\\
&\left.+\left((93)^2\cdot \dfrac{(9312)}{93}\cdot \dfrac{(9.36)^2}{12}\right)\right]\\
&= 1.97\\
\end{align}
For the total hours watching TV example:
\(\hat{\tau}_{st}=N\cdot \bar{y}_{st}=310 \times 27.7=8587\)
\begin{align}
\hat{V}ar(\hat{\tau}_{st})&= N^2 \hat{V}ar(\bar{y}_{st})\\
&= (310)^2 \times 1.97=189317\\
\end{align}
Confidence Intervals
When all of the stratum sizes are small, an approximate 100(1\(\alpha\))% CI for \(\tau\) is:
\(\hat{\tau}_{st} \pm t\sqrt{\hat{V}ar(\hat{\tau}_{st})}\)
However, when the stratum sample sizes are at least 30, use z to approximate t.
What is the degrees of freedom for the t used in this formula for the confidence interval? Intuitively we would want this to be, (\(n_11)+(n_21)+...+(n_L1)\), and this is correct when the variances of all strata are all the same. But when this is not the case and we can not pool the degrees of freedom, we will need to use the Satterwaithe approximation for the degrees of freedom as follows:
\(d=\left(\sum\limits_{h=1}^L a_h s^2_h\right)^2/\sum\limits_{h=1}^L \dfrac{(a_h s^2_h)^2}{(n_h1)}\)
where, \(a_h=\dfrac{N_h(N_hn_h)}{n_h}\)
In particular, when \(N_h\) are all equal, \(n_h\) are all equal and \(s^2_h\) are all equal , the d.f. = n  L.
For the TV example:
\(a_1=\dfrac{N_1(N_1n_1)}{n_1}=\dfrac{155(15520)}{20}=1046.25\)
\(a_2=\dfrac{N_2(N_2n_2)}{n_2}=\dfrac{62(628)}{8}=418.5\)
\(a_3=\dfrac{N_3(N_3n_3)}{n_3}=\dfrac{93(9312)}{12}=627.75\)
\begin{align}
d&= \dfrac{(a_1s^2_1+a_2s^2_2+a_3s^2_3)^2}{\dfrac{(a_1s^2_1)^2}{n_11}+\dfrac{(a_2s^2_2)^2}{n_21}+\dfrac{(a_3s^2_3)^2}{n_31}}\\
&= \dfrac{(1046.5\cdot(5.95)^2+418.5\cdot(15.25)^2+627.75\cdot(9.36)^2)^2}{\dfrac{(1046.5\cdot(5.95)^2)^2}{201}+\dfrac{(418.5\cdot(15.25)^2)^2}{81}+\dfrac{(627.75\cdot(9.36)^2)^2}{121}}\\
&=21.09\\
\end{align}
Try it!
We will use t with df=21, hence a 95% CI for \(\mu\) is:
\(\bar{y}_{st} \pm t\sqrt{\hat{V}ar(\bar{y}_{st})}\)
\begin{array}{lcl}
& = & 27.7 \pm 2.08 \times \sqrt{1.97} \\
& = & 27.7 \pm 2.91
\end{array}
Similarly, a 95% CI for \(\tau\) is:
\(\hat{\tau}_{st} \pm t\sqrt{\hat{V}ar(\hat{\tau}_{st})}\)
\begin{array}{lcl}
& = & 8587 \pm 2.08 \times \sqrt{189278.56} \\
& = & 8587 \pm 902.32
\end{array}
Using R
Here is the code for R for this example:
Datafile: TVhour.txt
R code: Chapter6_TVhour.R.txt
6.2  The Stratification Principle
6.2  The Stratification PrincipleThe Stratification Principle
If your only objective of stratification is to produce estimators with small variances, then we want to stratify such that within each stratum, the units are as similar as possible. In a survey of human population, stratification may be based on socioeconomic factors or geographic regions.
For example, to estimate the average starting income for recent Penn State graduates, it would make sense to stratify by department since the starting income for graduates of the same department would be similar.
Allocation in Stratified Random Sampling
The question is, given a total sample size of n, how do we allocate these among L strata?
Try it!
The best allocation scheme is affected by the following three factors:
 the total number of elements in each stratum,
 the variability of the measurements within each stratum, and
 the cost associated with obtaining an observation from each stratum.
If we don't have all this information, but we know the total number, we can use a simplistic allocation. This is a proportional allocation that will maintain a steady sampling fraction throughout the population.
\(n_h=\dfrac{n\cdot N_h}{N}\)
This does not take into consideration the variability within each stratum and is not the optimal choice.
If the cost of sampling from each stratum is the same, then the optimal allocation (the allocation with the lowest variances) is:
\(n_h=\dfrac{n \cdot N_h \sigma_h}{\sum\limits_{k=1}^L N_k \sigma_k}\)
read text section 11.8 for proof
However, if the cost of sampling differs from stratum to stratum and the total cost is:
\(c=c_0+c_1n_1+c_2n_2+...+c_Ln_L\)
where \(c_0\) is the overhead cost, \(c_h\) is the cost per unit for stratum h. The optimal allocation is:
\(n_h=\dfrac{(cc_0)N_h \sigma_h/\sqrt{c_h}}{\sum\limits_{k=1}^L N_k \sigma_k \sqrt{c_k}}\)
 the sample size is directly proportional to \(N_h\) and \(\sigma_h\), i.e., allocate a larger sample size to the larger and more variable stratum.
 the sample size is inversely proportional to \(\sqrt{c_h}\), i.e., this allocates smaller sample sizes to the more expensive stratum.
In order to use the optimal allocation, one must be able to estimate σ_{h}
Let's take a look at this in the context of the TV Example...
Try it!
Optimal allocation:
\(n_h=\dfrac{n \cdot N_h \sigma_h}{\sum\limits_{k=1}^L N_k \sigma_k}\)
where,
\(N_1=155, \sigma_1=5\)
\(N_2=62, \sigma_2=15\)
\(N_3=93, \sigma_3=10\)
Then,
\(n_1=\dfrac{40 \times 155 \times 5}{155 \times 5+62 \times 15+93 \times 10}=11.7647\)
\(n_2=\dfrac{40 \times 62 \times 15}{155 \times 5+62 \times 15+93 \times 10}=14.1176\)
\(n_3=\dfrac{40 \times 93 \times 10}{155 \times 5+62 \times 15+93 \times 10}=14.1177\)
Thus we will choose \(n_1=12, n_2=14\) and \(n_3=14\).
Remember, it is important that \(n_1+n_2+n_3=40\) in this case.
6.3  Poststratification and further topics on stratification
6.3  Poststratification and further topics on stratificationSometimes, we would like to stratify on a key variable but cannot place the units into their correct strata until the units are sampled. For instance, in a telephone interview the respondents can not be placed into a male or female stratum until after the respondent is contacted.
Poststratification (stratification after the sample has been selected by simple random sampling) is often appropriate when a simple random sample is not properly balanced by the representation.
Here is an example. We want to estimate the average weight and take a simple random sample of 100 people. Here is what was obtained.
\(n_1=20\)
\(\bar{y}_1=180\) lbs.
\(n_2=80\)
\(\bar{y}_2=120\) lbs.
\(\bar{y}\) = the overall sample mean = 132
This is obviously not balanced with respect to gender. This is likely an underestimate due to the under representation of males in the data. How can we account for this?
In the population \(\dfrac{N_1}{N}=0.5\) and \(\dfrac{N_2}{N}=0.5\).
Thus,
\begin{align}
\bar{y}_{st} &= 0.5\cdot \bar{y}_1+0.5 \cdot \bar{y}_2\\
&= \dfrac{N_1}{N} \bar{y}_1+\dfrac{N_2}{N} \bar{y}_2=150\\
\end{align}
The poststratification estimator \(\bar{y}_{st}\) will not have the same variance as the stratified sample mean since the sample sizes \(n_h\) are random. Thus, the variance of the poststratification \(\bar{y}_{st}\) is the sum of the variance of the stratum. \(\bar{y}_{st}\) under the proportional allocation: \(nN_h/N\) and a term that shows the amount of increase one expects from the post rather than the prestratification.
\(Var(\text{post}\text{stratified }\bar{y}) \approx \dfrac{Nn}{nN}\sum\limits_{h=1}^L \left(\dfrac{N_h}{N}\right)\sigma^2_h + \dfrac{1}{n^2}\left(\dfrac{Nn}{N1}\right)\sum\limits_{h=1}^L \dfrac{NN_h}{N}\sigma^2_h\)
Example 62: Account Receivable
Example
A firm knows that 40% of its accounts receivable are wholesale and 60% are retail. However, to identify an account without pulling a file and looking at it is difficult. An auditor randomly sampled 100 accounts without replacement. Here are the results of his sampling:
\(n_1=70\)
\(\bar{y}_1=520\)
\(s_1=210\)
\(n_2=30\)
\(\bar{y}_2=280\)
\(s_2=90\)
Try it!
\begin{align}
\bar{y}_{st} &= \dfrac{N_1}{N} \bar{y}_1+\dfrac{N_2}{N} \bar{y}_2\\
&= 0.4\times 520+0.6 \times 280\\
&= 376\\
\end{align}
Given the firm has many, many accounts receivable we can ignore the finite correction factor.
\begin{align}
\hat{V}ar(\text{post}\text{stratified }\bar{y}) & \approx \dfrac{1}{n}\left(\dfrac{N_1}{N}s^2_1+\dfrac{N_2}{N}s^2_2\right)+\dfrac{1}{n^2}\left[\left(1\dfrac{N_1}{N}\right) s^2_1 + \left(1\dfrac{N_2}{N}\right) s^2_2 \right]\\
&= \dfrac{1}{100}[0.4 \times (210)^2+ 0.6 \times (90)^2]+ \dfrac{1}{100^2}[0.6 \times (210)^2+ 0.4 \times (90)^2]\\
&= 225+2.97\\
&= 227.97\\
\end{align}
Note! Further Topic on Stratified Sampling
It is not true that stratified random sampling always produces an estimator with a smaller variance than that from simple random sampling.
Example 63: Students Weights
The principal of a Prep school for boys wants to estimate the average weight of the 7thgrade boys in the school. There are 4 classes, 24 students in class 1, 36 in class 2, 30 students in class 3, and 30 in class 4.
For administrative ease, he decides to use stratified sampling with each class as a stratum. The principal has enough time and money to obtain data for 20 students, and because the cost of sampling is the same in each stratum, he decides to use proportional allocation, which gives \(n_1=4, n_2=6, n_3=5\) and \(n_4=5\). The data (in lbs.) is given in the following table:
Weight of the student (in lbs.)


Class 1  94, 90, 102, 110 
Class 2  91, 99, 93, 105, 111, 101 
Class 3  108, 96, 100, 93, 93 
Class 4  92, 110, 94, 91, 113 
Here is the Minitab output that describes the data from each stratum:
Variable  N  Mean  StDev  SE Mean 

Class 1  4  99.00  8.87  4.43 
Class 2  6  100.00  7.46  3.04 
Class 3  5  98.00  6.28  2.81 
Class 4  5  100.00  10.61  4.74 
All  20  99.30  7.73  1.73 
Try it!
To estimate the average weight of the 7th grade boys, using the Minitab output:
\(\bar{y}_{st}=\sum\limits_{h=1}^L \dfrac{N_h}{N}\bar{y}_h=99.3\)
\begin{align}
\hat{V}ar(\bar{y}_{st}) &= \dfrac{1}{N^2}\sum\limits_{i=1}^4 N^2_i \left(\dfrac{N_in_i}{N_i}\right)\dfrac{s^2_i}{n_i}\\
&= \dfrac{1}{120^2}\left[\left((24)^2\cdot \dfrac{5}{6} \cdot \dfrac{(8.87)^2}{4}\right)+\left((36)^2\cdot \dfrac{5}{6} \cdot \dfrac{(7.46)^2}{6}\right) \right.\\
&\left.+\left((30)^2\cdot \dfrac{5}{6} \cdot \dfrac{(6.28)^2}{5}\right)+\left((30)^2\cdot \dfrac{5}{6} \cdot \dfrac{(10.61)^2}{5}\right)\right]\\
&= 2.93\\
\end{align}
For a 95% CI, we need to compute the Satterwaithe's formula to get the degree of freedom:
\(d=\dfrac{\left(\sum\limits_{h=1}^L a_h s^2_h \right)^2}{\sum\limits_{h=1}^L \dfrac{(a_h s^2_h)^2}{n_h1}}\)
\(a_h=\dfrac{N_h(N_hn_h)}{n_h}\)
Plug in the formula and we get that d = 13.7576.
Round it down to 13, to be more conservative, and use d.f. = 13.
Then, an approximate 95% CI is:
\(99.3 \pm 2.160\sqrt{2.93}\)
\(=99.3 \pm 3.697\)
Looking back at the data, if we had used simple random sampling, would our CI have been tighter or looser?
Usually, the stratified random sampling will overall perform better because we usually use stratified random sampling when the stratum are more homogeneous.
There is no reason that the classes are more homogeneous in weight, and therefore there is no reason why this stratified random sampling is any better than a simple random sampling.
Try it!
\begin{align}
\hat{V}ar(\bar{y})&= \left(\dfrac{Nn}{N}\right) \left(\dfrac{s^2}{n}\right)\\
&= \left(\dfrac{12020}{120}\right) \left(\dfrac{(7.73)^2}{20}\right)\\
&= 2.49\\
\end{align}
Then an approximate 95% CI is: df = 19
\(99.3 \pm 2.093\sqrt{2.49}\)
\(=99.3 \pm 3.30\)
Thus the margin of error is smaller and the confidence interval narrower.
Since the data had been collected by stratified sampling, the above method treating it as srs is the wrong way to compute the variance for this problem. How the variance is computed depends on the method by which the sample was taken. We did the computation just to show that if hypothetically, the data was collected by s.r.s. with the data turn out to be as shown (for illustration's sake), then the margin of error will be smaller.
Moral of this example:
Stratifying on class, which is not related to weight, does not result in smaller variances within the strata. On the other hand, if stratification had other purposes such as to estimate the parameters of each subgroup, it still makes sense to stratify, though the purpose is not to get estimates with smaller variance. For this particular example, the stratification to estimate the average weight for each class may be relevant.
Stratified sampling to estimate population proportion
\(\hat{p}_{st}=\dfrac{1}{N}\sum\limits_{h=1}^L N_h \hat{p}_h\)
\begin{align}
\hat{V}ar(\hat{p}_{st})&= \dfrac{1}{N^2}\sum\limits_{h=1}^L N^2_h \hat{V}ar(\hat{p}_h)\\
&= \dfrac{1}{N^2}\sum\limits_{h=1}^L N^2_h \left(\dfrac{N_hn_h}{N_h}\right)\cdot \dfrac{\hat{p}_h(1\hat{p}_h)}{n_h1}\\
\end{align}
Example 64: TV Show Viewership
The advertising firm wants to estimate the proportion of households in the county that view the television show "American Idol".
\(N_1=155,N_2=62, N_3=93\). As before, we stratify by town and the sample results is:
Stratum  Sample Size  \(\hat{p}_h\) 
Town A  \(n_1=20\)  16/20 = 0.80 
Town B  \(n_2=8\)  2/8 = 0.25 
Rural Area C  \(n_3=12\)  6/12 = 0.50 
We plug in the values and we can get the following:
Try it!
\begin{align}
\hat{p}_{st}&=\dfrac{1}{N}\sum\limits_{h=1}^L N_h \hat{p}_h\\
&= \dfrac{155}{310}\cdot 0.8 +\dfrac{62}{310}\cdot 0.25+\dfrac{93}{310}\cdot 0.5\\
&= 0.6\\
\end{align}
The following display the estimated variance for each stratum:
\begin{align}
\hat{V}ar(\hat{p}_1)&= \left(\dfrac{N_1n_1}{N_1}\right)\cdot \dfrac{\hat{p}_1(1\hat{p}_1)}{n_11}\\
&= \left(\dfrac{15520}{155}\right)\cdot \dfrac{0.8(0.2)}{19}\\
&= 0.007\\
\end{align}
\begin{align}
\hat{V}ar(\hat{p}_2)&= \left(\dfrac{N_2n_2}{N_2}\right)\cdot \dfrac{\hat{p}_2(1\hat{p}_2)}{n_21}\\
&= \left(\dfrac{628}{62}\right)\cdot \dfrac{0.25(0.75)}{7}\\
&= 0.024\\
\end{align}
\begin{align}
\hat{V}ar(\hat{p}_3)&= \left(\dfrac{N_3n_3}{N_3}\right)\cdot \dfrac{\hat{p}_3(1\hat{p}_3)}{n_31}\\
&= \left(\dfrac{9312}{93}\right)\cdot \dfrac{0.5(0.5)}{11}\\
&= 0.02\\
\end{align}
Try it!
\begin{align}
\hat{V}ar(\hat{p}_{st})&= \dfrac{1}{(310)^2}[(155)^2(0.007)+(62)^2(0.024)+(93)^2(0.02)]\\
&= 0.0045\\
\end{align}