In stratified sampling, the population is partitioned into nonoverlapping groups, called strata and a sample is selected by some design within each stratum.
For example, geographical regions can be stratified into similar regions by means of some known variables such as habitat type, elevation, or soil type. Another example might be to determine the proportions of defective products being assembled in a factory. In this case, sampling may be stratified by production lines, factories, etc.
Can you think of a couple of additional examples where stratified sampling would make sense? Look for opportunities when the measurements within the strata are more homogeneous.
The principal reasons for using stratified random sampling rather than simple random sampling include:
 Stratification may produce a smaller error of estimation than would be produced by a simple random sample of the same size. This result is particularly true if measurements within strata are very homogeneous.
 The cost per observation in the survey may be reduced by stratification of the population elements into convenient groupings.
 Estimates of population parameters may be desired for subgroups of the population. These subgroups should then be identified.
Example 61: Average Hours Watching TV Per Week Section
Reference p.121 of Scheaffer, Mendenhall, and Ott
An advertising firm, interested in determining how much to emphasize television advertising in a certain county decides to conduct a sample survey to estimate the average number of hours each week that households within that county watch television. The county has two towns, A and B, and a rural area C. Town A is built around a factory and most households contain factory workers with schoolaged children. Town B contains mainly retirees and rural area C residents are mainly farmers.
There are 155 households in town A, 62 in town B and 93 in rural area C. The firm decides to select 20 households from Town A, 8 households from Town B, and 12 households from the rural area. The results are given in the following table:
Town A 
35, 43, 36, 39, 28, 28, 29, 25, 38, 27, 26, 32, 29, 40, 35, 41, 37, 31, 45, 34 
\(N_1\) = 155


Town B  27, 15, 4, 41, 49, 25, 10, 30 
\(N_2\) = 62

Rural Area C  8, 14, 12, 15, 30, 32, 21, 20, 34, 7, 11, 24 
\(N_3\) = 93

Here is the Minitab output that describes the data from each stratum: ( N in the output denotes numbers of data)
Variable  N  Mean  StDev  SE Mean 

Town A  20  33.90  5.95  1.33 
Town B  8  25.12  15.25  5.39 
Rural ar  12  19.00  9.36  2.70 
Usually, a sample is selected by some probability design from each of the L strata in the population, with selections in different strata independent of each other. The special case where from each stratum a simple random sample is drawn is called a stratified random sample.
Try it!
Notation
 L = the number of strata
 N_{h} = number of units in each stratum h
 n_{h} = the number of samples taken from stratum h
 N = the total number of units in the population, i.e., N_{1} + N_{2} + ... + N_{L}
For our "Watching TV" example the following values are:
L = 3, \(N_1\) = 155, \(N_2\) = 62, \(N_3\) = 93, N = 155 + 62 + 93 = 310
Estimating the Population Total
\(\hat{\tau}_{st}=\sum\limits_{h=1}^L \hat{\tau}_h\)
The total is from each stratum added up where \(\hat{\tau}_h\) is an unbiased estimator for \(\tau_h\).
Since selections in a different strata are independent, the variance is:
\(Var(\hat{\tau}_{st})=\sum\limits_{h=1}^L Var(\hat{\tau}_h)\), and
\(\hat{V}ar(\hat{\tau}_{st})=\sum\limits_{h=1}^L \hat{V}ar(\hat{\tau}_h)\)
The formula is computed differently according to the sampling scheme within each stratum. For stratified random sampling, i.e., take a random sample within each stratum:
\(\hat{\tau}_h=N_h \bar{y}_h\)
\(\hat{V}ar(\hat{\tau}_{st})=\sum\limits_{h=1}^L N_h \cdot (N_hn_h)\cdot \dfrac{s^2_h}{n_h}\)
\(s^2_h=\dfrac{1}{n_h1}\sum\limits_{i=1}^{n_h}(y_{hi}\bar{y}_h)^2\)
You can see that this turns out pretty easy to remember, and one can easily obtain the estimates for the population mean.
\(\hat{\mu}_{st}=\dfrac{\hat{\tau}_{st}}{N}\)
\(\hat{V}ar(\hat{\mu}_{st})=\dfrac{1}{N^2}\hat{V}ar(\hat{\tau}_{st})\)
For stratified random sampling:
\(\bar{y}_{st}=\dfrac{1}{N} \sum\limits_{h=1}^L N_h \bar{y}_h\)
\(\hat{V}ar(\bar{y}_{st})=\sum\limits_{h=1}^L \left(\dfrac{N_h}{N}\right)^2 \left(\dfrac{N_hn_h}{N_h}\right) \dfrac{s^2_h}{n_h}\)
\(s_h\) is the sample standard deviation of h stratum as given in Minitab.
Try it!
\begin{align}
\bar{y}_{st} &=\dfrac{1}{N}(N_1\bar{y}_1+N_2\bar{y}_2+N_3\bar{y}_3)\\
&= \dfrac{1}{155+62+93} [(155 \times 33.9)+ (62 \times 25.12)+(93 \times 19.0)]\\
&= 27.7\\
\end{align}
\begin{align}
\hat{V}ar(\bar{y}_{st}) &=\sum\limits_{h=1}^3 \left(\dfrac{N_h}{N}\right)^2 \left(\dfrac{N_hn_h}{N_h}\right) \dfrac{s^2_h}{n_h}\\
&=\dfrac{1}{(310)^2}\left[\left((155)^2\cdot \dfrac{(15520)}{155}\cdot \dfrac{(5.95)^2}{20}\right)+\left((62)^2\cdot \dfrac{(628)}{62}\cdot \dfrac{(15.25)^2}{8}\right) \right.\\
&\left.+\left((93)^2\cdot \dfrac{(9312)}{93}\cdot \dfrac{(9.36)^2}{12}\right)\right]\\
&= 1.97\\
\end{align}
For the total hours watching TV example:
\(\hat{\tau}_{st}=N\cdot \bar{y}_{st}=310 \times 27.7=8587\)
\begin{align}
\hat{V}ar(\hat{\tau}_{st})&= N^2 \hat{V}ar(\bar{y}_{st})\\
&= (310)^2 \times 1.97=189317\\
\end{align}
Confidence Intervals
When all of the stratum sizes are small, an approximate 100(1\(\alpha\))% CI for \(\tau\) is:
\(\hat{\tau}_{st} \pm t\sqrt{\hat{V}ar(\hat{\tau}_{st})}\)
However, when the stratum sample sizes are at least 30, use z to approximate t.
What are the degrees of freedom for the t used in this formula for the confidence interval? Intuitively we would want this to be, (\(n_11)+(n_21)+...+(n_L1)\), and this is correct when the variances of all strata are all the same. But when this is not the case and we can not pool the degrees of freedom, we will need to use the Satterwaithe approximation for the degrees of freedom as follows:
\(d=\left(\sum\limits_{h=1}^L a_h s^2_h\right)^2/\sum\limits_{h=1}^L \dfrac{(a_h s^2_h)^2}{(n_h1)}\)
where, \(a_h=\dfrac{N_h(N_hn_h)}{n_h}\)
In particular, when \(N_h\) are all equal, \(n_h\) are all equal and \(s^2_h\) are all equal , the d.f. = n  L.
For the TV example:
\(a_1=\dfrac{N_1(N_1n_1)}{n_1}=\dfrac{155(15520)}{20}=1046.25\)
\(a_2=\dfrac{N_2(N_2n_2)}{n_2}=\dfrac{62(628)}{8}=418.5\)
\(a_3=\dfrac{N_3(N_3n_3)}{n_3}=\dfrac{93(9312)}{12}=627.75\)
\begin{align}
d&= \dfrac{(a_1s^2_1+a_2s^2_2+a_3s^2_3)^2}{\dfrac{(a_1s^2_1)^2}{n_11}+\dfrac{(a_2s^2_2)^2}{n_21}+\dfrac{(a_3s^2_3)^2}{n_31}}\\
&= \dfrac{(1046.5\cdot(5.95)^2+418.5\cdot(15.25)^2+627.75\cdot(9.36)^2)^2}{\dfrac{(1046.5\cdot(5.95)^2)^2}{201}+\dfrac{(418.5\cdot(15.25)^2)^2}{81}+\dfrac{(627.75\cdot(9.36)^2)^2}{121}}\\
&=21.09\\
\end{align}
Try it!
We will use t with df=21, hence a 95% CI for \(\mu\) is:
\(\bar{y}_{st} \pm t\sqrt{\hat{V}ar(\bar{y}_{st})}\)
\begin{array}{lcl}
& = & 27.7 \pm 2.08 \times \sqrt{1.97} \\
& = & 27.7 \pm 2.91
\end{array}
Similarly, a 95% CI for \(\tau\) is:
\(\hat{\tau}_{st} \pm t\sqrt{\hat{V}ar(\hat{\tau}_{st})}\)
\begin{array}{lcl}
& = & 8587 \pm 2.08 \times \sqrt{189278.56} \\
& = & 8587 \pm 902.32
\end{array}
Using R
Here is the code for R for this example:
Datafile: TVhour.txt
R code: Chapter6_TVhour.R.txt