10.2 - Double Sampling for Stratification

In some sampling situations, units can be assigned to strata only after the sample is selected.

Try it!

Provide two examples where the units can be assigned to strata only after the sample is selected...

A couple of examples might include things like:

  • Phone survey where you know the gender after the person is sampled.
  • Person sampled can only be stratified by political affiliation after survey.

The method of post-stratification is useful only if the relative proportion of each stratum in the population \(W_h=\dfrac{N_h}{N}\) is known for each stratum h. If these proportions are not known, double sampling may be used, with an initial (large) sample used to classify the units into strata and then a stratified sample selected from the initial sample.

The two steps of double sampling for stratification:

Step 1: n' initial simple random samples are selected from a population of N units. These units are classified into strata, with \(n'_h\) observed to be in stratum h. The population proportion \(W_h=\dfrac{N_h}{N}\) is estimated by the sample proportion: \(w_h=\dfrac{n'_h}{n'}\), h = 1, ... , L.

Step 2: A second sample is then selected by stratified random sampling from the first sample. These units are classified into strata, with \(n_h\) units selected from the \(n'_h\) sample units in stratum h. Measurement of \(y_{hi}\) is recorded for each unit in the second sample.

We denote the sample mean in stratum h in the second sample as: \(\bar{y}_h=\sum\limits_{i=1}^{n_h} y_{hi}/n_h\)

An estimate for population mean is thus: \(\bar{y}_d=\sum\limits_{h=1}^L w_h \bar{y}_h\)

Note that \(\bar{y}_d\) is unbiased.

The decomposition of variances of two phase sampling is:

Let \(s_1\) denote the first-phase sample, then


Thus, the variance of the estimate for population mean is:

\(Var(\bar{y}_d)=\dfrac{N-n'}{N} \times \dfrac{\sigma^2}{n'}+E\sum\limits_{h=1}^L \left[\left(\dfrac{n'_h}{n'}\right)^2 \left(\dfrac{n'_h-n_h}{n'_h}\right) \dfrac{\sigma^2_{h(s_1)}}{n_h} \right]\)

where \(\sigma^2\) is the overall population variance and \(\sigma^2_{h(s_1)}\) is the population variance within stratum h for the particular first–phase sample \(s_1\).

An unbiased estimate for the variance of the estimate is:

\(\hat{V}ar(\bar{y}_d)=\dfrac{N-n'}{N} \times \dfrac{1}{n'-1}\sum\limits_{h=1}^L w_h(\bar{y}_h-\bar{y}_d)^2+ \dfrac{N-1}{N}\sum\limits_{h=1}^L \left[\left(\dfrac{n'_h-1}{n'-1}-\dfrac{n_h-1}{N-1}\right) \dfrac{w_h s^2_h}{n_h}\right]\)

where \(s_h^2\) is the stratum sample variance from the second sample.

Example 10-2: Double Sampling for Stratification Section

A shoe store wants to estimate the average number of pairs of shoes owned by the students who live in a certain college town neighborhood. They think that a stratified sample based on gender is a good approach to take but do not know the makeup of the gender in that neighborhood. They also do not know the gender of the respondent until after contacting them. So, they use double sampling by first contacting 160 randomly selected students in that neighborhood and ask them about their gender. It turns out that 64 are males and 96 are females. They then randomly sample 8 males and 12 females, provide them a $10.00 incentive for going home to count the number of pairs of shoes and report them.

The data are given in the table below:

Male 5 6 9 5 9 7 5 8        
Female 17 19 13 16 8 11 15 19 12 13 33 20
Variable N Mean StDev
male 8 6.750 1.753
female 12 16.33 6.37


To estimate the average pairs of shoes, they use a double sampling:

  1. Step 1:

    160 students are randomly sampled to find out their gender. Result: 64 male, 96 female.

  2. Step 2:

    stratify by gender and randomly sample 8 males and 12 females.

male: \(n'_1\) = 64 , female: \(n'_2\) = 96


Try it!

Compute \(\bar{y}_d\) and its estimated standard deviation.

\(w_1=\dfrac{64}{160}=0.4\) , \(w_2=\dfrac{96}{160}=0.6\)

\(n_1=8,\ n_2=12\)

\(\bar{y}_1=6.75\), \(\bar{y}_2=16.33\)

\(\bar{y}_d=\sum\limits_{h=1}^L w_h \bar{y}_h=(0.4 \times 6.75)+(0.6\times 16.33)=12.498\)

In this example, N is not known but we know it is large. Thus, approximately,

\(\hat{V}ar(\bar{y}_d)=\dfrac{1}{n'-1}\sum\limits_{h=1}^L w_h(\bar{y}_h-\bar{y}_d)^2+\sum\limits_{h=1}^L \dfrac{n'_h-1}{n'-1}\dfrac{w_h s^2_h}{n_h}\)

where \(s^2_1=(1.753)^2=3.073\) and \(s^2_2=(6.37)^2=40.5769\)

\hat{V}ar(\bar{y}_d) &= \dfrac{1}{160-1}[0.4(6.75-12.498)^2+0.6(16.33-12.498)^2]\\
& + \dfrac{64-1}{160-1}\left[ 0.4 \times \dfrac{3.073}{8}\right]+\dfrac{96-1}{160-1}\left[0.6\times \dfrac{40.5769}{12}\right]\\
&= 0.1385+1.2731\\
&= 1.4116\\