# 6.3 Poststratification and further topics on stratification

Printer-friendly version
 Unit Summary poststratification variance for poststratification example to show stratified sample sometimes is not better that simple random sampling stratified sample for proportion

Sometimes, we would like to stratify on a key variable but cannot place the units into their correct strata until the units are sampled. For instance, in a telephone interview the respondents can not be placed into a male or female stratum until after the respondent is contacted.

Poststratification (stratification after the selection of a sample) is often appropriate when a simple random sample is not properly balanced by the representation.

Here is an example. We want to estimate the average weight and take a simple random sample of 100 people. Here is what was obtained.

 Male Female n1 = 20 n2 = 80 $\bar{y}_1=180$ lbs. $\bar{y}_2=120$ lbs.

$\bar{y}$ = the overall sample mean = 132

This is obviously not balanced with respect to gender. This is likely an underestimate due to the under representation of males in the data. How can we account for this?

In the population $\dfrac{N_1}{N}=0.5$ and $\dfrac{N_2}{N}=0.5$.

Thus,

\begin{align}
\bar{y}_{st} &= 0.5\cdot \bar{y}_1+0.5 \cdot \bar{y}_2\\
&= \dfrac{N_1}{N} \bar{y}_1+\dfrac{N_2}{N} \bar{y}_2=150\\
\end{align}

The poststratification estimator  $\bar{y}_{st}$ will not have the same variance as the stratified sample mean since the sample sizes nh are random. Thus, the variance of the poststratification  $\bar{y}_{st}$ is the sum of the variance of the stratum.  $\bar{y}_{st}$ under the proportional allocation: nNh/N and a term that shows the amount of increase one expects from the post- rather than the pre-stratification.

$Var(\text{post}-\text{stratified }\bar{y}) \approx \dfrac{N-n}{nN}\sum\limits_{h=1}^L \left(\dfrac{N_h}{N}\right)\sigma^2_h + \dfrac{1}{n^2}\left(\dfrac{N-n}{N-1}\right)\sum\limits_{h=1}^L \dfrac{N-N_h}{N}\sigma^2_h$

Example

A firm knows that 40% of its accounts receivable are wholesale and 60% are retail. However, to identify an account without pulling a file and looking at it is difficult. An auditor randomly sampled 100 accounts without replacement. Here are the results of his sampling:

 Wholesale Retail n1 = 70 n2 = 30 $\bar{y}_1=520$ $\bar{y}_2=280$ s1 = 210 s2 = 90

#### Application Exercise

Compute the post-stratified mean and the variance of the post-stratified mean.

[Come up with an answer to this question and then click on the icon to reveal the solution.]

#### Further Topic on Stratified Sampling

It is not true that stratified random sampling always produces an estimator with a smaller variance than that from simple random sampling.

Example: The principal of a Prep school for boys wants to estimate the average weight of the 7th grade boys in the school. There are 4 classes, 24 students in class 1, 36 in class 2, 30 students in class 3, and 30 in class 4.

For administrative ease, he decides to use stratified sampling with each class as a stratum. The principal has enough time and money to obtain data for 20 students, and because the cost of sampling is the same in each stratum, he decides to use proportional allocation, which gives n1 = 4, n2 = 6, n3 = 5, and n4= 5. The data (in lbs.) is given in the following table:

 Weight of the student (in lbs.) Class 1 94, 90, 102, 110 Class 2 91, 99, 93, 105, 111, 101 Class 3 108, 96, 100, 93, 93 Class 4 92, 110, 94, 91, 113

Here is Minitab output that describes the data from each stratum:

 Activity Calculate the stratified estimator $\bar{y}_{st}$ and the variance of $\bar{y}_{st}$.

To estimate the average weight of the 7th grade boys, using the Minitab output:

$\bar{y}_{st}=\sum\limits_{h=1}^L \dfrac{N_h}{N}\bar{y}_h=99.3$

\begin{align}
\hat{V}ar(\bar{y}_{st}) &= \dfrac{1}{N^2}\sum\limits_{i=1}^4 N^2_i \left(\dfrac{N_i-n_i}{N_i}\right)\dfrac{s^2_i}{n_i}\\
&= \dfrac{1}{120^2}\left[\left((24)^2\cdot \dfrac{5}{6} \cdot \dfrac{(8.87)^2}{4}\right)+\left((36)^2\cdot \dfrac{5}{6} \cdot \dfrac{(7.46)^2}{6}\right) \right.\\
&\left.+\left((30)^2\cdot \dfrac{5}{6} \cdot \dfrac{(6.28)^2}{5}\right)+\left((30)^2\cdot \dfrac{5}{6} \cdot \dfrac{(10.61)^2}{5}\right)\right]\\
&= 2.93\\
\end{align}

For a 95% CI, we need to compute the Satterwaithe's formula to get the degree of freedom:

$d=\dfrac{\left(\sum\limits_{h=1}^L a_h s^2_h \right)^2}{\sum\limits_{h=1}^L \dfrac{(a_h s^2_h)^2}{n_h-1}}$

$a_h=\dfrac{N_h(N_h-n_h)}{n_h}$

 $a_1=\dfrac{24(24-4)}{4}=120$ $a_2=\dfrac{36(36-6)}{6}=180$ $a_3=\dfrac{30(30-5)}{5}=150$ $a_4=\dfrac{30(30-5)}{5}=150$

Plug in the formula and we get that d = 13.7576.

Round it down to 13, to be more conservative, and use d.f. = 13.

Then, an approximate 95% CI is:

$99.3 \pm 2.160\sqrt{2.93}$
$=99.3 \pm 3.697$

Looking back at the data, if we had used simple random sampling, would our CI have been tighter or looser?

Usually the stratified random sampling will overall perform better because we usually use stratified random sampling when the stratum are more homogeneous.

There is no reason that the classes are more homogeneous in weight, and therefore there is no reason why this stratified random sampling is any better than a simple random sampling.

#### Application Exercise

Find a 95% CI for population mean based on sample mean.  Is it wider or narrow than that based on the stratified estimate?

[Come up with an answer to this question and then click on the icon to reveal the solution.]

Since the data had been collected by stratified sampling, the above method treating it as srs is the wrong way to compute the variance for this problem. How the variance is computed depends on the method by which the sample was taken.  We did the computation just to show that if hypothetically, the data was collected by s.r.s. with the data turn out to be as shown (for illustration's sake),  then the margin of error will be smaller.

Moral of this example:

Stratifying on class, which is not related to weight, does not result in smaller variances within the strata. On the other hand, if stratification had other purposes such as to estimate the parameters of each subgroup, it still makes sense to stratify,  though the purpose is not to get estimates with smaller variance.  For this particular example, the stratification to estimate the average weight for each class may be relevant.

Stratified sampling to estimate population proportion:

$\hat{p}_{st}=\dfrac{1}{N}\sum\limits_{h=1}^L N_h \hat{p}_h$

\begin{align}
\hat{V}ar(\hat{p}_{st})&= \dfrac{1}{N^2}\sum\limits_{h=1}^L N^2_h \hat{V}ar(\hat{p}_h)\\
&= \dfrac{1}{N^2}\sum\limits_{h=1}^L N^2_h \left(\dfrac{N_h-n_h}{N_h}\right)\cdot \dfrac{\hat{p}_h(1-\hat{p}_h)}{n_h-1}\\
\end{align}

Example: The advertising firm wants to estimate the proportion of households in the county that view the television show "American Idol".

N1 = 155, N2 = 62, N3 = 93. As before, we stratify by town and the sample results is:

 Stratum Sample Size $\hat{p}_h$ Town A n1 = 20 16/20 = 0.80 Town B n2 = 8 2/8 = 0.25 Rural Area C n3 = 12 6/12 = 0.50

We plug in the values and we can get the following:

#### Application Exercise

Compute the estimator for the population proportion.

[Come up with an answer to this question and then click on the icon to reveal the solution.]

The following display the estimated variance for each stratum:

\begin{align}
\hat{V}ar(\hat{p}_1)&= \left(\dfrac{N_1-n_1}{N_1}\right)\cdot \dfrac{\hat{p}_1(1-\hat{p}_1)}{n_1-1}\\
&= \left(\dfrac{155-20}{155}\right)\cdot \dfrac{0.8(0.2)}{19}\\
&= 0.007\\
\end{align}

\begin{align}
\hat{V}ar(\hat{p}_2)&= \left(\dfrac{N_2-n_2}{N_2}\right)\cdot \dfrac{\hat{p}_2(1-\hat{p}_2)}{n_2-1}\\
&= \left(\dfrac{62-8}{62}\right)\cdot \dfrac{0.25(0.75)}{7}\\
&= 0.024\\
\end{align}

\begin{align}
\hat{V}ar(\hat{p}_3)&= \left(\dfrac{N_3-n_3}{N_3}\right)\cdot \dfrac{\hat{p}_3(1-\hat{p}_3)}{n_3-1}\\
&= \left(\dfrac{93-12}{93}\right)\cdot \dfrac{0.5(0.5)}{11}\\
&= 0.02\\
\end{align}

#### Application Exercise

Compute the estimated variance of the strartified proportion.

[Come up with an answer to this question and then click on the icon to reveal the solution.]