# Lesson 9: Multi-stage Designs

Lesson 9: Multi-stage Designs## Overview

In Section 9.1, we introduce multi-stage design and give a few practical examples. We then provide the notations for two stage design. The unbiased estimators for two stage design with simple random sampling at each stage is discussed. Then we discuss the ratio estimator for two stage design with simple random sampling at each stage.

In Section 9.2, we discuss the two stage design with primary units selected with probability proportional to size and secondary units selected with simple random sampling. The Hansen-Hurwitz estimator is computed for this situation. We also show how to compute the estimated variance for the H-H estimator.

*Sampling *by Steven Thompson, 3rd edition.

## Objectives

- know why and when to use multi-stage sampling,
- compute unbiased estimator and its estimated variance for the two stage design when srs is used at each stage,
- compute ratio estimator and its estimated variance for the two stage design when srs is used at each stage, and
- compute the Hansen-Hurwitz estimator and its estimated variance when primary units are selected with probability proportional to size and secondary units selected with srs.

# 9.1 - Multi-Stage Sampling: Two Stages with S.R.S at Each Stage

9.1 - Multi-Stage Sampling: Two Stages with S.R.S at Each StageWe have learned about cluster sampling where one selects the primary units and then all of the cases from the secondary units. With multi-stage sampling, we will only select some of the units from the secondary stages.

For example, in two-stage sampling:

- 1st stage samples
*n*primary units - 2nd stage, for the
*i*th primary unit, selects*m*(not all) secondary units_{i}

Multistage designs are used in many practical cases. These are just a few:

**Large surveys involving the sampling of housing units**- The U.S. Census Bureau selects geographical areas within each state and then select housing units within each selected geographical area.**Practical quality control problems**often involve two (or more) stages of sampling. For example, Ford wants to inspect the quality of a supplier of air filters. They first sample some cartons and then inspect some air filters inside these selected cartons.**Gallop poll samples**approximately 300 election districts. At the second stage, they select 5 households per district.

#### Notation:

: number of primary units in the population*N*- \(\mathbf{M_i}\) : number of secondary units in the
*i*th primary unit - \(y_i=\sum\limits_{j=1}^{M_i}y_{ij}\)
**population total :**\(\tau=\sum\limits_{i=1}^N \sum\limits_{j=1}^{M_i}y_{ij}\)- \(\mu=\dfrac{\tau}{M}\) where \(M=\sum\limits_{i=1}^N M_i\)
: number of primary units selected in the first stage*n*- \(\mathbf{m_i}\) : number of secondary units selected in the second stage

#### Try it!

1. If \(m_i\) = \(M_i\)(all secondary units are selected), it reduces to cluster sampling.

2. If *n* = *N* (all primary units are selected), it reduces to stratified random sampling.

#### Multistage Design

This is something that arises in practice quite often. As a result, we need to be able to figure out how this type of sampling design is implemented. Most of the time this deals with two stages of sample with simple random sampling at each stage.

Let's take a look at this graph as a means of understanding how this type of sampling design plays out.

*N* = 50 for both graphs

Here is another graph for another example of a two-stage sample

#### Two-stage cluster sampling with simple random sampling at each stage

We will discuss two possible estimators for this sampling design: unbiased estimator and ratio estimator.

## A. Unbiased Estimator

Since simple random sampling is used in the second stage, an unbiased estimator of the total *y*-value for the *i*th primary unit is:

\(\hat{y}_i=M_i \dfrac{\sum\limits_{j=1}^{m_i}y_{ij}}{m_i}=M_i \bar{y}_i\) where \(\bar{y}_i=\dfrac{\sum\limits_{j=1}^{m_i}y_{ij}}{m_i}\)

The first part of this formula is also known as the expansion estimator.

Also, since simple random sampling is used in the first stage, an unbiased estimator for the population total is:

\(\hat{\tau}=N\cdot \dfrac{\sum\limits_{i=1}^n\hat{y}_i}{n}=N \cdot \dfrac{\sum\limits_{i=1}^n M_i \bar{y}_i}{n}\)

Now we have the expansion estimators from each stage. The next thing we need is the variance.

The estimated variance of \(\hat{\tau}\) is:

\(\hat{V}ar(\hat{\tau})=N(N-n)\dfrac{s^2_u}{n}+\dfrac{N}{n}\sum\limits_{i=1}^n M_i (M_i-m_i) \dfrac{s^2_i}{m_i}\)

\(s_u^2\) is the sample variance among the primary unit totals,

\(s_i^2\) is the sample variance within the *i*th primary unit, here

\(s^2_u=\dfrac{1}{n-1}\sum\limits_{i=1}^n \left(\hat{y}_i-\dfrac{\sum\limits_{i=1}^n \hat{y}_i}{n}\right)^2\), and \(s^2_i=\dfrac{1}{m_i-1}\sum\limits_{j=1}^{m_i}(y_{ij}-\bar{y}_i)^2\)

To estimate the population mean μ =\(\tau\)/ *M*, the estimators and the estimated variance are:

\(\hat{\mu}=\dfrac{N}{M}\cdot \dfrac{\sum\limits_{i=1}^n \hat{y}_i}{n}\), and \(\hat{V}ar(\hat{\mu})=\dfrac{1}{M^2}\hat{V}ar(\hat{\tau})\)

Let's take a look at an example where we can compute both the estimates and their variances.

## Example 9-1: Restaurant Employee Satisfaction

A restaurant chain wants to estimate the average employee satisfaction with their job (the scale is from 1 to 7). They have 120 restaurants the total number of employees in the chain is 6860. They use simple random sampling to sample 10 restaurants. They then use simple random sampling to sample and interview about 20% of the employees in those restaurants. The data are given as follows.

Restaurant |
\(M_i\) |
\(m_i\) |
Employee Satisfaction |
\(\bar{y}_i\) | \(s_i\) |

1 |
54 | 10 | 5, 7, 6, 5, 4, 7, 6, 6, 4, 5 | 5.50 | 1.08 |

2 |
48 | 10 | 7, 7, 7, 6, 5, 4, 7, 7, 6, 6 | 6.20 | 1.03 |

3 |
68 | 14 | 5, 6, 5, 6, 4, 5, 6, 5, 4, 5, 4, 6, 5, 6 | 5.14 | 0.77 |

4 |
70 | 14 | 6, 5, 7, 6, 7, 6, 5, 7, 5, 7, 6, 5, 7, 6 | 6.07 | 0.83 |

5 |
52 | 10 | 4, 5, 4, 5, 5, 6, 5, 4, 4, 4 | 4.60 | 0.70 |

6 |
62 | 12 | 5, 7, 6, 7, 4, 3, 1, 5, 4, 6, 4, 5 | 4.75 | 1.71 |

7 |
41 | 8 | 7, 6, 7, 7, 6, 6, 5, 7 | 6.38 | 0.74 |

8 |
53 | 11 | 6, 6, 5, 4, 6, 7, 5, 5, 7, 6, 5 | 5.64 | 0.92 |

9 |
64 | 12 | 7, 6, 5, 4, 6, 5, 7, 4, 3, 6, 5, 7 | 5.42 | 1.31 |

10 |
43 | 9 | 7, 6, 6, 5, 7, 3, 5, 4, 5 | 5.33 | 1.32 |

**Minitab output:**

Mi | mi | yibar | yihat |
---|---|---|---|

54 | 10 | 5.50 | 297.00 |

48 | 10 | 6.20 | 297.60 |

68 | 14 | 5.14 | 349.52 |

70 | 14 | 6.07 | 424.90 |

52 | 10 | 4.60 | 239.20 |

62 | 12 | 4.75 | 294.50 |

41 | 8 | 6.38 | 261.58 |

53 | 11 | 5.64 | 298.92 |

64 | 12 | 5.42 | 346.88 |

43 | 9 | 5.33 | 229.19 |

##### Descriptive Statistics for yibar and yihat

Variable | N | Mean | StDev |
---|---|---|---|

yibar | 10 | 5.503 | 0.591 |

yihat | 10 | 303.9 | 58.1 |

Here we have output from Minitab that provides the descriptive statistics that you will need to compute the estimators and variance.

#### Try it!

The unbiased estimator is:

\begin{align}

\hat{\tau}&= N \cdot \dfrac{\sum\limits_{i=1}^n M_i\bar{y}_i}{n}\\

&= 120 \cdot \dfrac{(54\times 5.50)+(48\times 6.20)+\ldots+(43\times 5.33)}{10}\\

&= 36471.5\\

\end{align}

This might be thought of as the total satisfaction score. If we divided this by the total number of employees we would get the average score. If *M* is given to be 6860 then

\(\hat{\mu}=\dfrac{36471.5}{6860}=5.32\)

The estimated variance of the unbiased estimator is then:

\(\hat{V}ar(\hat{\tau})=N(N-n)\dfrac{s^2_u}{n}+\dfrac{N}{n}\sum\limits_{i=1}^n M_i (M_i-m_i) \dfrac{s^2_i}{m_i}\)

\(s_u^2\) is the sample variance of \(\hat{y}_1,\ \hat{y}_2,\cdots,\ \hat{y}_{10}\). From the Minitab output, \(s_u^2\) = (58.1)^{2} = 3375.61

\(s_i^2\) is the sample variance within the primary unit.

\(s^2_i=\dfrac{1}{m_i-1}\sum\limits_{j=1}^{m_i}(y_{ij}-\bar{y}_i)^2\)

\(s_i\) has been computed and given in the table.

#### Try it!

\begin{align}

\hat{V}ar(\hat{\tau})&= 120 \times (120-10)\times \dfrac{3375.61}{10}+\dfrac{120}{10}\times \left(54(54-10)\dfrac{1.08^2}{10}+\ldots+43(43-9)\dfrac{1.32^2}{9}\right)\\

&= 4455805.2+32451.6\\

&= 4488256.8\\

\end{align}

\(\hat{V}ar(\hat{\mu})=\dfrac{4488256.8}{6860^2}=0.095\)

**Note!**If

*M*is unknown, we cannot use the unbiased estimator \(\hat{\mu}\).

If the cluster total is proportional to the cluster size, then the ratio estimate is appropriate. We will discuss the ratio estimator in the following:

## B. Ratio Estimator

For the population total, the ratio estimator and its estimated variance are:

\(\hat{\tau}_r=\dfrac{\sum\limits_{i=1}^n \hat{y}_i}{\sum\limits_{i=1}^n M_i}\cdot M=\hat{r}M\)

\(\hat{V}ar(\hat{\tau}_r)=\dfrac{N(N-n)}{n}\cdot \dfrac{1}{n-1}\sum\limits_{i=1}^n(\hat{y}_i-M_i\hat{r})^2+\dfrac{N}{n}\sum\limits_{i=1}^n M_i(M_i-m_i)\dfrac{s^2_i}{m_i}\)

A similar question can be asked of the population mean. Therefore, for the population mean, the ratio estimator and its estimated variance are:

\(\hat{\mu}_r=\hat{r}\)

\(\hat{V}ar(\hat{\mu}_r)=\dfrac{1}{M^2}\hat{V}ar(\hat{\tau}_r)\)

#### Try it!

\(\hat{\mu}_r=\dfrac{\sum\limits_{i=1}^n M_i\bar{y}_i}{\sum\limits_{i=1}^n M_i}=\dfrac{54\times 5.50+\ldots+43\times 5.33}{54+48+\ldots+43}=\dfrac{3039.3}{555}=5.48\)

\begin{align}

\hat{V}ar(\hat{\mu}_r)&=\dfrac{1}{M^2}\left[ \dfrac{N(N-n)}{n}\cdot \dfrac{1}{n-1}\sum\limits_{i=1}^n(\hat{y}_i-M_i\hat{r})^2+\dfrac{N}{n}\sum\limits_{i=1}^n M_i(M_i-m_i)\dfrac{s^2_i}{m_i} \right]\\

&= \dfrac{1}{6860^2}\left[ \dfrac{120(120-10)}{10}\cdot \dfrac{1}{9}((54 \times 5.50-54\times 5.48)^2+ \ldots + (43\times 5.33-43\times 5.48)^2)+32451.6 \right]\\

&= 0.029\\

\end{align}

**Note!**If

*M*is unknown, one can use \(\hat{\mu}_r\) and estimate

*M*by: \(\dfrac{\sum\limits_{i=1}^n M_i}{n}\times N\)

Recall: \(M=\sum\limits_{i=1}^N M_i\)

# 9.2 - Two Stages with Primary Units Selected by Probability Proportional to Size and Secondary Units Selected with S.R.S.

9.2 - Two Stages with Primary Units Selected by Probability Proportional to Size and Secondary Units Selected with S.R.S.Multi-stage design with primary units selected with p.p.s. and secondary units selected with simple random sampling.

Using the Hansen-Hurwitz estimator, we get the following:

To estimate the population total:

\(\hat{\tau}_p=\dfrac{M}{n}\sum\limits_{i=1}^n \dfrac{\hat{y}_i}{M_i}=M \dfrac{\sum \bar{y}_i}{n}\), where \(\bar{y}_i=\dfrac{\hat{y}_i}{M_i}\)

\(\hat{V}ar(\hat{\tau}_p)=\dfrac{M^2}{n(n-1)} \sum (\bar{y}_i-\hat{\mu}_p)^2\)

To estimate the population mean:

\(\hat{\mu}_p=\dfrac{\sum \bar{y}_i}{n}\)

[since \(\hat{\mu}_p=\left(\dfrac{\hat{\tau}_p}{M}\right)\) and thus it becomes \(\dfrac{\sum \bar{y}_i}{n}\)]

\(\hat{V}ar(\hat{\mu}_p)=\dfrac{1}{n(n-1)}\sum (\bar{y}_i-\hat{\mu}_p)^2\)

**Example**

There are 36 departments in a small liberal arts college. One wants to estimate the average amount of money the students spent on textbooks last semester. Since the size of each department varies very much, a two-stage cluster sampling using probability proportional to size for the primary unit is carried out. The results are listed in the table below.

Department |
\(\mathbf{M_i}\) | \(\mathbf{m_i}\) | Textbook expenses in $ for last semester |

1 |
10 | 4 | 326, 400, 423, 443 |

2 |
20 | 8 | 278, 312, 450, 350, 227, 438, 512, 403 |

3 |
30 | 12 | 512, 256, 332, 402, 512, 309, 411, 610, 422, 630, 550, 470 |

4 |
15 | 6 | 426, 312, 512, 440, 342, 533 |

**Minitab output:**

##### Decriptive Statistics: dept1, dept2, dept3, dept4

Variable | Mean | SE Mean | StDev | Variance |
---|---|---|---|---|

dept1 | 398.0 | 25.6 | 51.1 | 2612.7 |

dept2 | 371.3 | 34.1 | 96.3 | 9277.4 |

dept3 | 451.3 | 33.9 | 117.6 | 13828.8 |

dept4 | 427.5 | 36.1 | 88.4 | 7815.9 |

#### Try it!

\(\hat{\mu}_p=\dfrac{\sum \bar{y}_i}{n}=\dfrac{398+371.3+451.3+427.5}{4}=412.025\)

\begin{align}

\hat{V}ar(\hat{\mu}_p) &= \dfrac{1}{n(n-1)}\sum (\bar{y}_i-\hat{\mu}_p)^2\\

&= \dfrac{1}{4\times 3}\left[(398-412.025)^2+(371.3-412.025)^2+(451.3-412.025)^2+(427.5-412.025)^2\right]\\

&= 303.12\\

\end{align}