7.2 - Estimators for Cluster Sampling when Primary units are selected by simple random sampling

When the primary units are selected by simple random sampling, frequently used estimators among many possible estimators are:

A. Unbiased estimator

\(\hat{\tau}=N\cdot \bar{y}=\dfrac{N\cdot \sum\limits_{i=1}^n y_i}{n}\)

Recall that yi is the total of y-values in the ith primary unit.

\(\hat{V}ar(\hat{\tau})=N\cdot (N-n)\dfrac{s^2_u}{n}\)

where \(s^2_u=\dfrac{1}{n-1}\sum\limits_{i=1}^n(y_i-\bar{y})^2\)

To estimate the mean per primary unit,\(\tau\)/ N, the mean and variance equations are given below:

\(\bar{y}=\dfrac{\hat{\tau}}{N}\), \(\hat{Var}(\bar{y})=\dfrac{1}{N^2} \hat{Var}(\hat{\tau})\)

To estimate the mean per secondary unit, the mean and variance equations are given below:

\(\hat{\mu}=\dfrac{\hat{\tau}}{M}\), \(\hat{Var}(\hat{\mu})=\dfrac{1}{M^2} \hat{Var}(\hat{\tau})\)

B. Ratio Estimator

If the primary unit total yi is highly correlated with the primary unit size Mi , a ratio estimator based on size may be efficient. 

\(\hat{\tau}_r=r \cdot M,\quad M=\sum\limits_{i=1}^N M_i\)

where \(r=\dfrac{\sum\limits_{i=1}^n y_i}{\sum\limits_{i=1}^n M_i},\quad \hat{V}ar(\hat{\tau}_r)=\dfrac{N(N-n)}{n(n-1)}\sum\limits_{i=1}^n (y_i-rM_i)^2\)

The Basic Principle Section

Since every secondary unit is observed within a selected primary unit, the primary unit variance does not enter into the variances of the estimators. For example,

\(\hat{V}ar(\hat{\tau})=N(N-n)\cdot \dfrac{s^2_u}{n}\)
where  \(s^2_u=\dfrac{1}{n-1}\sum\limits_{i=1}^n (y_i-\bar{y})^2\)

Thus, to obtain estimators of low variances,

  1. Clusters should be formed so that one cluster is similar to another cluster. (Note: this is 'very different' from saying that units in the cluster are similar)
  2. Each cluster should contain the full diversity of the population and thus, is 'representative'.

With natural populations of spatially distributed plants, animals, or minerals, and human populations, the above condition is typically satisfied by systematic sampling where each cluster contains units that are far apart. Cluster sampling is more often than not carried out for reasons of convenience or practicality rather than to obtain the lowest variances.

Why or When do we use cluster sampling? Section

Will it give us a more precise estimator? The answer is no in most cases.

We do use cluster sampling out of necessity even though it will give us a larger variance.

If the objective of sampling is to obtain a specified amount of information about a population parameter at minimum cost, cluster sampling sometimes gives more information per unit cost than simple random sampling, stratified sampling, and systematic sampling due to the cost of sampling units within a cluster may be much lower.

Cluster sampling is an effective design in two different scenarios:

  1. A good frame listing the population elements either is not available or is very costly to obtain, whereas a frame listing clusters is easily obtained.
  2. The cost of obtaining observations increases as the distance separating the elements increases.

Example 7-1: Average yearly Vacation Budget Section

Let's look at an example of cluster sampling using a ratio estimator.

A sociologist wants to estimate the average yearly vacation budget for each household in a certain city. It is given that there are 3,100 households in the city. The sociologist marked off the city into 400 blocks and treated them as 400 clusters. He then randomly sampled 24 clusters interviewing every household living in that cluster. The data are given in the table below:

Cluster Number of households \(M_i\) Total vacation budget per cluster \(y_i\)
1 7 12,000
2 9 15,000
3 5 8,000
4 8 13,000
5 12 18,000
6 5 7,000
7 4 6,000
8 8 13,000
9 14 22,000
10 6 9,800
11 3 7,000
12 13 18,000
13 8 12,340
14 4 5,000
15 6 8,900
16 9 14,000
17 3 4,000
18 10 11,400
19 4 5,000
20 7 13,000
21 6 8,900
22 5 8,700
23 7 10,000
24 6 9,200
  169 259,240

  Using Minitab

To use Minitab to plot the total for cluster versus cluster size:

  1. Graph > Scatterplot
  2. Select 'Total for Cluster' as the Y variable
  3. Select 'Cluster Size' as the X variable

To use Minitab to display descriptive statistics:

Stat > Basic Statistics > Display Descriptive Statistics

Here is a plot of this data so that we can see if the cluster size is proportional to the total for the cluster.

Minitab output

Minitab output

The regression equation is

total for the cluster = 648 + 1442 cluster size

Coefficients

Predictor Coef StDev T P
Constant 648.0 705.9 0.92 0.369
cluster 1441.94 92.59 15.57 0.000

Descriptive Statistics for the variables:

Variable N Mean StDev
Cluster size 24 7.042 2.985
\(Y_i\) 24 10802 4495
\Y_i -rM_i\) 24 -0 1325

The ratio estimator for cluster sample (ratio-to-size):

If primary unit total \(y_i\) is highly correlated with cluster size \(M_i\), a ratio estimator based on size may be efficient. The ratio estimator of the population total is:

 \(\hat{\tau}_r=r\cdot M \quad \text{where } r=\dfrac{\sum\limits_{i=1}^n y_i}{\sum\limits_{i=1}^n M_i}\)

The ratio estimator is biased but the bias is small when the sample size is large. Here is the variance:

\(\hat{V}ar(\hat{\tau}_r)=\dfrac{N(N-n)}{n(n-1)}\sum\limits_{i=1}^n (y_i-rM_i)^2\)

To estimate the population mean per secondary unit we have: \(\mu\) =\(\tau\)/ M

The ratio estimator is:

\(\hat{\mu}_r=\dfrac{\hat{\tau}_r}{M}=r\)

\(\hat{V}ar(\hat{\mu}_r)=\dfrac{N(N-n)}{n(n-1)}\cdot \dfrac{1}{M^2} \sum\limits_{i=1}^n (y_i-rM_i)^2\)

Back to the example. To estimate the average yearly vacation budget for each household we will use:

\(\hat{\mu}_r=r=\dfrac{\sum\limits_{i=1}^n y_i}{\sum\limits_{i=1}^n M_i}\)

In this example, we see that N = 400, the total number of blocks, and n = 24. M in this case is as follows:

\(M=\sum\limits_{i=1}^N M_i=3100\)

Try it!

Find the ratio estimator for the average yearly vacation budget for each household in that city. Also, find the estimated variance for the ratio estimator.

\(\hat{\mu}_r=\dfrac{\sum\limits_{i=1}^n y_i}{\sum\limits_{i=1}^n M_i}=\dfrac{259240}{169}=1533.96\)

\(\hat{V}ar(\hat{\mu}_r)=\dfrac{N(N-n)}{n\cdot M^2}\cdot \dfrac{1}{n-1} \sum\limits_{i=1}^n (y_i-rM_i)^2\)

For this example, M = 3100, N = 400, n = 24

\begin{align}
\dfrac{1}{n-1}\sum\limits_{i=1}^n (y_i-rM_i)^2 &=[\text{st.dev. of }(y-rM)]^2\\
&= (1325)^2\\
\end{align}

\begin{align}
\hat{V}ar(\hat{\mu}_r)&=\dfrac{400(400-24)}{24(3100)^2}\cdot (1325)^2\\
&= 1144.84\\
\end{align}

 If we used the unbiased estimator would our variance be larger or smaller?

For this example, we also want to compute the unbiased estimator for comparison purposes.

Try it!

Find the unbiased estimator for the average yearly vacation budget for each household in that city. Also, find the estimated variance for the unbiased estimator.

\begin{align}
\hat{\mu}&= N \dfrac{\sum\limits_{i=1}^n y_i }{n}\cdot \dfrac{1}{M}\\
&= 400 \cdot \dfrac{259240}{24}\cdot \dfrac{1}{3100}\\
&= 400 \cdot 10802 \cdot \dfrac{1}{3100}\\
&= 1393.81\\
\end{align}

\begin{align}
\hat{V}ar(\hat{\mu})&=\dfrac{N(N-n)}{M^2 \cdot n} \cdot \dfrac{1}{n-1}\sum\limits_{i=1}^n (y_i-\bar{y})^2\\
&= \dfrac{400(400-24)}{(3100)^2 \cdot 24}(\text{st.dev. of }y)^2\\
&= \dfrac{400(400-24)}{(3100)^2 \cdot 24}(4495)^2\\
&= 13175.67\\
\end{align}

Remark 1: This variance is huge and we should be very unhappy using the unbiased estimate. We can thus see that when cluster total is proportional to cluster size, it is better to use the ratio estimate than the unbiased estimator.
Remark 2: Can we use a formula to compute variances by simple random sampling? Unfortunately, No! We would have to have collected this data via simple random sampling in order to calculate the variance by the formula corresponding to simple random sampling. Note: it is a big mistake if you do not compute the variance according to its sampling scheme!

Using R

Here is the code for R for this example:

Datafile: Vacation.txt
R code: Chapter7_Vacation Budget.R.txt