7 Part 1 of Cluster and Systematic Sampling
Overview
In Section 7.1, we introduce cluster and systematic sampling and show their similar structure. Graphical representations of primary units and secondary units are given. Notations are introduced.
In Section 7.2, when primary units are selected by SRS, unbiased estimators and ratio estimators for cluster sampling are provided. Basic principles to obtain estimators of low variances are discussed. Then we discuss why and when will we use cluster sampling. That is followed by an example showing how to compute the ratio estimator and the unbiased estimator when the cluster sampling with primary units selected by SRS is used.
In Section 7.3, cluster sampling with primary units selected by probabilities proportional to size is discussed. Then an example is given.
Lesson 7: Ch. 12.1-12.3 of Sampling by Steven Thompson, 3rd Edition.
Objectives
Upon completion of this lesson you should be able to:
- Identify the appropriate reasons and situations to use cluster sampling,
- Recognize and use the appropriate notation for cluster and systematic sampling,
- Define and differentiate between primary units and secondary units,
- Compute the unbiased estimator for cluster samples when primary units are selected by SRS,
- Compute the ratio estimator for cluster samples when primary units are selected by SRS, and
- Compute the Hansen-Hurwitz estimator for cluster samples when primary units are selected by PPS.
7.1 Introduction to Cluster and Systematic Sampling
On the surface, systematic and cluster sampling is very different. The two designs share the same structure: the population is partitioned into primary units, each primary unit being composed of secondary units. Whenever a primary unit is included in the sample, the \(y\)-values of every secondary unit within it are observed.
In the following two graphs, we provide examples of two configurations of primary units:
Figure 7.2 has 50 primary units (PSU) (the colored rectangle is an example of a primary unit of cluster sampling)
Figure 7.3 has 25 primary units (PSU) (the colored units (collectively) are an example of a primary unit of systematic sampling)
Primary units (PSU) may be different from observation units. One can view systematic sampling as a sampling of primary units. Once the primary units are selected, a cluster of secondary units is also selected.
Advantages of Systematic Sampling
- Easier to perform in the field, especially if a good frame is not available.
- Frequently provides more information per unit cost than simple random sampling, in the sense of smaller variances.
For example, a systematic sample was drawn from a batch of produced computer chips. The first 400 chips are fine but due to a fault in the machine, the last 300 chips are defective. Systematic sampling will select uniformly over the defective and non-defective items and would give a very accurate estimate of the fraction of defective items.
Cluster Sampling and Systematic Sampling
A cluster/systematic sample is a probability sample in which each sampling unit is a collection, or cluster, of elements.
For Figure 7.4 below, \(N = 50\), \(n = 10\), \(M_i = 8\)
For Figure 7.5 below, \(N = 25\), \(n = 2\), \(M_i= 16\)
Figure 7.4 shows an example of cluster sampling and Figure 7.5 shows an example of systematic sampling. Secondary units of a primary unit of cluster sampling are close together whereas secondary units of a primary unit of systematic sampling are separate.
7.2 Estimators for Cluster Sampling when Primary units are selected by simple random sampling
When the primary units are selected by simple random sampling, frequently used estimators among many possible estimators are:
A. Unbiased estimator
Recall that \(y_i\) is the total of \(y\)-values in the ith primary unit.
Where \(s^2_u=\dfrac{1}{n-1}\sum\limits_{i=1}^n(y_i-\bar{y})^2\)
To estimate the mean per primary unit, \(\tau/ N\), the mean and variance equations are given below:
To estimate the mean per secondary unit, the mean and variance equations are given below:
\[\hat{\mu}=\dfrac{\hat{\tau}}{M}, \hat{\operatorname{Var}}(\hat{\mu})=\dfrac{1}{M^2} \hat{\operatorname{Var}}(\hat{\tau})\]
B. Ratio Estimator
If the primary unit total \(y_i\) is highly correlated with the primary unit size \(M_i\), a ratio estimator based on size may be efficient.
\[\hat{\tau}_r=r \cdot M \text{, }M=\sum\limits_{i=1}^N M_i\]
The Basic Principle
Since every secondary unit is observed within a selected primary unit, the primary unit variance does not enter into the variances of the estimators. For example,
\[\hat{\operatorname{Var}}(\hat{\tau})=N(N-n)\cdot \dfrac{s^2_u}{n}\]
\[\text{where }s^2_u=\dfrac{1}{n-1}\sum\limits_{i=1}^n (y_i-\bar{y})^2\]
Thus, to obtain estimators of low variances,
- Clusters should be formed so that one cluster is similar to another cluster. (Note: this is very different from saying that units in the cluster are similar)
- Each cluster should contain the full diversity of the population and thus, is ‘representative’.
With natural populations of spatially distributed plants, animals, or minerals, and human populations, the above condition is typically satisfied by systematic sampling where each cluster contains units that are far apart. Cluster sampling is more often than not carried out for reasons of convenience or practicality rather than to obtain the lowest variances.
Why or When Do We Use Cluster Sampling?
Will it give us a more precise estimator? The answer is no in most cases.
We do use cluster sampling out of necessity even though it will give us a larger variance.
If the objective of sampling is to obtain a specified amount of information about a population parameter at minimum cost, cluster sampling sometimes gives more information per unit cost than simple random sampling, stratified sampling, and systematic sampling due to the cost of sampling units within a cluster may be much lower.
Cluster sampling is an effective design in two different scenarios:
- A good frame listing the population elements either is not available or is very costly to obtain, whereas a frame listing clusters is easily obtained.
- The cost of obtaining observations increases as the distance separating the elements increases.
Example 7.1 (Average Yearly Vacation Budget) Let’s look at an example of cluster sampling using a ratio estimator.
A sociologist wants to estimate the average yearly vacation budget for each household in a certain city. It is given that there are 3,100 households in the city. The sociologist marked off the city into 400 blocks and treated them as 400 clusters. He then randomly sampled 24 clusters interviewing every household living in that cluster. The data are given in the table below:
Cluster | Cluster size \(M_i\) | Total per cluster \(y_i\) |
---|---|---|
1 | 7 | 12,000 |
2 | 9 | 15,000 |
3 | 5 | 8,000 |
4 | 8 | 13,000 |
5 | 12 | 18,000 |
6 | 5 | 7,000 |
7 | 4 | 6,000 |
8 | 8 | 13,000 |
9 | 14 | 22,000 |
10 | 6 | 9,800 |
11 | 3 | 7,000 |
12 | 13 | 18,000 |
13 | 8 | 12,340 |
14 | 4 | 5,000 |
15 | 6 | 8,900 |
16 | 9 | 14,000 |
17 | 3 | 4,000 |
18 | 10 | 11,400 |
19 | 4 | 5,000 |
20 | 7 | 13,000 |
21 | 6 | 8,900 |
22 | 5 | 8,700 |
23 | 7 | 10,000 |
24 | 6 | 9,200 |
- | 169 | 259,240 |
Using Minitab
To use Minitab to plot the total for cluster versus cluster size:
- Graph > Scatterplot
- Select ‘Total per Cluster’ as the \(Y\) variable
- Select ‘Cluster Size’ as the \(X\) variable
To use Minitab to display descriptive statistics:
Stat > Basic Statistics > Display Descriptive Statistics
Here is a plot of this data so that we can see if the cluster size is proportional to the total for the cluster.
Minitab output:
Regression Equation:
total for the cluster = 648 + 1442 cluster size
Coefficients
Predictor | Coef | StDev | T | P |
---|---|---|---|---|
Constant | 648.0 | 705.9 | 0.92 | 0.369 |
Cluster | 1441.94 | 92.59 | 15.57 | 0.000 |
Descriptive Statistics for the Variables:
Variable | N | Mean | StDev |
---|---|---|---|
Cluster size | 24 | 7.042 | 2.985 |
Yi | 24 | 10802 | 4495 |
Yi -rMi | 24 | -0 | 1325 |
The ratio estimator for cluster sample (ratio-to-size):
If primary unit total \(y_i\) is highly correlated with cluster size \(M_i\), a ratio estimator based on size may be efficient. The ratio estimator of the population total is:
\[\hat{\tau}_r=r\cdot \text{M, where }r=\dfrac{\sum\limits_{i=1}^n y_i}{\sum\limits_{i=1}^n M_i}\]
The ratio estimator is biased but the bias is small when the sample size is large. Here is the variance:
\[\hat{\operatorname{Var}}(\hat{\tau}_r)=\dfrac{N(N-n)}{n(n-1)}\sum\limits_{i=1}^n (y_i-rM_i)^2\]
To estimate the population mean per secondary unit we have: \(\mu = \tau / M\)
The ratio estimator is:
\[\hat{\mu}_r=\dfrac{\hat{\tau}_r}{M}=r\]
Back to the example. To estimate the average yearly vacation budget for each household we will use:
\[\hat{\mu}_r=r=\dfrac{\sum\limits_{i=1}^n y_i}{\sum\limits_{i=1}^n M_i}\]
In this example, we see that \(N= 400\), the total number of blocks, and \(n = 24\). \(M\) in this case is as follows:
\[M=\sum\limits_{i=1}^N M_i=3100\]
Try It!
Find the ratio estimator for the average yearly vacation budget for each household in that city. Also, find the estimated variance for the ratio estimator.
For this example, \(M = 3100\), \(N = 400\), \(n = 24\)
If we used the unbiased estimator would our variance be larger or smaller?
For this example, we also want to compute the unbiased estimator for comparison purposes.
Try It!
Find the unbiased estimator for the average yearly vacation budget for each household in that city. Also, find the estimated variance for the unbiased estimator.
Using R
Here is the code for R for this example:
Datafile: Vacation.txt
R code: Chapter7_Vacation Budget.R.txt
7.3 Estimator for Cluster Sampling when Primary units are selected by PPS
The primary units selected with probabilities proportional to size:
\[p_i=M_i/M\]
Denote by \(\bar{y}_i=\dfrac{y_i}{M_i}\)
\(\hat{\operatorname{Var}}(\hat{\tau}_p)=\dfrac{M^2}{n(n-1)}\sum\limits_{i=1}^n (\bar{y}_i-\hat{\mu}_p)^2\), where \(\hat{\mu}_p=\dfrac{\hat{\tau}_p}{M}\) is unbiased for \(\mu\).
Thus we also see that:
\[\hat{\operatorname{Var}}(\hat{\mu}_p)=\dfrac{1}{n(n-1)}\sum\limits_{i=1}^n (\bar{y}_i-\hat{\mu}_p)^2\]
Try It!
Find the Hansen-Hurwitz estimator for the population mean and also find the variance of the estimator.
\[\begin{align} \hat{\mu}_p &= \dfrac{1}{n} \sum\limits_{i=1}^n \dfrac{y_i}{M_i}\\ &= \dfrac{1}{3}\times \left(\dfrac{420}{650}+\dfrac{1785}{2840}+\dfrac{2198}{3200}\right)\\ &= 0.6538\\ \end{align}\]