8.2 - Variance and Cost in Cluster and Systematic Sampling versus S.R.S.

For simplicity, suppose that each of N primary units has an equal number \(\overline{M}\) of secondary units. To simplify the variance computations and to explore the relationship between cluster and simple random sampling, we note the identity:

\(\sum\limits_{i=1}^N \sum\limits_{j=1}^{\overline{M}}(y_{ij}-\mu)^2= \sum\limits_{i=1}^N \sum\limits_{j=1}^{\overline{M}}(y_{ij}-\bar{y}_i)^2+\overline{M}\sum\limits_{i=1}^N (\bar{y}_i-\mu)^2\)

\(\text{where } \bar{y}_i=\sum\limits_{j=1}^{\overline{M}}\dfrac{y_{ij}}{\overline{M}}\)

SST = SSW + SSB

SST: the total sum of square
SSW: within-cluster sum of squares (within-primary units)
SSB: between-cluster sum of squares (between-primary units)

The within-primary-unit variance is:

\(\sigma^2_w=\left\{\sum\limits_{i=1}^N \sum\limits_{j=1}^{\overline{M}}(y_{ij}-\bar{y}_i)^2\right\}/[N(\overline{M}-1)]\)

The between-primary-unit variance is:

\(\sigma^2_b=\left\{\sum\limits_{i=1}^N (\bar{y}_i-\mu)^2\right\}/(N-1)\)

The identity can be rewritten as:

\((N\overline{M}-1)\sigma^2=N(\overline{M}-1)\sigma^2_w+(N-1)\overline{M}\sigma^2_b\)

Thus, an unbiased estimator of \(\sigma^2\) from a simple random cluster sample is:

\(\hat{\sigma}^2=\dfrac{N(\overline{M}-1)S^2_w+(N-1)\overline{M}S^2_b}{N\overline{M}-1}\)

Since the data was obtained by cluster sampling, we cannot use \(s^2\) to estimate \(\sigma^2\) but we can use \(\hat{\sigma}^2\) to estimate \(\sigma^2\).

The relative efficiency of simple random sampling versus simple random cluster sampling is:

\(\dfrac{Var(\bar{y}_{srs})}{Var(\hat{\mu})}=\dfrac{\overline{M}\sigma^2}{\sigma^2_u}\)

It can be estimated by:

\(\dfrac{\hat{V}ar(\bar{y}_{srs})}{\hat{V}ar(\hat{\mu})}=\dfrac{\overline{M}\hat{\sigma}^2}{s^2_u}\)

Note!

\(s^2_u=\dfrac{1}{n-1}\sum\limits_{i=1}^n(y_i-\bar{y})^2=\dfrac{1}{n-1}\sum\limits_{i=1}^n (\overline{M}\bar{y}_i-\overline{M}\hat{\mu})^2={\overline{M}}^2
\dfrac{\sum\limits_{i=1}^n(\bar{y}_i-\hat{\mu})^2}{n-1}={\overline{M}}^2 s^2_b\)

Recall: \(Var(\bar{y}_{srs})=\dfrac{N-n}{N}\cdot \dfrac{\sigma^2}{n\overline{M}}\) and \(Var(\hat{\mu})=\dfrac{N-n}{N}\cdot \dfrac{\sigma^2_u}{n{\overline{M}}^2}\)

where \(\sigma^2_u\) is the finite population variance of \(y_i\) .

Example 8-2: Number of cell phones per household Section

The marketing research department of a communication company wishes to estimate the average number of cell phones purchased per household in a given community. Therefore, the 4,000 households in the community are listed in 400 geographical clusters of 10 households each, and a simple random sample of 4 clusters is selected to reduce the traveling cost for interviewing each household. The data are given in the following table:

Cluster Number of cell phones Total
1 3 5 6 4 5 6 3 2 4 5 43
2 2 0 2 1 1 0 1 1 0 1 9
3 3 2 3 2 4 2 2 1 2 2 23
4 5 2 3 2 1 1 2 2 4 1 23

 Using Minitab

Stat > ANOVA > One-way

The data should be entered in two columns, the response contains all 40 responses and the column for factor indicates whether it is cluster 1, cluster 2, cluster 3, or cluster 4.

Minitab output

One-Way ANOVA: cluster 1, cluster 2, cluster 3, cluster 4
Source DF SS MS F P
Factor 3 58.70 19.57 16.31 0.000
Error 36 43.20 1.20    
Total 39 101.90      

Let's find the relative efficiency of simple random sampling versus cluster sampling for the data in this example.

In this example, N = 400, n = 4, and \(\overline{M}=10\).

We need to find \(s_b^2, s_w^2\).

Note the identity for the population: \((N\overline{M}-1)\sigma^2=N(\overline{M}-1)\sigma^2_w+(N-1)\overline{M}\sigma^2_b\)

The identity for the sample is: \((n\overline{M}-1)s^2=n(\overline{M}-1)s^2_w+(n-1)\overline{M}s^2_b\)

SS total = SS error + SS factor

From the ANOVA table of the example, we can find sb2 by:

\(\text{SS factor}=(4-1)10s^2_b=58.70\)

\(s^2_b=\dfrac{58.70}{30}=1.957\)

We can find \(s_w^2\) by:

\(\text{SS error}=4(10-1)s^2_w=43.2\)

\(s^2_w=1.20\)

Try it!

Compute \(\hat{\sigma}^2\).

\(\hat{\sigma}^2=\dfrac{N(\overline{M}-1)\sigma^2_w+(N-1)\overline{M}\sigma^2_b}{N\overline{M}-1}
=\dfrac{(400\times 9 \times 1.2)+[(400-1)\times 10 \times 1.957]}{400\times 10 -1}=3.03\)

And now we can determine the relative efficiency of simple random sampling versus cluster sampling by plugging the values into the formula:

\(s^2_u={\overline{M}}^2 s^2_b=100 \times 1.957=195.7\)

Try it!

Compute the relative efficiency of simple random sampling versus cluster sampling. What does that tell us?
\(\dfrac{\hat{V}ar(\bar{y}_{srs})}
{\hat{V}ar(\hat{\mu})}=\dfrac{\overline{M}\hat{\sigma}^2}{s^2_u}=\dfrac{10 \times 3.03}{195.7}=0.155\)

 

What is this telling us?

Thus, the variance of simple random sampling is just 15.5% of that of cluster sampling if the same sample size is used. We can see that in this example simple random sampling is more efficient if the only variance is considered.

Note! It is a BIG mistake to analyze a cluster sample as if it were a simple random sample, (often with the reported standard error much less than it should be). You will end up being much too optimistic and not conservative regarding your results as you should be.