# 8.2 - Variance and Cost in Cluster and Systematic Sampling versus S.R.S.

8.2 - Variance and Cost in Cluster and Systematic Sampling versus S.R.S.For simplicity, suppose that each of *N* primary units has an equal number \(\overline{M}\) of secondary units. To simplify the variance computations and to explore the relationship between cluster and simple random sampling, we note the identity:

\(\sum\limits_{i=1}^N \sum\limits_{j=1}^{\overline{M}}(y_{ij}-\mu)^2= \sum\limits_{i=1}^N \sum\limits_{j=1}^{\overline{M}}(y_{ij}-\bar{y}_i)^2+\overline{M}\sum\limits_{i=1}^N (\bar{y}_i-\mu)^2\)

\(\text{where } \bar{y}_i=\sum\limits_{j=1}^{\overline{M}}\dfrac{y_{ij}}{\overline{M}}\)

SST = SSW + SSB

SST: the total sum of square

SSW: within-cluster sum of squares (within-primary units)

SSB: between-cluster sum of squares (between-primary units)

The within-primary-unit variance is:

\(\sigma^2_w=\left\{\sum\limits_{i=1}^N \sum\limits_{j=1}^{\overline{M}}(y_{ij}-\bar{y}_i)^2\right\}/[N(\overline{M}-1)]\)

The between-primary-unit variance is:

\(\sigma^2_b=\left\{\sum\limits_{i=1}^N (\bar{y}_i-\mu)^2\right\}/(N-1)\)

The identity can be rewritten as:

\((N\overline{M}-1)\sigma^2=N(\overline{M}-1)\sigma^2_w+(N-1)\overline{M}\sigma^2_b\)

Thus, an unbiased estimator of \(\sigma^2\) from a simple random cluster sample is:

\(\hat{\sigma}^2=\dfrac{N(\overline{M}-1)S^2_w+(N-1)\overline{M}S^2_b}{N\overline{M}-1}\)

Since the data was obtained by cluster sampling, we cannot use \(s^2\) to estimate \(\sigma^2\) but we can use \(\hat{\sigma}^2\) to estimate \(\sigma^2\).

The relative efficiency of simple random sampling versus simple random cluster sampling is:

\(\dfrac{Var(\bar{y}_{srs})}{Var(\hat{\mu})}=\dfrac{\overline{M}\sigma^2}{\sigma^2_u}\)

It can be estimated by:

\(\dfrac{\hat{V}ar(\bar{y}_{srs})}{\hat{V}ar(\hat{\mu})}=\dfrac{\overline{M}\hat{\sigma}^2}{s^2_u}\)

**Note!**

\(s^2_u=\dfrac{1}{n-1}\sum\limits_{i=1}^n(y_i-\bar{y})^2=\dfrac{1}{n-1}\sum\limits_{i=1}^n (\overline{M}\bar{y}_i-\overline{M}\hat{\mu})^2={\overline{M}}^2

\dfrac{\sum\limits_{i=1}^n(\bar{y}_i-\hat{\mu})^2}{n-1}={\overline{M}}^2 s^2_b\)

**Recall**: \(Var(\bar{y}_{srs})=\dfrac{N-n}{N}\cdot \dfrac{\sigma^2}{n\overline{M}}\) and \(Var(\hat{\mu})=\dfrac{N-n}{N}\cdot \dfrac{\sigma^2_u}{n{\overline{M}}^2}\)

where \(\sigma^2_u\) is the finite population variance of \(y_i\) .

## Example 8-2: Number of cell phones per household

The marketing research department of a communication company wishes to estimate the average number of cell phones purchased per household in a given community. Therefore, the 4,000 households in the community are listed in 400 geographical clusters of 10 households each, and a simple random sample of 4 clusters is selected to reduce the traveling cost for interviewing each household. The data are given in the following table:

Cluster |
Number of cell phones |
Total |
|||||||||

1 |
3 | 5 | 6 | 4 | 5 | 6 | 3 | 2 | 4 | 5 | 43 |

2 |
2 | 0 | 2 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 9 |

3 |
3 | 2 | 3 | 2 | 4 | 2 | 2 | 1 | 2 | 2 | 23 |

4 |
5 | 2 | 3 | 2 | 1 | 1 | 2 | 2 | 4 | 1 | 23 |

#### Using Minitab

`Stat` > `ANOVA` > `One-way`

The data should be entered in two columns, response contains all the 40 responses and the column for factor indicates whether it is cluster 1, cluster 2, cluster 3 or cluster 4.

##### Minitab output

##### One-Way ANOVA: cluster 1, cluster 2, cluster 3, cluster 4

Source | DF | SS | MS | F | P |
---|---|---|---|---|---|

Factor | 3 | 58.70 | 19.57 | 16.31 | 0.000 |

Error | 36 | 43.20 | 1.20 | ||

Total | 39 | 101.90 |

Let's find the relative efficiency of simple random sampling versus cluster sampling for the data in this example.

In this example, *N* = 400, *n* = 4, and \(\overline{M}=10\).

We need to find \(s_b^2, s_w^2\).

Note the identity for the population: \((N\overline{M}-1)\sigma^2=N(\overline{M}-1)\sigma^2_w+(N-1)\overline{M}\sigma^2_b\)

The identity for the sample is: \((n\overline{M}-1)s^2=n(\overline{M}-1)s^2_w+(n-1)\overline{M}s^2_b\)

*SS total = SS error + SS factor *

From the ANOVA table of the example, we can find *s _{b}*

^{2}by:

\(\text{SS factor}=(4-1)10s^2_b=58.70\)

\(s^2_b=\dfrac{58.70}{30}=1.957\)

We can find \(s_w^2\)* *by:

\(\text{SS error}=4(10-1)s^2_w=43.2\)

\(s^2_w=1.20\)

#### Try it!

\(\hat{\sigma}^2=\dfrac{N(\overline{M}-1)\sigma^2_w+(N-1)\overline{M}\sigma^2_b}{N\overline{M}-1}

=\dfrac{(400\times 9 \times 1.2)+[(400-1)\times 10 \times 1.957]}{400\times 10 -1}=3.03\)

And now we can determine the relative efficiency of simple random sampling versus cluster sampling by plugging the values into the formula:

\(s^2_u={\overline{M}}^2 s^2_b=100 \times 1.957=195.7\)

#### Try it!

{\hat{V}ar(\hat{\mu})}=\dfrac{\overline{M}\hat{\sigma}^2}{s^2_u}=\dfrac{10 \times 3.03}{195.7}=0.155\)

What is this telling us?

Thus, the variance of simple random sampling is just 15.5% of that of the cluster sampling if the same sample size is used. We can see that *in this example* simple random sampling is **more** efficient if only variance is considered.

**Note!**It is a

**BIG**mistake to analyze a cluster sample as if it were a simple random sample, (often with the reported standard error much less than it should be). You will end up being much too optimistic and not conservative regarding your results as you should be.