# Lesson 8: Part 2 of Cluster and Systematic Sampling

Lesson 8: Part 2 of Cluster and Systematic Sampling## Overview

In Section 8.1, we introduce systematic sampling and state why it may be a challenge to estimate the variance when only one primary unit is taken. Then the repeated systematic sampling is introduced so that the variance can be estimated. We then provide an example of repeated systematic sampling.

In Sections 8.2, variance for cluster and systematic sampling is decomposed in terms of between cluster and within cluster variances. We then provide an estimate for the relative efficiency of simple random sampling versus simple random cluster sampling. An example is provided to compare the variances for these two sampling methods. One should note that it is not uncommon to see examples that cluster sampling is much less efficient than the simple random sampling, as illustrated in this example.

*Sampling* by Steven Thompson, 3rd edition

## Objectives

- know why and when to use systematic sampling,
- know why and when to use repeated systematic sampling,
- compute the within cluster variance and the between cluster variance, and
- compute relative efficiency of the cluster sampling compared to simple random sampling.

# 8.1 - Systematic Sampling

8.1 - Systematic SamplingSuppose you have a number of students lined up in a row:

1 2 3 4 5 6 7 8 9 10 11 12

Here we might take a sample every 4 elements, or 1 in 4 elements from the population. (1, 5, 9) or (2, 6, 10), etc. There are four primary units: (1, 5, 9), (2, 6, 10), (3, 7, 11), (4, 8, 12).

To sample systematically from a field, the following is one example:

1 | 2 | 3 | 4 |

5 | 6 | 7 | 8 |

9 | 10 | 11 | 12 |

13 | 14 | 15 | 16 |

There are four primary units: (1, 3, 9, 11), (2, 4, 10, 12), (5, 7, 13, 15), (6, 8, 14, 16).

How do we draw a 1 in *k* systematic sample?

**Example:** Suppose our population is 9,000 students and we want to sample 1,200 students. How do we sample these students systematically?

Since, 9000/1200 = 7.5, we can perform a 1-in-7 systematic sample. Or, we should sample every 7th student. We can pick a starting point randomly from 1 to 600 and sample every 7th student from that on until we have reached 1200 samples.

How do we estimate the variance of this single systematic sample?

We can not use the formula:

\(s^2_u=\dfrac{1}{n-1}\sum\limits_{i=1}^n (y_i-\bar{y})^2\)

since *n* = 1. Only one primary unit is selected.

If the population is randomly ordered, then there is no problem. We can estimate the variance \(\sigma^2\) by:

\(s^2=\dfrac{\sum\limits_{j=1}^{M_1}(y_{1j}-\bar{y}_1)^2}{M_1-1}\)

However, when the population is **ordered**, the systematic sampling is usually **better** than simple random sampling and the above formula will **overestimate** the variance.

When the population is **periodic**, the systematic sampling may be **worse** than the simple random sampling and the above formula will **underestimate** the variance since if the period *k* is chosen poorly, then the elements sampled may be too similar to each other.

#### Repeated Systematic Sampling

Unless the population is randomly ordered we can't use the naive method to compute variance. [Look in the textbook page 162 for more advanced ways.] Thus, we need more than one primary unit.

## Example 8-1: Repeated systematic sampling of ferry cars

(

see p.247 of Scheaffer, Mendenhall and Ott)A ferry that carries cars across a bay charges a fee by the carload rather than by person. The ferry company wants to estimate the average number of people per car for the month of August. The company knows from last year that 400 cars took the ferry and they want to sample 80 cars. To facilitate the estimation of variance of the systematic sample the investigator chooses to use repeated systematic sampling with 10 samples of 8 cars each. Use the data given in the following table to estimate the average number of persons per car and also provide an estimate of the variance.

How do we obtain the random numbers for the repeated systematic sampling?

We will select 10 repeated samples with 8 samples in each, so we choose 1-in-400/8 = 50. From the values 1 to 50, 10 numbers are selected without replacement and we start from those 10 numbers to get 10 samples of 1-in-50 systematic samples.

The 10 numbers sampled randomly without replacement from 1 to 50 are: 2, 5, 7, 13, 26, 31, 35, 40, 45, 46. In the following table, the car that will be sampled is listed with the number of people per car (the response) in parentheses.

Random starting point |
Second element |
Third element |
Fourth element |
Fifth element |
Sixth element |
Seventh element |
Eighth element |
\(\bar{y}_i\) mean |

2(3) | 52(4) | 102(5) | 152(3) | 202(6) | 252(1) | 302(4) | 352(4) | 3.75 |

5(5) | 55(3) | 105(4) | 155(2) | 205(4) | 255(2) | 305(3) | 355(4) | 3.38 |

7(2) | 57(4) | 107(6) | 157(2) | 207(3) | 257(2) | 307(1) | 357(3) | 2.88 |

13(6) | 63(4) | 113(6) | 163(7) | 213(2) | 263(3) | 313(2) | 363(7) | 4.62 |

26(4) | 76(5) | 126(7) | 176(4) | 226(2) | 276(6) | 326(2) | 376(6) | 4.50 |

31(7) | 81(6) | 131(4) | 181(4) | 231(3) | 281(6) | 331(7) | 381(5) | 5.25 |

35(3) | 85(3) | 135(2) | 185(3) | 135(6) | 285(5) | 235(6) | 385(8) | 4.50 |

40(2) | 90(6) | 140(2) | 190(5) | 240(5) | 290(4) | 340(4) | 390(5) | 4.12 |

45(2) | 95(6) | 145(3) | 195(6) | 245(4) | 295(4) | 345(5) | 395(4) | 4.25 |

46(6) | 96(5) | 146(4) | 196(6) | 246(3) | 296(3) | 346(5) | 396(3) | 4.38 |

#### Try it!

**The total number of primary units***N*= ?**The number of primary units sampled***n*= ?**The number of secondary units in the***i*th primary unit \(M_i\) = ?**The total number of secondary units in the population***M*= ?

- The total number of primary units
*N*= 50 - The number of primary units sampled
*n*= 10 - The number of secondary units in the
*i*th primary unit \(M_i\) = 8 - The total number of secondary units in the population
- \(M=\sum\limits_{i=1}^{50}M_i=400\)

To estimate the population mean \(\mu\) =\(\tau\)/ *M* we can use the unbiased estimator. The estimator is:

\(\hat{\mu}=\dfrac{\hat{\tau}}{M}=\sum\limits_{i=1}^n \dfrac{\bar{y}_i}{n}=4.16\)

\(\text{where } \bar{y}_i=\dfrac{y_i}{M_i}=\dfrac{\sum\limits_{j=1}^{M_i} y_{ij}}{M_i} \text{for }i=1,2,\ldots,n.\)

In this example, \(\overline{M}=M_1=M_2=\ldots=M_n\)

#### Try it!

\begin{align}

\hat{V}ar(\hat{\mu}) &= \dfrac{M-n\cdot \bar{M}}{M}\cdot \dfrac{1}{n(n-1)} \cdot \sum\limits_{i=1}^n (\bar{y}_i-\hat{\mu})^2\\

&= \dfrac{400-10 \cdot 8}{400} \cdot \dfrac{1}{10(9)}\cdot [(3.75-4.16)^2+\cdots+(4.38-4.16)^2]\\

&= 0.0365\\

\end{align}

#### Try it!

*t*, what is the degrees of freedom? {Hint: consider how many primary units do you have and then compute the degree of freedom}

# 8.2 - Variance and Cost in Cluster and Systematic Sampling versus S.R.S.

8.2 - Variance and Cost in Cluster and Systematic Sampling versus S.R.S.For simplicity, suppose that each of *N* primary units has an equal number \(\overline{M}\) of secondary units. To simplify the variance computations and to explore the relationship between cluster and simple random sampling, we note the identity:

\(\sum\limits_{i=1}^N \sum\limits_{j=1}^{\overline{M}}(y_{ij}-\mu)^2= \sum\limits_{i=1}^N \sum\limits_{j=1}^{\overline{M}}(y_{ij}-\bar{y}_i)^2+\overline{M}\sum\limits_{i=1}^N (\bar{y}_i-\mu)^2\)

\(\text{where } \bar{y}_i=\sum\limits_{j=1}^{\overline{M}}\dfrac{y_{ij}}{\overline{M}}\)

SST = SSW + SSB

SST: the total sum of square

SSW: within-cluster sum of squares (within-primary units)

SSB: between-cluster sum of squares (between-primary units)

The within-primary-unit variance is:

\(\sigma^2_w=\left\{\sum\limits_{i=1}^N \sum\limits_{j=1}^{\overline{M}}(y_{ij}-\bar{y}_i)^2\right\}/[N(\overline{M}-1)]\)

The between-primary-unit variance is:

\(\sigma^2_b=\left\{\sum\limits_{i=1}^N (\bar{y}_i-\mu)^2\right\}/(N-1)\)

The identity can be rewritten as:

\((N\overline{M}-1)\sigma^2=N(\overline{M}-1)\sigma^2_w+(N-1)\overline{M}\sigma^2_b\)

Thus, an unbiased estimator of \(\sigma^2\) from a simple random cluster sample is:

\(\hat{\sigma}^2=\dfrac{N(\overline{M}-1)S^2_w+(N-1)\overline{M}S^2_b}{N\overline{M}-1}\)

Since the data was obtained by cluster sampling, we cannot use \(s^2\) to estimate \(\sigma^2\) but we can use \(\hat{\sigma}^2\) to estimate \(\sigma^2\).

The relative efficiency of simple random sampling versus simple random cluster sampling is:

\(\dfrac{Var(\bar{y}_{srs})}{Var(\hat{\mu})}=\dfrac{\overline{M}\sigma^2}{\sigma^2_u}\)

It can be estimated by:

\(\dfrac{\hat{V}ar(\bar{y}_{srs})}{\hat{V}ar(\hat{\mu})}=\dfrac{\overline{M}\hat{\sigma}^2}{s^2_u}\)

**Note!**

\(s^2_u=\dfrac{1}{n-1}\sum\limits_{i=1}^n(y_i-\bar{y})^2=\dfrac{1}{n-1}\sum\limits_{i=1}^n (\overline{M}\bar{y}_i-\overline{M}\hat{\mu})^2={\overline{M}}^2

\dfrac{\sum\limits_{i=1}^n(\bar{y}_i-\hat{\mu})^2}{n-1}={\overline{M}}^2 s^2_b\)

**Recall**: \(Var(\bar{y}_{srs})=\dfrac{N-n}{N}\cdot \dfrac{\sigma^2}{n\overline{M}}\) and \(Var(\hat{\mu})=\dfrac{N-n}{N}\cdot \dfrac{\sigma^2_u}{n{\overline{M}}^2}\)

where \(\sigma^2_u\) is the finite population variance of \(y_i\) .

## Example 8-2: Number of cell phones per household

The marketing research department of a communication company wishes to estimate the average number of cell phones purchased per household in a given community. Therefore, the 4,000 households in the community are listed in 400 geographical clusters of 10 households each, and a simple random sample of 4 clusters is selected to reduce the traveling cost for interviewing each household. The data are given in the following table:

Cluster |
Number of cell phones |
Total |
|||||||||

1 |
3 | 5 | 6 | 4 | 5 | 6 | 3 | 2 | 4 | 5 | 43 |

2 |
2 | 0 | 2 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 9 |

3 |
3 | 2 | 3 | 2 | 4 | 2 | 2 | 1 | 2 | 2 | 23 |

4 |
5 | 2 | 3 | 2 | 1 | 1 | 2 | 2 | 4 | 1 | 23 |

#### Using Minitab

`Stat` > `ANOVA` > `One-way`

The data should be entered in two columns, response contains all the 40 responses and the column for factor indicates whether it is cluster 1, cluster 2, cluster 3 or cluster 4.

##### Minitab output

##### One-Way ANOVA: cluster 1, cluster 2, cluster 3, cluster 4

Source | DF | SS | MS | F | P |
---|---|---|---|---|---|

Factor | 3 | 58.70 | 19.57 | 16.31 | 0.000 |

Error | 36 | 43.20 | 1.20 | ||

Total | 39 | 101.90 |

Let's find the relative efficiency of simple random sampling versus cluster sampling for the data in this example.

In this example, *N* = 400, *n* = 4, and \(\overline{M}=10\).

We need to find \(s_b^2, s_w^2\).

Note the identity for the population: \((N\overline{M}-1)\sigma^2=N(\overline{M}-1)\sigma^2_w+(N-1)\overline{M}\sigma^2_b\)

The identity for the sample is: \((n\overline{M}-1)s^2=n(\overline{M}-1)s^2_w+(n-1)\overline{M}s^2_b\)

*SS total = SS error + SS factor *

From the ANOVA table of the example, we can find *s _{b}*

^{2}by:

\(\text{SS factor}=(4-1)10s^2_b=58.70\)

\(s^2_b=\dfrac{58.70}{30}=1.957\)

We can find \(s_w^2\)* *by:

\(\text{SS error}=4(10-1)s^2_w=43.2\)

\(s^2_w=1.20\)

#### Try it!

\(\hat{\sigma}^2=\dfrac{N(\overline{M}-1)\sigma^2_w+(N-1)\overline{M}\sigma^2_b}{N\overline{M}-1}

=\dfrac{(400\times 9 \times 1.2)+[(400-1)\times 10 \times 1.957]}{400\times 10 -1}=3.03\)

And now we can determine the relative efficiency of simple random sampling versus cluster sampling by plugging the values into the formula:

\(s^2_u={\overline{M}}^2 s^2_b=100 \times 1.957=195.7\)

#### Try it!

{\hat{V}ar(\hat{\mu})}=\dfrac{\overline{M}\hat{\sigma}^2}{s^2_u}=\dfrac{10 \times 3.03}{195.7}=0.155\)

What is this telling us?

Thus, the variance of simple random sampling is just 15.5% of that of the cluster sampling if the same sample size is used. We can see that *in this example* simple random sampling is **more** efficient if only variance is considered.

**Note!**It is a

**BIG**mistake to analyze a cluster sample as if it were a simple random sample, (often with the reported standard error much less than it should be). You will end up being much too optimistic and not conservative regarding your results as you should be.