8  Part 2 of Cluster and Systematic Sampling

Overview

In Section 8.1, we introduce systematic sampling and state why it may be a challenge to estimate the variance when only one primary unit is taken. Then repeated systematic sampling is introduced so that the variance can be estimated. We then provide an example of repeated systematic sampling.

In Section 8.2, variance for cluster and systematic sampling is decomposed in terms of between-cluster and within-cluster variances. We then provide an estimate for the relative efficiency of simple random sampling versus simple random cluster sampling. An example is provided to compare the variances for these two sampling methods. One should note that it is not uncommon to see examples that cluster sampling is much less efficient than simple random sampling, as illustrated in this example.

Lesson 8: Ch. 12.4-12.5 of Sampling by Steven Thompson, 3rd Edition.

Objectives

Upon completion of this lesson you should be able to:

  1. Identify the appropriate reasons and situations to use systematic sampling,
  2. Identify the appropriate reasons and situations to use repeated systematic sampling,
  3. Compute the within-cluster variance and the between-cluster variance, and
  4. Compute the relative efficiency of the cluster sampling compared to simple random sampling.

8.1 Systematic Sampling

Suppose you have a number of students lined up in a row:

1 2 3 4 5 6 7 8 9 10 11 12

Here we might take a sample of every 4 elements, or 1 in 4 elements from the population: (1, 5, 9) or (2, 6, 10), etc. There are four primary units: (1, 5, 9), (2, 6, 10), (3, 7, 11), and (4, 8, 12).

To sample systematically from a field, the following is one example:

A 4 by 4 square with each section colored in one of four colors.

There are four primary units: (1, 3, 9, 11), (2, 4, 10, 12), (5, 7, 13, 15), and (6, 8, 14, 16).

How do we draw a 1 in \(k\) systematic sample?

Example: Suppose our population is 9,000 students and we want to sample 1,200 students. How do we sample these students systematically?

Since, \(9000/1200 = 7.5\), we can perform a 1-in-7 systematic sample. Or, we should sample every 7th student. We can pick a starting point randomly from 1 to 600 and sample every 7th student from that on until we have reached 1200 samples.

How do we estimate the variance of this single systematic sample?

We cannot use the formula:

\[s^2_u=\dfrac{1}{n-1}\sum\limits_{i=1}^n (y_i-\bar{y})^2\]

since \(n = 1\). Only one primary unit is selected.

If the population is randomly ordered, then there is no problem. We can estimate the variance \(\sigma^2\) by:

\[s^2=\dfrac{\sum\limits_{j=1}^{M_1}(y_{1j}-\bar{y}_1)^2}{M_1-1}\]

However, when the population is ordered, systematic sampling is usually better than simple random sampling and the above formula will overestimate the variance.

When the population is periodic, the systematic sampling may be worse than the simple random sampling and the above formula will underestimate the variance since if the period \(k\) is chosen poorly, then the elements sampled may be too similar to each other.

Repeated Systematic Sampling

Unless the population is randomly ordered we can’t use the naive method to compute the variance. [Look in the textbook on page 162 for more advanced ways.] Thus, we need more than one primary unit.

Example 8.1 (Repeated systematic sampling of ferry cars)  

(see p.247 of Scheaffer, Mendenhall and Ott)

A ferry that carries cars across a bay charges a fee by carload rather than by a person. The ferry company wants to estimate the average number of people per car for August. The company knows from last year that 400 cars took the ferry and they want to sample 80 cars. To facilitate the estimation of the variance of the systematic sample the investigator chooses to use repeated systematic sampling with 10 samples of 8 cars each. Use the data given in the following table to estimate the average number of persons per car and also provide an estimate of the variance.

How do we obtain the random numbers for repeated systematic sampling?

We will select 10 repeated samples with 8 samples in each, so we choose 1-in-400/8 = 50. From the values 1 to 50, 10 numbers are selected without replacement and we start from those 10 numbers to get 10 samples of 1-in-50 systematic samples.

The 10 numbers sampled randomly without replacement from 1 to 50 are 2, 5, 7, 13, 26, 31, 35, 40, 45, and 46. In the following table, the car that will be sampled is listed with the number of people per car (the response) in parentheses.

Random starting point Second element Third element Fourth element Fifth element Sixth element Seventh element Eighth element \(\bar{y}_i\) mean
2(3) 52(4) 102(5) 152(3) 202(6) 252(1) 302(4) 352(4) 3.75
5(5) 55(3) 105(4) 155(2) 205(4) 255(2) 305(3) 355(4) 3.38
7(2) 57(4) 107(6) 157(2) 207(3) 257(2) 307(1) 357(3) 2.88
13(6) 63(4) 113(6) 163(7) 213(2) 263(3) 313(2) 363(7) 4.62
26(4) 76(5) 126(7) 176(4) 226(2) 276(6) 326(2) 376(6) 4.50
31(7) 81(6) 131(4) 181(4) 231(3) 281(6) 331(7) 381(5) 5.25
35(3) 85(3) 135(2) 185(3) 135(6) 285(5) 235(6) 385(8) 4.50
40(2) 90(6) 140(2) 190(5) 240(5) 290(4) 340(4) 390(5) 4.12
45(2) 95(6) 145(3) 195(6) 245(4) 295(4) 345(5) 395(4) 4.25
46(6) 96(5) 146(4) 196(6) 246(3) 296(3) 346(5) 396(3) 4.38

Try It!

For the above “Passengers in a car” example, determine the following:

  1. The total number of primary units \(N=\)?
  2. The number of primary units sampled \(n=\)?
  3. The number of secondary units in the ith primary unit \(M_i =\)?
  4. The total number of secondary units in the population \(M=\)?
  1. The total number of primary units \(N = 50\)
  2. The number of primary units sampled \(n = 10\)
  3. The number of secondary units in the ith primary unit \(M_i = 8\)
  4. The total number of secondary units in the population \(M=\sum\limits_{i=1}^{50}M_i=400\)

To estimate the population mean \(\mu =\tau/ M\) we can use the unbiased estimator. The estimator is:

\[\hat{\mu}=\dfrac{\hat{\tau}}{M}=\sum\limits_{i=1}^n \dfrac{\bar{y}_i}{n}=4.16\]

where \(\bar{y}_i=\dfrac{y_i}{M_i}=\dfrac{\sum\limits_{j=1}^{M_i} y_{ij}}{M_i}\) for \(i=1,2,\ldots,n\).

In this example, \(\overline{M}=M_1=M_2=\ldots=M_n\)

Try It!

Compute the variance of the above estimator.

\[\begin{align} \hat{\operatorname{Var}}(\hat{\mu}) &= \dfrac{M-n\cdot \bar{M}}{M}\cdot \dfrac{1}{n(n-1)} \cdot \sum\limits_{i=1}^n (\bar{y}_i-\hat{\mu})^2\\ &= \dfrac{400-10 \cdot 8}{400} \cdot \dfrac{1}{10(9)}\cdot [(3.75-4.16)^2+\cdots+(4.38-4.16)^2]\\ &= 0.0365\\ \end{align}\]

Try It!

When we use confidence intervals to estimate \(\mu\) we use \(t\), what are the degrees of freedom? (Hint: consider how many primary units you have and then compute the degree of freedom)

There are 10 primary units. Therefore, the degree of freedom is 9.

8.2 Variance and Cost in Cluster and Systematic Sampling versus SRS.

For simplicity, suppose that each of \(N\) primary units has an equal number \(\overline{M}\) of secondary units. To simplify the variance computations and to explore the relationship between cluster and simple random sampling, we note the identity:

\[\sum\limits_{i=1}^N \sum\limits_{j=1}^{\overline{M}}(y_{ij}-\mu)^2= \sum\limits_{i=1}^N \sum\limits_{j=1}^{\overline{M}}(y_{ij}-\bar{y}_i)^2+\overline{M}\sum\limits_{i=1}^N (\bar{y}_i-\mu)^2\]

\[\text{where }\bar{y}_i=\sum\limits_{j=1}^{\overline{M}}\dfrac{y_{ij}}{\overline{M}}\]

SST = SSW + SSB

  • SST: the total sum of square
  • SSW: within-cluster sum of squares (within-primary units)
  • SSB: between-cluster sum of squares (between-primary units)

The within-primary-unit variance is:

\[\sigma^2_w=\left\{\sum\limits_{i=1}^N \sum\limits_{j=1}^{\overline{M}}(y_{ij}-\bar{y}_i)^2\right\}/[N(\overline{M}-1)]\]

The between-primary-unit variance is:

\[\sigma^2_b=\left\{\sum\limits_{i=1}^N (\bar{y}_i-\mu)^2\right\}/(N-1)\]

The identity can be rewritten as:

\[(N\overline{M}-1)\sigma^2=N(\overline{M}-1)\sigma^2_w+(N-1)\overline{M}\sigma^2_b\]

Thus, an unbiased estimator of \(\sigma^2\) from a simple random cluster sample is:

\[\hat{\sigma}^2=\dfrac{N(\overline{M}-1)S^2_w+(N-1)\overline{M}S^2_b}{N\overline{M}-1}\]

Since the data was obtained by cluster sampling, we cannot use \(s^2\) to estimate \(\sigma^2\) but we can use \(\hat{\sigma}^2\) to estimate \(\sigma^2\).

The relative efficiency of simple random sampling versus simple random cluster sampling is:

\[\dfrac{\operatorname{Var}(\bar{y}_{\text{SRS}})}{\operatorname{Var}(\hat{\mu})}=\dfrac{\overline{M}\sigma^2}{\sigma^2_u}\]

It can be estimated by:

\[\dfrac{\hat{\operatorname{Var}}(\bar{y}_{\text{SRS}})}{\hat{\operatorname{Var}}(\hat{\mu})}=\dfrac{\overline{M}\hat{\sigma}^2}{s^2_u}\]

Note!

\[\begin{align} s^2_u&=\dfrac{1}{n-1}\sum\limits_{i=1}^n(y_i-\bar{y})^2\\ &=\dfrac{1}{n-1}\sum\limits_{i=1}^n (\overline{M}\bar{y}_i-\overline{M}\hat{\mu})^2\\ &={\overline{M}}^2 \dfrac{\sum\limits_{i=1}^n(\bar{y}_i-\hat{\mu})^2}{n-1}\\ &={\overline{M}}^2 s^2_b \end{align}\]

Recall: \(\operatorname{Var}(\bar{y}_{\text{SRS}})=\dfrac{N-n}{N}\cdot \dfrac{\sigma^2}{n\overline{M}}\) and \(\operatorname{Var}(\hat{\mu})=\dfrac{N-n}{N}\cdot \dfrac{\sigma^2_u}{n{\overline{M}}^2}\)

where \(\sigma^2_u\) is the finite population variance of \(y_i\).

Example 8.2 (Number of cell phones per household) The marketing research department of a communication company wishes to estimate the average number of cell phones purchased per household in a given community. Therefore, the 4,000 households in the community are listed in 400 geographical clusters of 10 households each, and a simple random sample of 4 clusters is selected to reduce the traveling cost for interviewing each household. The data are given in the following table:

Cluster Number of cell phones Total
1 3 5 6 4 5 6 3 2 4 5 43
2 2 0 2 1 1 0 1 1 0 1 9
3 3 2 3 2 4 2 2 1 2 2 23
4 5 2 3 2 1 1 2 2 4 1 23

Using Minitab

Stat > ANOVA > One-way

The data should be entered in two columns, the response contains all 40 responses and the column for factor indicates whether it is cluster 1, cluster 2, cluster 3, or cluster 4.

Minitab output:

One-Way ANOVA: cluster 1, cluster 2, cluster 3, cluster 4

Source DF SS MS F P
Factor 3 58.70 19.57 16.31 0.000
Error 36 43.20 1.20 - -
Total 39 101.90 - - -

Let’s find the relative efficiency of simple random sampling versus cluster sampling for the data in this example.

In this example, \(N = 400\), \(n = 4\), and \(\overline{M}=10\).

We need to find \(s_b^2\), \(s_w^2\).

Note the identity for the population:

\[(N\overline{M}-1)\sigma^2=N(\overline{M}-1)\sigma^2_w+(N-1)\overline{M}\sigma^2_b\]

The identity for the sample is:

\[(n\overline{M}-1)s^2=n(\overline{M}-1)s^2_w+(n-1)\overline{M}s^2_b\]

SS total = SS error + SS factor

From the ANOVA table of the example, we can find \(s_b^2\) by:

\[\text{SS factor}=(4-1)10s^2_b=58.70\]

\[s^2_b=\dfrac{58.70}{30}=1.957\]

We can find \(s_w^2\) by:

\[\text{SS error}=4(10-1)s^2_w=43.2\]

\[s^2_w=1.20\]

Try It!

Compute \(\hat{\sigma}^2\).

\[\begin{align} \hat{\sigma}^2&=\dfrac{N(\overline{M}-1)\sigma^2_w+(N-1)\overline{M}\sigma^2_b}{N\overline{M}-1}\\ &=\dfrac{(400\times 9 \times 1.2)+[(400-1)\times 10 \times 1.957]}{400\times 10 -1}\\ &=3.03 \end{align}\]

And now we can determine the relative efficiency of simple random sampling versus cluster sampling by plugging the values into the formula:

\[s^2_u={\overline{M}}^2 s^2_b=100 \times 1.957=195.7\]

Try It!

Compute the relative efficiency of simple random sampling versus cluster sampling. What does that tell us?

\[\dfrac{\hat{\operatorname{Var}}(\bar{y}_{\text{SRS}})}{\hat{\operatorname{Var}}(\hat{\mu})}=\dfrac{\overline{M}\hat{\sigma}^2}{s^2_u}=\dfrac{10 \times 3.03}{195.7}=0.155\]

What is this telling us?

Thus, the variance of simple random sampling is just 15.5% of that of cluster sampling if the same sample size is used. We can see that in this example simple random sampling is more efficient if the only variance is considered.

Note! It is a BIG mistake to analyze a cluster sample as if it were a simple random sample, (often with the reported standard error much less than it should be). You will end up being much too optimistic and not conservative regarding your results as you should be.