8 Part 2 of Cluster and Systematic Sampling
Overview
In Section 8.1, we introduce systematic sampling and state why it may be a challenge to estimate the variance when only one primary unit is taken. Then repeated systematic sampling is introduced so that the variance can be estimated. We then provide an example of repeated systematic sampling.
In Section 8.2, variance for cluster and systematic sampling is decomposed in terms of between-cluster and within-cluster variances. We then provide an estimate for the relative efficiency of simple random sampling versus simple random cluster sampling. An example is provided to compare the variances for these two sampling methods. One should note that it is not uncommon to see examples that cluster sampling is much less efficient than simple random sampling, as illustrated in this example.
Lesson 8: Ch. 12.4-12.5 of Sampling by Steven Thompson, 3rd Edition.
Objectives
Upon completion of this lesson you should be able to:
- Identify the appropriate reasons and situations to use systematic sampling,
- Identify the appropriate reasons and situations to use repeated systematic sampling,
- Compute the within-cluster variance and the between-cluster variance, and
- Compute the relative efficiency of the cluster sampling compared to simple random sampling.
8.1 Systematic Sampling
Suppose you have a number of students lined up in a row:
1 2 3 4 5 6 7 8 9 10 11 12
Here we might take a sample of every 4 elements, or 1 in 4 elements from the population: (1, 5, 9) or (2, 6, 10), etc. There are four primary units: (1, 5, 9), (2, 6, 10), (3, 7, 11), and (4, 8, 12).
To sample systematically from a field, the following is one example:
There are four primary units: (1, 3, 9, 11), (2, 4, 10, 12), (5, 7, 13, 15), and (6, 8, 14, 16).
How do we draw a 1 in \(k\) systematic sample?
Repeated Systematic Sampling
Unless the population is randomly ordered we can’t use the naive method to compute the variance. [Look in the textbook on page 162 for more advanced ways.] Thus, we need more than one primary unit.
Example 8.1 (Repeated systematic sampling of ferry cars)
(see p.247 of Scheaffer, Mendenhall and Ott)
A ferry that carries cars across a bay charges a fee by carload rather than by a person. The ferry company wants to estimate the average number of people per car for August. The company knows from last year that 400 cars took the ferry and they want to sample 80 cars. To facilitate the estimation of the variance of the systematic sample the investigator chooses to use repeated systematic sampling with 10 samples of 8 cars each. Use the data given in the following table to estimate the average number of persons per car and also provide an estimate of the variance.
How do we obtain the random numbers for repeated systematic sampling?
We will select 10 repeated samples with 8 samples in each, so we choose 1-in-400/8 = 50. From the values 1 to 50, 10 numbers are selected without replacement and we start from those 10 numbers to get 10 samples of 1-in-50 systematic samples.
The 10 numbers sampled randomly without replacement from 1 to 50 are 2, 5, 7, 13, 26, 31, 35, 40, 45, and 46. In the following table, the car that will be sampled is listed with the number of people per car (the response) in parentheses.
Random starting point | Second element | Third element | Fourth element | Fifth element | Sixth element | Seventh element | Eighth element | \(\bar{y}_i\) mean |
---|---|---|---|---|---|---|---|---|
2(3) | 52(4) | 102(5) | 152(3) | 202(6) | 252(1) | 302(4) | 352(4) | 3.75 |
5(5) | 55(3) | 105(4) | 155(2) | 205(4) | 255(2) | 305(3) | 355(4) | 3.38 |
7(2) | 57(4) | 107(6) | 157(2) | 207(3) | 257(2) | 307(1) | 357(3) | 2.88 |
13(6) | 63(4) | 113(6) | 163(7) | 213(2) | 263(3) | 313(2) | 363(7) | 4.62 |
26(4) | 76(5) | 126(7) | 176(4) | 226(2) | 276(6) | 326(2) | 376(6) | 4.50 |
31(7) | 81(6) | 131(4) | 181(4) | 231(3) | 281(6) | 331(7) | 381(5) | 5.25 |
35(3) | 85(3) | 135(2) | 185(3) | 135(6) | 285(5) | 235(6) | 385(8) | 4.50 |
40(2) | 90(6) | 140(2) | 190(5) | 240(5) | 290(4) | 340(4) | 390(5) | 4.12 |
45(2) | 95(6) | 145(3) | 195(6) | 245(4) | 295(4) | 345(5) | 395(4) | 4.25 |
46(6) | 96(5) | 146(4) | 196(6) | 246(3) | 296(3) | 346(5) | 396(3) | 4.38 |
Try It!
For the above “Passengers in a car” example, determine the following:
- The total number of primary units \(N=\)?
- The number of primary units sampled \(n=\)?
- The number of secondary units in the ith primary unit \(M_i =\)?
- The total number of secondary units in the population \(M=\)?
- The total number of primary units \(N = 50\)
- The number of primary units sampled \(n = 10\)
- The number of secondary units in the ith primary unit \(M_i = 8\)
- The total number of secondary units in the population \(M=\sum\limits_{i=1}^{50}M_i=400\)
To estimate the population mean \(\mu =\tau/ M\) we can use the unbiased estimator. The estimator is:
\[\hat{\mu}=\dfrac{\hat{\tau}}{M}=\sum\limits_{i=1}^n \dfrac{\bar{y}_i}{n}=4.16\]
where \(\bar{y}_i=\dfrac{y_i}{M_i}=\dfrac{\sum\limits_{j=1}^{M_i} y_{ij}}{M_i}\) for \(i=1,2,\ldots,n\).
In this example, \(\overline{M}=M_1=M_2=\ldots=M_n\)
Try It!
Compute the variance of the above estimator.
Try It!
When we use confidence intervals to estimate \(\mu\) we use \(t\), what are the degrees of freedom? (Hint: consider how many primary units you have and then compute the degree of freedom)
There are 10 primary units. Therefore, the degree of freedom is 9.
8.2 Variance and Cost in Cluster and Systematic Sampling versus SRS.
For simplicity, suppose that each of \(N\) primary units has an equal number \(\overline{M}\) of secondary units. To simplify the variance computations and to explore the relationship between cluster and simple random sampling, we note the identity:
\[\text{where }\bar{y}_i=\sum\limits_{j=1}^{\overline{M}}\dfrac{y_{ij}}{\overline{M}}\]
SST = SSW + SSB
- SST: the total sum of square
- SSW: within-cluster sum of squares (within-primary units)
- SSB: between-cluster sum of squares (between-primary units)
The within-primary-unit variance is:
The between-primary-unit variance is:
The identity can be rewritten as:
Thus, an unbiased estimator of \(\sigma^2\) from a simple random cluster sample is:
Since the data was obtained by cluster sampling, we cannot use \(s^2\) to estimate \(\sigma^2\) but we can use \(\hat{\sigma}^2\) to estimate \(\sigma^2\).
The relative efficiency of simple random sampling versus simple random cluster sampling is:
It can be estimated by:
Recall: \(\operatorname{Var}(\bar{y}_{\text{SRS}})=\dfrac{N-n}{N}\cdot \dfrac{\sigma^2}{n\overline{M}}\) and \(\operatorname{Var}(\hat{\mu})=\dfrac{N-n}{N}\cdot \dfrac{\sigma^2_u}{n{\overline{M}}^2}\)
where \(\sigma^2_u\) is the finite population variance of \(y_i\).
Example 8.2 (Number of cell phones per household) The marketing research department of a communication company wishes to estimate the average number of cell phones purchased per household in a given community. Therefore, the 4,000 households in the community are listed in 400 geographical clusters of 10 households each, and a simple random sample of 4 clusters is selected to reduce the traveling cost for interviewing each household. The data are given in the following table:
Cluster | Number of cell phones | Total | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | 5 | 6 | 4 | 5 | 6 | 3 | 2 | 4 | 5 | 43 |
2 | 2 | 0 | 2 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 9 |
3 | 3 | 2 | 3 | 2 | 4 | 2 | 2 | 1 | 2 | 2 | 23 |
4 | 5 | 2 | 3 | 2 | 1 | 1 | 2 | 2 | 4 | 1 | 23 |
Using Minitab
Stat > ANOVA > One-way
The data should be entered in two columns, the response contains all 40 responses and the column for factor indicates whether it is cluster 1, cluster 2, cluster 3, or cluster 4.
Minitab output:
One-Way ANOVA: cluster 1, cluster 2, cluster 3, cluster 4
Source | DF | SS | MS | F | P |
---|---|---|---|---|---|
Factor | 3 | 58.70 | 19.57 | 16.31 | 0.000 |
Error | 36 | 43.20 | 1.20 | - | - |
Total | 39 | 101.90 | - | - | - |
Let’s find the relative efficiency of simple random sampling versus cluster sampling for the data in this example.
In this example, \(N = 400\), \(n = 4\), and \(\overline{M}=10\).
We need to find \(s_b^2\), \(s_w^2\).
Note the identity for the population:
The identity for the sample is:
SS total = SS error + SS factor
From the ANOVA table of the example, we can find \(s_b^2\) by:
\[\text{SS factor}=(4-1)10s^2_b=58.70\]
\[s^2_b=\dfrac{58.70}{30}=1.957\]
We can find \(s_w^2\) by:
\[\text{SS error}=4(10-1)s^2_w=43.2\]
\[s^2_w=1.20\]
Try It!
Compute \(\hat{\sigma}^2\).
And now we can determine the relative efficiency of simple random sampling versus cluster sampling by plugging the values into the formula:
\[s^2_u={\overline{M}}^2 s^2_b=100 \times 1.957=195.7\]
Try It!
Compute the relative efficiency of simple random sampling versus cluster sampling. What does that tell us?
\[\dfrac{\hat{\operatorname{Var}}(\bar{y}_{\text{SRS}})}{\hat{\operatorname{Var}}(\hat{\mu})}=\dfrac{\overline{M}\hat{\sigma}^2}{s^2_u}=\dfrac{10 \times 3.03}{195.7}=0.155\]
What is this telling us?
Thus, the variance of simple random sampling is just 15.5% of that of cluster sampling if the same sample size is used. We can see that in this example simple random sampling is more efficient if the only variance is considered.