12  Capture-Recapture Sampling, Random Response Model

Overview

In Section 12.1, we introduce capture-recapture sampling and discuss its application of it to estimating population size. We then provide the formula for the variance of the estimate. An example is provided for capture-recapture sampling. This is the direct sampling of capture-recapture where both the capture size and the second capture size are pre-determined.

In Section 12.2, we discuss inverse sampling for capture-recapture where the initial capture size is pre-determined but we sample until a fixed number of tagged items are recaptured. Here, the second capture size is random. An example is provided to compute the estimate as well as its estimated standard deviation.

In Section 12.3, the random response model is introduced to promote more truthful answers to sensitive questions. An example is given to illustrate how to compute the estimate as well as its estimated standard deviation.

Lesson 12: Ch. 18.1 of Sampling by Steven Thompson, 3rd Edition.

Objectives

Upon completion of this lesson you should be able to:

  1. Apply the capture-recapture sampling method to estimate the size of a population,
  2. Distinguish between the capture-recapture method and the inverse capture-recapture method,
  3. Apply the inverse capture-recapture sampling method to estimate the size of a population, and
  4. Apply the random response model to address the issue of sensitive questions in survey research.

12.1 Capture-Recapture Sampling

One of the popular methods to estimate the total number of individuals in a population is by capture-recapture sampling. In capture-recapture sampling, an initial sample is obtained and marked. A second sample is obtained independently and it is noted how many of the individuals in that sample were marked.

Example 1: To estimate the abundance of an animal population such as the deer population in the state of Pennsylvania.

Example 2: To estimate the total number of homeless individuals in a given city.

Single Recapture

Notation:

  • \(X\) - initial sample size captured and marked
  • \(y\) - second sample size recaptured independently
  • \(x\) - number of samples in the recaptured one that is marked
  • \(\tau\) - total population size

Question: How do we estimate the total population size?

Since the proportion of the marked subjects in the recaptured sample is likely to be about the same as the first sample in the whole population:

\[\dfrac{x}{y}\cong \dfrac{X}{\tau}\]

\[\hat{\tau}=\dfrac{y}{x}\cdot X\]

An estimate of the variance of \(\hat{\tau}\) is:

\[\hat{\operatorname{Var}}(\hat{\tau})=\dfrac{Xy(X-x)(y-x)}{x^3}\]

An approximate \(100(1-\sigma)%\) confidence interval is:

\[\hat{\tau} \pm z\sqrt{\hat{\operatorname{Var}}(\hat{\tau})}\]

To deal with the case when \(x = 0\) and we do not want to estimate \(\tau\) by infinity, a modified estimator for \(\tau\) is:

\[\tilde{\tau}=\dfrac{(X+1)(y+1)}{x+1}-1\]

Try It!

In a free concert given on the Old Main lawn, we want to estimate the number of attendees. How are you going to conduct sampling for this purpose?

At the beginning of the concert, 500 Penn State t-shirts were randomly given out to attendees. 200 attendees are randomly sampled and we find that 40 have the Penn State t-shirt.

Try It!

How many total attendees are at the concert using values given in the answer to the question above?

\[\hat{\tau}=\dfrac{y}{x}\cdot X =\dfrac{200}{40}\cdot 500=2500\]

\[\hat{\operatorname{Var}}(\hat{\tau})=\dfrac{500\times 200(500-40)(200-40)}{40^3}=115000\]

\[\hat{\text{SD}}(\hat{\tau})=339.16\]

A 95% confidence interval is:

\[2500 ± 1.96 × 339.16\] \[2500 ± 664.67\]

Note! \(y\) can be larger than \(X\)

12.2 Inverse Sampling for Capture-Recapture

What we covered already is the direct sampling of capture-recapture, i.e., the size of both the initial sample (capture) size and the second sample (recapture) size is pre-determined. When the second capture size is not pre-determined, then we have:

Inverse Sampling for Capture-Recapture

Again, assume that an initial sample of \(X\) individuals is captured, tagged, and released back into the population. Then, random sampling is conducted until \(x\)-tagged individuals are recaptured. If \(y\) denotes the second sample size, then:

\[\hat{\tau}=\dfrac{y}{x}X\]

Note that for inverse sampling, \(x\) is fixed but \(y\) is random. The estimated variance of \(\hat{\tau}\) is:

\[\hat{\operatorname{Var}}(\hat{\tau})=\dfrac{X^2y(y-x)}{x^2(x+1)}\]

Note! here \(x\) is specified and we do not need to worry about the case \(x = 0\).

Example 12.1 (Number of Eagles) We want to estimate the total number of eagles in a wildlife preserve. A random sample of 200 eagles is trapped, tagged, and then released. In the same month, a second sample is drawn until 35 tagged eagles are recaptured. The sample size needed to get 35 tagged eagles is 100. (as opposed to having 100 eagles being recaptured to find 35 tagged ones in the direct capture-recapture).

Try It!

Estimate the total population size of eagles for the above example and find the variance of your estimate.

\[X = 200, x = 35, y = 100\] \[\hat{\tau}=\dfrac{100}{35}\times 200=571.43\]

\[\hat{\operatorname{Var}}(\hat{\tau})=\dfrac{200^2\times 100(100-35)}{35^2(35+1)}=5895.69\]

\[\hat{\text{SD}}(\hat{\tau})=76.78\]

12.3 Random Response Model

People may lie about sensitive questions such as: “Have you used cocaine before?”

For these types of questions, a question form that encourages truthful answers and makes people comfortable is useful.

Horvitz (1967) based on the idea from Warner (1965), suggests using two questions - a sensitive question and an unrelated question - and uses a randomization device to determine which is the question the respondent should answer.

Example

Q1: Have you used cocaine before?

Q2: Is the second hand of your watch between 0 and 30?

The respondent will flip a coin and decide which question to answer whereas the interviewer does not know the outcome of the coin.

The randomization device can be anything but it must have:

  1. known probability \(t\) that the person answers the sensitive question and probability \(1 - t\) that the person answers other questions.
  2. the probability that the person responds yes to the other question is known.

Example 12.2 (Tax return question) Q1: Have you ever falsified your tax return? Yes or no.

Q2: Flip a book and answer: is the page number odd? Yes or no.

The interviewer merely records the answer and does not know whether the respondent is answering Q1 or Q2.

We will conduct this survey on \(n\) subjects, \(n_1\) denotes the number of respondents who respond yes. How are we going to estimate the population proportion \(p\)?

\(t = 1/2\)

Tree diagram for tax return question HTYesNoYesNo

Here we write out what this tree diagram is expressing in terms of the probability of yes:

\[P(\text{yes})=\dfrac{1}{2}\times p+\dfrac{1}{2}\times\dfrac{1}{2}=\dfrac{p}{2}+\dfrac{1}{4}\]

Let \(n_1\) denote the number of yes in \(n\) subjects

\[\dfrac{n_1}{n}=\dfrac{\hat{p}}{2}+\dfrac{1}{4}\]

\[\hat{p}=2\left(\dfrac{n_1}{n}-\dfrac{1}{4}\right)\]

If the sample size is small compared to the population size, the finite correction factor can be omitted and the variance formula is:

\[\hat{\operatorname{Var}}(\hat{p})=\dfrac{4}{n}\times \dfrac{n_1}{n}\times (1-\dfrac{n_1}{n})\]

Try It!

If we survey 400 subjects and the number who answer yes to the composite question is 128, estimate the proportion of people who have falsified their tax return.

\[n = 400\] \[n_1=128\] \[\hat{p}=2\left(\dfrac{128}{400}-\dfrac{1}{4}\right)=0.14\]

\[\hat{\operatorname{Var}}(\hat{p})=\dfrac{4}{400}\times \dfrac{128}{400}\times \left(1-\dfrac{128}{400}\right)=0.0022\]

\[\hat{\text{SD}}(\hat{p})=0.047\]