11.2 - Introduction to Bootstrapping

In this section, we will start by reviewing the concept of sampling distributions. Recall, we can find the sampling distribution of any summary statistic. Then, the method of bootstrapping samples to find the approximate sampling distribution of a statistic is introduced.

Review of Sampling Distributions

Before looking at the bootstrapping method, we will need to recall the idea of sampling distributions. More specifically, let's look at the sampling distribution of the sample mean, \(\bar{x}\).

Suppose we are interested in estimating the population mean, \(\mu\). To do this, we find a random sample of size \(n\) and calculate the sample mean, \(\bar{x}\). But how do we know how good of an estimate \(\bar{x}\) is? To answer this question, we need to find the standard deviation of the estimate.

Recall that \(\bar{x}\) is calculated from a random sample and is, therefore, a random variable. Let's call the sample mean from above \(\bar{x}_1\). Now suppose we gather another random sample of size \(n\) and calculate \(\bar{x}\) from that sample and denote it \(\bar{x}_2\). Take another sample, and so on and so on. With many of these samples, we can construct a histogram of the sample means.

With theory and the central limit theorem, we have the following summary:

If the sample satisfied at least one of the following:

The distribution of the random variable, \(X\), is Normal
The sample size is large; rule of thumb is \(n>30\)

...then the sampling distribution of \(\bar{X}\) is approximately Normal with

Mean: \(\mu\)
Standard deviation: \(\frac{\sigma}{\sqrt{n}}\)
Standard error: \(\frac{s}{\sqrt{n}}\)

Using the above, we can construct confidence intervals, and hypothesis test for the population mean, \(\mu\).

What happens when we do not know the underlying distribution and cannot resample from the distribution? How could we estimate certain sample statistics? This is what we try to answer in the next section.

11.2.1 - Bootstrapping Methods

Point estimates are helpful to estimate unknown parameters but in order to make inference about an unknown parameter, we need interval estimates. Confidence intervals are based on information from the sampling distribution, including the standard error.

What if the underlying distribution is unknown? What if we are interested in a population parameter that is not the mean, such as the median? How can we construct a confidence interval for the population median?

If we have sample data, then we can use bootstrapping methods to construct a bootstrap sampling distribution to construct a confidence interval.

Bootstrapping is a topic that has been studied extensively for many different population parameters and many different situations. There are parametric bootstrap, nonparametric bootstraps, weighted bootstraps, etc. We merely introduce the very basics of the bootstrap method. To introduce all of the topics would be an entire class in itself.

Bootstrapping: Bootstrapping is a resampling procedure that uses data from one sample to generate a sampling distribution by repeatedly taking random samples from the known sample, with replacement.

Let’s show how to create a bootstrap sample for the median. Let the sample median be denoted as \(M\).

Steps to create a bootstrap sample:

Replace the population with the sample
Sample with replacement \(B\) times. \(B\) should be large, say 1000.
Compute sample medians each time, \(M_i\)
Obtain the approximate distribution of the sample median.

If we have the approximate distribution, we can find an estimate of the standard error of the sample median by finding the standard deviation of \(M_1,...,M_B\).

Sampling with replacement is important. If we did not sample with replacement, we would always get the same sample median as the observed value. The sample we get from sampling from the data with replacement is called the bootstrap sample.

Once we find the bootstrap sample, we can create a confidence interval. For a 90% confidence interval, for example, we would find the 5th percentile and the 95th percentile of the bootstrap sample.

You can create a bootstrap sample to find the approximate sampling distribution of any statistic, not just the median. The steps would be the same except you would calculate the appropriate statistic instead of the median.

Video: Bootstrapping

Sampling R Code from the Bootstrapping Video

sampling.distribution <- function(n = 100, B = 1000, mean = 5, sd = 1, confidence = 0.95) {
  median <- rep(0, B)
  for (i in 1:B) {
    median[i] <- median(rnorm(n, mean = mean, sd = sd))
  }
  med.obs <- median(median)
  c.l <- round((1 - confidence) / 2 * B, 0)
  c.u <- round(B - (1 - confidence) / 2 * B, 0)
  l <- sort(median)[c.l]
  u <- sort(median)[c.u]
  cat(c.l / 1000 * 100, "-percentile:      ", l, "\n")
  cat("Median: ", med.obs, "\n")
  cat(c.u / 1000 * 100, "-percentile:      ", u, "\n")
  return(median)
}

bootstrap.median <- function(data, B = 1000, confidence = 0.95) {
  n <- length(data)
  median <- rep(0, B)
  for (i in 1:B) {
    median[i] <- median(sample(data, size = n, replace = T))
  }
  med.obs <- median(median)
  c.l <- round((1 - confidence) / 2 * B, 0)
  c.u <- round(B - (1 - confidence) / 2 * B, 0)
  l <- sort(median)[c.l]
  u <- sort(median)[c.u]
  cat(c.l / 1000 * 100, "-percentile:      ", l, "\n")
  cat("Median: ", med.obs, "\n")
  cat(c.u / 1000 * 100, "-percentile:      ", u, "\n")
  return(median)
}

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility