Lesson 15: Cross-validation, Bootstraps and Consensus

Printer-friendly versionPrinter-friendly version
Key Learning Goals for this Lesson:
  • Understanding how resampling can help us in understanding variability of estimators.
  • Understanding how prediction error helps us understand how well a prediction model fits the data.
  • Understanding the three types of bootstrap: nonparametric (resampling), semi-parametric (noisy) and parametric
  • Understanding how the bootstrap or cross-validation samples can be used to improve prediction and classification via consensus (aggregation).

Introduction

Uncertainty and Bias Quantification

Uncertainty and error are two of the central ideas in statistical thinking.  Variability is a measure of how much an estimator or other construct (such as a plot or clustering scheme) changes with draws of random samples from the population.  Bias is a measure of whether a numerical estimator is systematically higher or lower than the target quantity (parameter) being estimated.  More generally, statisticians describe the sampling distribution of the construct as the set of all possible values under different random samples, weighted by the probability of the outcome.  When the construct is numerical, the sampling distribution can be summarized with a histogram, but for complicated constructs such as cluster dendrograms, the distribution is simply the set of all possible values.  When the population is finite, the sampling distribution (for a smaller, finite sample size n) can (in principal) be constructed exactly by constructing all the possible samples of size n and computing the construct with weight 1/(number of possible samples).

In some cases for numerical summaries, we can compute the sampling distribution (or at least its mean and variance) using statistical theory.  For example, when sampling from a population with finite variance \(\sigma^2\), the sample mean has a sampling distribution with mean the same as the population mean and variance \(\sigma^2/n\).  The bias of an estimator is defined formally as the difference between the mean of the sampling distribution of the estimator and the target value.  When the target is the population mean, the sample mean is unbiased.  However, not every estimator we use is unbiased.  Although the sample variance is an unbiased estimator of the population variance, the sample SD is a biased estimator of the population SD, being systematically too small.

Another way that we quantify uncertainty is through probability statements.  For example p-values, power, the "confidence" of a confidence interval and Bayesian posterior probabilities are all ways of quantifying uncertainty through assessments of probability.

For complicated analyses, such as analysis pipelines and clustering algorithms, it is very difficult to determine sampling distributions or assess probabilities.  For example, consider hierarchical clustering.  The sampling distribution is a collection of dendrograms, each for a different sample drawn from the population.  Often the dendrograms are called "trees" and the sampling distribution is called a "forest".  It is difficult to know how to summarize such a sampling distribution.  However, there are summaries that can be quantified such as the probability that two items will belong to the same cluster, or that a tree will have a given branch.  

Classification is also complicated.  In classification we are usually interested in the probability that a newly observed sample will be correctly classified by our algorithm.  However, in many contexts we may not weight all errors equally.  For example, in determining if a patient has a rare but serious disease, the simple classifier that classifies all patients as healthy will likely have a low misclassification rate, but will not be useful.  In developing a prediction rule, we may have enriched our sample of diseased cases to ensure that we have sufficient cases to construct a rule.  We then need to find a way to quantify the probability that a patient is correctly classified given that they do or do not have the disease, or conversely, the probability that the patient has the disease given that they were classified to the healthy or diseased group.  Assessments of probability developed from the same training data used to estimate the classification rule are known to be optimistic - that is they are biased towards smaller estimates of error. As well, they will be incorrect if the proportion of each class in the training set differs from the proportions in the population to which the classification rule will be applied.

Another very difficult problem is assessing confidence after feature selection.  How can we develop an estimate of confidence that takes into account both the feature selection and the estimation of the model parameters such as regression coefficients or effect sizes after selection? 

Simulation and resampling are two methods that help assess and quantify uncertainty and error when the mathematical theory is too difficult.  Simulation is used to assess and quantify uncertainty under the ideal conditions set up in the simulation study.  Resampling methods, which include permutation tests, cross-validation and the bootstrap are methods which simulate new samples from the data as a means of estimating the sampling distribution.  They do not work very well for extremely small samples, as the number of "new" samples that can be drawn is too small.  However, they can work surprisingly well when the sample sizes are moderate.