Lesson 15: Cross-validation, Bootstraps and Consensus

Key Learning Goals for this Lesson:

Understanding how resampling can help us in understanding variability of estimators.
Understanding how prediction error helps us understand how well a prediction model fits the data.
Understanding the three types of bootstrap: nonparametric (resampling), semi-parametric (noisy) and parametric
Understanding how the bootstrap or cross-validation samples can be used to improve prediction and classification via consensus (aggregation).

Introduction

Uncertainty and Bias Quantification

Uncertainty and error are two of the central ideas in statistical thinking. Variability is a measure of how much an estimator or other construct (such as a plot or clustering scheme) changes with draws of random samples from the population. Bias is a measure of whether a numerical estimator is systematically higher or lower than the target quantity (parameter) being estimated. More generally, statisticians describe the sampling distribution of the construct as the set of all possible values under different random samples, weighted by the probability of the outcome. When the construct is numerical, the sampling distribution can be summarized with a histogram, but for complicated constructs such as cluster dendrograms, the distribution is simply the set of all possible values. When the population is finite, the sampling distribution (for a smaller, finite sample size n) can (in principal) be constructed exactly by constructing all the possible samples of size n and computing the construct with weight 1/(number of possible samples).

In some cases for numerical summaries, we can compute the sampling distribution (or at least its mean and variance) using statistical theory. For example, when sampling from a population with finite variance \(\sigma^2\), the sample mean has a sampling distribution with mean the same as the population mean and variance \(\sigma^2/n\). The bias of an estimator is defined formally as the difference between the mean of the sampling distribution of the estimator and the target value. When the target is the population mean, the sample mean is unbiased. However, not every estimator we use is unbiased. Although the sample variance is an unbiased estimator of the population variance, the sample SD is a biased estimator of the population SD, being systematically too small.

Another way that we quantify uncertainty is through probability statements. For example p-values, power, the "confidence" of a confidence interval and Bayesian posterior probabilities are all ways of quantifying uncertainty through assessments of probability.

For complicated analyses, such as analysis pipelines and clustering algorithms, it is very difficult to determine sampling distributions or assess probabilities. For example, consider hierarchical clustering. The sampling distribution is a collection of dendrograms, each for a different sample drawn from the population. Often the dendrograms are called "trees" and the sampling distribution is called a "forest". It is difficult to know how to summarize such a sampling distribution. However, there are summaries that can be quantified such as the probability that two items will belong to the same cluster, or that a tree will have a given branch.

Classification is also complicated. In classification we are usually interested in the probability that a newly observed sample will be correctly classified by our algorithm. However, in many contexts we may not weight all errors equally. For example, in determining if a patient has a rare but serious disease, the simple classifier that classifies all patients as healthy will likely have a low misclassification rate, but will not be useful. In developing a prediction rule, we may have enriched our sample of diseased cases to ensure that we have sufficient cases to construct a rule. We then need to find a way to quantify the probability that a patient is correctly classified given that they do or do not have the disease, or conversely, the probability that the patient has the disease given that they were classified to the healthy or diseased group. Assessments of probability developed from the same training data used to estimate the classification rule are known to be optimistic - that is they are biased towards smaller estimates of error. As well, they will be incorrect if the proportion of each class in the training set differs from the proportions in the population to which the classification rule will be applied.

Another very difficult problem is assessing confidence after feature selection. How can we develop an estimate of confidence that takes into account both the feature selection and the estimation of the model parameters such as regression coefficients or effect sizes after selection?

Simulation and resampling are two methods that help assess and quantify uncertainty and error when the mathematical theory is too difficult. Simulation is used to assess and quantify uncertainty under the ideal conditions set up in the simulation study. Resampling methods, which include permutation tests, cross-validation and the bootstrap are methods which simulate new samples from the data as a means of estimating the sampling distribution. They do not work very well for extremely small samples, as the number of "new" samples that can be drawn is too small. However, they can work surprisingly well when the sample sizes are moderate.

15.1 - Prediction Error

We will start the discussion of uncertainty quantification with problem that is of particular interest in regression and classification: assessing prediction error. In regression we have a continuous response variable and one or more predictor variables (which may be continuous or categorical). The prediction problem ignores questions of model correctness (e.g. are the regression coefficients the correct order of magnitude?) and focuses on the predicted values: Do we obtain good predictions of the response variable? Similarly, in the classification problem, we have a categorical response variable and one or more predictor variables. The prediction problem focuses on whether the samples are correctly classified to their category. The objective is to find a rule that performs well in predicting outcomes or categories for new cases for which the response or category is not known. For example, we might want to predict the success probability for new graduate students based on the information in their dossier, or categorize tissue samples into normal, benign or cancerous based on their gene expression. In these examples, we are not looking for a model that accurately reflects the underlying process - we just want a good predictor for new samples.

The data on which the prediction or classification rules are developed are called the training sample. In statistics, we talk about "fitting" the model; in machine learning, we talk about "training" the predictor. Typically, the fitting step minimizes a measure of prediction error on the training sample. For example, in least squares regression, the residual is the difference between the observed response variable and its predicted value and the model is fitted by selecting parameters which minimized the sum of squared residuals (also called the sum of squared errors or SSE). In classification, we might count the number of misclassified observations (possibly weighting more serious misclassifications more heavily).

The method seems straight-forward, but it has several flaws. The one we will tackle in this chapter is "optimism": because the prediction rule is fitted to minimize the prediction error in the training set, it underestimates the prediction error it will achieve with new data. "Optimism" leads to another problem called overfitting, which occurs when the fitted prediction rule fits the noise as well as the systematic aspects of the data. As a simple example, suppose that there are n observations and p linearly independent features, with p=n. (We cannot have p>n and have linear independence.) Linear algebra tells us that linear regression can provide a perfect fit to any response variable, including a categorical response, so that the estimated prediction error will always be zero. This mathematical fact has nothing to do with underlying structure - we could randomly simulate our p features and still obtain perfect prediction. Even if the number of predictors is less than the number of observations we can overfit the model although perhaps not obtain a perfect fit. Besides leading to an optimistic assessment of prediction error, overfitting usually leads to rules that have high variability and are very sensitive to noise in the data.

Since our objective is to develop a prediction rule that does well with new data, ideally we would collect a sufficiently large set of new observations including the outcome, and assess the prediction rule on those data. This is called the validation sample. We expect that the assessment of prediction error on the validation sample should be quite good, because the training and validation samples share only their systematic features and not the noise.

However collecting new data is expensive - we would not want to do it to assess a rule that might not be very good in the first place. And even if we can afford to collect new data, we would likely want to use those data to improve our predictor - not just to assess it. So the question becomes, how can we use the data we have already collected to assess prediction error. And once we have done that, can we use the results to improve our prediction rule?

Cross-validation (CV) is a method that was developed in the 1970s to assess prediction error while also training the prediction rule. The bootstrap (which is actually a wide class of methods) is a much more general method developed in the early 1980s which starts by estimating the sampling distribution of the prediction rule (or its parameters) and can also be used to assess aspects of the prediction rule including prediction error.

15.2 - Cross-Validation

Cross-validation is a method that uses the same data to both train the model and obtain a less biased estimate of prediction error than the direct estimate. The basic idea is to split the training data into two subsets - one subset is used to train the prediction rule and then the other subset is used to assess prediction error. To use the data efficiently, this is repeated with multiple splits of the data.

The earliest and still most commonly used method is leave-one-out cross-validation. One out of the n observations is set aside for validation and the prediction rule is trained on the other n-1 observations. The error in predicting the observation is recorded. This is repeated n times, leaving out each observation once. For regression, the average sum of squares of the prediction error (or its square root) is the estimated prediction error. For classification a weighted average of the number of misclassified observations is generally used. Although the method requires fitting the prediction rule n times, there are computationally efficient methods to do this for many commonly used predictors.

Statistical theory and simulation have demonstrated that leave-one-out cross-validation is not a good estimate of prediction error for every type of predictor. In particular it does not do well for problems such as determining the number of clusters or feature selection. As well, in some cases there are no known computationally efficient methods. In these cases, two other cross-validation strategies may be used. Leave-out-k cross-validation divides the data into a subset of k observations that will be used as the validation set, and the other n-k observations that are used for training. It then proceeds like leave-out-one cross-validation. Since there are many more subsets of size k than of size 1, often only a random sample of the subsets is used. There is some statistical theory to guide the choice of k.

k-fold cross-validation divides the entire dataset into k subsets. In turn, each subset is used as the validation sample, while the other k-1 subsets are combined to use as the training sample. There is statistical theory that shows that the appropriate choice of k depends on n and the type of predictor. However, in practise, the dependency is weak, and k=10 works well for a large range of sample sizes and in many problems.

Note that cross-validation is used to estimate prediction error and sometimes aspects of the prediction equation equation such as the number of clusters or number of predictors. The final predictor will be trained on all the training data.

There is a small problem with this method for assessing prediction error. The final predictor will be based on all N, or 100% percent of the sample but the estimated prediction error is based on predictor developed on a smaller sample: \(N > N-k\). So the cross-validation estimate of prediction error might actually be pessimistic - might have slightly better prediction error than you think. However, with 10-fold cross-validation can't be too far off because you are using at least 90% of your samples.

In some circumstances we want to pick the best of several choices of predictor. In this case, we use the same cross-validation strategy with each of the predictors and select the predictor with the smallest cross-validation estimate of prediction error. However, we have now over-used the sample in terms of estimating prediction error. It is best in this case to have yet another validation sample, called the test of hold-out sample, that is not used in the model development, but which can be used to assess the prediction error of the selected predictor. [1]

[1] Lever, J., Krzywinski, M. & Altman, N., Model Selection and Overfitting. Nature Methods, 13, 703–704 (2016)

15.3 - Bootstrapping

Bootstrapping is a method of sample reuse that is much more general than cross-validation [1]. The idea is to use the observed sample to estimate the population distribution. Then samples can be drawn from the estimated population and the sampling distribution of any type of estimator can itself be estimated.

an overview of bootstrapping

The steps in bootstrapping are illustrated in the figure above. Observed quantities are denoted by solid curves and unobserved quantities by dashed curves. The objective is to estimate the true sampling distribution of some quantity T, which may be numeric (such as a regression coefficient) or more complicated (such as a feature cluster dendrogram). The true sampling distribution is computed by taking new samples from the true population, computing T and then accumulating all of the values of T into the sampling distribution. However, taking new samples is expensive, so instead, we take a single sample (1) and use it to estimate the population (2). We then (3) take samples "in silico" (on the computer) from the estimated population, compute T from each (4) and accumulate all of the values of T into an estimate of the sampling distribution. From this estimated sampling distribution we can estimate the desired features of the sampling distribution. For example, if T is quantitative, we are interested in features such as the mean, variance, skewness, etc and also confidence intervals for the mean of T. If T is a cluster dendrogram, we can estimate features such as the proportion of trees in the sampling distribution than include a particular node.

There are three forms of bootstrapping which differ primarily in how the population is estimated. Most people who have heard of bootstrapping have only heard of the so-called nonparametric or resampling bootstrap.

Nonparametric (resampling)
Semiparametric (adding noise)
Parametric (simulation)

Nonparametric (resampling) bootstrap

In the nonparametric bootstrap a sample of the same size as the data is take from the data with replacement. What does this mean? It means that if you measure 10 samples, you create a new sample of size 10 by replicating some of the samples that you've already seen and omitting others. At first this might not seem to make sense, compared to cross validation which may seem to be more principled. However, it turns out that this process actually has good statistical properties.

Semiparametric bootstrap

The resampling bootstrap can only reproduce the items that were in the original sample. The semiparametric bootstrap assumes that the population includes other items that are similar to the observed sample by sampling from a smoothed version of the sample histogram. It turns out that this can be done very simply by first taking a sample with replacement from the observed sample (just like the nonparametric bootstrap) and then adding noise.

Semiparametric bootstrapping works out much better for procedures like feature selection, clustering and classification in which there is no continuous way to move between quantities. In the nonparametric bootstrap sample there will almost always be some replication of the same sample values due to sampling with replacement. In the semiparametric bootstrap, this replication will be broken up by the added noise.

Parametric bootstrap

Parametric bootstrapping assumes that the data comes from a known distribution with unknown parameters. (For example the data may come from a Poisson, negative binomial for counts, or normal for continuous distribution.) You estimate the parameters from the data that you have and then you use the estimated distributions to simulate the samples.

All of these three methods are simulation-based ideas.

Examples

Clustering

The nonparametric bootstrap does not work well because sampling with replacement produces exact replicates. The samples that are identical are going to get clustered together. So, you don't get very much new information.

The semi-parametric bootstrap perturbs the data with a bit a noise. For clustering, instead of taking a bootstrap sample and perturbing it, we might take the entire original sample and perturb it. This allows us to identify the original data points on the cluster diagram and see whether they remain in the same clusters or move to new clusters.

Obtaining a confidence interval for a Normal mean (a parametric example)

Suppose we have a sample of size n and we believe the population is Normally distributed. A parametric bootstrap can be done by computing the sample mean \(\bar{x}\) and variance \(s^2\). The bootstrap samples can be taken by generating random samples of size n from N(\(\bar{x},s^2\)). After taking 1000 samples or so, the set of 1000 bootstrap sample means should be a good estimate of the sampling distribution of \(\bar{x}\). A 95% confidence interval for the population mean is then formed by sorting the bootstrap means from lowest to highest, and dropping the 2.5% smallest and 2.5% largest. the smallest and largest remaining values are the ends of the confidence interval.

How does this compare to the usual confidence interval: \(\bar{x}\plusminus t_{.975}s/\sqrt{n}\)? Our interval turns out to approximate \(\bar{x}\plusminus z_{.975}s/\sqrt{n}\) - that is, is uses the Normal approximation to the t-distribution. This is because it does not take into account that we have estimated the variance. There are ways to improve the estimate, but we will not discuss them here.

Obtaining a confidence interval for \(\pi_0\) with RNA-seq data (a complex parametric example)

For an example of using the parametric bootstrap let's consider computing a confidence interval for \(\pi_0\) an RNA-seq experiment. In this case we will assume that the data are Poisson. Here is what we would do:

1) First we estimate \(\pi_0\) from all of the data.

2) Now we need to obtain a bootstrap sample from the Poisson distribution. We will hold the library sizes fixed.

a) For each treatment:

i) in each sample for each feature, recompute the count as the percentage of the library size.

ii) for each feature compute the mean percentage over all the samples from that treatment - call this \(g_{i}\) where i is the feature.

iii) For each sample, multiply the library size \(N_j\) where j is the sample, by \(g_i\) to obtain \(N_jg_i\) the expected count for feature i in sample j.

iv) The bootstrap sample for feature i in sample j is generated as a random Poisson with mean \(N_jg_i\) .

b) Now that there is a bootstrap "observation" for each feature in each sample, redo the differential expression analysis and estimate \(\pi_0\).

c) Repeat steps a0 and b0 1000 times. Now you have 1000 different estimates of \(\pi_0\) - this is your estimate of the sampling distribution of the estimate.

3) Your 1000 bootstrap estimates can be used to draw a histogram of the sampling distribution of the estimate of \(\pi_0\). The central 95% of the histogram is a 95% confidence interval for \(\pi_0\). To estimate this interval, it is simplest to use the sorted bootstrap values instead of the histogram. For example, if you drop the 2.5% smallest and largest values, the remainder are in the 95% confidence interval. To form the ends of the interval, use the smallest and largest of this central 95% of the bootstrap values.

This is a parametric bootstrap confidence interval because the bootstrap samples were generated by estimating the Poisson means and then generating samples from the Poisson distribution.

[1] Efron, B. (1982). The jackknife, the bootstrap, and other resampling plans. 38. Society of Industrial and Applied Mathematics CBMS-NSF Monographs. ISBN [1] 0-89871-179-7 [2].

15.4 - Consensus Methods

When you do cross validation (CV) or bootstrap you produce many estimates or predictors - one for each of the pseudo-samples. These can be used to improve the final estimate or prediction equation. In machine learning, this is called the "agreement of experts" or "ensemble leaerning". In bioinformatics it is called "consensus". There are many ways to reach consensus rules.

To understand some of the methods, consider the problem of classifying the bone marrow samples into 4 classes: normal, MGUS, SMM or MM.

Strict consensus makes a prediction only when all of the CV or bootstrap samples make the same prediction. This can be useful in classification, as it helps determine which classifications are undisputed and which may be questionable. It is useful in tree-based methods, as it indicates when all the trees have the same node and the same branches at that node. However, you may end up with many samples that are not classified because at least one of the CV or bootstrap samples gave it a different classification from the others.

Majority rule makes whichever prediction gets the most votes. Of course, it is not guaranteed that there is a majority prediction. For example, if you do 100 bootstraps, a sample might be classified equally often into each of the 4 classes. Alternatively, it could be classified in normal 24 times, into MGUS and SMM 25 times each and into MM 26 times, and then the majority rule would assign it to MM. Majority rule can be problematic as it treats 0,0,0,100 the same as 24,25,25,26.

Strict majority rule is when you will use the majority prediction only if the majority is over 50%. Samples with more varied predictions do not get classified.

As we apply these approaches to classification, the following is what you might see.

Strict consensus – you only classify samples which agree B times; the remainder are not classified.

Majority rule – you classify samples into the most frequently chosen class.

Strict Majority rule – you classify samples into the most frequently chosen class if chosen more than 50% of the time; otherwise they are not classified.

We are going to discuss three consensus methods - bagging, consensus clustering and random forest (for consensus classification using classification trees).

15.5 - Aggregated Prediction

Bootstrapping was developed around 1982. Around 1994, the idea of using the bootstrap samples to improve prediction was proposed [1,2 ]. Bootstrap aggregation (shortened to "bagging") computes a predictor from each of the bootstrap samples, then aggregates into a consensus predictor by either voting or averaging. Random forest is a similar method using classification trees.

Example

This example comes from an observational study of cardiovascular risk. We cannot randomly assign people to low and high risk environments. One interesting design is to study populations in which originate in low risk environments but migrate to high risk environments. These data come from a study of Peruvian Indians moving from the high lands of Peru (low risk) into a major city (westernized diet and lifestyle leading to higher risk). The researchers measured a bunch of physical attributes that are thought to be related to cardiovascular disease:

Age,
Years (in city),
Height,
Pulse,
Systolic BP, and
3 skinfolds: Chin, Forearm, Calf

The three skinfold measurements are a measure of fat under the skin and are highly correlated with one another and with weight. This high correlation causes problems in prediction, so often variable selection is used to pick the features that are the good predictors.

Bagging - Aggregating multiple linear regression with variable selection

Multiple linear regression was used to develop a measure of weight predicted from all of the other variables. Variable selection (stepwise regression) was used to select a smaller set of predictors, because of the problems of high multicollinearity of the predictors.

Now lets see how bagging would work for this example. A large number of nonparametric bootstrap samples were taken from the data. For each bootstrap sample variables were selected using stepwise regression and from this a multiple linear prediction equation was created to predict the weight of all the individuals (including those not selected in the bootstrap sample). Bagging was done by averaging the predictions across all the prediction equations and is shown in black in the plot below. For comparison we also predicted the weight using all the other variables, shown in red on the plot.

bagging / bootstrap aggregation plot

You can see that the red predicted weights are not well correlated with the true weight, while the bagged predictions are highly correlated.

As well, the bagged estimated come with some bonus uncertainty assessments. For each individual in the study, each of the bootstrap samples gives us a prediction. Therefore if we delete the 2.5% smallest and largest values, we have 95% confidence interval for individuals with the same measurements that does not rely on Normality assumption.

As well, we can assess how important each of these variables is as a predictor by counting how many times each variable was selected.

Age	Calf	Chin	Forearm	Height	Pulse	BP	Years
773	372	848	862	1000	194	993	945

Height always got selected - this is not surprising because taller people tend to be heavier. On the other hand some of the information is surprising. For example, pulse is not selected often, but blood pressure is, suggesting a correlation between weight and BP in this group. Calf skinfold is only selected about 1/3 of the time - this might suggest a lack of correlation between calf skinfold and weight, or it may be due to the multicollinearity with the other skinfold measures. In any case, these counts might give a researcher some important clues about the epidemiology of questions under consideration.

Random Forest

Although a method called regression trees can be used similarly to classification trees for prediction, we will focus on classification. (The random forest idea can be used with regression trees very similarly to bagging.)

Suppose that each individual is classified into Normal or overweight based on their body mass index (weight/height²) over or under 25. (There is also an obese category - over 30, but there is only one individual in this sample in this category.) We then use recursive partitioning to classify the samples into these two classes, using all of the variables except for weight. We obtain the following tree:

Peru data recursive partitioning example

Next we bootstrap, creating a tree for each bootstrap sample. This is called a random forest. Each observation is classified by each tree and the final classification is by majority rule.

Here are the two confusion matrices:

Recursive partitioning (without bootstrapping)

                     true class    Normal        Overweight
predicted
                   Normal          14                   1
                   Overweight       5                  19

Random forest

                     true class    Normal        Overweight
predicted
                   Normal          15                   6
                   Overweight       4                  14

In most cases, random forest is a better classifier, but this example is one of the exceptions. It is not clear why, but it might be due to the sharp cut-point used for BMI, as 1/3 of the sample has BMI between 24 and 26.

References

[1] Breiman, Leo (1996). "Bagging predictors". Machine Learning 24 (2): 123–140. doi:10.1007/BF00058655. CiteSeerX: 10.1.1.121.7654

[12] Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32. doi:10.1023/A:1010933404324

15.6 - Consensus Clustering

Consensus Clustering is another idea for using bootstrap sampling.

We start by clustering our data using whatever method we prefer - e.g. complete linkage hierarchical clustering. To use consensus clustering, we will need to break the tree into clusters by some method that we can repeat with other samples, such as chosing a fixed number of clusters or fixed cluster height.

We then do bootstrap sampling. In clustering, identical items will automatically be clustered together, so the replicates that one always obtains with the nonparametric bootstrap are problematic. There are two ways around it. Often the semiparametric bootstrap is used, using all of the original observations and adding noise. This makes it possible to identify the original sample in each bootstrap cluster. Another alternative is to sample without replacement the expected number of unique items. For large sample sizes, this is about (1-1/e)*N where e is the base of the natural logarithm and N is the sample size. (1-1/e)\(\approx\)0.632 and this is sometimes called the .632 bootstrap.

For each bootstrap sample we obtain the clusters. Then we seek consensus.

The simplest consensus approach creates a matrix counting the number of times \(C_{ij}\) items i and j are in the same cluster. Items that clearly cluster together should be in the same cluster often, while those that are clearly distant may never be in the same cluster. A function of this matrix is used as the distance measure - e.g. distance(i,j)=1/\(C_{ij}\). The consensus cluster is formed by clustering (again with any method desired) with this new distance measure.