Printer-friendly versionPrinter-friendly version

We will start the discussion of uncertainty quantification with problem that is of particular interest in regression and classification: assessing prediction error. In regression we have a continuous response variable and one or more predictor variables (which may be continuous or categorical).  The prediction problem ignores questions of model correctness (e.g. are the regression coefficients the correct order of magnitude?) and focuses on the predicted values: Do we obtain good predictions of the response variable? Similarly, in the classification problem, we have a categorical response variable and one or more predictor variables.  The prediction problem focuses on whether the samples are correctly classified to their category.  The objective is to find a rule that performs well in predicting outcomes or categories for new cases for which the response or category is not known.  For example, we might want to predict the success probability for new graduate students based on the information in their dossier, or categorize tissue samples into normal, benign or cancerous based on their gene expression. In these examples, we are not looking for a model that accurately reflects the underlying process - we just want a good predictor for new samples.

The data on which the prediction or classification rules are developed are called the training sample.  In statistics, we talk about "fitting" the model; in machine learning, we talk about "training" the predictor.  Typically, the fitting step minimizes a measure of prediction error on the training sample.  For example, in least squares regression, the residual is the difference between the observed response variable and its predicted value and the model is fitted by selecting parameters which minimized the sum of squared residuals (also called the sum of squared errors or SSE).  In classification, we might count the number of misclassified observations (possibly weighting more serious misclassifications more heavily).

The method seems straight-forward, but it has several flaws.  The one we will tackle in this chapter is "optimism": because the prediction rule is fitted to minimize the prediction error in the training set, it underestimates the prediction error it will achieve with new data.  "Optimism" leads to another problem called overfitting, which occurs when the fitted prediction rule fits the noise as well as the systematic aspects of the data.  As a simple example, suppose that there are n observations and p linearly independent features, with p=n. (We cannot have p>n and have linear independence.)  Linear algebra tells us that linear regression can provide a perfect fit to any response variable, including a categorical response, so that the estimated prediction error will always be zero.  This mathematical fact has nothing to do with underlying structure - we could randomly simulate our p features and still obtain perfect prediction.  Even if the number of predictors is less than the number of observations we can overfit the model although perhaps not obtain a perfect fit.  Besides leading to an optimistic assessment of prediction error, overfitting usually leads to rules that have high variability and are very sensitive to noise in the data.

Since our objective is to develop a prediction rule that does well with new data, ideally we would collect a sufficiently large set of new observations including the outcome, and assess the prediction rule on those data.  This is called the validation sample. We expect that the assessment of prediction error on the validation sample should be quite good, because the training and validation samples share only their systematic features and not the noise. 

However collecting new data is expensive - we would not want to do it to assess a rule that might not be very good in the first place.  And even if we can afford to collect new data, we would likely want to use those data to improve our predictor - not just to assess it.  So the question becomes, how can we use the data we have already collected to assess prediction error.  And once we have done that, can we use the results to improve our prediction rule?

Cross-validation (CV) is a method that was developed in the 1970s to assess prediction error while also training the prediction rule.  The bootstrap (which is actually a wide class of methods) is a much more general method developed in the early 1980s which starts by estimating the sampling distribution of the prediction rule (or its parameters) and can also be used to assess aspects of the prediction rule including prediction error.