2.2 - Cross Validation
In the previous subsection we mentioned that cross-validation is a technique to measure predictive performance of a model. Here we will explain the different methods of cross-validation (CV) and their peculiarities.
Holdout Sample: Training and Test Data
Data is split into two groups. The training set is used to train the learner. The test set is used to estimate the error rate of the trained model. This method has two basic drawbacks. In a sparse data set, one may not have the luxury to set aside a reasonable portion of the data for testing. Since it is a single repetition of the train-&-test experiment, the error estimate is not stable. If we happen to have a 'bad' split, the estimate is not reliable.
Three-way Split: Training, Validation and Test Data
The available data is partitioned into three sets: training, validation and test set. The prediction model is trained on the training set and is evaluated on the validation set. For example, in case of neural network, the training set is used to find the optimal weights with the back-propagation rule. The validation set may be used to find the optimum number of hidden layers or to determine a stopping rule for the back-propagation algorithm. (NN is not covered in this course). Training and validation may be iterated a few times till a 'best' model is found. The final model is assessed using the test set.
A typical split is 50% for the training data and 25% each for validation set and test set.
With three-way split, the model selection and the true error rate computation can be carried out simultaneously. The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model. Hence a third independent part of the data, the test data, is required.
After assessing the final model on the test set, the model must not be fine-tuned any further.
Unfortunately, data insufficiency often does not allow three-way split.
The limitations of the holdout or three-way split can be overcome with a family of resampling methods at the expense of higher computational cost.
Among the methods available for estimating prediction error, the most widely used is cross-validation (Stone, 1974). Essentially cross-validation includes techniques to split the sample into multiple training and test data sets.
Random subsampling performs K data splits of the entire sample. For each data split, a fixed number of observations is chosen without replacement from the sample and kept aside as the test data. The prediction model is fitted to the training data from scratch for each of the K splits and an estimate of prediction error is obtained from each test set. Let the estimated PE in i-th test set be denoted by Ei. The true error estimate is obtained as the average of the separate estimates Ei.
A K-fold partition of the sample space is created. The original sample is randomly partitioned into K equal sized (or almost equal sized) subsamples. Of the K subsamples, a single subsample is retained as the test set for estimating the PE, and the remaining K-1 subsamples are used as training data. The cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the test set. The K error estimates from the folds can then be averaged to produce a single estimation. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.
For classification problems, one typically uses stratified K-fold cross-validation, in which the folds are selected so that each fold contains roughly the same proportions of class labels.
In repeated cross-validation, the cross-validation procedure is repeated m times, yielding m random partitions of the original sample. The m results are again averaged (or otherwise combined) to produce a single estimation.
A common choice for K is 10.
With a large number of folds (K large) the bias of the true error rate estimator is small but the variance will be large. The computational time may also be very large as well, depending on the complexity of the models under consideration. With a small number of folds the variance of the estimator will be small but the bias will be large. The estimate may be larger than the true error rate.
In practice the choice of the number of folds depends on the size of the data set. For large data set, smaller K (e.g. 3) may yield quite accurate results. For sparse data sets, Leave-one-out (LOO or LOOCV) may need to be used.
LOO is the degenerate case of K-fold cross-validation where K = n for a sample of size n. That means that n separate times, the prediction function is trained on all the data except for one point and a prediction is made for that point. As before the average error is computed and used to evaluate the model. The evaluation given by leave-one-out cross validation error is good, but sometimes it may be very expensive to compute.