In the past, we have focused on statical learning procedures that produce a single set of results. For example:
- A regression equation, with one set of regression coefficients or smoothing parameters.
- A classification regression tree with one set of leaf nodes.
Model selection is often required: a measure of fit associated with each candidate model.
The Aggregating Procedure:
Here the discussion shifts to statistical learning building on many sets of outputs that are aggregated to produce results. The aggregating procedure makes a number of passes over the data.
On each pass, inputs X are linked with outputs Y just as before. However, of interest now is the collection of all the results from all passes over the data. Aggregated results have several important benefits:
Averaging over a collection of fitted values can help to avoid overfitting. It tends to cancel out the uncommon features of the data captured by a specific model. Therefore, the aggregated results are more stable.
A large number of fitting attempts can produce very flexible fitting functions.
Putting the averaging and the flexible fitting functions together has the potential to break the bias-variance tradeoff.
Any attempt to summarize patterns in a dataset risk overtting. All fitting procedures adapt to the data on hand so that even if the results are applied to a new sample from the same population, fit quality will likely decline. Hence, generalization can be somewhat risky.
"optimism increases linearly with the number of inputs or basis functions ..., but decreases as the training sample size increases.'' -- Hastie, Tibshirani and Friedman (unjustified).
Decision Tree Example:
Consider decision trees as a key illustration. The overfitting often increases with (1) the number of possible splits for a given predictor; (2) the number of candidate predictors; (3) the number of stages which is typically represented by the number of leaf nodes.
When overfitting occurs in a classification tree, the classification error is underestimated; the model may have a structure that will not generalize well. For example, one or more predictors may be included in a tree that really does not belong.
Ideally, one would have two random samples from the same population: a training dataset and a test dataset. The fit measure from the test data would be a better indicator of how accurate the classification is. Often there is only a single dataset. The data are split up into several randomly chosen, non-overlapping, partitions of about the same size. With ten partitions, each would be a part of the training data in nine analyses and serve as the test data in one analysis. The following figure illustrates the 2-fold cross validation for estimating the cross-validation prediction error for model A and model B. The model selection is based on choosing the one with the smallest cross-validation prediction error.