Printer-friendly versionPrinter-friendly version

A good learner is the one which has good prediction accuracy; in other words, which has the smallest prediction error.

Let us try to understand the prediction problem intuitively. Consider the simple case of fitting a linear regression model to the observed data. A model is a good fit, if it provides a high R2 value. However, note that the model has used all the observed data and only the observed data. Hence, how it will perform when predicting for a new set of input values (the predictor vector), is not clear. Assumption is that, with a high R2 value, the model is expected to predict well for data observed in the future.

Suppose now the model is more complex than a linear model and a spline smoother or a polynomial regression needs to be considered. What would be the proper complexity of the model? Would it be a fifth-degree polynomial or a cubic spline would suffice? Many modern classification and regression models are highly adaptable and are capable of formulating complex relationships. At the same time they may overemphasize patterns that are not reproducible. Without a methodological approach to evaluating models, the problem will not be detected until the next set of samples are predicted. And here we are not talking about the data quality of the sample, which is used to develop the model, being bad!

The data at hand is to be used to find the best predictive model. Almost all predictive modeling techniques have tuning parameters that enable the model to flex to find the structure in the data. Hence, we must use the existing data to identify settings for the model’s parameters that yield the best and most realistic predictive performance (known as model tuning) for future. Traditionally, this has been achieved by splitting the existing data into training and test sets. The training set is used to build and tune the model and the test set is used to estimate the model’s predictive performance. Modern approaches to model building split the data into multiple training and test sets, which have often been shown to get more optimal tuning parameters and give a more accurate representation of the model’s predictive performance. More on data splitting is discussed in the next subsection.

Let us consider the general regression problem. The training data,

\[D^{training} = \{(X_i, Y_i ),  i = 1, 2, ..., n\}\]

is used to regress Y on X, and then a new response, $Y_{new}$, is estimated by applying the fi tted model to a brand-new set of predictors, ${X}_{new}$, from the test set $D_{test}$. Prediction for ${Y}_{new}$ is done by multiplying the new predictor values by the regression coefficients already obtained from training set.

The resulting prediction is compared with the actual response value.

Prediction Error

The Prediction Error, PE, is defi ned as the mean squared error in predicting $Y_{new}$ using $\hat{f} (X_{new})$.

$PE = E[(Y_{new} - \hat{f} (X_{new}))^2]$, where the expectation is taken over $(X_{new},Y_{new})$.

We can estimate PE by:

\[\frac{1}{n}\sum_{i=1}^{n}\left( Y_{(new)i}-\hat{f}(X_{new)i} \right)^2\]

Note that this is not the same quantity as calculated from the training data. The latter is a misleadingly optimistic value because it estimates the predictive ability of the fitted model from the same data that was used to fit that model.

The dilemma of developing a statistical learning algorithm is clear. The model can be made very accurate based on the observed data. However since the model is evaluated on its predictive ability on unseen observations, there is no guarantee that the closest model to the observed data will have the highest predictive accuracy for future data! In fact, more often than not, it will NOT be.

Training and Test Error as A Function of Model Complexity

Let us again go back to the multiple regression problem. Fit of a model improves with the  complexity of the model, i.e. as more predictors are included in the model the R2 value is expected to improve.  If predictors truly capture the main features behind the data, then they are retained in the model.The trick to build an accurate predictive model is not to overfit the model to the training data.

Overfitting a Model

If a learning technique learns the structure of a training data too well then the model is applied to the data on which the model was built, it correctly predicts every sample value. In the extreme case the model in training data admits no error. In addition to learning the general patterns in the data, the model has also learned the characteristics of each training data point's unique noise. This type of model is said to be over-fit and will usually have poor accuracy when predicting a new sample. (Why?)

Bias-Variance Trade-off

Since this course deals with multiple linear regression and several other regression methods, let us concentrate on the inherent problem of bias-variance trade off in that context. However, the problem is completely general and is at the core of coming up with a good predictive model.

When the outcome is quantitative (as opposed to qualitative), the most common method for characterizing a model’s predictive capabilities is to use the root mean squared error (RMSE). This metric is a function of the model residuals, which are the observed values minus the model predictions. The mean squared error (MSE) is calculated by squaring the residuals and summing them.  The value is usually interpreted as either how far (on average) the residuals are from zero or as the average distance between the observed values and the model predictions.
If we assume that the data points are statistically independent and that the residuals have a theoretical mean of zero and a constant variance σ2, then

E[MSE] = σ2 + (Model Bias)2 + Model Variance

The first term, σ2, is the irreducible error and cannot be eliminated by modeling. The second term is the squared bias of the model. This reflects how close the functional form of the model is to the true relationship between the predictors and the outcome. If the true functional form in the population is parabolic and a linear model is used, then the model is a biased model. It is part of systematic error in the model. The third part is the model variance. It quantifies the dependency of a model on the data points, that are used to create the model. If change in a small portion of the data results in a substantial change in the estimates of the model parameters, the model is said to have high variance.

The best learner is the one which can balance the bias and the variance of a model.

A biased model typically has low variance. An extreme example is when a polynomial regression model is estimated by a constant value equal to the sample median. The straight line will have no impact if a handful of observations are changed. However, bias of this model is excessively high and naturally it is not a good model to consider. On the other extreme, suppose a model is constructed where the regression line is made to go through all data points, or through as many of them as possible. This model will have very high variance, as even if a single observed value is changed, the model changes.Thus it is possible that when an intentional bias is introduced in a regression model, the prediction error becomes smaller, compared to an unbiased regression model. Ridge regression and Lasso are examples of that. While a simple model has high bias, model complexity causes model variance to increase. An ideal predictor is that, which will learn all the structure in the data but none of the noise. While with increasing model complexity in the training data, PE reduces monotonically, the same will not be true for test data. Bias and variance move in opposing directions and at a suitable bias-variance combination the PE is the minimum in the test data. The model that achieves this lowest possible PE is the best prediction model. The following figure is a graphical representation of that fact.

Cross-validation is a comprehensive set of data splitting techniques which helps to estimate the point of inflexion of of PE.

model complexity