3  Statistical Learning and Model Selection

Prediction
Cross Validation
Overfitting
K-fold Cross-Validation

Overview

Statistical learning theory was introduced in the late 1960s but until 1990s it was simply a problem of function estimation from a given collection of data. In the middle of the 1990s, new types of learning algorithms (e.g., support vector machines) based on the developed theory were proposed. This made statistical learning theory not only a tool for theoretical analysis but also a tool for creating practical algorithms for estimating multidimensional functions.

Statistical learning plays a key role in many areas of science, finance, and industry. A few examples are already considered in Lesson 1. Some more examples of the learning problems are:

  • Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient.
  • Predict the price of a stock in 6 months from now, on the basis of company performance measures and economic data.
  • Estimate the amount of glucose in the blood of a diabetic person, from the infrared absorption spectrum of that person’s blood.
  • Identify the risk factors for prostate cancer, based on clinical and demographic variables.

The science of learning plays a key role in the fields of statistics, data mining, and artificial intelligence, intersecting with areas of engineering and other disciplines. The abstract learning theory of the 1960s established more generalized conditions compared to those discussed in classical statistical paradigms. Understanding these conditions inspired new algorithmic approaches to function estimation problems.

In essence, a statistical learning problem is learning from the data. In a typical scenario, we have an outcome measurement, usually quantitative (such as a stock price) or categorical (such as heart attack/no heart attack), that we wish to predict based on a set of features (such as diet and clinical measurements). We have a Training Set which is used to observe the outcome and feature measurements for a set of objects. Using this data we build a Prediction Model, or a Statistical Learner, which enables us to predict the outcome for a set of new unseen objects.

A good learner is one that accurately predicts such an outcome.

The examples considered above are all supervised learning.

All statistical learning problems may be constructed so as to minimize expected loss. Mathematically, the problem of learning is that of choosing from a given set of functions, the one that predicts the supervised learning’s response in the best possible way. In order to choose the best available response, a risk function is minimized in a situation where the joint distribution of the predictors and response is unknown and the only available information is obtained from the training data.

The formulation of the learning problem is quite general. However, two main types of problems are that of

  • Regression Estimation
  • Classification

In the current course only these two are considered. The problem of regression estimation is the problem of minimizing the risk functional with the squared error loss function. When the problem is of classification, the loss function is an indicator function. Hence the problem is that of finding a function that minimizes the misclassification error.

There are several aspects of the model building process or the process of finding an appropriate learning function. In what proportion data is allocated to certain tasks like model building and evaluating model performance, is an important aspect of modeling. How much data should be allocated to the training and test sets? It generally depends on the situation. If the pool of data is small, the data splitting decisions can be critical. Large data sets reduce the criticality of these decisions.

Before evaluating a model’s predictive performance in the test data, quantitative assessments of the model using resampling techniques helps to understand how alternative models are expected to perform on new data. Simple visualization, like a residual plot in case of a regression, would also help.

It is always a good practice to try out alternative models. There is no single model that will always do better than any other model for all datasets. Because of this, a strong case can be made to try a wide variety of techniques, then determine which model to focus on. Cross-validation, as well as the performance of a model on the test data, help to make the final decision.

Objectives

Upon successful completion of this lesson, you should be able to:


  • Recognize how supervised learning works and review the underlying statistical framework.
  • Recognize that the performance of statistical learning models depends on their prediction capability on independent test data.
  • Get familiar with various methods for model performance assessment.

3.1 Prediction Accuracy

A good learner is the one which has good prediction accuracy; in other words, which has the smallest prediction error.

Let us try to understand the prediction problem intuitively. Consider the simple case of fitting a linear regression model to the observed data. A model is a good fit if it provides a high \(R^{2}\) value. However, note that the model has used all the observed data and only the observed data. Hence, how it will perform when predicting for a new set of input values (the predictor vector), is not clear. The assumption is that, with a high \(R^{2}\) value, the model is expected to predict well for data observed in the future.

Suppose now the model is more complex than a linear model and a spline smoother or a polynomial regression needs to be considered. What would be the proper complexity of the model? Would it be a fifth-degree polynomial or a cubic spline would suffice? Many modern classification and regression models are highly adaptable and are capable of formulating complex relationships. At the same time, they may overemphasize patterns that are not reproducible. Without a methodological approach to evaluating models, the problem will not be detected until the next set of samples are predicted. And here we are not talking about the data quality of the sample, which is used to develop the model, being bad!

The data at hand is to be used to find the best predictive model. Almost all predictive modeling techniques have tuning parameters that enable the model to flex to find the structure in the data. Hence, we must use the existing data to identify settings for the model’s parameters that yield the best and most realistic predictive performance (known as model tuning) for the future. Traditionally, this has been achieved by splitting the existing data into training and test sets. The training set is used to build and tune the model and the test set is used to estimate the model’s predictive performance. Modern approaches to model building split the data into multiple training and test sets, which have often been shown to get more optimal tuning parameters and give a more accurate representation of the model’s predictive performance. More on data splitting is discussed in the next subsection.

Let us consider the general regression problem. The training data,

\[D^{training} = \\(X_i, Y_i ), i = 1, 2, ..., n\]

is used to regress Y on X, and then a new response, \(Y_{new}\) , is estimated by applying the fitted model to a brand-new set of predictors, \({X}_{new}\), from the test set \(D_{test}\). Prediction for \(Y_{new}\) is done by multiplying the new predictor values by the regression coefficients already obtained from the training set.

The resulting prediction is compared with the actual response value.

Prediction Error

The Prediction Error, PE, is defined as the mean squared error in predicting \(Y_{new}\) using \(\hat{f} (X_{new})\).

\(PE = E[(Y_{new} - \hat{f} (X_{new}))^2]\), where the expectation is taken over \((X_{new},Y_{new})\).

We can estimate PE by:

\[\dfrac{1}{n}\sum_{i=1}^{n}\left(Y_{(new)i}-\hat{f}(X_{new)i} \right)^2\]

Note that this is not the same quantity as calculated from the training data. The latter is a misleadingly optimistic value because it estimates the predictive ability of the fitted model from the same data that was used to fit that model.

The dilemma of developing a statistical learning algorithm is clear. The model can be made very accurate based on the observed data. However since the model is evaluated on its predictive ability on unseen observations, there is no guarantee that the closest model to the observed data will have the highest predictive accuracy for future data! In fact, more often than not, it will NOT be.

Training and Test Error as A Function of Model Complexity

Let us again go back to the multiple regression problem. The fit of a model improves with the complexity of the model, i.e. as more predictors are included in the model the \(R^{2}\) value is expected to improve. If predictors truly capture the main features behind the data, then they are retained in the model. The trick to building an accurate predictive model is not to overfit the model to the training data.

Overfitting a Model

If a learning technique learns the structure of a training data too well then the model is applied to the data on which the model was built, it correctly predicts every sample value. In the extreme case, the model in training data admits no error. In addition to learning the general patterns in the data, the model has also learned the characteristics of each training data point’s unique noise. This type of model is said to be over-fit and will usually have poor accuracy when predicting a new sample. (Why?)

Bias-Variance Trade-off

Since this course deals with multiple linear regression and several other regression methods, let us concentrate on the inherent problem of bias-variance trade-off in that context. However, the problem is completely general and is at the core of coming up with a good predictive model.

When the outcome is quantitative (as opposed to qualitative), the most common method for characterizing a model’s predictive capabilities is to use the root mean squared error (RMSE). This metric is a function of the model residuals, which are the observed values minus the model predictions. The mean squared error (MSE) is calculated by squaring the residuals and summing them. The value is usually interpreted as either how far (on average) the residuals are from zero or as the average distance between the observed values and the model predictions.
If we assume that the data points are statistically independent and that the residuals have a theoretical mean of zero and a constant variance \(\sigma ^2\), then

\[E [ M S E ] = \sigma ^ { 2 } + ( \text { Model Bias } ) ^ { 2 } + \text{Model Variance}\]

The first term, \(\sigma ^2\) , is the irreducible error and cannot be eliminated by modeling. The second term is the squared bias of the model. This reflects how close the functional form of the model is to the true relationship between the predictors and the outcome. If the true functional form in the population is parabolic and a linear model is used, then the model is a biased model. It is part of systematic error in the model. The third part is the model variance. It quantifies the dependency of a model on the data points, that are used to create the model. If a change in a small portion of the data results in a substantial change in the estimates of the model parameters, the model is said to have high variance.

The best learner is the one which can balance the bias and the variance of a model.

A biased model typically has low variance. An extreme example is when a polynomial regression model is estimated by a constant value equal to the sample median. The straight line will have no impact if a handful of observations are changed. However, a bias of this model is excessively high and naturally, it is not a good model to consider. On the other extreme, suppose a model is constructed where the regression line is made to go through all data points, or through as many of them as possible. This model will have very high variance, as even if a single observed value is changed, the model changes. Thus it is possible that when intentional bias is introduced in a regression model, the prediction error becomes smaller, compared to an unbiased regression model. Ridge regression and Lasso are examples of that. While a simple model has a high bias, model complexity causes model variance to increase. An ideal predictor is that, which will learn all the structure in the data but none of the noise. While with increasing model complexity in the training data, PE reduces monotonically, the same will not be true for test data. Bias and variance move in opposing directions and at a suitable bias-variance combination the PE is the minimum in the test data. The model that achieves this lowest possible PE is the best prediction model. The following figure is a graphical representation of that fact.

Cross-validation is a comprehensive set of data splitting techniques which helps to estimate the point of inflection of PE.

Model complexity

3.2 Cross Validation

In the previous subsection, we mentioned that cross-validation is a technique to measure the predictive performance of a model. Here we will explain the different methods of cross-validation (CV) and their peculiarities.

Holdout Sample: Training and Test Data

Data is split into two groups. The training set is used to train the learner. The test set is used to estimate the error rate of the trained model. This method has two basic drawbacks. In a sparse data set, one may not have the luxury to set aside a reasonable portion of the data for testing. Since it is a single repetition of the train-&-test experiment, the error estimate is not stable. If we happen to have a ‘bad’ split, the estimate is not reliable.

Three-way Split: Training, Validation and Test Data

The available data is partitioned into three sets: training, validation and test set. The prediction model is trained on the training set and is evaluated on the validation set. For example, in the case of a neural network, the training set is used to find the optimal weights with the back-propagation rule. The validation set may be used to find the optimum number of hidden layers or to determine a stopping rule for the back-propagation algorithm. (NN is not covered in this course). Training and validation may be iterated a few times till a ‘best’ model is found. The final model is assessed using the test set.

A typical split is 50% for the training data and 25% each for validation set and test set.

With a three-way split, the model selection and the true error rate computation can be carried out simultaneously. The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model. Hence a third independent part of the data, the test data, is required.

After assessing the final model on the test set, the model must not be fine-tuned any further.

Unfortunately, data insufficiency often does not allow three-way split.

The limitations of the holdout or three-way split can be overcome with a family of resampling methods at the expense of higher computational cost.

Cross-Validation

Among the methods available for estimating prediction error, the most widely used is cross-validation (Stone, 1974). Essentially cross-validation includes techniques to split the sample into multiple training and test datasets.

Random Subsampling

Random subsampling performs K data splits of the entire sample. For each data split, a fixed number of observations is chosen without replacement from the sample and kept aside as the test data. The prediction model is fitted to the training data from scratch for each of the K splits and an estimate of the prediction error is obtained from each test set. Let the estimated PE in the \(i^{th}\) test set be denoted by \(E_i\). The true error estimate is obtained as the average of the separate estimates \(E_i\).

\[\dfrac{1}{K}\sum_{i=1}^{K} E_{i}\]

K-fold Cross-Validation

A K-fold partition of the sample space is created. The original sample is randomly partitioned into K equal-sized (or almost equal sized) subsamples. Of the K subsamples, a single subsample is retained as the test set for estimating the PE, and the remaining K-1 subsamples are used as training data. The cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the test set. The K error estimates from the folds can then be averaged to produce a single estimation. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.

For classification problems, one typically uses stratified K-fold cross-validation, in which the folds are selected so that each fold contains roughly the same proportions of class labels.

In repeated cross-validation, the cross-validation procedure is repeated m times, yielding m random partitions of the original sample. The m results are again averaged (or otherwise combined) to produce a single estimation.

A common choice for K is 10.

With a large number of folds (K large) the bias of the true error rate estimator is small but the variance will be large. The computational time may also be very large as well, depending on the complexity of the models under consideration. With a small number of folds, the variance of the estimator will be small but the bias will be large. The estimate may be larger than the true error rate.

In practice, the choice of the number of folds depends on the size of the data set. For large data set, smaller K (e.g. 3) may yield quite accurate results. For sparse data sets, Leave-one-out (LOO or LOOCV) may need to be used.

Leave-One-Out Cross-Validation

LOO is the degenerate case of K-fold cross-validation where K = n for a sample of size n. That means that n separate times, the prediction function is trained on all the data except for one point and a prediction is made for that point. As before the average error is computed and used to evaluate the model. The evaluation given by leave-one-out cross-validation error is good, but sometimes it may be very expensive to compute.


3.3 An Example

Wage Data Revisited

Wage data is available in the ISLR package in R. The data has been used before in Lesson 1 and it was noted that there are 3000 observations of which 2480 are White and the rest Black, Asian or Other.

The objective is to predict accurately whether a worker will have raw wage above 100 or not. Of the predictors included in the data year, age, education, and job class are considered as having an impact on the wage. The learning algorithm used for prediction is the K-nearest neighbor algorithm which will be considered later in the course. Briefly, KNN is a simple classifier which classifies a new observation based on similarity measure computed amongst ‘nearest neighbors’. The predictors are used to compute the similarity. K can take any value from 1 onwards, depending on the size of the data set. If K = 1, the new observation is classified as having the same class as the nearest neighbor. If K > 1, then the most frequent class among the nearest neighbors is assigned to the new observation.

For K = 1 in the training data, there is always overfitting. As a result error in the training data is small, but the test error is expected to be high. For a large K, the prediction is not expected to be accurate, since the bias will be high. For an accurate prediction, an optimum value of K is to be determined.

Case I: Holdout sample: Training and Test

Only White data is considered for prediction. Note that Job class is a binary variable and Education is an ordinal variable with 5 levels. Age and Year are assumed continuous.

  • Sample R code for Training and Test Split 50:50
## Training and Test Split 50:50
    library(ISLR)
    library(ggplot2)
    library(class)
    data(Wage)
    WageWhite<-Wage[Wage$race=="1. White",]
    set.seed(679)
    M <- 50
    attach(WageWhite)
    ieducation<-as.integer(WageWhite$education)
    ijobclass<-as.integer(jobclass)
    data<-cbind(ijobclass,ieducation,year,age)
    train <- sample(nrow(WageWhite), nrow(WageWhite)*0.5, replace = FALSE)
    test <- -train
    train.X=data[train,]
    test.X=data[test,]
    train.wage=I(wage>100)[train]
    train.err<-rep(NA,M)
    test.err<-rep(NA,M)
    for (i in 1:M){
      knn.pred=knn(train.X,test.X,train.wage,k=i)
      test.err[i]<-mean(knn.pred != I(wage[test]>100))
      knn.pred=knn(train.X,train.X,train.wage,k=i)
      train.err[i]<-mean(knn.pred != I(wage[train]>100))
    }
    df <-data.frame(c(rep("Training",M),rep("Test",M)),rep(seq(1:M),2),c(train.err,test.err))
    colnames(df)<-c("Data","K","ErrorRate")
    ggplot(data=df, aes(x=K, y=ErrorRate, group=Data, colour=Data)) +
      geom_line() +
      geom_point() + 
      ggtitle("Training and Test Split 50:50") + theme(plot.title = 
    element_text(color="black", face="bold", size=16))
Training and Test Split 50:50
Fig 3.2: Training and Test Split 50:50

The blue line is the error rate in the training sample. When K = 1, the error rate in the training sample is the lowest. As K increases the error rate increases. After a large value of K, the rate of increase becomes negligible.

When K = 1, the predictive power of the algorithm is the least, which is corroborated by the test error rate shown. As K increases, test error rate decreases, indicating that the predictive power of the algorithm gets better.

For all the K values considered, a minimum is reached at K = 8. It is also clear that for large K, the training and test error rates are close, the training error rate is monotonically increasing but the test error rate has stabilized.

Important Observation: The training and test data creation has an impact on the training and test error rates. For the figure above a random seed value, 679 was used. However, for a different seed value, a different behavior of training and test errors is noted. In the R codes provided, you may verify this yourself.

  • Sample R code for Training and Test Split 90:10
## Training and Test Split 90:10
    library(ISLR)
    library(ggplot2)
    library(class)
    data(Wage)
    WageWhite<-Wage[Wage$race=="1. White",]
    set.seed(679)
    M <- 50
    attach(WageWhite)
    ieducation<-as.integer(WageWhite$education)
    ijobclass<-as.integer(jobclass)
    data<-cbind(ijobclass,ieducation,year,age)
    train <- sample(nrow(WageWhite), nrow(WageWhite)*0.9, replace = FALSE)
    test <- -train
    train.X=data[train,]
    test.X=data[test,]
    train.wage=I(wage>100)[train]
    train.err<-rep(NA,M)
    test.err<-rep(NA,M)
    for (i in 1:M){
      knn.pred=knn(train.X,test.X,train.wage,k=i)
      test.err[i]<-mean(knn.pred != I(wage[test]>100))
      knn.pred=knn(train.X,train.X,train.wage,k=i)
      train.err[i]<-mean(knn.pred != I(wage[train]>100))
    }
    df <-data.frame(c(rep("Training",M),rep("Test",M)),rep(seq(1:M),2),c(train.err,test.err))
    colnames(df)<-c("Data","K","ErrorRate")
    ggplot(data=df, aes(x=K, y=ErrorRate, group=Data, colour=Data)) +
      geom_line() +
      geom_point() + 
      ggtitle("Training and Test Split 90:10") + theme(plot.title = 
    element_text(color="black", face="bold", size=16))
Training and Test Split 90:10
Fig 3.3: Training and Test Split 90:10

The split between training and test data also may have an impact on the error rate. If the test data is a small proportion of the sample, the error rates will not be smooth, even though the minimum test error is attained at a reasonable lower value of K (K = 6).

Case II: 10-Fold Cross-validation

Only White data is considered along with the same 4 predictors. The objective is to find an optimal K. The data is split into 10 partitions of the sample space. All values of K from 1 to 50 is considered. For each value of K, 9 folds are used as the training data to develop the model and the residual part is considered as the test data. By rotation, each fold is considered as part of training data and test data.

  • Sample R code for 10-Fold Cross-validation
## 10-Fold Cross-validation
## installing dependencies to use with KODAMA

    library(ISLR)
    library(ggplot2)
    library(KODAMA)
    library(class)
    data(Wage)
    WageWhite<-Wage[Wage$race=="1. White",]
    set.seed(679)
    M <- 50
    attach(WageWhite)
    ieducation<-as.integer(WageWhite$education)
    ijobclass<-as.integer(jobclass)
    data<-cbind(ijobclass,ieducation,year,age)
    err<-rep(NA,M)
    for (i in 1:M){
    a<-knn.cv(data,I(wage>100),1:length(I(wage>100)),k=i)
    err[i]<- mean(a!= I(wage>100))
    }
    df<-data.frame(seq(1:M),err)
    colnames(df)<-c("K","ErrorRate")
    ggplot(data=df, aes(x=K, y=ErrorRate)) +
    geom_line() +
    geom_point()+ ggtitle("10-Fold CV")
10-Fold Cross-validation
Fig 3.4: 10-Fold Cross-validation

Case III: LOOCV

At every cycle, one observation is left out and predicted for class membership using the rest of the observations. This method is also a robust method of error prediction but computation time might be on the higher side.

  • Sample R code for LOOCV
## LOOCV
    library(ISLR)
    library(ggplot2)
    library(class)
    data(Wage)
    WageWhite<-Wage[Wage$race=="1. White",]
    set.seed(679)
    M <- 50
    attach(WageWhite)
    ieducation<-as.integer(WageWhite$education)
    ijobclass<-as.integer(jobclass)
    data<-cbind(ijobclass,ieducation,year,age)
    set.seed(679)
    err<-rep(NA,M)
    for (i in 1:M){
    a<-knn.cv(data,I(wage>100),k=i)
    err[i]<-mean(a!= I(wage>100))
    }
    df<-data.frame(seq(1:M),err)
    colnames(df)<-c("K","ErrorRate")
    ggplot(data=df, aes(x=K, y=ErrorRate)) +
    geom_line() +
    geom_point()+ ggtitle("LOOCV")
LOOCV
Fig 3.5: LOOCV

The optimum values of K for 10-fold CV as well as LOOCV are very close. In the former, it is 18 while in the latter the value is 20. The error distribution for different K values follows a parabolic curve with a single minimum. The error curve is smoother compared to the hold-out sample situation. It is true that a unique minimum is reached for hold-out sample situations, but the variability of the error curve is much higher.

Also to note that depending on the cross-validation mechanism, the minimum error may be achieved at a different value of the tuning parameter.

There is a lot of scope of experimenting here. Different seed values, different proportional split in the data and a number of other criteria impact the estimation accuracy of PE.