2.3 - An Example

Wage Data Revisited Section

Wage data is available in the ISLR package in R. The data has been used before in Lesson 1 and it was noted that there are 3000 observations of which 2480 are White and the rest Black, Asian or Other.

The objective is to predict accurately whether a worker will have raw wage above 100 or not. Of the predictors included in the data year, age, education, and job class are considered as having an impact on the wage. The learning algorithm used for prediction is the K-nearest neighbor algorithm which will be considered later in the course. Briefly, KNN is a simple classifier which classifies a new observation based on similarity measure computed amongst 'nearest neighbors'. The predictors are used to compute the similarity. K can take any value from 1 onwards, depending on the size of the data set. If K = 1, the new observation is classified as having the same class as the nearest neighbor. If K > 1, then the most frequent class among the nearest neighbors is assigned to the new observation.

For K = 1 in the training data, there is always overfitting. As a result error in the training data is small, but the test error is expected to be high. For a large K, the prediction is not expected to be accurate, since the bias will be high. For an accurate prediction, an optimum value of K is to be determined.

Case I: Holdout sample: Training and Test

Only White data is considered for prediction. Note that Job class is a binary variable and Education is an ordinal variable with 5 levels. Age and Year are assumed continuous.

  Sample R code for Training and Test Split 50:50

## Training and Test Split 50:50
library(ISLR)
library(ggplot2)
library(class)
data(Wage)
WageWhite<-Wage[Wage$race=="1. White",]
set.seed(679)
M <- 50
attach(WageWhite)
ieducation<-as.integer(WageWhite$education)
ijobclass<-as.integer(jobclass)
data<-cbind(ijobclass,ieducation,year,age)
train <- sample(nrow(WageWhite), nrow(WageWhite)*0.5, replace = FALSE)
test <- -train
train.X=data[train,]
test.X=data[test,]
train.wage=I(wage>100)[train]
train.err<-rep(NA,M)
test.err<-rep(NA,M)
for (i in 1:M){
  knn.pred=knn(train.X,test.X,train.wage,k=i)
  test.err[i]<-mean(knn.pred != I(wage[test]>100))
  
  knn.pred=knn(train.X,train.X,train.wage,k=i)
  train.err[i]<-mean(knn.pred != I(wage[train]>100))
}

df <-data.frame(c(rep("Training",M),rep("Test",M)),rep(seq(1:M),2),c(train.err,test.err))
colnames(df)<-c("Data","K","ErrorRate")

ggplot(data=df, aes(x=K, y=ErrorRate, group=Data, colour=Data)) +
  geom_line() +
  geom_point() + 
  ggtitle("Training and Test Split 50:50") + theme(plot.title = 
element_text(color="black", face="bold", size=16))

training and test split: 50:50

The blue line is the error rate in the training sample. When K = 1, the error rate in the training sample is the lowest. As K increases the error rate increases. After a large value of K, the rate of increase becomes negligible.

When K = 1, the predictive power of the algorithm is the least, which is corroborated by the test error rate shown. As K increases, test error rate decreases, indicating that the predictive power of the algorithm gets better.

For all the K values considered, a minimum is reached at K = 8.  It is also clear that for large K, the training and test error rates are close, the training error rate is monotonically increasing but the test error rate has stabilized.

Important Observation: The training and test data creation has an impact on the training and test error rates. For the figure above a random seed value, 679 was used. However, for a different seed value, a different behavior of training and test errors is noted. In the R codes provided, you may verify this yourself.

  Sample R code for Training and Test Split 90:10

## Training and Test Split 90:10
library(ISLR)
library(ggplot2)
library(class)
data(Wage)
WageWhite<-Wage[Wage$race=="1. White",]
set.seed(679)
M <- 50
attach(WageWhite)
ieducation<-as.integer(WageWhite$education)
ijobclass<-as.integer(jobclass)
data<-cbind(ijobclass,ieducation,year,age)
train <- sample(nrow(WageWhite), nrow(WageWhite)*0.9, replace = FALSE)
test <- -train
train.X=data[train,]
test.X=data[test,]
train.wage=I(wage>100)[train]
train.err<-rep(NA,M)
test.err<-rep(NA,M)
for (i in 1:M){
  knn.pred=knn(train.X,test.X,train.wage,k=i)
  test.err[i]<-mean(knn.pred != I(wage[test]>100))
  
  knn.pred=knn(train.X,train.X,train.wage,k=i)
  train.err[i]<-mean(knn.pred != I(wage[train]>100))
}

df <-data.frame(c(rep("Training",M),rep("Test",M)),rep(seq(1:M),2),c(train.err,test.err))
colnames(df)<-c("Data","K","ErrorRate")

ggplot(data=df, aes(x=K, y=ErrorRate, group=Data, colour=Data)) +
  geom_line() +
  geom_point() + 
  ggtitle("Training and Test Split 90:10") + theme(plot.title = 
element_text(color="black", face="bold", size=16))

training and test split: 90:10

The split between training and test data also may have an impact on the error rate. If the test data is a small proportion of the sample, the error rates will not be smooth, even though the minimum test error is attained at a reasonable lower value of K (K = 6).

Case II: 10-Fold Cross-validation

Only White data is considered along with the same 4 predictors. The objective is to find an optimal K. The data is split into 10 partitions of the sample space. All values of K from 1 to 50 is considered. For each value of K, 9 folds are used as the training data to develop the model and the residual part is considered as the test data. By rotation, each fold is considered as part of training data and test data.

  Sample R code for 10-Fold Cross-validation

## 10-Fold Cross-validation
install.packages(c("plsgenomics", "e1071", "class", "spam", "maps"))  ## installing dependencies to use with KODAMA
install.packages( "https://cran.r-project.org/src/contrib/Archive/KODAMA/KODAMA_0.0.1.tar.gz", repos=NULL, type='source') ## installing KODAMA v.0.0.1 from source files
library(ISLR)
library(ggplot2)
library(KODAMA)
data(Wage)
WageWhite<-Wage[Wage$race=="1. White",]
set.seed(679)
M <- 50
attach(WageWhite)
ieducation<-as.integer(WageWhite$education)
ijobclass<-as.integer(jobclass)
data<-cbind(ijobclass,ieducation,year,age)
err<-rep(NA,M)
for (i in 1:M){
a<-KNN.CV(data,I(wage>100),1:length(I(wage>100)),kn=i)
err[i]<- mean(a!= I(wage>100))
}
df<-data.frame(seq(1:M),err)
colnames(df)<-c("K","ErrorRate")
ggplot(data=df, aes(x=K, y=ErrorRate)) +
geom_line() +
geom_point()+ ggtitle("10-Fold CV")

 

10-Fold CV

Case III: LOOCV

At every cycle, one observation is left out and predicted for class membership using the rest of the observations. This method is also a robust method of error prediction but computation time might be on the higher side.

  Sample R code for LOOCV

## LOOCV
library(ISLR)
library(ggplot2)
library(class)
data(Wage)
WageWhite<-Wage[Wage$race=="1. White",]
set.seed(679)
M <- 50
attach(WageWhite)
ieducation<-as.integer(WageWhite$education)
ijobclass<-as.integer(jobclass)
data<-cbind(ijobclass,ieducation,year,age)
set.seed(679)
err<-rep(NA,M)
for (i in 1:M){
a<-knn.cv(data,I(wage>100),k=i)
err[i]<-mean(a!= I(wage>100))
}
df<-data.frame(seq(1:M),err)
colnames(df)<-c("K","ErrorRate")
ggplot(data=df, aes(x=K, y=ErrorRate)) +
geom_line() +
geom_point()+ ggtitle("LOOCV")

LOOCV

The optimum values of K for 10-fold CV as well as LOOCV are very close. In the former, it is 18 while in the latter the value is 20. The error distribution for different K values follows a parabolic curve with a single minimum. The error curve is smoother compared to the hold-out sample situation. It is true that a unique minimum is reached for hold-out sample situations, but the variability of the error curve is much higher.

Also to note that depending on the cross-validation mechanism, the minimum error may be achieved at a different value of the tuning parameter.

There is a lot of scope of experimenting here. Different seed values, different proportional split in the data and a number of other criteria impact the estimation accuracy of PE.