CD.3: 10-Fold Cross-validation

  Sample R code for Cross-Validation

########### Creating Cross-validation Dataset  ###########################

CVInd <-, 5372, replace=FALSE)
CVInds <- CVInd[1:4833]
dim(CVInds) <- c(537,9)
CVInds10 <- CVInd[4834:5372]
TXTrain1 <- TextureMasterPCScore[-CVInds[,1],] # Training Set 1
TXTest1  <- TextureMasterPCScore[CVInds[,1],]  # Test Set 1
TXTrain2 <- TextureMasterPCScore[-CVInds[,2],] # Training Set 2
TXTest2  <- TextureMasterPCScore[CVInds[,2],]  # Test Set 2
TXTrain3 <- TextureMasterPCScore[-CVInds[,3],] # Training Set 3
TXTest3 <- TextureMasterPCScore[CVInds[,3],]  # Test Set 3
TXTrain4<- TextureMasterPCScore[-CVInds[,4],] # Training Set 4
TXTest4 <- TextureMasterPCScore[CVInds[,4],]  # Test Set 4
TXTrain5<- TextureMasterPCScore[-CVInds[,5],] # Training Set 5
TXTest5 <- TextureMasterPCScore[CVInds[,5],]  # Test Set 5
TXTrain6<- TextureMasterPCScore[-CVInds[,6],] # Training Set 6
TXTest6 <- TextureMasterPCScore[CVInds[,6],]  # Test Set 6
TXTrain7<- TextureMasterPCScore[-CVInds[,7],] # Training Set 7
TXTest7 <- TextureMasterPCScore[CVInds[,7],]  # Test Set 7
TXTrain8 <- TextureMasterPCScore[-CVInds[,8],] # Training Set 8
TXTest8 <- TextureMasterPCScore[CVInds[,8],]  # Test Set 8
TXTrain9 <- TextureMasterPCScore[-CVInds[,9],] # Training Set 9
TXTest9 <- TextureMasterPCScore[CVInds[,9],]  # Test Set 9
TXTrain10 <- TextureMasterPCScore[-CVInds10,] # Training Set 10
TXTest10  <- TextureMasterPCScore[CVInds10,]  # Test Set 10
TextureMasterPCScore <-
TextureMasterPCScore$TxClassLib <- as.factor(TextureMasterPCScore$TxClassLib)

Since data set is large enough, 10-fold cross-validation is applied to evaluate model performance. After removing the outliers 5372 observations are included in the master data and the first 8 principal components are used for prediction. For each observation (row) a score corresponding to each PC is computed and this is the value of the predictors (PCs) used to evaluate model performance. Hence the master data used has 5372 rows (observations) and 8 predictors and one response (total of 9 columns) indicating the classes each observation belongs to.

Data is divided into 10 sets randomly of which 9 sets have 537 observations and the last set has 539 observations. Training data is formed by taking 9 sets at a time and leave one set out as the Test data. Hence 10 different combinations of Training and Test sets are formed. On each of Training and Test pair a technique is applied and evaluated. Final evaluation of the technique is determined by the average mis-classification probability over the 10 Test sets.

Following table shows classification proportion in the Master data as well as in the Training data sets. Distribution of different classes is almost identical over the data sets. Moreover all categories have almost uniform representation

texture classification table

Linear Discriminant Analysis

  Sample R code for Discriminant Analysis

ldafit4 <-lda(TxClassLib ~ ., data=TXTrain4)
# ldafit4
lda.pred4 <- predict(ldafit4, data=TXTrain4)
tab4 <- table(lda.pred4$class, TXTrain4$TxClassLib)
propmisTrain4 <- 1 - tr(tab4)/length(lda.pred4$class)
cat("Proportion of Misclassification in Training Set 4:", propmisTrain4)
lda.Testpred4 <- predict(ldafit4, TXTest4)
testtab4 <- table(lda.Testpred4$class, TXTest4$TxClassLib)
propmisTest4 <- 1-tr(testtab4)/length(lda.Testpred4$class)
cat("Proportion of Misclassification in Test Set 4:", propmisTest4)

Since PCs are linear combinations of original variables, they may also be assumed to follow multivariate normal distribution. For each Training set a linear discriminant function is developed using all 8 PCs. Prior probability distribution for each Training set is very similar as given in the table above.

Details are given for the Training Data 1:

group means table

coefficients of linear discriminants

proportion of trace

Results from other Training sets are also very similar and are not shown here. In the following table misclassification probabilities in Training and Test sets created for the 10-fold cross-validation are shown.

mis-classification probability

Therefore overall misclassification probability of the 10-fold cross-validation is 2.55%, which is the mean misclassification probability of the Test sets.

Note that for Sets 5, 7, 8 and 9 mis-classification probability in Test set is less than that in the corresponding Training set. This may seem fallacious; however, several points to be noted here. Training set size is much larger compared to Test set. With 11 classes in Test sets, each class has sometimes even fewer than 40 representations. This might lead to the standard error of probability of misclassification to be relatively higher, in turn leading to apparent counter-intuitive results. Average error of Training set is 2.54%.

Overall results indicate accurate and stable classification rules.