Analysis of Classification Data

image of different textures In this example of data mining for knowledge discovery we consider a classification problem with a large number of objects to be classified based on many attributes. A set of 40 characters or attributes are measured on 5500 items which belong to 11 different categories of varied textures. Textures include a grass lawn, pressed calf leather, handmade paper, cotton canvas, etc. All of the attributes are measured on a continuous scale. Data are obtained from (https://sci2s.ugr.es/keel/dataset.php?cod=72#sub2 [1])

Objective of the Analysis

Pattern recognition and Classification of 5500 objects into 11 classes based on 40 attributes

Data Files for this case (right-click and "save as" ) :

Texture.csv [2] - full dataset

TXTrain1.csv [3]
TXTrain2.csv [4]
TXTrain3.csv [5]
TXTrain4.csv [6]
TXTrain5.csv [7]
TXTrain6.csv [8]
TXTrain7.csv [9]
TXTrain8.csv [10]
TXTrain9.csv [11]
TXTrain10.csv [12]

TXTest1.csv [13]
TXTest2.csv [14]
TXTest3.csv [15]
TXTest4.csv [16]
TXTest5.csv [17]
TXTest6.csv [18]
TXTest7.csv [19]
TXTest8.csv [20]
TXTest9.csv [21]
TXTest10.csv [22]

Texture.zip [23] - all data files above together in a .zip file for convenience

Overview of Classification Problem and Cross-Validation

Classification problem may be treated as a special type of regression problem where, based on the values of the predictors, each observation is placed into one and only one of the categories. Probability that the i^th object will be placed into one of the j categories is 1, for all i = 1, … n. Each object has a different probability to be placed into different classes and is put into the class which maximizes this probability.

Performance of a classification rule is measured through the mis-classification probability. Following techniques of classification are applied here

Linear Discriminant Analysis
K Nearest Neighbour
Classification Tree
Random Forest

CD.1: Exploratory Data Analysis (EDA) and Data Pre-processing

There are 40 predictors in the data. Univariate descriptive statistics and the box-plots are shown below.

R codes for
Data Preparation and
Exploratory Data Analysis

R codes for Data Preparation and Exploratory Data Analysis

TexturePred <- Texture[,-41]
TextureClass <- Texture[,41]
PCTexture <- princomp(TexturePred, cor=T)
screeplot(PCTexture, col=c("royal blue","blue","dark blue","light blue", "purple","grey"))
summary(PCTexture)
par(mfrow=c(1,10), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
for (i in 1:10) {
boxplot(TexturePred[,i], pch=19, col="moccasin")
mtext("Boxplots of A1 - A10", cex=0.8, side=1, line=0, outer=T)
}
for (i in 11:20) {
boxplot(TexturePred[,i], pch=19, col="aliceblue")
mtext("Boxplots of A11 - A20", cex=0.8, side=1, line=0, outer=T)
}
for (i in 21:30) {
boxplot(TexturePred[,i], pch=19, col="bisque1")
mtext("Boxplots of A21 - A30", cex=0.8, side=1, line=0, outer=T)
}
for (i in 31:40) {
boxplot(TexturePred[,i], pch=19, col="lightblue1")
mtext("Boxplots of A31 - A40", cex=0.8, side=1, line=0, outer=T)
}

DescTex <- describe(TexturePred)
Qm <- matrix(0, nrow(DescTex), 3)
DescTex <- cbind(DescTex, Qm)
for (i in 1:nrow(DescTex)){
DescTex[i,14] <- quantile(TexturePred[,i], 0.25)
DescTex[i,15] <- quantile(TexturePred[,i], 0.75)
DescTex[i,16] <- DescTex[i,15]-DescTex[i,14]
}
write.csv(DescTex, "C:/Users/sbasu/Desktop/Stat 897D/DescTex.csv")

######### Outlier rejection ##############

limoutL <- rep(0,40)
limoutU <- rep(0,40)
for (i in 1:40){
t1 <- quantile(TexturePred[,i], 0.75)
t2 <- quantile(TexturePred[,i], 0.25)
t3 <- IQR(TexturePred[,i])
limoutU[i] <- t1 + 3*t3
limoutL[i] <- t2 - 3*t3
}
TxPrdIndexU <- matrix(0, 5500, 40)
TxPrdIndexL <- matrix(0, 5500, 40)
for (i in 1:5500)
for (j in 1:40){
if (TexturePred[i,j] > limoutU[j]) TxPrdIndexU[i,j] <- 1
if (TexturePred[i,j] < limoutL[j]) TxPrdIndexL[i,j] <- 1
}
TxIndU <- apply(TxPrdIndexU, 1, sum)
TxIndL <- apply(TxPrdIndexL, 1, sum)
TxInd <- TxIndU + TxIndL
TxPrdTemp <- cbind(TxInd, TexturePred)

Indexes <- rep(0, 128)
j <- 1
for (i in 1:5500){
if (TxInd[i] > 0) {Indexes[j]<- i
j <- j + 1}
else j <- j
}
TxPredLib <- TexturePred[-Indexes,] # Inside of Q3+3IQR Predictors
TxClassLib <- TextureClass[-Indexes] # Inside of Q3+3IQR the class column
############ Correlation #############
TexturePredCor <- cor(TxPredLib)
write.csv(TexturePredCor, "C:/Users/sbasu/Desktop/Stat 897D/CorrTex.csv")

Descriptive statistics

Boxplots

From the above univariate exercises, it is clear that several of the 40 attributes have outliers of varied proportion. In order to include as many rows as possible, but eliminating the extreme outliers, all data points (rows) were included, which do not contain any outlier value in any of the 40 predictors, outliers being defined as a value outside of [Q1-3IQR, Q3+3IQR] limits. This eliminates 128 rows. A more conservative approach is where the outlier is defined to be a point outside of the limit [Q1-1.5IQR, Q3+1.5IQR]. In this case, 1060 rows would have been removed.

While looking at the correlation matrix it was found that there is a very high degree of dependency among the predictor variables. The red highlighted cells all show very high degree of dependency and it is all in the positive direction.

CD.2: Principal Components Analysis

R codes for
Principal Component Analysis

R codes forPrincipal Component Analysis

PCTexture <- prcomp(TxPredLib, scale=T)

print(PCTexture)

summary(PCTexture)

par(mfrow=c(1,1), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)

plot(PCTexture, type="line", col=c("dark blue"), main="", pch=19) ## Scree plot

mtext("Screeplot of Texture", side=1, line=3, cex=0.8)

biplot(PCTexture, pch=19, cex=0.6, col=c("olivedrab1", "blue"))

TextureScore <- PCTexture$x[,1:8]

TextureMasterPCScore <- cbind(TextureScore, TxClassLib)

# Used for model fitting and 10-fold cross-validation

With such a high degree of dependency it is recommended that a PCA is done on the data and only the top few components are used for classification.

principal components analysis table

scree plot

From the table and the screeplot above it is clear that it is sufficient to consider only the first 8 PCs. They are given below.

principal components analysis table

For classification, therefore, only the first 8 PCs will be used, instead of all the 40 attributes.

CD.3: 10-Fold Cross-validation

R codes for
Cross-Validation

R codes for Cross-Validation

########### Creating Cross-validation Dataset  ###########################

CVInd <- sample.int(5372, 5372, replace=FALSE)

CVInds <- CVInd[1:4833]

dim(CVInds) <- c(537,9)

CVInds10 <- CVInd[4834:5372]

TXTrain1 <- TextureMasterPCScore[-CVInds[,1],] # Training Set 1

TXTest1  <- TextureMasterPCScore[CVInds[,1],]  # Test Set 1

TXTrain2 <- TextureMasterPCScore[-CVInds[,2],] # Training Set 2

TXTest2  <- TextureMasterPCScore[CVInds[,2],]  # Test Set 2

TXTrain3 <- TextureMasterPCScore[-CVInds[,3],] # Training Set 3

TXTest3 <- TextureMasterPCScore[CVInds[,3],]  # Test Set 3

TXTrain4<- TextureMasterPCScore[-CVInds[,4],] # Training Set 4

TXTest4 <- TextureMasterPCScore[CVInds[,4],]  # Test Set 4

TXTrain5<- TextureMasterPCScore[-CVInds[,5],] # Training Set 5

TXTest5 <- TextureMasterPCScore[CVInds[,5],]  # Test Set 5

TXTrain6<- TextureMasterPCScore[-CVInds[,6],] # Training Set 6

TXTest6 <- TextureMasterPCScore[CVInds[,6],]  # Test Set 6

TXTrain7<- TextureMasterPCScore[-CVInds[,7],] # Training Set 7

TXTest7 <- TextureMasterPCScore[CVInds[,7],]  # Test Set 7

TXTrain8 <- TextureMasterPCScore[-CVInds[,8],] # Training Set 8

TXTest8 <- TextureMasterPCScore[CVInds[,8],]  # Test Set 8

TXTrain9 <- TextureMasterPCScore[-CVInds[,9],] # Training Set 9

TXTest9 <- TextureMasterPCScore[CVInds[,9],]  # Test Set 9

TXTrain10 <- TextureMasterPCScore[-CVInds10,] # Training Set 10

TXTest10  <- TextureMasterPCScore[CVInds10,]  # Test Set 10

TextureMasterPCScore <- as.data.frame(TextureMasterPCScore)

TextureMasterPCScore$TxClassLib <- as.factor(TextureMasterPCScore$TxClassLib)

prop.table(table(TextureMasterPCScore$TxClassLib))

prop.table(table(TXTrain1$TxClassLib))

prop.table(table(TXTrain2$TxClassLib))

prop.table(table(TXTrain3$TxClassLib))

prop.table(table(TXTrain4$TxClassLib))

prop.table(table(TXTrain5$TxClassLib))

prop.table(table(TXTrain6$TxClassLib))

prop.table(table(TXTrain7$TxClassLib))

prop.table(table(TXTrain8$TxClassLib))

prop.table(table(TXTrain9$TxClassLib))

prop.table(table(TXTrain10$TxClassLib))

Since data set is large enough, 10-fold cross-validation is applied to evaluate model performance. After removing the outliers 5372 observations are included in the master data and the first 8 principal components are used for prediction. For each observation (row) a score corresponding to each PC is computed and this is the value of the predictors (PCs) used to evaluate model performance. Hence the master data used has 5372 rows (observations) and 8 predictors and one response (total of 9 columns) indicating the classes each observation belongs to.

Data is divided into 10 sets randomly of which 9 sets have 537 observations and the last set has 539 observations. Training data is formed by taking 9 sets at a time and leave one set out as the Test data. Hence 10 different combinations of Training and Test sets are formed. On each of Training and Test pair a technique is applied and evaluated. Final evaluation of the technique is determined by the average mis-classification probability over the 10 Test sets.

Following table shows classification proportion in the Master data as well as in the Training data sets. Distribution of different classes is almost identical over the data sets. Moreover all categories have almost uniform representation

texture classification table

Linear Discriminant Analysis

R codes for
Discriminant Analysis

R codes for Discriminant Analysis

ldafit4 <-lda(TxClassLib ~ ., data=TXTrain4)

# ldafit4

lda.pred4 <- predict(ldafit4, data=TXTrain4)

tab4 <- table(lda.pred4$class, TXTrain4$TxClassLib)

propmisTrain4 <- 1 - tr(tab4)/length(lda.pred4$class)

cat("Proportion of Misclassification in Training Set 4:", propmisTrain4)

lda.Testpred4 <- predict(ldafit4, TXTest4)

testtab4 <- table(lda.Testpred4$class, TXTest4$TxClassLib)

propmisTest4 <- 1-tr(testtab4)/length(lda.Testpred4$class)

cat("Proportion of Misclassification in Test Set 4:", propmisTest4)

Since PCs are linear combinations of original variables, they may also be assumed to follow multivariate normal distribution. For each Training set a linear discriminant function is developed using all 8 PCs. Prior probability distribution for each Training set is very similar as given in the table above.

Details are given for the Training Data 1:

group means table

coefficients of linear discriminants

proportion of trace

Results from other Training sets are also very similar and are not shown here. In the following table misclassification probabilities in Training and Test sets created for the 10-fold cross-validation are shown.

mis-classification probability

Therefore overall misclassification probability of the 10-fold cross-validation is 2.55%, which is the mean misclassification probability of the Test sets.

Note that for Sets 5, 7, 8 and 9 mis-classification probability in Test set is less than that in the corresponding Training set. This may seem fallacious; however, several points to be noted here. Training set size is much larger compared to Test set. With 11 classes in Test sets, each class has sometimes even fewer than 40 representations. This might lead to the standard error of probability of misclassification to be relatively higher, in turn leading to apparent counter-intuitive results. Average error of Training set is 2.54%.

Overall results indicate accurate and stable classification rules.

CD.4: K Nearest Neighbour

For this method k = 7 is used, i.e. 7 nearest neighbours are used to predict class membership of each observation in the Test set. Following is the misclassification error rate for Test sets.

mis-classification rate for test set

Overall error rate is 3.0%. This value is comparable to the LDA error rate.

CD.5: Decision Tree

R codes for
Tree Based Algorithms

R codes for Tree Based Algorithms


TXTrain1$TxClassLib <- as.factor(TXTrain1$TxClassLib)

TXTrainTree1 <- tree(TxClassLib ~ ., data=TXTrain1, method="class")

plot(TXTrainTree1, col="dark red")

text(TXTrainTree1, pretty=0, cex=0.6, col="dark red")

mtext("Decision Tree (Unpruned) for Training Set 1", side=3, line = 2, cex=0.8, col="dark red")

m <- misclass.tree(TXTrainTree1)

propmisTrain1 <- m / length(TXTrainTree1$y)

cat("Proportion of Misclassification in Training Set 1:", propmisTrain1)

TXTest1Treefit1 <- predict(TXTrainTree1, TXTest1, type="class")

Tab1 <- table(TXTest1Treefit1, TXTest1$TxClassLib)

propmisTest1 <- 1-tr(Tab1)/length(TXTest1Treefit1)

cat("Proportion of Misclassification in Test Set 1 =", propmisTest1)

TXTrainPruneTree1 <- prune.misclass(TXTrainTree1, best=20)

m <- misclass.tree(TXTrainPruneTree1)

m / length(TXTrainPruneTree1$y)

plot(TXTrainPruneTree1, col="dark red")

text(TXTrainPruneTree1, pretty=0, cex=0.6, col="dark red")

mtext("Decision Tree for Training Set 1", side=3, line = 2, cex=0.8, col="dark red")

TXTest1PruneTreefit1 <- predict(TXTrainPruneTree1, TXTest1, type="class")

Tab1 <- table(TXTest1PruneTreefit1, TXTest1$TxClassLib)

propmisTest1 <- 1-tr(Tab1)/length(TXTest1PruneTreefit1)

cat("Proportion of Misclassification in Test Set 1 =", propmisTest1)

################### Random Forest ###################

TXTrainRF1 <- randomForest(TXTrain1[,1:8],TXTrain1[,9], ntree=100, importance=T, proximity=T)

# TXTrainRF1 <- randomForest(TXTrain1[,1:8],TXTrain1[,9], xtest=TXTest1[,1:8], ytest=TXTest1[,9], ntree=100, importance=T, proximity=T)

plot(TXTrainRF1, main="OOB Error Rate: Set 1", cex=0.4)

TXTrainRF1

varImpPlot(TXTrainRF1,  pch=19, col="dark red", main="Variable Importance: Set 1", cex=0.8)

Unsupervised tree algorithm is applied to all Training sets and misclassification probability was calculated for both the Training and Test sets. All the Training Sets give rise to very similar decision trees. Three representative trees are shown below as examples.

decision tree

Following table summarizes the misclassification probabilities for Tree classification

mis-classification probability for tree

Therefore overall mis-classification probability of the 10-fold cross-validation is 17.9%, which is the mean mis-classification probability of the Test sets.

Pruning was tried for this decision tree, but it did not improve the result.

At the first glance this high error rate compared to k-NN and LDA looks surprising. LDA uses only linear classifier all over the sample space, but Tree procedure recursively partitions the sample space to reduce mis-classification error. It is therefore expected that Tree procedure will always give better results than LDA.

However, it is to be noted that LDA takes into account linear combinations of the predictors, whereas Tree always divides the sample space into splits parallel to the axes. If separation is along any other line, Tree wil not be able to capture that. This is exactly what is happening here.

CD.6: Random Forest

High mis-classification error rate is corrected to a large extent by using Random Forest. Unsupervised random forest method is applied to each Training set and both Out-of-Bag error rate and Test error rate are calculated for error or mis-classification corresponding to each of the 11 categories. Variable importance plots are also shown. All results are for ntree = 100. Number of variables tried at each split is 2.

random forest error rate

Convergence of the errors are shown for two Sets only since behaviour of the error is very similar for all the 10 Sets random forest technique was applied. For Set 1 both the Out-of-Bag error rate and Test error rate are shown.

oob error rate plot

However, there are slight differences in the variable importance plots. All the plots are shown below. It is clear from the plots below that PC1, PC5 and PC6 are the primary influential variables, but they appear in different order in different cross-validation sets.

variable importance - plots

CD.7: Conclusion

All the 11 different categories can be separated by various classification methods with almost similar misclassification error rates. Classification Tree-based method shows less than optimal performance. But this can be improved by application of Random Forest or Oblique Tree method.