Analysis of Classification Data

Printer-friendly versionPrinter-friendly version

image of different texturesIn this example of data mining for knowledge discovery we consider a classification problem with a large number of objects to be classified based on many attributes. A set of 40 characters or attributes are measured on 5500 items which belong to 11 different categories of varied textures. Textures include a grass lawn, pressed calf leather, handmade paper, cotton canvas, etc. All of the attributes are measured on a continuous scale. Data are obtained from (https://sci2s.ugr.es/keel/dataset.php?cod=72#sub2)

Objective of the Analysis

Pattern recognition and Classification of 5500 objects into 11 classes based on 40 attributes

Data Files for this case (right-click and "save as" ) :

Texture.csv - full dataset

TXTrain1.csv
TXTrain2.csv
TXTrain3.csv
TXTrain4.csv
TXTrain5.csv
TXTrain6.csv
TXTrain7.csv
TXTrain8.csv
TXTrain9.csv
TXTrain10.csv
TXTest1.csv
TXTest2.csv
TXTest3.csv
TXTest4.csv
TXTest5.csv
TXTest6.csv
TXTest7.csv
TXTest8.csv
TXTest9.csv
TXTest10.csv

Texture.zip - all data files above together in a .zip file for convenience

Overview of Classification Problem and Cross-Validation

Classification problem may be treated as a special type of regression problem where, based on the values of the predictors, each observation is placed into one and only one of the categories. Probability that the ith object will be placed into one of the j categories is 1, for all i = 1, … n. Each object has a different probability to be placed into different classes and is put into the class which maximizes this probability.

Performance of a classification rule is measured through the mis-classification probability. Following techniques of classification are applied here

  • Linear Discriminant Analysis
  • K Nearest Neighbour
  • Classification Tree
  • Random Forest