Analysis of Classification Data
In this example of data mining for knowledge discovery we consider a classification problem with a large number of objects to be classified based on many attributes. A set of 40 characters or attributes are measured on 5500 items which belong to 11 different categories of varied textures. Textures include a grass lawn, pressed calf leather, handmade paper, cotton canvas, etc. All of the attributes are measured on a continuous scale. Data are obtained from (https://sci2s.ugr.es/keel/dataset.php?cod=72#sub2)
Objective of the Analysis
Pattern recognition and Classification of 5500 objects into 11 classes based on 40 attributes
Data Files for this case (right-click and "save as" ) :
Texture.csv - full dataset
Texture.zip - all data files above together in a .zip file for convenience
Overview of Classification Problem and Cross-Validation
Classification problem may be treated as a special type of regression problem where, based on the values of the predictors, each observation is placed into one and only one of the categories. Probability that the ith object will be placed into one of the j categories is 1, for all i = 1, … n. Each object has a different probability to be placed into different classes and is put into the class which maximizes this probability.
Performance of a classification rule is measured through the mis-classification probability. Following techniques of classification are applied here
- Linear Discriminant Analysis
- K Nearest Neighbour
- Classification Tree
- Random Forest