Lesson 14: Classification
|Key Learning Goals for this Lesson:|
We have already discussed clustering, which is a means of finding patterns in our data to find sets of similar features or of similar samples. Often, however, we have already clustered or selected our samples by phenotype and would like to either determine which (genomic) features define the clusters or use genomic features to assign new samples to the correct cluster. This is the classification problem. For example, we might classify tissue biopsy samples into normal or cancerous. In the computer science and machine learning literature, classification is sometimes called "supervised learning" because the algorithm "learns" the estimated parameters of the classification rule from the data with guidance from the known classes. By contrast, clustering is sometimes called "unsupervised learning" because the algorithm must learn both the classes and the classification rule from the data.
The typical starting point is that we have some samples with known classification as well as other measurements. The purpose is to use the known samples, which is called the training set, to get a classification rule for new samples whose class is not known but for whom the other measurements are available. As a simple example, we might think of determining whether a tissue sample is normal or cancer based on features seen under a microscope. Typically for this course however, the measurements will be high throughput measurements such as gene expression or methylation.
Since we will use the training set to tune the classification rule, we expect to do better classifying the training samples than new samples using our rule. For this reason, to assess how well our classification rule works, or to select among several competing rules, we also maintain another set of samples, the validation set, for which we also know the correct classification. The validation set is not used for tuning the classification rule; the elements of the validation set are classified using the classification rule and these classifications are then compared with the correct classifications to assess the misclassification rate. When choosing between different classifiers, we usually choose the one with the lowest misclassification rate on the validation set.  We will discuss this more thoroughly in the lesson on cross-validation and bootstraps.
Classification using genomic features can be useful to replace or enhance other methods. If a small set of genes associated with different types of cancer can be identified, then an inexpensive microarray might be able to diagnose whether a tissue sample is benign or cancerous more rapidly and accurately than a pathologist.
However, there is a problem with our agenda - overfitting. Whenever the number of features is larger than the number of samples, we methods like linear discriminant analysis classify the training sample perfectly - even if the features are not actually associated with the phenotype. This means that with n samples almost any set of n+1 features can achieve perfect classification of the training set. Although we can assess how well the classification rule works using the validation set, we can end up with millions of candidates as possible "good" classification rules. Both statisticians and computer scientists have worked on this problem, attempting to find ways to find a close to optimal set of features for classification .
We are not going to look at the methods for selecting a small number of features to use for classification. However, we will look at 3 popular methods for developing classification rules once the features have been selected. Recursive partitioning, which searches through the features at each step (that's why it is "recursive") can also be considered a feature selection method as usually only a subset of the features will actually be used in the classification rule.
- Recursive Partitioning (partition trees)
- Discriminant Analysis
- Support Vector Machines (SVM)
The first two of these methods have been around for a while. SVM, while not new, is the newest method of the three. There is a large literature on developing classification rules in both statistics and machine learning. These three methods are very popular in both the statistics and machine learning worlds.
 Lever, J., Krzywinski, M. & ., Altman, N. (2016). Points of Significance: Classification Evaluation. Nature Methods, 13(8), 603-604. doi:10.1038/nmeth.3945
 Lever, J., Krzywinski, M. & ., Altman, N. (2016). Points of Significance: Model Selection and Overfitting. Nature Methods, 13(9), 703-704. doi:10.1038/nmeth.3968