Lesson 14: Classification

Key Learning Goals for this Lesson:

Understanding classification as a supervised rule learning method
Understanding the differences between the training and validation samples and the new data to be classified
Learning about 3 classification methods: recursive partitioning, linear discriminant analysis and support vector machines.

Introduction

We have already discussed clustering, which is a means of finding patterns in our data to find sets of similar features or of similar samples. Often, however, we have already clustered or selected our samples by phenotype and would like to either determine which (genomic) features define the clusters or use genomic features to assign new samples to the correct cluster. This is the classification problem. For example, we might classify tissue biopsy samples into normal or cancerous. In the computer science and machine learning literature, classification is sometimes called "supervised learning" because the algorithm "learns" the estimated parameters of the classification rule from the data with guidance from the known classes. By contrast, clustering is sometimes called "unsupervised learning" because the algorithm must learn both the classes and the classification rule from the data.

The typical starting point is that we have some samples with known classification as well as other measurements. The purpose is to use the known samples, which is called the training set, to get a classification rule for new samples whose class is not known but for whom the other measurements are available. As a simple example, we might think of determining whether a tissue sample is normal or cancer based on features seen under a microscope. Typically for this course however, the measurements will be high throughput measurements such as gene expression or methylation.

Since we will use the training set to tune the classification rule, we expect to do better classifying the training samples than new samples using our rule. For this reason, to assess how well our classification rule works, or to select among several competing rules, we also maintain another set of samples, the validation set, for which we also know the correct classification. The validation set is not used for tuning the classification rule; the elements of the validation set are classified using the classification rule and these classifications are then compared with the correct classifications to assess the misclassification rate. When choosing between different classifiers, we usually choose the one with the lowest misclassification rate on the validation set. [1] We will discuss this more thoroughly in the lesson on cross-validation and bootstraps.

Classification using genomic features can be useful to replace or enhance other methods. If a small set of genes associated with different types of cancer can be identified, then an inexpensive microarray might be able to diagnose whether a tissue sample is benign or cancerous more rapidly and accurately than a pathologist.

However, there is a problem with our agenda - overfitting. Whenever the number of features is larger than the number of samples, we methods like linear discriminant analysis classify the training sample perfectly - even if the features are not actually associated with the phenotype. This means that with n samples almost any set of n+1 features can achieve perfect classification of the training set. Although we can assess how well the classification rule works using the validation set, we can end up with millions of candidates as possible "good" classification rules. Both statisticians and computer scientists have worked on this problem, attempting to find ways to find a close to optimal set of features for classification [2].

We are not going to look at the methods for selecting a small number of features to use for classification. However, we will look at 3 popular methods for developing classification rules once the features have been selected. Recursive partitioning, which searches through the features at each step (that's why it is "recursive") can also be considered a feature selection method as usually only a subset of the features will actually be used in the classification rule.

Recursive Partitioning (partition trees)
Discriminant Analysis
Support Vector Machines (SVM)

The first two of these methods have been around for a while. SVM, while not new, is the newest method of the three. There is a large literature on developing classification rules in both statistics and machine learning. These three method are very popular in both the statistics and machine learning worlds.

References:

[1] Lever, J., Krzywinski, M. & ., Altman, N. (2016). Points of Significance: Classification Evaluation. Nature Methods, 13(8), 603-604. doi:10.1038/nmeth.3945

[2] Lever, J., Krzywinski, M. & ., Altman, N. (2016). Points of Significance: Model Selection and Overfitting. Nature Methods, 13(9), 703-704. doi:10.1038/nmeth.3968

14.1 - Example: Bone Marrow Cancer Data

Example

We will explore classification using the 10 most differentially expressed genes from a study using 99 Affymetrix microarrays from GSE47552 which focused on the differences in gene expression among normal, transitioning and cancerous bone marrow samples. The number of samples for each of the 4 bone marrow types are summarized below:

Normal	5
MGUS	20
SMM	33
MM	41

To give some idea of the challenge of finding a classification rule for these 99 samples, we start with a cluster diagram based on the 120 most differentially expressed genes. As can be seen below, even with 120 genes, the samples do not cluster cleanly by marrow type. We will see how well we can do with classifying the samples using only 10 genes.

dendogram

As always, before starting an analysis, we should explore the data. I usually use scatterplots. Since our objective is to classify the samples using the gene expression, lets start by looking at all the samples using a scatterplot matrix based on the genes. Here I display only the 5 most differentially expressed genes, due to space limitations. The sample types are plotted in color where Normal is green, is "G" black, "M" is red and "S" is blue. We can see that we should be able to do well in classifying the Normal samples, and reasonably well in classifying the "G" samples, but distinguishing between "M" and "S" might be difficult.

Top 5 genes used

We can see that this set of highly differentially expressed genes are also highly correlated with each other. As well, we can see that the Normal group (green) is fairly well separated from the other samples using any pair of genes but that the cancerous samples are intermingled.

Note as well that the usefulness of a gene set for classification (or prediction) is not the same as the usefulness of the gene set for understanding the biological process which created the condition. The genes useful for classification need only be associated with the phenotype - they are not necessarily causal.

14.2 - Recursive Partitioning

Recursive Partitioning for Classification

Recursive partitioning is a very simple idea for clustering. It is the inverse of hierarchical clustering.

In hierarchical clustering, we start with individual items and cluster those that are closest in some metric. In recursive partitioning, we start with a single cluster and then split into clusters that have the smallest within cluster distances in some metric. When we are doing recursive partitioning for clustering, the "within cluster distance" is just a measure of how homogeneous the cluster is with respect to the classes of the objects in it. For example, a cluster with all the same tissue type is more homogeneous that a cluster with a mix of tissue types.

There are several measures of cluster homogeneity. Denote the proportion of the node coming from type i as $ p(i|node)$ . The most popular homogeneity measure is the Gini index: $\Sigma_{i,j} p(i|node)p(j|node)$ where i, j are the classes. If all the nodes are homogeneous the Gini index is zero. The impurity index is a similar idea based on information theory.: $-\Sigma_{i,j} p(i|node) log (p(j|node))$.

For gene expression data, we start with a set of preselected genes and some measure, such as gene expression, for each gene. These may have been selected based on network analysis, or more simply based on a differential expression analysis. We start with a single node(cluster) containing all the samples, and recursively split into increasingly homogeneous clusters. At each step, we select a node to split and split it independently of other nodes and any splits already performed.

The measures on each gene are the splitting variables. For each possible value of each selection variable, we partition the node into two groups - samples above the cut-off or less than or equal the cut-off. The algorithm simply considers all possible splits for all possible variables and selects the variable and split that creates the most homogeneous clusters. (Note that if a variable includes two values say A and B with no values between, then any cut-off between A and B produces the same 2 clusters, so the algorithm only needs to search through a finite number of splits - the midpoints between adjacent observed values.

We stop splitting when the leaves are all pure or have less than r elements (where r is chosen in advance).

Hence this process creates a recursive binary tree.

recursive partitioning process

When splitting stops, the samples in each node are classified according to a majority rule. So the left nodes above are declared red (upper) and yellow (lower) leading to 3 misclassified samples. The right nodes are declared either black or red (upper) and black (lower) leading to one misclassified sample.

It is quite easy to overfit when using this method. Overfitting occurs when we fit the noise as well as the signal. In this case, it would mean adding a branch that works well for this particular sample by chance. An overfitted classifier works well with the training sample, but when new data are classified it makes more mistakes than a classifier with few branches.

Determining when to stop splitting is more of an art than a science. It is usually not useful to partition down to single elements as this increases overfitting. Cross-validation, which will be discussed in the next lesson, is often used to help tune the tree selection without over-fitting.

How do you classify a new sample for which the category is unknown? The new sample will have the same set of variables (measured features). You start at the top the tree and "drop" the sample down the tree. The classification for the new sample is the majority type in the terminal node. Sometimes instead of a "hard" classification into a type, the proportion of training samples in the node is used to obtain a probabilistic (fuzzy) classification. If the leaf is homogeneous, the new sample is assigned to that type with probability one. If the leaf is a mixture, then you assign the new sample to each component of the mixture with probability equal to the mixture proportion.

A more recent (and now more popular method) is to use grow multiple trees using random subsets of the data. This is called a "random forest". The final classifier uses all of the tree rules. To classify a new sample, it is dropped down all the trees (following the tree rules). This gives it a final classification by "voting" - each sample is classified into the type obtained from the majority of trees. This consensus method will be discussed in more detail in the lesson on the cross-validation, bootstraps and consensus.

Recursive partitioning bases the split points on each variable, looking for a split that increases the homogeneity of the resulting subsets. If two splits are equally good, the algorithm selects one. The "goodness" of the split does not depend on any measure of distance between the nodes, only the within node homogeneity. All the split points are based on one variable at a time; therefore they are represented by vertical and horizontal partitioning of the scatterplot. We will demonstrate the splitting algorithm using the two most differentially expressed genes as seen below.

Gene 1 vs Gene 2

The first split uses gene 2 and splits into two groups based on log2(expression) above or below 7.406.

Split 1

Then the two groups are each split again. They could be split on either gene, but both are split on gene 1. This gives 4 groups:

gene 2 below 7.406, gene 1 above/below 3.35

gene 2 above 7.406, gene 1 above/below 5.6. (note that the "above" group is a bit less homogeneous than it would be with a higher split, but the "below" group is more homogeneous)

Split 3

The entire tree grown using just these 2 genes is shown below:

Tree using 2 genes

Each of the terminal (leaf) nodes is labelled with the category making up the majority. The numbers below each leaf is the number of samples in G\M\N\S. We can see that none of these leaf nodes is purely one category, so there are lots of misclassified genes. However, we can distinguish reasonably well between the normal and cancerous samples, and between G and the other two cancer types. How well can we do with the 10 best genes? That tree is displayed below:

Tree using 10 genes

Only 5 of the 10 most differentially expressed genes are used in the classification. We do a bit better than with only two genes, but still have a lot of misclassified genes.

Notice that we have misclassified two tumor samples as normal. We might be less worried about misclassifying the cancer samples among themselves and more worried about misclassifying them as normal, since a classification as cancerous will likely lead to further testing. A weighting of the classification cost, to make it more costly to misclassify cancer samples can provide a classification that does better at separating out the normal from cancer samples. This is readily done by providing misclassification weights.

14.3 - Discriminant Analysis

Discriminant analysis is the oldest of the three classification methods. It was originally developed for multivariate normal distributed data. Actually, for linear discriminant analysis to be optimal, the data as a whole should not be normally distributed but within each class the data should be normally distributed. This means that if you could plot the data, each class would form an ellipsoid, but the means would differ.

The simplest type of discriminant analysis is called linear discriminant analysis or LDA. The LDA model is appropriate when the ellipsoids have the same orientation as shown below. A related method called quadratic discriminant analysis is appropriate when the ellipsoids have different orientations.

To understand how the method works, consider the 2 ellipses in Figure 1a, which have the same orientation and spread. Now consider drawing any line on the plot, and putting the centers of the ellipses on the line by drawing an orthogonal to the line (Figure 1b shows 2 examples). Now consider putting all the data on the line also by drawing an orthogonal from each data point to the line (Figure 1c). This is called the orthogonal projection.

For each ellipse, the projected data has a Normal distribution (Figure 1d) with mean the projected center and some variance. As we change the orientation of the lines, the mean and variance change. Note that only the orientation of the lines matters to the distance between the means and the 2 variances. For a given line, call the two means $\mu_1$ and $\mu_2$ and the two variances $\sigma^2_1$ and $\sigma^2_2$. Then similar to the two-sample t-test, we consider the difference between the means measured in SDs is

\[\frac{\mu_1- \mu_2}{\sqrt{\sigma^2_1+\sigma^2_2}}\]

Now consider a point M on the line which minimizes $(\mu_1-M)^2/\sigma_1^2 + (\mu_2-M)^2/\sigma_2^2)$. When the variances are the same, M is just the midpoint between the two means. The orthogonal line going through M is called a separating line. (Figure 1e) (When there are more than two dimensions it is called a separating hyperplane.) The best separating line (hyperplane) is the line going through M that is orthogonal to the projection minimizing the difference in means. Data points are classified according to which side of the hyperplane they lie on or alternatively, to the class whose projected mean is closest to the projected point.

When the two data ellipses have the same orientation but different sizes (Figure 1f) M moves closer to the mean of the class with the smaller variance. Otherwise the computation is the same.

lda diagram

Figure 1: Linear discriminant analysis

It turns out that this method maximizes the ratio of the within group variance among the points to the between group variance of the means. Unlike recursive partitioning, there is both a concept of separating the groups AND of maximizing the distance between them.

To account for the orientation and differences in spread, a weighted distance called Mahalanobis distance is usually used with LDA, because it provides a more justifiable idea of distance in terms of SD along the line joining the point to the mean. We expect data from the cluster to be closer to the center if it is in one of the "thin" directions than if it is in one of the "fat" directions. One effect of using Mahalanobis distance is to move the split point between the means closer to the less variable class as we have already seen in Figure 1f.

It is not important to know the exact equation for Mahalanobis distance, but for those who are interested, if two data points are designed by x and y (and each is a vector) and if S is a symmetric positive definite matrix:

Euclidean Distance between x and y is $\sqrt{(x-y)^\top(x-y)}$.

The S-weighted distance between x and y is $\sqrt{(x-y)^\top S^{-1}(x-y)}$.

The Mahalanobis distance between x and the center c_i of class i is the S-weighted distance where S is the estimated variance-covariance matrix of the class.

When there are more than 2 classes and the ellipses are oriented in the same direction, the separating hyperplanes are eigenvectors of the ratio of the between and within variance matrices. These are not readily visualized on the scatterplot (and in any case, with multiple features it is not clear which scatterplot should be used) so instead the data are often plotted by their projections onto the eigenvectors. This produces the directions in which the ratio of between class to within class variance is maximized, so we should be able to see the clusters on the plot as shown in the example. Basically, the plot is a rotation to new axes in the directions of greatest spread.

If the classes are elliptical but have a different orientation, the minimizer of the ratio of within to between class differences turns out to describe a quadratic surface rather than a linear surface. The method is called quadratic discriminant analysis or QDA.

LDA and QDA often do a good job of classification even if the classes are not elliptical but neither can do a good job if the classes are not convex. (See the sketch below). A newer method called kernel discriminant analysis (KDA) uses an even more general weighting scheme than S-weights, and can follow very curved boundaries. The ideas behind KDA are the same as the ideas behind the kernel support vector machine and will be illustrated in the next section.

example plots

With "omics" data, we have a huge multiplicity problem. When you have more features or variables than the number of samples we can get perfect separation even if the data are just noise! The solution is to firstly select a small set of features based on other criteria such as differential expression. As with recursive partitioning, we use a training sample and validation sample and/or cross validation to mimic how well the classifier will work with new samples.

Bone Marrow Data

Lets see how linear discriminant analysis works for our bone marrow data. Recall that the scatterplot matrix of the five most differentially expressed genes indicate that we should be able to do well in classifying the Normal samples, and reasonably well in classifying the "G" samples, but distinguishing between "M" and "S" might be difficult.

Lets start by looking at the two genes we selected for recursive partitioning. We have already seen the plot of the two genes, which is repeated below.

Gene 1 vs Gene 2

We can see that we might be able to draw a line separating the "N" samples from most of the others, and another separating the "G" samples from most of the others, but the "M" and "S" samples are not clearly separated. Since we are using only 2 features, we have only 2 discriminant functions. Projecting each sample onto the two directions and then plotting gives the plot below.

LD2 plot

We can see that direction LD1 does a good job of separating out the "N" and "G" groups, with the former having values less than -2 and the latter having values between -2 and 2. The utility of direction LD2 is less clear, but it might help us distinguish between "G" and "M". Direction LD1 is essentially the trend line on the scatterplot.

Using the 10 most differentially expressed genes allows us to find 3 discriminant functions. Plotting them pairwise gives the plot below:

LD 10 plot

Note that LD1 on this plot is not the same as on the previous plot, as it is sitting in a 10-dimensional space instead of a 2-dimensional space. However, by concentrating on the first column of the scatterplot matrix (LD1 is the x-axis for both plots) we can see that it is a very similar direction. Samples with LD1<-2 are mainly "N" and between -2 and 2 are mainly "G". Directions LD2 and LD3 are disentangling the "G", "M" and "S" samples, but not well.

We saw from the scatterplot matrix of the features, that at least the first 5 features have a very similar pattern across the samples. We might do better in classification if we could find genes that are differentially expressed, but have different patterns than these. There are a number of ways to do this. One method is variable selection - start with a larger set of genes and then sequentially add genes that do a good job of classification. However, when selecting variables we are very likely to overfit so methods like the bootstrap and cross-validation must be used to provide less biased estimates of the quality of the classifier.

14.4 - Support Vector Machine

The idea behind the basic support vector machine (SVM) is similar to LDA - find a hyperplane separating the classes. In LDA, the separating hyperplane separates the means of the samples in each class, which is suitable when the data are sitting inside an ellipsoid or other convex set. However, intuitively what we really want is a rule that separates the samples in the different classes which appear to be closest. When the classes are separable (i.e. can be separated by a hyperplane without misclassification), the support vector machine (SVM) finds the hyperplane that maximizes the minimum distance from the hyperplane to the nearest point in each set. The nearest points are called the support vectors.

In discriminant analysis, the assumption is that the samples cluster more tightly around the mean of the within class distribution. Using this method the focus is on separating the ellipsoids containing the bulk of the samples within each class.

discriminant analysis example

Using a SVM, the focus is on separating the closest points in different classes. The first step is to determine which point or points are closest to the other classes. These points are called support vectors because each sample is described by its vector of values on the genes (e.g. expression values). After identifying the support vectors (circled below) for each pair of classes we compute the separating hyperplane by computing the shortest distance from each support vector to the line. We then select the line the maximizes the sum of distances. If it is impossible to completely separate the classes, then a negative penalty is applied for misclassification - the selected line then maximizes the sum of distances plus penalty, which balances the distances from the support vectors to the line and the misclassifications.

svm example plots

The separating hyperplane used in LDA is sensitive to all the data in each class because you are using the entire ellipse of data and its center. The separating hyperplane used in SVM is sensitive only to the data on the boundary which provides computational savings and makes it more to robust outliers.

This is how it is applied:

Suppose the data vector for the ith sample is $x_i$, which is the expression vector for several genes. Also, suppose we have only 2 classes denoted by $y_i= +/- 1$. The hyperplane is defined by a vector W and a scalar b such that for all i.

$y_i(x_i'W+b)-1\ge 0$ for all i with equality for the support vectors and $||W||$ is minimal. After computing W and b using the training set, new samples are classified by determining the value of y satisfying the inequality.

If the clusters overlap, the "obvious" extension is to find the hyperplane that minimizes the distance between the main samples and the hyperplane, while penalizing for misclassified samples.

svm example plots

This is the solution to the optimization problem: $y_i(x_i'W+b)-1+z_i \ge 0)$ for all i with equality for the support vectors where all $z_i ≥ 0$ and $||W|| + C\Sigma_i z_i$ is minimal for some C.

The idea is that some points can be one the wrong side of the separating hyperplane, but there is a penalty for each misclassification. The hyperplane is now found by minimizing a criterion that still separates the support vectors as much as possible in this orthogonal direction, but also minimizes the number of misclassifications.

It is not clear that SVM does much better than linear discriminant analysis when the groups are actually elliptical. When the groups are not elliptical because of outliers SVM does better because the criterion is only sensitive to the boundaries and not sensitive to the entire shape. However, like LDA, SVM requires groups that can be separated by planes. Neither method will do well for an example like the one below. However, there is a method called the kernel method (sometimes called the kernel trick) that works for more general examples.

svm example plots

The idea is simple - the implementation is very clever.

Suppose that we suspected that the clusters were in concentric circles (or spheres) as seen above. If we designate the very center of the points as the origin, we can compute the distance from each point to the center as a new variable. In this case, the clusters separate very nicely on the radius, creating a circular boundary.

The general idea is to add more variables that are nonlinear functions of the original variables until we have enough dimensions to separate the groups.

The kernel trick uses the fact that for both LDA and SVM, only the dot product of the data vectors $x_i'x_j$ are used in the computations. Lets call $K(x_i,x_j)=x_i'x_j$. We replace $K$ with a more general function called a kernel function and use $K(x_i,x_j)$ as if it were the dot product in our algorithm. (For those who are more mathematically inclined, the kernel function needs to be positive definite, and we will actually be using the dot product in the basis described by its eigenfunctions.)

The most popular kernels are the radial kernel and polynomial kernel. The radial kernel (sometimes called the Gaussian kernel) is $k(x_i, x_j)= \text{exp}(-||x_i-x_j||^2/2\sigma^2)$.

Here is an animation that shows how "kernel" SVM works with the concentric circles using the radial kernel.

Here is another example of essentially the same problem using a polynomial kernel instead of Normal densities. $k(x_i, x_j)=(x_i'x_j+1)^2$.

The plot below shows the results of ordinary (linear) SVM on our cancer data. I believe the blue "tail" is an artifact of the plotting routine - the boundaries between the groups should be straight lines. The data points marked with "X" are the support points - there are a lot of them because the data are so strung out.

Linear SVM classification plot

I also used SVM with the radial kernel (below). The plot looks nicer, but there are actually more support points (15 instead of 12) and the same 2 samples are misclassified.

Radial SVM classification plot