STAT 555
Published on STAT 555 (https://onlinecourses.science.psu.edu/stat555)

Home > Lesson 10: Clustering

Lesson 10: Clustering

Key Learning Goals for this Lesson:
  • Understanding some clustering algorithms and how they are used
  • Understanding how the choice of distance measure affects the interpretation of the clusters
  • Learning how to assess the quality of a clustering

Introduction

Clustering is a set of methods that are used to explore our data and to assist in interpreting the inferences we have made.  In the machine learning literature is it one of a set of methods referred to as "unsupervised learning" -  "unsupervised" because we are not guided by a priori ideas of which features or samples belong in which clusters.  "Learning" because the machine algorithm "learns" how to cluster.  Another name for this set of techniques is "pattern recognition".

Clustering Genes and Samples 

Often we use sample clustering as part of our initial exploration of our data.  We usually have some idea about which samples should be most similar - technical replicates, biological replicates of the same treatments and so on - and we do the clustering to verify the expected sample clusters, and to make sure that there are not unusual outlier samples.  Occasionally mislabelled samples or outliers are found.

Feature clustering is usually done after a set of genes have been selected, to try to interpret what we have found.  In gene expression studies, we usually cluster based on how the genes express under several conditions, and expect the clusters to consist of genes with similar expression patterns.  Clustering could, in principle, be done with your entire set of genes, but there are two problems. The first problem is tied to computer memory and computer time - the algorithms most commonly used can readily handle a few hundred genes, but not a few thousand.  That is readily overcome using specialized software, but visualization of the results can still be problematic.  However, the second problem is more serious - many of the measures of pattern similarity consider only the shape of the pattern, not the magnitude.  So random variation in the sample means can be mistaken for a significant expression pattern unless genes with no significant differences among treatments are first removed.

In clustering we are interested in whether there are groups of genes or groups of samples that have similar gene expression patterns. The first thing that we have to do is to articulate what we mean by similarity or dissimilarity as expressed by a measure of distance. We can then use this measure to cluster genes or samples that are similar.

This lesson will talk about two methods: hierarchical clustering and k-means clustering (although we will demonstrate with a variant of k-means called k-mediods that seems to work a little better when there are a few extreme values).

10.1 - Hierarchical Clustering

Hierarchical clustering is set of methods that recursively cluster two items at a time. There are basically two different types of algorithms, agglomerative and partitioning. In partitioning algorithms, the entire set of items starts in a cluster which is partitioned into two more homogeneous clusters.  Then the algorithm restarts with each of the new clusters, partitioning each into more homogeneous clusters until each cluster contains only identical items (possibly only 1 item).  If there is time towards the end of the course we may discuss partitioning algorithms.

In agglomerative algorithms, each item starts in its own cluster and the two most similar items are then clustered.   You continue accumulating the most similiar items or clusters together two at a time until there is one cluster.  For both types of algorithms, the clusters at each step can be displayed in a dendrogram as we have seen with our microarray and RNA-seq data.

Agglomerative Process

  1. Choose a distance function for items  \(d(x_i, x_j)\)
  2. Choose a distance function for clusters \(D(C_i, C_j)\) - for clusters formed by just one point, D should reduce to d.
  3. Start from N clusters, each containing one item.  Then, at each iteration:
    • a) using the current matrix of cluster distances, find two closest clusters.
    • b) update the list of clusters by merging the two closest.
    • c) update the matrix of cluster distances accordingly
  4. Repeat until all items are joined in one cluster.

This is called a greedy algorithm. It looks at only the current state and does the best it can at that stage and does not look ahead to see whether another choice would be better in the long run.  If you join two items into the same group early on you cannot determine if a cluster later develops that is actually closer to one of the items.  For this reason, you never get to 'shuffle' and put an item back into a better group.  

A problem with the algorithm occurs when there are two pairs that could be merged at a particular stage.  Only one pair is merged - usually the pair that is first in the data matrix. After this pair is merged the distance matrix is updated, and it is possible that the second pair is no longer closest. If you had picked the other pair first, you could get a different clustering sequence.  This is typically not a big problem but could be if it happens early on. The only way to see if this has happened, is to shuffle the items and redo the clustering method to see if you get a different result.

Distance Measures

One of the problems with any kind of clustering method is that clusters are always created ubt may not always have meaning. One way to think about this is what would happen if we tried to cluster people in a given room. We could cluster people in any number of ways and these would all be valid clusters. We could cluster by:

  • gender
  • profession
  • country of origin
  • height

These are all valid clusterings but they have different meanings.  The importance of the clustering depends on whether the clustering criterion is associated with the phenomenon under study.

In biological studies, we often don't know what aspects of the data are important. We have to set a distance measure that makes sense for our study. A set of commonly used distance measures is in the table below.  The vectors x and y are either samples, in which case their components are the expression measure for each feature, or features, in which case their components are the expression of the feature either in each sample or (better) the mean expression in each treatment.

'euclidean':

Usual square distance between the two vectors (2 norm).
'maximum': Maximum distance between two components of x and y (supremum norm)
'manhattan': Absolute distance between the two vectors (1 norm).
'canberra':
\(\sum(|x_i - y_i| / |x_i + y_i|)\). Terms with zero numerator and denominator are omitted from the sum and treated as if the values were missing.
'minkowski':
The p norm, the pth root of the sum of the pth powers of the differences of the components.
'correlation':
1 - r where r is the Pearson or Spearman correlation
'absolute correlation':
1 - | r |

For most common hierarchical clustering software, the default distance measure is the Euclidean distance. This is the square root of the sum of the square differences.  However, for gene expression, correlation distance is often used.  The distance between two vectors is 0 when they are perfectly correlated.  The absolute correlation distance may be used when we consider genes to be close to one another either when they go  up and down together or in opposition i.e. wherever one gene over-expresses, the other gene under-expresses and vice versa. Absolute correlation distance is unlikely to be a sensible distance when clustering samples.  R has a function that computes distances between the columns of matrices and offers many different distance functions.  

To understand the differences between some of the distance measures, take a look at the graphs below, each showing the expression pattern of a particular gene (black) versus another gene on the log2 scale on 25 treatments (or 25 samples). Which gene is closest expression pattern to the black gene?

clustering graphs

Our visual assessment is closest to Euclidean distance - we tend to focus on the differences between the expression values, which might consider the solid red gene to be the furthest, and the solid blue to be the closest.  However, in clustering gene expression, we are more interested in the pattern than the overall mean expression.

If you subtract the mean, which is called centering, you find that you can't even see the difference between the solid red and black genes.  The geen, blue and dashed red genes already had the same mean as the black gene and were not affected by centering.  The similarity between the dashed green and black genes are now higher.

clustering graphs - centered
Centered gene expression

Now suppose we take the z-score of expression.  The Euclidean distance of the z-scores is the same as correlation distance.  z-scores are computed from the centered data by dividing by the SD.  This is called scaling.  We now see that all the genes except the green and dashed red gene are identical to the black gene after centering and scaling.  The green gene is actually now gone from the plot - because it was constant, the SD was zero and the z-score is undefined.  The dashed red gene has perfect negative correlation with the black gene. Using correlation distance, all of the genes except the red dashed one have distance zero from the black gene.  Using absolute correlation distance, all of the genes have distance zero from each other.

clustering graphs - centered and scaled
Centered and scaled (z-score = (y - mean) / sd)

It is interesting that the gene expression literature has a fair amount of discussion about what kind of clustering to use, and almost no discussion about what distance method to use. And yet, as you will see in a later example of the two most common methods of clustering, using the same distance measure the clusters look nearly identical. You can very easily see that they have the same expression patterns. The bottom line is, what really matters is the distance measure.  Correlation works well for gene expression in clustering both samples and genes.  

Pearson's correlation is quite sensitive to outliers.  This does not matter when clustering samples, because the correlation is over thousands of genes.  When clustering genes, it is important to be aware of the possible impact of outliers.  This can be mitigated by using Spearman's correlation instead of Pearson's correlation.

Defining Cluster Distance: The Linkage Function

So far we have defined a distance between items.  The linkage function tells you to measure the distance between clusters. Again, there are many choices.  Typically you consider either a new item that summarizes the items in the cluster, or a new distance that summarizes the distance between the items in the cluster and items in other clusters.  Here is a list of four methods.  In each example, x is in one cluster and y is in the other.

   name  of linkage function                                                                              form of linkage function

Single (string-like, long) \(f = min(d(x,y))\)
Complete (ball-like, compact)
\(f = max(d(x,y))\)
Average (ball-like, compact)
\(f = average(d(x,y))\)
Centroid (ball-like, compact)
\(d(ave(X),ave(Y) )\) where we take the average over all items in each cluster

       

For example, suppose we have two clusters \(C_I\) and \(C_2\) with elements \(x_{ij}\)  where i is the cluster and j is the item in the cluster.  \(D(C_1, C_2)\) is a function of the distances \(f  \{d(x_{1j}, x_{2k}) \}\).

Single linkage clusters looks at all the pairwise distances between the items in the two clusters and takes the distance between the clusters as the minimum distance.

Complete linkage, which is more popular, takes the maximum distance.

Average linkage takes the average, which as it turns out is fairly similar to complete linkage.

Centroid linkage sounds the same as average linkage but instead of using the average distance, it creates a new item which is the average of  all the individual items and then uses the distance between averages. 

Single and complete linkage give the same dendrogram whether you use the raw data, the log of the data or any other transformation of the data that preserves the order because what matters is which ones have the smallest distance. The other methods are sensitive to the measurement scale. The most popular methods for gene expression data are to use log2(expression + 0.25), correlation distance and complete linkage clustering.

10.2 - Example: Agglomerative Hierarchical Clustering

Example of Complete Linkage Clustering

Clustering starts by computing a distance between every pair of units that you want to cluster.  A distance matrix will be symmetric (because the distance between x and y is the same as the distance between y and x) and will have zeroes on the diagonal (because every item is distance zero from itself).  The table below is an example of a distance matrix.  Only the lower triangle is shown, because the upper triangle can be filled in by reflection.

distance matrix

Now lets start clustering.  The smallest distance is between three and five and they get linked up or merged first into a the cluster '35'.

 To obtain the new distance matrix, we need to remove the 3 and 5 entries, and replace it by an entry "35" .  Since we are using complete linkage clustering, the distance between "35" and every other item is the maximum of the distance between this item and 3 and this item and 5.  For example, d(1,3)= 3 and d(1,5)=11.  So, D(1,"35")=11.  This gives us the new distance matrix.  The items with the smallest distance get clustered next.  This will be 2 and 4.

distance matrix

Continuing in this way, after 6 steps, everything is clustered. This is summarized below.  On this plot, the y-axis shows the distance between the objects at the time they were clustered.  This is called the cluster height.  Different visualizations use different measures of cluster height.

complete linkage
Complete Linkage

Below is the single linkage dendrogram for the same distance matrix.  It starts with cluster "35" but the distance between "35" and each item is now the minimum of d(x,3) and d(x,5).  So c(1,"35")=3.

single linkage

Single Linkage

Determining clusters

One of the problems with hierarchical clustering is that there is no objective way to say how many clusters there are. 

If we cut the single linkage tree at the point shown below, we would say that there are two clusters.

single linkage with cut

However, if we cut the tree lower we might say that there is one cluster and two singletons.

single linkage with cut

There is no commonly agreed-upon way to decide where to cut the tree.

Let's look at some real data.  In homework 5 we consider gene expression in 4 regions of 3 human and 3 chimpanzee brains.  The RNA was hybridized to Affymetrix human gene expression microarrays.  We normalized the data using RMA and did a differential expression analysis using LIMMA.  Here we selected the 200 most significantly differentially expressed genes from the study. We cluster all the differentially expressed genes based on their mean expression in each of the 8 species by brain region treatments

Here are the clusters based on Euclidean distance and correlation distance, using complete and single linkage clustering.

Euclidean, Complete    Euclidean, Single

Correlation, Complete    Correlation, Single

We can see that the clustering pattern for complete linkage distance tends to create compact clusters of clusters, while single linkage tends to add one point at a time to the cluster, creating long stringy clusters.  As we might expect from our discussion of distances, Euclidean distance and correlation distance produce very different dendrograms.  

Hierarchical clustering does not tell us how many clusters there are, or where to cut the dendrogram to form clusters.  In R there is a function cutttree which will cut a tree into clusters at a specified height.  However, based on our visualization, we might prefer to cut the long branches at different heights.  In any case, there is a fair amount of subjectivity in determining which branches should and should not be cut to form separate clusters. 

Understanding the clusters

To understand the clusters we usually plot the log2(expression) values of the genes in the cluster, or in other words, plot the gene expressions over the samples. (The numbering in these graphs are totally arbitrary.)  Even though the treatments are unordered, I usually connect the points coming from a single feature to make the pattern clearer.  These are called profile plots.

Here is some of the profile plots from complete linkage clustering when we used Euclidean distance:

hierarchical clustering comparisons, Euclidean distance

These look very tightly packed.  However, clusters 2 and 4 have genes with different up and down patterns, because they have about the same mean expression.  Cluster 2 are very highly expressed genes.

Here's what we got when we use correlation distance:

hierarchical clustering comparisons, Correlation distance

These are much looser on the y-axis because correlation focuses on the expression pattern, not the mean.  However, all the genes in the same cluster have a peak or valley in the same treatments (which are brain regions by species combinations).  Clusters 1 and 2 are genes that are respectively higher or lower in the cerebellum compared to other brain regions in both species. 

Selecting a gene list 

In principle it is possible to cluster all the genes, although visualizing a huge dendrogram might be problematic.  Usually, some type of preliminary analysis, such as differential expression analysis is used to select genes for clustering.  There are good reasons to do so, although there are also some caveats.

Typically in gene expression, the distance metric used is correlation distance.  Correlation distance is the same as centering and scaling the data, and then using Euclidean distance.  When there are systematic treatment effects, we expect the variability of gene expression from treatment to treatment to be a mix of systematic treatment effects and noise.  When there are no treatment effects, the variability of gene expression is just due to noise.  However, centering and scaling the data puts all variabity on the same scale.  Hence genes that exhibit a pattern due to chance are not distinguishable from those that have a systematic component.   

As we have seen, correlation distance has better biological interpretation than Euclidean distance for gene expression studies, but the same scaling that makes it useful for finding biologically meaningful patterns of gene regulation introduces spurious results for genes that do not differentially express.  Selecting genes based on differential expression analysis removes genes which are likely to have only chance patterns.  This should enhance the patterns found in the gene clusters.

As a caveat, however, consider the effects of gene selection on clustering samples or treatments.  The selected genes are those which test positive in differential expression analysis.  Use of those genes to cluster samples is biased towards clustering the samples by treatment.

10.3 - Heatmaps

Heat maps are ways to simultaneously visualize clusters of samples and  features, in our case genes. First hierarchical clustering is done of both the rows and the columns of the expression matrix. Usually correlation distance is used, but neither the clustering algorithm nor the distance need to be the same for rows and columns. Then the branches of the dendrograms are rotated so that the blocks of 'high' and 'low' expression values are adjacent in the expression matrix. Finally, a color scheme is applied for the visualization and the expression matrix is displayed.

The branches of the trees are rotated to create blocks in which the individual values are close in both directions. These are color-coded by expression values. (But for correlation distance, we should use z-scores.)

Euclidean distance heatmap

Euclidean distance:  Color coding is by mean gene expression

Correlation distance heatmap
Correlation distance: Color coding is by mean gene expression.

And here is the correlation distance heat map after converting to z-scores of the rows (genes).

Correlation distance heatmap, with z-scores
Correlation distance: Color coding after computing z-scores (row scaling)

This looks much better and you can see patterns picked out by the clustering algorithm.

[Please note that all of the default for colors on most of these heat maps is red and green. Given that a certain percentage of the population is red/green color-blind you may want to consider changing these to other colors that don't have this problem.]

Here are heat maps using yellow and red for colors for the brain study.  In the R lab you will have a look at the readily available color schemes and how to apply them.


Heatmap using Euclidean distance

Heatmap using Euclidean distance

Heatmap using Correlation distance

Heatmap using Correlation distance

There are a number of genes that are high in cerebellum for both species and pretty low in other regions. And, there are number of genes that are low in cerebellum and high in other regions. This is pretty interesting for a number of reasons, not the least of which is that all the individuals in the study died of different causes which might have affected gene expression in the brain just prior to death.  Nevertheless, a strong pattern is evident for both cerebellum and caudate, which is shared by both species.

10.4 - K-means and K-mediods

K means or K mediods clustering are other popular methods for clustering.  They require as input the data, the number K of clusters you expect, and K "centers" which are used to start the algorithm.  The centers have the same format as one of the data vectors.  As the algorithm progresses, the centers are recomputed along with the clusters.

The algorithm proceeds by assigning each data value to the nearest center.  This creates K clusters.  Then a new cluster center is selected based on the data in the cluster.  For the K-means algorithm, the distance is always Euclidean distance and the new center is the component-wise mean of the data in the cluster.  To use correlation distance, the data are input as z-scores.  For the K-mediods algorithm, other distances can be used and the new center is one of the data values in the cluster.  

If there are no well-separated clusters in the data (the usual case) then the clusters found by the algorithm may depend on the starting values.  For this reason, it is common to try the algorithm with different starting values.

The K-means method is sensitive to anomalous data points and outliers. If you have an outlier then whatever cluster it would be included in, the centroid of that cluster would be pulled out to towards that point. The K-mediod method is robust to outliers when robust distance measures such as Manhattan distance are used. 

At each step of the algorithm, all the points may be reassigned to new clusters.  This means that mistaken cluster assignments early in the process are corrected later on.

If two centers are equally (and maximally) close to an observation at a given iteration, we have to choose arbitrarily which one to cluster it with. The problem here is not as serious as in hierarchical clustering because points can move in later iterations.

The algorithm minimizes the total within cluster distance. This is basically like the average distance between the center of a cluster and everything in the cluster.  Clusters tend to be ball shaped with respect to the chosen distance.

Here is a case that is pretty ideal because you have to clearly separated clusters. Suppose we started at the open rectangles.

graph of k means clustering

The points that are closest to one of the open rectangles get put into a cluster defined by the centroid. All the points nearest each of the open rectangles is shown in color here.  The centroids were then recomputed based on this current clustering leading to the centroids being recomputed as the dashed rectangles. Data values are then reassigned to clusters and the process is repeated. In this example, only the two points with circles around them are assigned to new clusters in the second step.  The process is repeated until the clusters (and hence the cluster centroids) do not change.

The algorithm doesn't tell you how to pick the number of clusters. There are some measures about how tight the clusters are that could be used for this however.

Two-dimensional sets of points are easier to visualize than, for instance, our example with eight dimensions defined by the treatment means and 200 genes. This makes it much more difficult to see the clusters visually.  However, it turns out that K-means clustering is closely associated mathematically with a method called principal components analysis (PCA).  PCA picks out directions of maximal variance from high dimensional data (e.g. for gene expression, the expression value in one treatment is a dimension; for clustering samples, the expression of each gene in the sample is a dimension).  Scatterplots using the first 2 or 3 principal components can often be used to visualize the data in a way that clearly shows the clusters from K-means.  K-mediods is not mathematically associated with PCA, but the first 2 or 3 principal components still often do a good job of displaying the clusters.

Below we show the PCA visualization of the brain data with 8 treatment means of the 200 most differentially express genes.    We used k-mediod clustering with K=6 clusters and Euclidean distance.  Where clusters overlap on the plot, they might actually be separated if we could display 3 dimensions.  However, even in 2 dimensions we see that the clusters are quite well separated.

Euclidean distance

Using correlation distance for the brain data we obtain:

Correlation distance

Because correlations are squeezed between -1 and 1, the principal components plot of the correlation distance often has this elliptical shape, with most of the data on the boundary of the ellipse.  The clusters are still clearly separated.

However, the plots also show another problem with clustering with selected genes.  In a certain sense we can see that the separation into clusters is arbitrary - we could rotate the elliptical cluster boundaries around the boundary of the main ellipse, and we would still see clustered structure.  What we are doing is what Bryan [1] called 'zip coding'. This is an arbitrary way of putting nearby units into clusters with arbitrary boundaries just as zip-codes are arbitrary geographical boundaries.

Clustering is an exploratory method that helps us visualize the data but it is only ruseful if it matches up with something biological that we can interpret. Before data mining became a standard part of machine learning, statisticians used to talk about data mining as a form of p-hacking,  looking for patterns that may or may not be real. Clustering is very much in this 'searching for significance' mind-set. If the resulting clusters up with something interpretable, this is good, but you still have to validate the clustering by taking another sample or using someone else's data to to see if you get the same patterns.

We can also look at the cluster contents by looking at the profiles. These graphs using both  K-mediods  (top) and complete linkage clustering (bottom) look very much the same here.  Both methods try to find "spherical" clusters, and the cluster patterns look very similar.  This gives us greater confidence that the clusters have picked out some meaningful patterns in our data.

cluster comparison PAM
K-mediods method

cluster comparison complete linkage
Complete linkage hierarchical clustering

10.5 - Other Partitioning Methods

There are many other ways to cluster data.  Some of the popular ones are summarized briefly below.

1. Self Organizing Maps (SOM): adds an underlying 'topology', a neighboring structure on the lattice, that relates cluster centroids to one another. Kohonen (1997), Tamayo et. Al. (1999).

2. Fuzzy K-means: Instead of partitioning the data into clusters, these methods provide, for each item and each cluster, the probability that the item is a member of the cluster. This provides fuzzy boundaries between clusters and probabilities associated with centroids. This is an interesting method but is harder to visualize. Gash and Eisen (2002).

3. Mixture Based Clustering: implemented through an EM (Expectation-Maximization) algorithm. This provides 'soft partitioning', and allows for 'modeling' of cluster centroids and shapes. The idea is that each cluster should have only one center, with the items in the cluster distributed around the center.  For example, within each cluster, the data might be multivariate Normal, in which case the density of the data should have a peak at the center and elliptical contours. Mixture based  clustering looks at the whole distribution and tries to figure out how many peaks there are and how many pieces you can decompose it into. It is similar to fuzzy K-means methods because for each item it produces a probability of cluster membership for each cluster. Yeung et al. (2001), McLachlen et al. (2002).

10.6 - Assessing the Clusters Computationally

Clustering is Exploratory

Clustering methods are exploratory - they can be used for data quality assessment and to generate hypotheses.  But no matter what goes into the clustering algorithm, clusters come out.  This is a classic "garbage in -- garbage out" situation.  Of course, we hope that what is going into the analysis is not garbage, but that does not guarantee that pearls of wisdom are coming out.

The bottom line is that the clustering is good if it is biologically meaningful. But, this is hard to assess. 

When clustering gene expression, it is better to cluster treatment means than the individual sample values which are noisy. We don't want to cluster on noise -- we want to cluster on the systematic effects that are of interest.  However, if you only have a few treatments you are better off clustering samples so that you have some kind of spread of expression.

There are measures of cluster goodness.  They work on the principle that distances between items in the same cluster should be small and that distances between items in different clusters should be large.  This is an internal check on cluster "tightness" but does not guarantee that the clusters are biologically meaningful.  

Look at the plot below.  Do you see clusters?  Where would you put the cluster boundaries if you planned to have 3 clusters?

initial plot of values to look for clusters

The plot above were completely random points with uniform x and uniform y values.  Below is a cluster dendrogram from 4 rows of completely random data.

cluster dendogram identifying clusters

Measures of Cluster Goodness

The most popular measure of cluster goodness is called the silhouette. You measure the distance between all of the genes in the cluster and look at the maximum distance between genes in the same cluster. Then, you look at the distance between the genes and every other cluster and take the minimum distance between these genes and the closest other clusters. The ratio of these distances is the silhouette. What you want is very tight distances within clusters and long distances between clusters so the ratio will be small.

Another assessment is a stability analysis. If the clusters are close together and you perturb the data a little bit then you get different clusters. However, if the clusters are far apart and you move the data a little bit, the same clusters appear. How do you perturb the data? There are two commonly used methods: resampling and adding noise.

Resampling (also called the bootstrap method) means that you take a sample from the observed data.  You subject this sample to the same clustering algorithm.  Repeat many times.  Then look at the intersections of the resulting clusters (or use what is called a consensus cluster).  If the clusters are stable, the intersections will be large.  We will look at this method in Chapter 15.

Another way of perturbing the data is by adding a little bit of noise.  This should be a small fraction of the difference between the smallest and largest values, so it should not disturb the clusters too much.  Items which move between clusters are not stable.  Again, we sometimes repeat this method several times and then form a consensus cluster.


Source URL: https://onlinecourses.science.psu.edu/stat555/node/11