Lesson 10: Clustering

Printer-friendly versionPrinter-friendly version
Key Learning Goals for this Lesson:
  • Understanding some clustering algorithms and how they are used
  • Understanding how the choice of distance measure affects the interpretation of the clusters
  • Learning how to assess the quality of a clustering

Introduction

Clustering is a set of methods that are used to explore our data and to assist in interpreting the inferences we have made.  In the machine learning literature is it one of a set of methods referred to as "unsupervised learning" -  "unsupervised" because we are not guided by a priori ideas of which features or samples belong in which clusters.  "Learning" because the machine algorithm "learns" how to cluster.  Another name for this set of techniques is "pattern recognition".

Clustering Genes and Samples 

Often we use sample clustering as part of our initial exploration of our data.  We usually have some idea about which samples should be most similar - technical replicates, biological replicates of the same treatments and so on - and we do the clustering to verify the expected sample clusters, and to make sure that there are not unusual outlier samples.  Occasionally mislabelled samples or outliers are found.

Feature clustering is usually done after a set of genes have been selected, to try to interpret what we have found.  In gene expression studies, we usually cluster based on how the genes express under several conditions, and expect the clusters to consist of genes with similar expression patterns.  Clustering could, in principle, be done with your entire set of genes, but there are two problems. The first problem is tied to computer memory and computer time - the algorithms most commonly used can readily handle a few hundred genes, but not a few thousand.  That is readily overcome using specialized software, but visualization of the results can still be problematic.  However, the second problem is more serious - many of the measures of pattern similarity consider only the shape of the pattern, not the magnitude.  So random variation in the sample means can be mistaken for a significant expression pattern unless genes with no significant differences among treatments are first removed.

In clustering we are interested in whether there are groups of genes or groups of samples that have similar gene expression patterns. The first thing that we have to do is to articulate what we mean by similarity or dissimilarity as expressed by a measure of distance. We can then use this measure to cluster genes or samples that are similar.

This lesson will talk about two methods: hierarchical clustering and k-means clustering (although we will demonstrate with a variant of k-means called k-mediods that seems to work a little better when there are a few extreme values).