Printer-friendly versionPrinter-friendly version

Consensus Clustering is another idea for using bootstrap sampling.

We start by clustering our data using whatever method we prefer - e.g. complete linkage hierarchical clustering.  To use consensus clustering, we will need to break the tree into clusters by some method that we can repeat with other samples, such as chosing a fixed number of clusters or fixed cluster height.

We then do bootstrap sampling.  In clustering, identical items will automatically be clustered together, so the replicates that one always obtains with the nonparametric bootstrap are problematic.  There are two ways around it.  Often the semiparametric bootstrap is used, using all of the original observations and adding noise.  This makes it possible to identify the original sample in each bootstrap cluster.  Another alternative is to sample without replacement the expected number of unique items.  For large sample sizes, this is about (1-1/e)*N where e is the base of the natural logarithm and N is the sample size.  (1-1/e)\(\approx\)0.632 and this is sometimes called the .632 bootstrap.

For each bootstrap sample we obtain the clusters.  Then we seek consensus.

The simplest consensus approach creates a matrix counting the number of times \(C_{ij}\) items i and j are in the same cluster.  Items that clearly cluster together should be in the same cluster often, while those that are clearly distant may never be in the same cluster.  A function of this matrix is used as the distance measure - e.g. distance(i,j)=1/\(C_{ij}\).  The consensus cluster is formed by clustering (again with any method desired) with this new distance measure.