Printer-friendly versionPrinter-friendly version

When you do cross validation (CV) or bootstrap you produce many estimates or predictors  - one for each of the pseudo-samples.  These can be used to improve the final estimate or prediction equation.  In machine learning, this is called the "agreement of experts" or "ensemble leaerning".  In bioinformatics it is called "consensus".  There are many ways to reach consensus rules.

To understand some of the methods, consider the problem of classifying the bone marrow samples into 4 classes: normal, MGUS, SMM or MM. 

Strict consensus makes a prediction only when all of the CV or bootstrap samples make the same prediction.  This can be useful in classification, as it helps determine which classifications are undisputed and which may be questionable.  It is useful in tree-based methods, as it indicates when all the trees have the same node and the same branches at that node. However, you may end up with many samples that are not classified because at least one of the CV or bootstrap samples gave it a different classification from the others.

Majority rule makes whichever prediction gets the most votes. Of course, it is not guaranteed that there is a majority prediction.  For example, if you do 100 bootstraps, a sample might be classified equally often into each of the 4 classes.  Alternatively, it could be classified in normal 24 times, into MGUS and SMM 25 times each and into MM 26 times, and then the majority rule would assign it to MM.  Majority rule can be problematic as it treats 0,0,0,100 the same as 24,25,25,26.

Strict majority rule is when you will use the majority prediction only if the majority is over 50%.  Samples with more varied predictions do not get classified.

As we apply these approaches to classification, the following is what you might see.

Strict consensus – you only classify samples which agree B times; the remainder are not classified.

Majority rule – you classify samples into the most frequently chosen class.

Strict Majority rule – you classify samples into the most frequently chosen class if chosen more than 50% of the time; otherwise they are not classified.

We are going to discuss three consensus methods - bagging, consensus clustering and random forest (for consensus classification using classification trees).