Bagging constructs a large number of trees with bootstrap samples from a dataset. But now, as each tree is constructed, take a random sample of predictors before each node is split. For example, if there are twenty predictors, choose a random five as candidates for constructing the best split. Repeat this process for each node until the tree is large enough. And as in bagging, do not prune.
Random Forests Algorithm
The random forests algorithm is very much like the bagging algorithm. Let N be the number of observations and assume for now that the response variable is binary.
Take a random sample of size N with replacement from the data (bootstrap sample).
Take a random sample without replacement of the predictors.
Construct a split by using predictors selected in Step 2.
Repeat Steps 2 and 3 for each subsequent split until the tree is as large as desired. Do not prune. Each tree is produced from a random sample of cases and at each split a random sample of predictors.
Drop the out-of-bag data down the tree. Store the class assigned to each observation along with each observation's predictor values.
Repeat Steps 1-5 a large number of times (e.g., 500).
For each observation in the dataset, count the number of trees that it is classified in one category over the number of trees.
Assign each observation to a final category by a majority vote over the set of trees. Thus, if 51% of the time over a large number of trees a given observation is classified as a "1", that becomes its classification.
Why Random Forests Work
Variance reduction: the trees are more independent because of the combination of bootstrap samples and random draws of predictors.
- It is apparent that random forests are a form of bagging, and the averaging over trees can substantially reduce instability that might otherwise result. Moreover, by working with a random sample of predictors at each possible split, the fitted values across trees are more independent. Consequently, the gains from averaging over a large number of trees (variance reduction) can be more dramatic.
Bias reduction: a very large number of predictors can be considered, and local feature predictors can play a role in tree construction.
Random forests are able to work with a very large number of predictors, even more, predictors than there are observations. An obvious gain with random forests is that more information may be brought to reduce bias of fitted values and estimated splits.
There are often a few predictors that dominate the decision tree fitting process because on the average they consistently perform just a bit better than their competitors. Consequently, many other predictors, which could be useful for very local features of the data, are rarely selected as splitting variables. With random forests computed for a large enough number of trees, each predictor will have at least several opportunities to be the predictor defining a split. In those opportunities, it will have very few competitors. Much of the time a dominant predictor will not be included. Therefore, local feature predictors will have the opportunity to define a split.
Indeed, random forests are among the very best classifiers invented to date (Breiman, 2001a).
Random forests include 3 main tuning parameters.
Node Size: unlike in decision trees, the number of observations in the terminal nodes of each tree of the forest can be very small. The goal is to grow trees with as little bias as possible.
Number of Trees: in practice, 500 trees is often a good choice.
Number of Predictors Sampled: the number of predictors sampled at each split would seem to be a key tuning parameter that should affect how well random forests perform. Sampling 2-5 each time is often adequate.
Taking Costs into Account
In the example of domestic violence, the following predictors were collected from 500+ households: Household size and number of children; Male / female age (years); Marital duration; Male / female education (years); Employment status and income; The number of times the police had been called to that household before; Alcohol or drug abuse, etc.
Our goal is not to forecast new domestic violence, but only those cases in which there is evidence that serious domestic violence has actually occurred. There are 29 felony incidents which are very small as a fraction of all domestic violence calls for service (4%). And they would be extremely difficult to forecast. When a logistic regression was applied to the data, not a single incident of serious domestic violence was identified.
There is a need to consider the relative costs of false negatives (fail to predict a felony incident) and false positives (predict a case to be a felony incident when it is not). Otherwise, the best prediction would be assuming no serious domestic violence with an error rate of 4%. In random forests, there are two common approaches. They differ by whether costs are imposed on the data before each tree is built, or at the end when classes are assigned.
Weighted Classification Votes: After all of the trees are built, one can differentially weight the classification votes over trees. For example, one vote for classification in the rare category might count the same as two votes for classification in the common category.
Stratified Bootstrap Sampling: When each bootstrap sample is drawn before a tree is built, one can oversample one class of cases for which forecasting errors are relatively more costly. The procedure is much in the same spirit as disproportional stratified sampling used for data collection (Thompson, 2002).
Using a cost ratio of 10 to 1 for false negatives to false positives favored by the police department, random forests correctly identify half of the rare serious domestic violence incidents.
In summary, with forecasting accuracy as a criterion, bagging is in principle an improvement over decision trees. It constructs a large number of trees with bootstrap samples from a dataset. Random forests are in principle an improvement over bagging. It draws a random sample of predictors to define each split.
Package for Random Forests
In R, the random forest procedure can be implemented by the "randomForest" package.
rf=randomForest(x=X.train,y=as.factor(Y.train),importance=T,do.trace=50, ntree=200,classwt=c(5,1)) print(rf)