# comma delimited data and no header for each variable
= read.table("diabetes.data",sep = ",",header=FALSE) RawData
12 Tree-based Methods
Overview
Textbook reading: Chapter 8: Tree-Based Methods.
Decision trees can be used for both regression and classification problems. Here we focus on classification trees. Classification trees are a very different approach to classification than prototype methods such as k-nearest neighbors. The basic idea of these methods is to partition the space and identify some representative centroids.
They also differ from linear methods, e.g., linear discriminant analysis, quadratic discriminant analysis, and logistic regression. These methods use hyperplanes as classification boundaries.
Classification trees are a hierarchical way of partitioning the space. We start with the entire space and recursively divide it into smaller regions. In the end, every region is assigned to a class label.
Tree-Structured Classifier
The following textbook presents Classification and Regression Trees (CART) :
Reference: Classification and Regression Trees by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Chapman & Hall, 1984.
Let’s start with a medical example to get a rough idea about classification trees.
A Medical Example
One big advantage of decision trees is that the classifier generated is highly interpretable. For physicians, this is an especially desirable feature.
In this example, patients are classified into one of two classes: high risk versus low risk. It is predicted that the high-risk patients would not survive at least 30 days based on the initial 24-hour data. There are 19 measurements taken from each patient during the first 24 hours. These include blood pressure, age, etc.
Here a tree-structured classification rule is generated and can be interpreted as follows:
First, we look at the minimum systolic blood pressure within the initial 24 hours and determine whether it is above 91. If the answer is no, the patient is classified as high-risk. We don’t need to look at the other measurements for this patient. If the answer is yes, then we can’t make a decision yet. The classifier will then look at whether the patient’s age is greater than 62.5 years old. If the answer is no, the patient is classified as low risk. However, if the patient is over 62.5 years old, we still cannot make a decision and then look at the third measurement, specifically, whether sinus tachycardia is present. If the answer is no, the patient is classified as low risk. If the answer is yes, the patient is classified as high risk.
Only three measurements are looked at by this classifier. For some patients, only one measurement determines the final result. Classification trees operate similarly to a doctor’s examination.
Objectives
Upon successful completion of this lesson, you should be able to:
- Understand the basic idea of decision trees.
- Understand the three elements in the construction of a classification tree.
- Understand the definition of the impurity function and several example functions.
- Know how to estimate the posterior probabilities of classes in each tree node.
- Understand the advantages of tree-structured classification methods.
- Understand the resubstitution error rate and the cost-complexity measure, their differences, and why the cost-complexity measure is introduced.
- Understand weakest-link pruning.
- Understand the fact that the best-pruned subtrees are nested and can be obtained recursively.
- Understand the method based on cross-validation for choosing the complexity parameter and the final subtree.
- Understand the purpose of model averaging.
- Understand the bagging procedure.
- Understand the random forest procedure.
- Understand the boosting approach.
12.1 Construct the Tree
Notation
We will denote the feature space by \(\mathbf{X}\). Normally \(\mathbf{X}\) is a multidimensional Euclidean space. However, sometimes some variables (measurements) may be categorical such as gender, (male or female). CART has the advantage of treating real variables and categorical variables in a unified manner. This is not so for many other classification methods, for instance, LDA.
The input vector is indicated by \(X \in \mathbf{X}\) contains p features \(X_1, X_2, \cdots , X_p\).
Tree-structured classifiers are constructed by repeated splits of the space X into smaller and smaller subsets, beginning with X itself.
We will also need to introduce a few additional definitions:
Rollover the definitions on the right in the interactive image below:
node, terminal node (leaf node), parent node, child node.
One thing that we need to keep in mind is that the tree represents the recursive splitting of the space. Therefore, every node of interest corresponds to one region in the original space. Two child nodes will occupy two different regions and if we put the two together, we get the same region as that of the parent node. In the end, every leaf node is assigned with a class and a test point is assigned with the class of the leaf node it lands in.
Additional Notation:
- A node is denoted by t. We will also denote the left child node by \(t_L\) and the right one by \(t_R\) .
- Denote the collection of all the nodes in the tree by T and the collection of all the leaf nodes by \(\tilde{T}\).
- A split will be denoted by s. The set of splits is denoted by S.
Let’s take a look at how these splits can take place.
The whole space is represented by X.
View the Video Explanation
Splits
The Three Elements
The construction of a tree involves the following three general elements:
- The selection of the splits, i.e., how do we decide which node (region) to split and how to split it?
- If we know how to make splits or ‘grow’ the tree, how do we decide when to declare a node terminal and stop splitting?
- We have to assign each terminal node to a class. How do we assign these class labels?
In particular, we need to decide upon the following:
- The pool of candidate splits that we might select from involves a set Q of binary questions of the form {Is \(\mathbf{x} \in A\)?}, \(A \subseteq \mathbf{X}\). Basically, we ask whether our input \(\mathbf{x}\) belongs to a certain region, A. We need to pick one A from the pool.
- The candidate split is evaluated using a goodness of split criterion \(\Phi(s, t)\) that can be evaluated for any split s of any node t.
- A stop-splitting rule, i.e., we have to know when it is appropriate to stop splitting. One can ‘grow’ the tree very big. In an extreme case, one could ‘grow’ the tree to the extent that in every leaf node there is only a single data point. Then it makes no sense to split any farther. In practice, we often don’t go that far.
- Finally, we need a rule for assigning every terminal node to a class.
Now, let’s get into the details for each of these four decisions that we have to make…
1) Standard Set of Questions for Suggesting Possible Splits
Let’s say that the input vector \(X = (X_1, X_2, \cdots , X_p)\) contains features of both categorical and ordered types. CART makes things simple because every split depends on the value of only a single variable.
If we have an ordered variable - for instance, \(X_j\) – the question inducing the split is whether \(X_j\) is smaller or equal to some threshold? Thus, for each ordered variable \(X_j\) , Q includes all questions of the form:
Is \(X_j \leq c\)?
for all real-valued c.
There are other ways to partition the space. For instance, you might ask whether \(X_1+ X_2\) is smaller than some threshold. In this case, the split line is not parallel to the coordinates. However, here we restrict our interest to the questions of the above format. Every question involves one of \(X_1, \cdots , X_p\), and a threshold.
Since the training data set is finite, there are only finitely many thresholds c that results in a distinct division of the data points.
If \(X_j\) is categorical, say taking values from \(\{ 1,2 , \dots , M \}\), the questions Q, are of the form:
Is \(X_j \in A\)?
where A is any subset of \(\{ 1,2 , \dots , M \}\).
The splits or questions for all p variables form the pool of candidate splits.
This first step identifies all the candidate questions. Which one to use at any node when constructing the tree is the next question …
2) Determining Goodness of Split
The way we choose the question, i.e., split, is to measure every split by a ‘goodness of split’ measure, which depends on the split question as well as the node to split. The ‘goodness of split’ in turn is measured by an impurity function.
Intuitively, when we split the points we want the region corresponding to each leaf node to be “pure”, that is, most points in this region come from the same class, that is, one class dominates.
Look at the following example. We have two classes shown in the plot by x’s and o’s. We could split first by checking whether the horizontal variable is above or below a threshold, shown below:
The split is indicated by the blue line. Remember by the nature of the candidate splits, the regions are always split by lines parallel to either coordinate. For the example split above, we might consider it a good split because the left-hand side is nearly pure in that most of the points belong to the x class. Only two points belong to the o class. The same is true of the right-hand side.
Generating pure nodes is the intuition for choosing a split at every node. If we go one level deeper down the tree, we have created two more splits, shown below:
now you see that the upper left region or leaf node contains only the x class. Therefore, it is 100% pure, no class blending in this region. The same is true for the lower right region. It only has o’s. Once we have reached this level, it is unnecessary to further split because all the leaf regions are 100% pure. Additional splits will not make the class separation any better in the training data, although it might make a difference with the unseen test data.
12.2 The Impurity Function
The impurity function measures the extent of purity for a region containing data points from possibly different classes. Suppose the number of classes is K. Then the impurity function is a function of \(p_1, \cdots , p_K\) , the probabilities for any data point in the region belonging to class 1, 2,…, K. During training, we do not know the real probabilities. What we would use is the percentage of points in class 1, class 2, class 3, and so on, according to the training data set.
The impurity function can be defined in different ways, but the bottom line is that it satisfies three properties.
Definition: An impurity function is a function \(\Phi\) defined on the set of all K-tuples of numbers \((p_1, \cdots , p_K)\) satisfying \(p_j \geq 0,\;\; j =1, \cdots , K, \Sigma_j p_j = 1\) with the properties:
- \(\Phi\) achieves maximum only for the uniform distribution, that is all the pj are equal.
- \(\Phi\) achieves minimum only at the points (1, 0, … , 0), (0, 1, 0, … , 0), …, (0, 0, … , 0, 1), i.e., when the probability of being in a certain class is 1 and 0 for all the other classes.
- \(\Phi\) is a symmetric function of \(p_1, \cdots , p_K\), i.e., if we permute \(p_j\) , \(\Phi\) remains constant.
Definition: Given an impurity function \(\Phi\), define the impurity measure, denoted as i(t), of a node t as follows: \[i(t)=\phi(p(1|t), p(2|t), ... , p(K|t)) \] where p(j | t) is the estimated posterior probability of class j given a point is in node t. This is called the impurity function or the impurity measure for node t.
Once we have i(t), we define the goodness of split s for node t, denoted by \(\Phi\)(s, t):
\[\Phi(s, t)=\Delta i(s, t) = i(t) - p_Ri(t_R)-p_L i(t_L)\]
Δi(s, t) is the difference between the impurity measure for node t and the weighted sum of the impurity measures for the right child and the left child nodes. The weights, \(p_R\) and \(p_L\) , are the proportions of the samples in node t that go to the right node \(t_R\)* * and the left node \(t_L\) respectively.
Look at the graphic again. Suppose the region to the left (shaded purple) is the node being split, the upper portion is the left child node the lower portion is the right child node. You can see that the proportion of points that are sent to the left child node is \(p_L = 8/10\). The proportion sent to the right child is \(p_R = 2/10\).
The classification tree algorithm goes through all the candidate splits to select the best one with maximum Δi(s, t).
Next, we will define I(t) = i(t) p(t), that is, the impurity function of node t weighted by the estimated proportion of data that go to node t. p(t) is the probability of data falling into the region corresponding to node t. A simple way to estimate it is to count the number of points that are in node t and divide it by the total number of points in the whole data set.
The aggregated impurity measure of tree T, I(T) is defined by
\[I(T)=\sum_{t\in\tilde{T}}I(t)=\sum_{t\in\tilde{T}}i(t)p(t)\]
This is a sum over all of the leaf nodes. (Remember, not all of the nodes in the tree, just the leaf nodes: \(\tilde{T}\).)
Note for any node t the following equations hold:
\[\begin{align}&p(t_L) + p(t_R)=p(t) \\ &p_L=p(t_L)/p(t) \\ &p_R=p(t_R)/p(t) \\ &p_L+p_R=1 \\ \end {align}\]
The region covered by the left child node, \(t_L\), and the right child node, \(t_R\), are disjoint and if combined, form the bigger region of their parent node t. The sum of the probabilities over two disjoined sets is equal to the probability of the union. \(p_L\)* * then becomes the relative proportion of the left child node with respect to the parent node.
Next, we define the difference between the weighted impurity measure of the parent node and the two child nodes.
\[ \begin {align}\Delta I(s,t)& = I(t)-I(t_L)-I(t_R)\\ & = p(t)i(t) -p(t_L)i(t_L)-p(t_R)i(t_R)\\ & = p(t)i(t) -p_L i(t_L)-p_Ri(t_R) \\ & = p(t)\Delta i(s,t) \\ \end {align} \]
Finally getting to this mystery of the impurity function…
It should be understood that no matter what impurity function we use, the way you use it in a classification tree is the same. The only difference is what specific impurity function to plug in. Once you use this, what follows is the same.
Possible impurity functions:
- Entropy function: \(\sum_{j=1}^{K}p_j \text{ log }\frac{1}{p_j}\). If pj = 0, use the limit \(\text{lim }p_j \rightarrow \text{ log }p_j=0\).
- Misclassification rate: \1 - max_j p_j \.
- Gini index: \(\sum_{j=1}^{K}p_j (1-p_j)=1-\sum_{j=1}^{K}p_{j}^{2}\).
Remember, impurity functions have to 1) achieve a maximum at the uniform distribution, 2) achieve a minimum when \(p_j = 1\), and 3) be symmetric with regard to their permutations.
Another Approach: The Twoing Rule
Another splitting method is the Twoing Rule. This approach does not have anything to do with the impurity function.
The intuition here is that the class distributions in the two child nodes should be as different as possible and the proportion of data falling into either of the child nodes should be balanced.
The twoing rule: At node t, choose the split s that maximizes:
\[\dfrac{p_L p_R}{4}\left[\sum_{j}|p(j|t_L)-p(j|t_R)|\right]^2\]
When we break one node to two child nodes, we want the posterior probabilities of the classes to be as different as possible. If they differ a lot, each tends to be pure. If instead, the proportions of classes in the two child nodes are roughly the same as the parent node, this indicates the splitting does not make the two child nodes much purer than the parent node and hence not a successful split.
In summary, one can use either the goodness of split defined using the impurity function or the twoing rule. At each node, try all possible splits exhaustively and select the best from them.
12.3 Estimate the Posterior Probabilities of Classes in Each Node
The impurity function is a function of the posterior probabilities of k classes. In this section, we answer the question, “How do we estimate these probabilities?”
Let’s begin by introducing the notation N, the total number of samples. The number of samples in class j, \(1 \leq j \leq K\), is \(N_j\) . If we add up all the \(N_j\) data points, we get the total number of data points N.
We also denote the number of samples going to node t by \(N(t)\), and, the number of samples of class j going to node t by \(N_j(t)\).
Then for every node t, if we add up over different classes we should get the total number of points back:
\[\sum_{j=1}^{K} N_j(t) =N(t)\] And, if we add the points going to the left and the points going the right child node, we should also get the number of points in the parent node. \[N_j (t_L) + N_j(t_R) = N_j(t)\] For a full tree (balanced), the sum of \(N(t)\) over all the node t’s at the same level is N.
Next, we will denote the prior probability of class j by \(\pi_j\) . The prior probabilities very often are estimated from the data by calculating the proportion of data in every class. For instance, if we want the prior probability for class 1, we simply compute the ratio between the number of points in class one and the total number of points, \(N_j / N\). These are the so-called empirical frequencies for the classes.
This is one way of getting priors. Sometimes the priors may be pre-given. For instance, in medical studies, researchers collect a large amount of data from patients who have a disease. The percentage of cases with the disease in the collected data may be much higher than that in the population. In this case, it is inappropriate to use the empirical frequencies based on the data. If the data is a random sample from the population, then it may be reasonable to use empirical frequency.
The estimated probability of a sample in class j going to node t is \(p(t | j) = N_j (t) / N_j\) . Obviously, \[p(t_L | j) + p(t_R | j) = p(t | j)\] Next, we can assume that we know how to compute \(p(t | j)\) and then we will find the joint probability of a sample point in class j and in node t.
The joint probability of a sample being in class j and going to node t is as follows: \[p(j, t) = \pi_j p(t | j) = \pi_j N_j (t) / N_j \] Because the prior probability is assumed known (or calculated) and \(p(t | j)\) is computed, the joint probability can be computed. The probability of any sample going to node t regardless of its class is: \[p(t)=\sum_{j=1}^{K}p(j,t)=\sum_{j=1}^{K}\pi_jN_j(t)/N_j\] Note: \(p(t_L) + p(t_R) = p(t)\).
Now, what we really need is \(p(j | t)\). That is if I know a point goes to node t, what is the probability this point is in class j.
(Be careful because we have flipped the condition and the event to compute the probability for!)
The probability of a sample being in class j given that it goes to node t is:
\[p( j | t ) = p(j, t) / p(t)\]
Probabilities on the right-hand side are both solved from the previous formulas.
For any t, \(\sum_{j=1}^{K}p(j|t)=1\).
There is a shortcut if the prior is not pre-given, but estimated by the empirical frequency of class j in the dataset!
When \(\pi_j = N_j / N\) , the simplification is as follows:
- calculate \(p( j | t ) = N_j (t) / N (t)\) - for all the points that land in node t.
- \(p(t) = N (t) / N\)
- \(p(j, t) = N_j (t) / N\)
This is the shortcut equivalent to the previous approach.
3) Determining Stopping Criteria
When we grow a tree, there are two basic types of calculations needed. First, for every node, we compute the posterior probabilities for the classes, that is, \(p( j | t )\) for all j and t. Then we have to go through all the possible splits and exhaustively search for the one with the maximum goodness. Suppose we have identified 100 candidate splits (i.e., splitting questions), to split each node, 100 class posterior distributions for the left and right child nodes each are computed, and 100 goodness measures are calculated. In the end, one split is chosen and only for this chosen split, the class posterior probabilities in the right and left child nodes are stored.
For the moment let’s assume we will leave off our discussion of pruning for later and that we will grow the tree until some sort of stopping criteria is met.
A simple criterion is as follows. We will stop splitting a node t when: \[\underset{s \in S}{\text{ max }} \Delta I(s,t) <\beta\] where \(\Delta I\) (defined before) is the decrease in the impurity measure weighted by the percentage of points going to node t, s is the optimal split and \(\beta\) is a pre-determined threshold.
We must note, however, the above stopping criterion for deciding the size of the tree is not a satisfactory strategy. The reason is that the tree growing method is greedy. The split at every node is ‘nearsighted’. We can only look one step forward. A bad split in one step may lead to very good splits in the future. The tree growing method does not consider such cases.
View the Video Explanation
Checkerboard
This process of growing the tree is greedy because it looks only one step ahead as it goes. Going one step forward and making a bad decision doesn’t mean that it is always going to end up bad. If you go a few steps more you might actually gain something. You might even perfectly separate the classes. A response to this might be that what if we looked two steps ahead? How about three steps? You can do this but it begs the question, “How many steps forward should we look?”
No matter how many steps we look forward, this process will always be greedy. Looking ahead multiple steps will not fundamentally solve this problem. This is why we need pruning.
4) Determining Class Assignment Rules
Finally, how do we decide which class to assign to each leaf node?
The decision tree classifies new data points as follows. We let a data point pass down the tree and see which leaf node it lands in. The class of the leaf node is assigned to the new data point. Basically, all the points that land in the same leaf node will be given the same class. This is similar to k-means or any prototype method.
A class assignment rule assigns a class \(j = {1, \cdots , K}\) to every terminal (leaf) node \(t \in \tilde{T}\). The class is assigned to node t . \(\tilde{T}\) is denoted by \(\kappa(t)\), e.g., if \(\kappa(t)= 2\), all the points in node t would be assigned to class 2.
If we use 0-1 loss, the class assignment rule is very similar to k-means (where we pick the majority class or the class with the maximum posterior probability): \[\kappa(t)= \text{ arg }\underset{j}{\text{ max }}p(j|t)\] Let’s assume for a moment that I have a tree and have the classes assigned for the leaf nodes. Now, I want to estimate the classification error rate for this tree. In this case, we need to introduce the resubstitution estimate \(r(t)\) for the probability of misclassification, given that a case falls into node t. This is: \[r(t)=1-\underset{j}{\text{ max }}p(j|t)=1-p(\kappa(t)|t)\] Denote \(R(t) = r(t) p(t)\). The resubstitution estimation for the overall misclassification rate \(R(T)\) of the tree classifier T is: \[R(T)=\sum_{t \in \tilde{T}}R(t)\] One thing that we should spend some time proving is that if we split a node t into child nodes, the misclassification rate is ensured to improve. In other words, if we estimate the error rate using the resubstitution estimate, the more splits, the better. This also indicates an issue with estimating the error rate using the re-substitution error rate because it is always biased towards a bigger tree.
Let’s go through the proof.
Proposition: For any split of a node t into \(t_L\) and \(t_R\), \[R(t) \geq R(t_L) + R(t_R)\]
Proof: Denote \(j^* = \kappa(t)\).
Let’s take a look at being in class \(j*\) given that you are in node t. \[ \begin {align}p(j^* |t)& = p(j^*,t_L |t) + p(j^*,t_R |t) \\ & = p(j^*|t_L) p(t_L|t)+p(j^*|t_R) p(t_R|t) \\ & = p_Lp(j^*|t_L)+p_Rp(j^*|t_R) \\ & \le p_L\underset{j}{\text{ max }}p(j|t_L)+p_R\underset{j}{\text{ max }}p(j|t_R) \\ \end {align} \] Hence, \[ \begin {align}r(|t)& = 1-p(j^*|t) \\ & \ge 1-\left( p_L\underset{j}{\text{ max }}p(j|t_L)+p_R\underset{j}{\text{ max }}p(j|t_R) \right) \\ & = p_L(1-\underset{j}{\text{ max }}p(j|t_L))+p_R(1-\underset{j}{\text{ max }}p(j|t_R)) \\ & = p_Lr(t_L)+p_Rr(t_R) \\ \end {align} \] Finally, \[ \begin {align}R(t)& = p(t)r(t) \\ & \ge p(t)p_Lr(t_L)+p(t)p_Rr(t_R) \\ & = p(t_L)r(t_L)+p(t_R)r(t_R) \\ & = R(t_L)+R(t_R) \\ \end {align} \]
12.4 Example: Digit Recognition
Here we have 10 digits, 0 through 9, and as you might see on a calculator, they are displayed by different on-off combinations of seven horizontal and vertical light bars.
In this case, each digit is represented by a 7-dimensional vector of zeros and ones. The \(i^{th}\) sample is \(x_i = (x_{i1} , x_{i2}, \cdots , x_{i7})\). If \(x_{ij} = 1\), the \(j^{th}\) light is on; if \(x_{ij} = 0\), the \(j^{th}\) light is off.
If the calculator works properly, the following table shows which light bars should be turned on or off for each of the digits.
Digit | \(x_1\) | \(x_2\) | \(x_3\) | \(x_4\) | \(x_5\) | \(x_6\) | \(x_7\) |
---|---|---|---|---|---|---|---|
1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 1 | 0 | 1 | 1 | 1 | 0 | 1 |
3 | 1 | 0 | 1 | 1 | 0 | 1 | 1 |
4 | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
5 | 1 | 1 | 0 | 1 | 0 | 1 | 1 |
6 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
7 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
8 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
9 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |
0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
Let’s suppose that the calculator is malfunctioning. Each of the seven lights has probability 0.1 of being in the wrong state independently. In the training data set 200 samples are generated according to the specified distribution.
A classification tree is applied to this dataset. For this data set, each of the seven variables takes only two possible values: 0 and 1. Therefore, for every variable, the only possible candidate question is whether the value is on (1) or off (0). Consequently, we only have seven questions in the candidate pool: Is \(x_{. j}= 0?, j = 1, 2, \cdots , 7\).
In this example, the twoing rule is used in splitting instead of the goodness of split based on an impurity function. Also, the result presented was obtained using pruning and cross-validation.
Classification Performance
Results:
- The error rate estimated by using an independent test dataset of size 5000 is 0.30.
- The error rate estimated by cross-validation using the training dataset which only contains 200 data points is also 0.30. In this case, the cross-validation did a very good job for estimating the error rate.
- The resubstitution estimate of the error rate is 0.29. This is slightly more optimistic than the true error rate.
Here pruning and cross-validation effectively help avoid overfitting. If we don’t prune and grow the tree too big, we might get a very small resubstitution error rate which is substantially smaller than the error rate based on the test data set.
- The Bayes error rate is 0.26. Remember, we know the exact distribution for generating the simulated data. Therefore the Bayes rule using the true model is known.
- There is little room for improvement over the tree classifier.
The tree obtained is shown below:
We can see that the first question checks the on/off status of the fifth light. We ask whether \(x_5 = 0\). If the answer is yes (the light is off), go to the left branch. If the answer is no, take the right branch. For the left branch, we ask whether the fourth light, \(x_4\), is on or off. If the answer is yes, then we check the first light, \(x_1\). If \(x_1\) is off then we say that it is digit 1, if the answer is no then it is a 7. The square nodes are leaf nodes, and the number in the square is the class label, or the digit, assigned to the leaf node.
In general, one class may occupy several leaf nodes and occasionally no leaf node.
Interestingly, in this example, every digit (or every class) occupies exactly one leaf node. There are exactly 10 leaf nodes. But this is just a special case.
Another interesting aspect about the tree in this example is that \(x_6\) and \(x_7\) are never used. This shows that classification trees sometimes achieve dimension reduction as a by-product.
Example: Waveforms
Let’s take a look at another example concerning waveforms. Let’s first define three functions \(h_1(\tau), h_2(\tau), h_3(\tau)\) which are shifted versions of each other, as shown in the figure below:
We specify each \(h_j\) by its values at integers \(\tau =1 \tilde 21\). Therefore, every waveform is characterized by a 21-dimensional vector.
Next, we will create in the following manner three classes of waveforms as random convex combinations of two of these waveforms plus independent Gaussian noise. Each sample is a 21-dimensional vector containing the values of the random waveforms measured at \(\tau = 1, 2, \cdots , 21\).
To generate a sample in class 1, first we generate a random number υ uniformly distributed in [0, 1] and then we generate 21 independent random numbers \(\epsilon_1, \epsilon_2, \cdots , \epsilon_{21}\) normally distributed with mean = 0 and variance 1 (they are Gaussian noise.) Now, we can create a random waveform by a random sum of \(h_1(j)\) and \(h_2(j)\) with the weights given by the random number picked from the interval [0, 1]: \[x_{·j} = uh_1(j) + (1 - \nu) h_2(j) + \epsilon_j, \\\\\\\\ j = 1, \cdots , 21\] This is a convex combination where the weights are nonnegative and add up to one, with Gaussian noise, \(\epsilon_j\), added on top.
Similarly, to generate a sample in class 2, we repeat the above process to generate a random number \(\nu\) and 21 random numbers \(\epsilon_1, \cdots , \epsilon_{21}\) and set \[x_{·j} = uh_1(j) + (1 - \nu) h_3(j) + \epsilon_j, \\\\\\\\ j = 1, \cdots , 21\] This is a convex combination of \(h_1(j)\) and \(h_3(j)\) plus noise.
Then, the class 3 vectors are generated by a convex combination of \(h_2(j)\) and \(h_3(j)\) plus noise: \[x_{·j} = uh_2(j) + (1 - \nu) h_3(j) + \epsilon_j, \\\\\\\\ j = 1, \cdots , 21\] Below are sample random waveforms generated according to the above description.
Let’s see how a classification tree performs.
Here we have generated 300 random samples using prior probabilities (1/3, 1/3, 1/3) for training.
The set of splitting questions are whether xj is smaller than or equal to a threshold c:
{Is \(x_{·j} \leq c?\)} for c ranging over all real numbers and j = 1, … , 21
Next, we use the Gini index as the impurity function and compute the goodness of split correspondingly.
Then the final tree is selected by pruning and cross-validation.
Results:
- The cross-validation estimate of misclassification rate is 0.29.
- The misclassification rate on a separate test dataset of size 5000 is 0.28.
- This is pretty close to the cross-validation estimate!
- The Bayes classification rule can be derived because we know the underlying distribution of the three classes. Applying this rule to the test set yields a misclassification rate of 0.14.
We can see that in this example, the classification tree performs much worse than the theoretical optimal classifier.
Here is the tree. Again, the corresponding question used for every split is placed below the node. Three numbers are put in every node, which indicates the number of points in every class for that node. For instance, in the root node at the top, there are 100 points in class 1, 85 points in class 2, and 115 in class 3. Although the prior probabilities used were all one third, because random sampling is used, there is no guarantee that in the real data set the numbers of points for the three classes are identical.
If we take a look at the internal node to the left of the root node, we see that there are 36 points in class 1, 17 points in class 2 and 109 points in class 3, the dominant class. If we look at the leaf nodes represented by the rectangles, for instance, the leaf node on the far left, it has seven points in class 1, 0 points in class 2 and 20 points in class 3. According to the class assignment rule, we would choose a class that dominates this leaf node, 3 in this case. Therefore, this leaf node is assigned to class 3, shown by the number below the rectangle. In the leaf node to its right, class 1 with 20 data points is most dominant and hence assigned to this leaf node.
We also see numbers on the right of the rectangles representing leaf nodes. These numbers indicate how many test data points in each class land in the corresponding leaf node. For the ease of comparison with the numbers inside the rectangles, which are based on the training data, the numbers based on test data are scaled to have the same sum as that on training.
Also, observe that although we have 21 dimensions, many of these are not used by the classification tree. The tree is relatively small.
12.5 Advantages of the Tree-Structured Approach
As we have mentioned many times, the tree-structured approach handles both categorical and ordered variables in a simple and natural way. Classification trees sometimes do an automatic stepwise variable selection and complexity reduction. They provide an estimate of the misclassification rate for a test point. For every data point, we know which leaf node it lands in and we have an estimation for the posterior probabilities of classes for every leaf node. The misclassification rate can be estimated using the estimated class posterior.
Classification trees are invariant under all monotone transformations of individual ordered variables. The reason is that classification trees split nodes by thresholding. Monotone transformations cannot change the possible ways of dividing data points by thresholding. Classification trees are also relatively robust to outliers and misclassified points in the training set. They do not calculate an average or anything else from the data points themselves. Classification trees are easy to interpret, which is appealing especially in medical applications.
12.6 Variable combinations
So far, we have assumed that the classification tree only partitions the space by hyperplanes parallel to the coordinate planes. In the two-dimensional case, we only divide the space either by horizontal or vertical lines. How much do we suffer by such restrictive partitions?
Let’s take a look at this example…
In the example below, we might want to make a split using the dotted diagonal line which separates the two classes well. Splits parallel to the coordinate axes seem inefficient for this data set. Many steps of splits are needed to approximate the result generated by one split using a sloped line.
There are classification tree extensions which, instead of thresholding individual variables, perform LDA for every node.
Or we could use more complicated questions. For instance, questions that use linear combinations of variables: \[\sum a_j x_{.j} \le c?\] This would increase the amount of computation significantly. Research seems to suggest that using more flexible questions often does not lead to obviously better classification result, if not worse. Overfitting is more likely to occur with more flexible splitting questions. It seems that using the right sized tree is more important than performing good splits at individual nodes.
12.7 Missing Values
We may have missing values for some variables in some training sample points. For instance, gene-expression microarray data often have missing gene measurements.
Suppose each variable has 5% chance of being missing independently. Then for a training data point with 50 variables, the probability of missing some variables is as high as 92.3%! This means that at least 90% of the data will have at least one missing value! Therefore, we cannot simply throw away data points whenever missing values occur.
A test point to be classified may also have missing variables.
Classification trees have a nice way of handling missing values by surrogate splits.
Suppose the best split for node t is s which involves a question on \(X_m\). Then think about what to do if this variable is not there. Classification trees tackle the issue by finding a replacement split. To find another split based on another variable, classification trees look at all the splits using all the other variables and search for the one yielding a division of training data points most similar to the optimal split. Along the same line of thought, the second best surrogate split could be found in case both the best variable and its top surrogate variable are missing, so on so forth.
One thing to notice is that to find the surrogate split, classification trees do not try to find the second-best split in terms of goodness measure. Instead, they try to approximate the result of the best split. Here, the goal is to divide data as similarly as possible to the best split so that it is meaningful to carry out the future decisions down the tree, which descend from the best split. There is no guarantee the second best split divides data similarly as the best split although their goodness measurements are close.
12.8 Right Sized Tree via Pruning
In the previous section, we talked about growing trees. In this section, we will discuss pruning trees.
Let the expected misclassification rate of a tree \(T\) be \(R^*(T)\).
Recall we used the resubstitution estimate for \(R^*(T)\). This is: \[R(T)=\sum_{t \in \tilde{T}}r(t)p(t)=\sum_{t \in \tilde{T}}R(t) \] Remember also that r(t) is the probability of making a wrong classification for points in node t. For a point in a given leaf node t, the estimated probability of misclassification is 1 minus the probability of the majority class in node t based on the training data.
To get the probability of misclassification for the whole tree, a weighted sum of the within leaf node error rate is computed according to the total probability formula.
We also mentioned in the last section the resubstitution error rate \(R(T)\) is biased downward. Specifically, we proved the weighted misclassification rate for the parent node is guaranteed to be greater or equal to the sum of the weighted misclassification rates of the left and right child nodes, that is: \[R(t) \geq R(t_L) + R(t_R)\] This means that if we simply minimize the resubstitution error rate, we would always prefer a bigger tree. There is no defense against overfitting.
Let’s look at the digit recognition example that we discussed before.
The biggest tree grown using the training data is of size 71. In other words, it has 71 leaf nodes. The tree is grown until all of the points in every leaf node are from the same class.
No. Terminal Nodes | \(R(T)\) | \(R^{ts}(T)\) |
---|---|---|
71 | 0.00 | 0.42 |
63 | 0.00 | 0.40 |
58 | 0.03 | 0.39 |
40 | 0.10 | 0.32 |
34 | 0.12 | 0.32 |
19 | 0.29 | 0.31 |
10 | 0.29 | 0.30 |
9 | 0.32 | 0.34 |
7 | 0.41 | 0.47 |
6 | 0.46 | 0.54 |
5 | 0.53 | 0.61 |
2 | 0.75 | 0.82 |
1 | 0.86 | 0.91 |
Then a pruning procedure is applied, (the details of this process we will get to later). We can see from the above table that the tree is gradually pruned. The tree next to the full tree has 63 leaf nodes, which is followed by a tree with 58 leaf nodes, so on so forth until only one leaf node is left. This minimum tree contains only a single node, the root.
The resubstitution error rate \(R(T)\) becomes monotonically larger when the tree shrinks.
The error rate \(R^{ts}\) based on a separate test data shows a different pattern. It decreases first when the tree becomes larger, hits minimum at the tree with 10 terminal nodes, and begins to increase when the tree further grows.
Comparing with \(R(T)\), \(R^{ts}(T)\) better reflects the real performance of the tree. The minimum \(R^{ts}(T)\) is achieved not by the biggest tree, but by a tree that better balances the resubstitution error rate and the tree size.
12.8.1 Preliminaries for Pruning
First, we would grow the tree to a large size. Denote this maximum size by \(T_{max}\). Stopping criterion is not important here because as long as the tree is fairly big, it doesn’t really matter when to stop. The overgrown tree will be pruned back eventually. There are a few ways of deciding when to stop:
- Keep going until all terminal nodes are pure (contain only one class).
- Keep going until the number of data in each terminal node is no greater than a certain threshold, say 5, or even 1.
- As long as the tree is sufficiently large, the size of the initial tree is not critical.
The key here is to make the initial tree sufficiently big before pruning back.
Notation
Now we need to introduce a notation… Let’s take a look at the following definitions:
Descendant: a node \(t^{'}\) is a descendant of node t if there is a connected path down the tree leading from t to \(t{'}\) .
Ancestor: t is an ancestor of \(t^{'}\) if \(t^{'}\) is its descendant.
A branch \(T_t\) of T with root node \(t \in T\) consists of the node t and all descendants of t in T .
Pruning a branch \(T_t\) from a tree T consists of deleting from T all descendants of t , that is, cutting off all of \(T_t\) except its root node. The tree pruned this way will be denoted by \(T - T_t\) .
If \(T^{'}\) is gotten from T by successively pruning off branches, then \(T^{'}\) is called a pruned subtree of T and denoted by \(T^{'} < T\)
Video Explanation
Parts of a Tree
Optimal Subtrees
Even for a moderately sized \(T_{max}\), there is an enormously large number of subtrees and an even larger number ways to prune the initial tree to get any. We, therefore, cannot exhaustively go through all the subtrees to find out which one is the best in some sense. Moreover, we typically do not have a separate test dataset to serve as a basis for selection.
A smarter method is necessary. A feasible method of pruning should ensure the following:
- The subtree is optimal in a certain sense, and
- The search of the optimal subtree should be computationally tractable.
12.8.2 Minimal Cost-Complexity Pruning
As we just discussed, \(R(T)\), is not a good measure for selecting a subtree because it always favors bigger trees. We need to add a complexity penalty to this resubstitution error rate. The penalty term favors smaller trees, and hence balances with \(R(T)\).
The definition for the cost-complexity measure:
For any subtree \(T < T_{max}\) , we will define its complexity as |\(\tilde{T}\)|, the number of terminal or leaf nodes in T . Let \(\alpha ≥ 0\) be a real number called the complexity parameter and define the cost-complexity measure \(R_{\alpha}(T)\) as: \[R_{\alpha}(T)=R(T) +\alpha| \tilde{T}| \] The more leaf nodes that the tree contains the higher complexity of the tree because we have more flexibility in partitioning the space into smaller pieces, and therefore more possibilities for fitting the training data. There’s also the issue of how much importance to put on the size of the tree. The complexity parameter α adjusts that.
In the end, the cost complexity measure comes as a penalized version of the resubstitution error rate. This is the function to be minimized when pruning the tree.
Which subtree is selected eventually depends on \(\alpha\) . If \(\alpha = 0\) then the biggest tree will be chosen because the complexity penalty term is essentially dropped. As \(\alpha\) approaches infinity, the tree of size 1, i.e., a single root node, will be selected.
In general, given a pre-selected \(\alpha\) , find the subtree \(T(\alpha)\) that minimizes \(R_{\alpha}(T)\), i.e., \[R_{\alpha}(T(\alpha))= \underset{T \preceq T_{max}}{min}R_{\alpha}(T)\] The minimizing subtree for any \(\alpha\) always exists since there are only finitely many subtrees.
Since there are at most a finite number of subtrees of \(T_{max}\) , \(R_{\alpha}(T(\alpha))\) yields different values for only finitely many \(\alpha\)’s. \(*`T`*`(α)\) continues to be the minimizing tree when \(\alpha\) increases until a jump point is reached.
Two questions:
- Is there a unique subtree \(T < T_{max}\) which minimizes \(R_{\alpha}(T)\)?
- In the minimizing sequence of trees \(T_1, T_2, \cdots\) is each subtree obtained by pruning upward from the previous subtree, i.e., does the nesting \(T_1 > T_2 > \cdots > {t_1}\) hold?
If the optimal subtrees are nested, the computation will be a lot easier. We can first find \(T_1\), and then to find \(T_2\) , we don’t need to start again from the maximum tree, but from \(T_1\), (because \(T_2\) is guaranteed to be a subtree of \(T_1\)). In this way when α increases, we prune based on a smaller and smaller subtree.
Definition: The smallest minimizing subtree \(T_{\alpha}\) for complexity parameter α is defined by the conditions:
\(R_{\alpha}(T(\alpha)) = min_{T \leq Tmax} R_{\alpha}(T)\)
If \(R_{\alpha}(T) = R_{\alpha}(T(\alpha))\) , then \(T(\alpha) \leq T\). - if another tree achieves the minimum at the same α, then the other tree is guaranteed to be bigger than the smallest minimized tree T(α).
By definition, (according to the second requirement above), if the smallest minimizing subtree \(T(\alpha)\) exists, it must be unique. Earlier we argued that a minimizing subtree always exists because there are only a finite number of subtrees. Here we go one step more. We can prove that the smallest minimizing subtree always exists. This is not trivial to show because one tree smaller than another means the former is embedded in the latter. Tree ordering is a partial ordering.
The starting point for the pruning is not \(T_{max}\) , but rather \(T_1 = T(0)\), which is the smallest subtree of \(T_{max}\) satisfying: \[R(T_1) = R(T_{max})\] The way that you get \(T_1\) is as follows:
First, look at the biggest tree, \(T_{max}\) , and for any two terminal nodes descended from the same parent, for instance \(t_L\) and \(t_R\) , if they yield the same re-substitution error rate as the parent node t, prune off these two terminal nodes, that is, if \(R(t) = R(t_L) + R(t_R)\), prune off \(t_L\) and \(t_R\).
This process is applied recursively. After we have pruned one pair of terminal nodes, the tree shrinks a little bit. Then based on the smaller tree, we do the same thing until we cannot find any pair of terminal nodes satisfying this equality. The resulting tree at this point is \(T_1\).
Let’s take a look at an example using a plot to see how \(T_{max}\) is obtained.
View the Video Explanation
Pruning
We will use \(T_t\) to denote a branch rooted at t. Then, for \(T_t\), we define \(R(T_t)\), (the resubstitution error rate for this branch ) by: \[R(T_t)=\sum_{t' \in \tilde{T}_t}R(t')\] where \(\tilde{T}_t\) is the set of terminal nodes of \(T_t\).
What we can prove is that, if t is any non-terminal or internal node of \(T_1\), then it is guaranteed to have a smaller re-substitution error rate than the re-substitution error rate of the branch, that is, \(R(t) > R(T_t)\). If we prune off the branch at t, the resubstitution error rate will strictly increase.
Weakest-Link Cutting
The weakest link cutting method not only finds the next α which results in a different optimal subtree but find that optimal subtree.
Remember, we previously defined \(R_\alpha\) for the entire tree. Here, we extend the definition to a node and then for a single branch coming out of a node.
For any node \(t \in T_1\), we can set \(R_\alpha({t}) = R(t) + \alpha\) .
Also, for any branch \(T_t\) , we can define \(R_\alpha(T_t) = R(T_t) + \alpha | \tilde{T}_t\) |.
We know that when \(\alpha = 0, R_0(T_t) < R_0({t})\). The inequality holds for sufficiently small \(\alpha\).
If we gradually increase α, because \(R_\alpha(T_t)\) increases faster with α (the coefficient in front of \(\alpha\) is larger than that in \((R_\alpha({t}) )\) , at a certain \(\alpha\) ( \(\alpha_1\) below), we will have \(R_\alpha(T_t) = R_\alpha({t})\).
If α further increases, the inequality sign will be reversed, and we have \(R_\alpha(T_t) > R_\alpha({t})\). Some node t may reach the equality earlier than some other. The node that achieves the equality at the smallest α is called the weakest link. It is possible that several nodes achieve equality at the same time, and hence there are several weakest link nodes.
Solve the inequality \(R_\alpha(T_t) < R_\alpha({t})\) and get \[\alpha < \frac{R(t)-R(T_t)}{|\tilde{T}_t| -1}\] The right hand side is the ratio between the difference in resubstitution error rates and the difference in complexity, which is positive because both the numerator and the denominator are positive.
Define a function \(g_1(t), t \in T_1\) by \[g_1(t)\left\{\begin{matrix} \frac{R(t)-R(T_t)}{|\tilde{T}_t|-1}, & t \notin \tilde{T}_1 \\ +\infty, & t \in \tilde{T}_1 \end{matrix}\right. \] The weakest link \(\bar{t}_1\) in \(T_1\) achieves the minimum of \(g_1(t)\). \[g_1(\bar{t}_1)=\underset{t \in T_1}{\text{ min }}g_1(t)\] and put \(\alpha_2=g_1(\bar{t}_1)\). To get the optimal subtree corresponding to \(\alpha_2\) , simply remove the branch growing out of \(\bar{t}_1\). When α increases, \(\bar{t}_1\) is the first node that becomes more preferable than the branch \(T_{\bar{t}_1}\) descended from it. If there are several nodes that simultaneously achieve the minimum \(g_1(t)\), we remove the branch grown out of each of these nodes. \(\alpha_2\) is the first value after \(\alpha_1 =0\) that yields an optimal subtree strictly smaller than \(T_1\). For all \(\alpha_1 \leq \alpha < \alpha_2\), the smallest minimizing subtree is the same as \(T_1\).
Let \(T_2=T_1-T_{\bar{t}_1}\).
Repeat the previous steps. Use \(T_2\) instead of \(T_1\) as the starting tree, find the weakest link in \(T_2\) and prune off at all the weakest link nodes to get the next optimal subtree. \[ \begin {align}g_2(t) & = \left\{\begin{matrix} \frac{R(t)-R(T_{2t})}{|\tilde{T}_{2t}|-1}, & t \in T_2, t \notin \tilde{T}_2 \\ +\infty, & t \in \tilde{T}_2 \end{matrix}\right. \\ g_2(\bar{t}_2) & = \underset{t \in T_2}{\text{ min }g_2(t)}\\ \alpha_3 & = g_2(\bar{t}_2) \\ T_3 & = T_2- T_{\bar{t}_2} \\ \end {align} \]
Computation
In terms of computation, we need to store a few values at each node.
- \(R(t)\) , the resubstitution error rate for node t. This only need to be computed once.
- \(R(T_t)\) , the resubstitution error rate for the branch coming out of node t. This may need to be updated after pruning because \(T_t\) may change after pruning.
- \(|T_t|\) , the number of leaf nodes on the branch coming out of node t. This may need to be updated after pruning.
R(t)
In order to compute the resubstitution error rate \(R(t)\) we need the proportion of data points in each class that land in node t. Let’s suppose we compute the class priors by the proportion of points in each class. As we grow the tree, we will store the number of points land in node t, as well as the number of points in each class that land in node t. Given those numbers, we can easily estimate the probability of node t and the class posterior given a data point is in node t. \(R(t)\) can then be calculated.
As far as calculating the next two numbers, a) the resubstitution error rate for the branch coming out of node t, and b) the number of leaf nodes that are on the branch coming out of node t, these two numbers change after pruning. After pruning we to need to update these values because the number of leaf nodes will have been reduced. To be specific we would need to update the values for all of the ancestor nodes of the branch.
A recursive procedure can be used to compute \(R(T_t)\) and \(|T_t|\) .
To find the number of leaf nodes in the branch coming out of node t, we can do a bottom-up sweep through the tree. The number of leaf nodes for any node is equal to the number of leaf nodes for the right child node plus the number of leaf nodes for the right child node. A bottom-up sweep ensures that the number of leaf nodes is computed for a child node before for a parent node. Similarly, \(R(T_t)\) is equal to the sum of the values for the two child nodes of t. And hence, a bottom-up sweep would do.
Once we have the three values at every node, we compute the ratio \(g(t)\) and find the weakest link. The corresponding ratio at the weakest link is the new α. It is guaranteed that the sequence of α obtained in the pruning process is strictly increasing.
If at any stage, there are multiple weakest links, for instance, if \(g_k(\bar{t}_k)=g_k(\bar{t}'_k)\), then define: \[T_{k+1}=T_k - T_{\bar{t}_k}-T_{\bar{t}'_k}\] In this case, the two branches are either nested or share no node. The pruning procedure gives sequence of nested subtrees: \[T_1 \> T_2 \> T_3 \> \cdots \> { t_1}\] Each embedded in the other.
Theorem for \(\alpha_k\)
The theorem states that the {\(\alpha_k\)} are an increasing sequence, that is, \(\alpha_k < \alpha_{k+1}, k \geq 1\),where \(\alpha_1 =0\).
For any \(k \geq 1, \alpha_k \leq \alpha < \alpha_{k+1}\) , the smallest optimal subtree \(T(\alpha) = T(\alpha_k) = T_k\) , i.e., is the same as the smallest optimal subtree for \(\alpha_k\).
Basically, this means that the smallest optimal subtree \(T_k\) stays optimal for all the \(\alpha\)’s starting from k until it reaches \(\alpha_k + 1\). Although we have a sequence of finite subtrees, they are optimal for a continuum of \(\alpha\).
At the initial steps of pruning, the algorithm tends to cut off large sub-branches with many leaf nodes very quickly. Then pruning becomes slower and slower as the tree becoming smaller. The algorithm tends to cut off fewer nodes. Let’s look at an example.
Digital Recognition Example
\(T_1\) is the smallest optimal subtree for \(\alpha_1 = 0\). It has 71 leaf nodes. Next, by finding the weakest link, after one step of pruning, the tree is reduced to size 63 (8 leaf nodes are pruned off in one step). Next, five leaf nodes pruned off. From \(T_3\) to \(T_4\) , the pruning is significant, 18 leave nodes removed. Towards the end, pruning becomes slower.
For classification purpose, we have to select a single \(α\), or a single subtree to use.
12.8.3 Best Pruned Subtree
There are two approaches to choosing the best-pruned subtree:
Use a test sample set
If we have a large test data set, we can compute the error rate using the test data set for all the subtrees and see which one achieves the minimum error rate. However, in practice, we rarely have a large test data set. Even if we have a large test data set, instead of using the data for testing, we might rather use this data for training in order to train a better tree. When data is scarce, we may not want to use too much for testing.
Cross-validation
How to conduct cross-validation for trees when trees are unstable? If the training data vary a little bit, the resulting tree may be very different. Therefore, we would have difficulty to match the trees obtained in each fold with the tree obtained using the entire data set.
However, although we said that the trees themselves can be unstable, this does not mean that the classifier resulting from the tree is unstable. We may end up with two trees that look very different, but make similar decisions for classification. The key strategy in a classification tree is to focus on choosing the right complexity parameter α. Instead of trying to say which tree is best, a classification tree tries to find the best complexity parameter \(\alpha\).
So, let’s look at this…
Pruning by Cross-Validation
Let’s consider V-fold cross-validation.
We will denote the original learning sample L which is divided randomly into V subsets, \(L_v, \;\; v = 1, \cdots , V\). We will also let the training sample set in each fold be \(L^{(v)} = L - L_v\).
Next, the tree is grown on the original set and we call this \(T_{max}\). Then, we repeat this procedure for every fold in the cross-validation. So, V additional trees \(T^{(v)}_{max}\) are grown on \(L^{(v)}\).
For each value of the complexity parameter \(\alpha\) , we will let \(T^{(v)}, L^{(v)}(\alpha), \;\; v = 1, \cdots , V\) , be the corresponding minimal cost-complexity subtrees of \(T_{max}T^{(v)}T_{max}\).
For each maximum tree, we obtain a sequence of critical values for \(\alpha\) that are strictly increasing: \[\alpha_1 < \alpha_2< \alpha_3 \cdots < \alpha_k < \cdots\] Then, to find the corresponding minimal cost-complexity subtree at \(\alpha\) ** , we will find \(\alpha_k\) ** from the list such that \(\alpha_k \leq \alpha_{k+1}\). The optimal subtree corresponding to \(\alpha_k\) ** is the subtree for \(\alpha\).
The cross-validation error rate of \(T(\alpha)\) is computed by this formula: \[R^{CV}(T(\alpha))=\frac{1}{V}\sum_{v=1}^{V}\frac{N^{(v)}_{miss}}{N^{(v)}}\] where \(N^{(v)}\) is the number of samples in the test set \(L_v\) in fold v; and \(N^{(v)}_{miss}\) is the number of misclassified samples in \(L_v\) using the smallest minimizing subtree at \(\alpha\) , \(T^{(v)}(\alpha)\).
Remember: \(T^{(v)}(\alpha)\) is a pruned tree of \(T^{(v)}_{max}\) trained from \(L_v\) .
Although α is continuous, there are only finitely many minimum cost-complexity trees grown on L . Consider each subtree obtained in the pruning of the tree grown on L . Let \(T_k = T(\alpha_{k})\). To compute the cross-validation error rate of \(T_k\) , let \(\alpha'_k=\sqrt{\alpha_k \alpha'_{k+1}}\).
Then compute the cross-validation error rate using the formula below: \[R^{CV}(T_k)=R^{CV}(T(\alpha^{'}_{k})\] The \(T_k\) yielding the minimum cross-validation error rate is chosen.
Computational Cost
How much computation is involved? Let’s take a look at this when using V-fold cross validation.
- Grow \(V + 1\) maximum trees.
- For each of the \(V + 1\) trees, find the sequence of subtrees with minimum cost-complexity.
- Suppose we have obtained the maximum tree grown on the original data set \(T_{max}\) and has K subtrees. Then, for each of the \((K - 1) \alpha^{'}_{k}\), we would compute the misclassification rate of each of the V test sample set, average the error rates and use the mean as the cross-validation error rate.
- Choose the best subtree of \(T_{max}\) with minimum \(R^{CV}(T_k\)).
The main computation occurs with pruning. Once the trees and the subtrees are obtained, to find the best one out of these is computationally light. For programming, it is recommended that under every fold and for every subtree, compute the error rate of this subtree using the corresponding test data set under that fold and store the error rate for that subtree. This way, later we can easily compute the cross-validation error rate given any \(\alpha\).
12.9 Bagging and Random Forests
In the past, we have focused on statical learning procedures that produce a single set of results. For example:
- A regression equation, with one set of regression coefficients or smoothing parameters.
- A classification regression tree with one set of leaf nodes.
Model selection is often required: a measure of fit associated with each candidate model.
The Aggregating Procedure:
Here the discussion shifts to statistical learning building on many sets of outputs that are aggregated to produce results. The aggregating procedure makes a number of passes over the data.
On each pass, inputs X are linked with outputs Y just as before. However, of interest now is the collection of all the results from all passes over the data. Aggregated results have several important benefits:
Averaging over a collection of fitted values can help to avoid overfitting. It tends to cancel out the uncommon features of the data captured by a specific model. Therefore, the aggregated results are more stable.
A large number of fitting attempts can produce very flexible fitting functions.
Putting the averaging and the flexible fitting functions together has the potential to break the bias-variance tradeoff.
Revisit Overfitting:
Any attempt to summarize patterns in a dataset risk overtting. All fitting procedures adapt to the data on hand so that even if the results are applied to a new sample from the same population, fit quality will likely decline. Hence, generalization can be somewhat risky.
“optimism increases linearly with the number of inputs or basis functions …, but decreases as the training sample size increases.’’ – Hastie, Tibshirani and Friedman (unjustified).
Decision Tree Example:
Consider decision trees as a key illustration. The overfitting often increases with (1) the number of possible splits for a given predictor; (2) the number of candidate predictors; (3) the number of stages which is typically represented by the number of leaf nodes.
When overfitting occurs in a classification tree, the classification error is underestimated; the model may have a structure that will not generalize well. For example, one or more predictors may be included in a tree that really does not belong.
Ideally, one would have two random samples from the same population: a training dataset and a test dataset. The fit measure from the test data would be a better indicator of how accurate the classification is. Often there is only a single dataset. The data are split up into several randomly chosen, non-overlapping, partitions of about the same size. With ten partitions, each would be a part of the training data in nine analyses and serve as the test data in one analysis. The following figure illustrates the 2-fold cross validation for estimating the cross-validation prediction error for model A and model B. The model selection is based on choosing the one with the smallest cross-validation prediction error.
12.10 R Scripts
1. Acquire Data
Diabetes data
The diabetes data set is taken from the UCI machine learning database on Kaggle: Pima Indians Diabetes Database
- 768 samples in the dataset
- 8 quantitative variables
- 2 classes; with or without signs of diabetes
Load data into R as follows:
In RawData
, the response variable is its last column; and the remaining columns are the predictor variables.
= as.matrix(RawData[,dim(RawData)[2]])
responseY = as.matrix(RawData[,1:(dim(RawData)[2]-1)])
predictorX = as.data.frame(cbind(responseY, predictorX))
data.train names(data.train) = c("Y", "X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8")
2. Classification and Regression Trees
The generation of a tree comprises two steps: to grow a tree and to prune a tree.
2.1 Grow a Tree
In R, the tree library can be used to construct classification and regression trees (see R Lab 8). As an alternative, they can also be generated through the rpart library package and the rpart(formula)
function grows a tree of the data. For the argument method, rpart(formula, method="class")
specifies the response is a categorical variable, otherwise rpart(formula, method="anova")
is assumed for a continuous response.
library(rpart)
set.seed(19)
<- rpart(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data.train, method="class") model.tree
To plot the tree, the following code can be executed in R: plot(result, uniform=TRUE)
plots the tree nodes, which are vertically equally spaced. Then text(result, use.n=TRUE)
writes the decision equation at each node.
plot(model.tree, uniform=T)
text(model.tree, use.n=T)
In Figure 1, the plot shows the predictor and its threshold used at each node of the tree, and it shows the number of observations in each class at each terminal node. Specifically, the numbers of points in Class 0 and Class 1 are displayed as ·/·.
At each node, we move down the left branch if the decision statement is true and down the right branch if it is false. The functions in the rpart library draw the tree in such a way that left branches are constrained to have a higher proportion of 0 values for the response variable than right branches. So, some of the decision statements contain “less than” (<) symbols and some contain “greater than or equal to” (>=) symbols (whatever is needed to satisfy this constraint). By contrast, the functions in the tree library draw the tree in such a way that all the decision statements contain “less than” (<) symbols. Thus, either branch may have a higher proportion of 0 values for the response value than the other.
2.2 Prune a Tree
To obtain the right sized tree to avoid overfitting, the cptable
element of the result generated by rpart
can be extracted.
$cptable model.tree\
The results are shown as follows:
The cptable
provides a brief summary of the overall fit of the model. The table is printed from the smallest tree (no splits) to the largest tree. The “CP” column lists the values of the complexity parameter, the number of splits is listed under”nsplit”, and the column ”xerror” contains cross-validated classification error rates; the standard deviation of the cross-validation error rates are in the ”xstd” column. Normally, we select a tree size that minimizes the cross-validated error, which is shown in the “xerror” column printed by ()\$cptable
.
Selection of the optimal subtree can also be done automatically using the following code:
<- model.tree\$cptable[which.min(model.tree$cptable[,"xerror"]),"CP"] opt
opt
stores the optimal complexity parameter. Now, to prune a tree with the complexity parameter chosen, simply do the following. The pruning is performed by function prune
, which takes the full tree as the first argument and the chosen complexity parameter as the second.
<- prune(model.tree, cp = opt) model.ptree
The pruned tree is shown in Figure 2 using the same plotting functions for creating Figure 1.
Further information on the pruned tree can be accessed using the summary()
function.
12.11 Bagging
There is a very powerful idea in the use of subsamples of the data and in averaging over subsamples through bootstrapping.
Bagging exploits that idea to address the overfitting issue in a more fundamental manner. It was invented by Leo Breiman, who called it “bootstrap aggregating” or simply “bagging” (see the reference: “Bagging Predictors,” Machine Learning, 24:123-140, 1996, cited by 7466).
In a classification tree, bagging takes a majority vote from classifiers trained on bootstrap samples of the training data.
Algorithm: Consider the following steps in a fitting algorithm with a dataset having N observations and a binary response variable.
Take a random sample of size N with replacement from the data (a bootstrap sample).
Construct a classification tree as usual but do not prune.
Assign a class to each terminal node, and store the class attached to each case coupled with the predictor values for each observation.
Repeat Steps 1-3 a large number of times.
For each observation in the dataset, count the number of trees that it is classified in one category over the number of trees.
Assign each observation to a final category by a majority vote over the set of trees. Thus, if 51% of the time over a large number of trees a given observation is classified as a “1’’, that becomes its classification.
Although there remain some important variations and details to consider, these are the key steps to producing “bagged’’ classification trees. The idea of classifying by averaging over the results from a large number of bootstrap samples generalizes easily to a wide variety of classifiers beyond classification trees.
Margins:
Bagging introduces a new concept, “margins.” Operationally, the “margin” is the difference between the proportion of times a case is correctly classified and the proportion of times it is incorrectly classified. For example, if over all trees an observation is correctly classified 75% of the time, the margin is 0.75 - 0.25 = 0.50.
Large margins are desirable because a more stable classification is implied. Ideally, there should be large margins for all of the observations. This bodes well for generalization to new data.
Out-Of-Bag Observations:
For each tree, observations not included in the bootstrap sample are called “out-of-bag’’ observations. These”out-of-bag’’ observations can be treated as a test dataset and dropped down the tree.
To get a better evaluation of the model, the prediction error is estimated only based on the “out-of-bag’’ observations. In other words, the averaging for a given observation is done only using the trees for which that observation was not used in the fitting process.
Example: Domestic Violence
Why Bagging Works? The core of bagging’s potential is found in the averaging over results from a substantial number of bootstrap samples. As a first approximation, the averaging helps to cancel out the impact of random variation. However, there is more to the story, some details of which are especially useful for understanding a number of topics we will discuss later.
Data were collected to help forecast incidents of domestic violence within households. For a sample of households to which sheriff’s deputies were dispatched for domestic violence incidents, the deputies collected information on a series of possible predictors of future domestic violence, for example, whether police officers had been called to that household in the recent past.
The following three figures are three classification trees constructed from the same data, but each using a different bootstrap sample.
It is clear that the three figures are very different. Unstable results may be due to any number of common problems: small sample sizes, highly correlated predictors; or heterogeneous terminal nodes. Interpretations from the results of a single tree can be quite risky when a classification tree performs in this manner. However, when a classification tree is used solely as a classification tool, the classes assigned may be relatively stable even if the tree structure is not.
The same phenomenon can be found in conventional regression when predictors are highly correlated. The regression coefficients estimated for particular predictors may be very unstable, but it does not necessarily follow that the fitted values will be unstable as well.
It is not clear how much bias exists in the three trees. But it is clear that the variance across trees is large. Bagging can help with the variance. The conceptual advantage of bagging is to aggregate fitted values from a large number of bootstrap samples. Ideally, many sets of fitted values, each with low bias but high variance, may be averaged in a manner that can effectively reduce the bite of the bias-variance tradeoff. The ways in which bagging aggregates the fitted values are the basis for many other statistical learning developments.
Bagging a Quantitative Response:
Recall that a regression tree maximizes the reduction in the error sum of squares at each split. All of the concerns about overfitting apply, especially given the potential impact that outliers can have on the fitting process when the response variable is quantitative. Bagging works by the same general principles when the response variable is numerical.
- For each tree, each observation is placed in a terminal node and assigned the mean of that terminal node.
- Then, the average of these assigned means over trees is computed for each observation.
- This average value for each observation is the bagged fitted value.
R Package for Bagging
In R, the bagging procedure (i.e.,bagging()
in the ipred
library) can be applied to classification, regression, and survival trees.
nbagg
gives an integer giving the number of bootstrap replications. control
gives control details of the rpart
algorithm.
We can also use the random forest procedure in the “randomForest” package since bagging is a special case of random forests.
12.12 From Bagging to Random Forests
Bagging constructs a large number of trees with bootstrap samples from a dataset. But now, as each tree is constructed, take a random sample of predictors before each node is split. For example, if there are twenty predictors, choose a random five as candidates for constructing the best split. Repeat this process for each node until the tree is large enough. And as in bagging, do not prune.
Random Forests Algorithm
The random forests algorithm is very much like the bagging algorithm. Let N be the number of observations and assume for now that the response variable is binary.
Take a random sample of size N with replacement from the data (bootstrap sample).
Take a random sample without replacement of the predictors.
Construct a split by using predictors selected in Step 2.
Repeat Steps 2 and 3 for each subsequent split until the tree is as large as desired. Do not prune. Each tree is produced from a random sample of cases and at each split a random sample of predictors.
Drop the out-of-bag data down the tree. Store the class assigned to each observation along with each observation’s predictor values.
Repeat Steps 1-5 a large number of times (e.g., 500).
For each observation in the dataset, count the number of trees that it is classified in one category over the number of trees.
Assign each observation to a final category by a majority vote over the set of trees. Thus, if 51% of the time over a large number of trees a given observation is classified as a “1”, that becomes its classification.
Why Random Forests Work
Variance reduction: the trees are more independent because of the combination of bootstrap samples and random draws of predictors.
- It is apparent that random forests are a form of bagging, and the averaging over trees can substantially reduce instability that might otherwise result. Moreover, by working with a random sample of predictors at each possible split, the fitted values across trees are more independent. Consequently, the gains from averaging over a large number of trees (variance reduction) can be more dramatic.
Bias reduction: a very large number of predictors can be considered, and local feature predictors can play a role in tree construction.
Random forests are able to work with a very large number of predictors, even more, predictors than there are observations. An obvious gain with random forests is that more information may be brought to reduce bias of fitted values and estimated splits.
There are often a few predictors that dominate the decision tree fitting process because on the average they consistently perform just a bit better than their competitors. Consequently, many other predictors, which could be useful for very local features of the data, are rarely selected as splitting variables. With random forests computed for a large enough number of trees, each predictor will have at least several opportunities to be the predictor defining a split. In those opportunities, it will have very few competitors. Much of the time a dominant predictor will not be included. Therefore, local feature predictors will have the opportunity to define a split.
Indeed, random forests are among the very best classifiers invented to date (Breiman, 2001a).
Random forests include 3 main tuning parameters.
Node Size: unlike in decision trees, the number of observations in the terminal nodes of each tree of the forest can be very small. The goal is to grow trees with as little bias as possible.
Number of Trees: in practice, 500 trees is often a good choice.
Number of Predictors Sampled: the number of predictors sampled at each split would seem to be a key tuning parameter that should affect how well random forests perform. Sampling 2-5 each time is often adequate.
Taking Costs into Account
In the example of domestic violence, the following predictors were collected from 500+ households: Household size and number of children; Male / female age (years); Marital duration; Male / female education (years); Employment status and income; The number of times the police had been called to that household before; Alcohol or drug abuse, etc.
Our goal is not to forecast new domestic violence, but only those cases in which there is evidence that serious domestic violence has actually occurred. There are 29 felony incidents which are very small as a fraction of all domestic violence calls for service (4%). And they would be extremely difficult to forecast. When a logistic regression was applied to the data, not a single incident of serious domestic violence was identified.
There is a need to consider the relative costs of false negatives (fail to predict a felony incident) and false positives (predict a case to be a felony incident when it is not). Otherwise, the best prediction would be assuming no serious domestic violence with an error rate of 4%. In random forests, there are two common approaches. They differ by whether costs are imposed on the data before each tree is built, or at the end when classes are assigned.
Weighted Classification Votes: After all of the trees are built, one can differentially weight the classification votes over trees. For example, one vote for classification in the rare category might count the same as two votes for classification in the common category.
Stratified Bootstrap Sampling: When each bootstrap sample is drawn before a tree is built, one can oversample one class of cases for which forecasting errors are relatively more costly. The procedure is much in the same spirit as disproportional stratified sampling used for data collection (Thompson, 2002).
Using a cost ratio of 10 to 1 for false negatives to false positives favored by the police department, random forests correctly identify half of the rare serious domestic violence incidents.
In summary, with forecasting accuracy as a criterion, bagging is in principle an improvement over decision trees. It constructs a large number of trees with bootstrap samples from a dataset. Random forests are in principle an improvement over bagging. It draws a random sample of predictors to define each split.
R Package for Random Forests
In R, the random forest procedure can be implemented by the “randomForest” package.
=randomForest(x=X.train,y=as.factor(Y.train),importance=T,do.trace=50, ntree=200,classwt=c(5,1))
rfprint(rf)
12.13 Boosting
Boosting, like bagging, is another general approach for improving prediction results for various statistical learning methods. It is also particularly well suited to decision trees. Section 8.2.3 in the textbook provides details.