11.8 - Right Sized Tree via Pruning

11.8 - Right Sized Tree via Pruning

In the previous section, we talked about growing trees. In this section, we will discuss pruning trees.

Let the expected misclassification rate of a tree \(T\) be \(R^*(T)\).

Recall we used the resubstitution estimate for \(R^*(T)\). This is:

\(R(T)=\sum_{t \in \tilde{T}}r(t)p(t)=\sum_{t \in \tilde{T}}R(t)  \)

Remember also that r(t) is the probability of making a wrong classification for points in node t. For a point in a given leaf node t, the estimated probability of misclassification is 1 minus the probability of the majority class in node t based on the training data.

To get the probability of misclassification for the whole tree, a weighted sum of the within leaf node error rate is computed according to the total probability formula.

We also mentioned in the last section the resubstitution error rate \(R(T)\) is biased downward. Specifically, we proved the weighted misclassification rate for the parent node is guaranteed to be greater or equal to the sum of the weighted misclassification rates of the left and right child nodes, that is:

\(R(t) \geq R(t_L) + R(t_R)\)

This means that if we simply minimize the resubstitution error rate, we would always prefer a bigger tree. There is no defense against overfitting.

Let's look at the digit recognition example that we discussed before.

The biggest tree grown using the training data is of size 71. In other words, it has 71 leaf nodes. The tree is grown until all of the points in every leaf node are from the same class.

table

Then a pruning procedure is applied, (the details of this process we will get to later). We can see from the above table that the tree is gradually pruned. The tree next to the full tree has 63 leaf nodes, which is followed by a tree with 58 leaf nodes, so on so forth until only one leaf node is left. This minimum tree contains only a single node, the root.

The resubstitution error rate \(R(T)\) becomes monotonically larger when the tree shrinks.

The error rate \(R^{ts}\) based on a separate test data shows a different pattern. It decreases first when the tree becomes larger, hits minimum at the tree with 10 terminal nodes, and begins to increase when the tree further grows.

Comparing with \(R(T)\), \(R^{ts}(T)\) better reflects the real performance of the tree. The minimum \(R^{ts}(T)\) is achieved not by the biggest tree, but by a tree that better balances the resubstitution error rate and the tree size.


11.8.1 - Preliminaries for Pruning

11.8.1 - Preliminaries for Pruning

First, we would grow the tree to a large size. Denote this maximum size by \(T_{max}\). Stopping criterion is not important here because as long as the tree is fairly big, it doesn't really matter when to stop. The overgrown tree will be pruned back eventually. There are a few ways of deciding when to stop:

  1. Keep going until all terminal nodes are pure (contain only one class).
  2. Keep going until the number of data in each terminal node is no greater than a certain threshold, say 5, or even 1.
  3. As long as the tree is sufficiently large, the size of the initial tree is not critical.

The key here is to make the initial tree sufficiently big before pruning back.

Notation

Now we need to introduce a notation... Let's take a look at the following definitions:

1. Descendant: a node \(t^{'}\) is a descendant of node t if there is a connected path down the tree leading from t to \(t{'}\) .

2. Ancestor: t is an ancestor of \(t^{'}\) if \(t^{'}\) is its descendant.

3. A branch \(T_t\) of T with root node \(t \in T\) consists of the node t and all descendants of t in T .

4. Pruning a branch \(T_t\) from a tree T consists of deleting from T all descendants of t , that is, cutting off all of \(T_t\) except its root node. The tree pruned this way will be denoted by \(T - T_t\) .

5. If \(T^{'}\) is gotten from T by successively pruning off branches, then \(T^{'}\) is called a pruned subtree of T and denoted by \(T^{'} < T\)

tree graph tree graph tree graph

 

 

Optimal Subtrees

Even for a moderately sized \(T_{max}\), there is an enormously large number of subtrees and an even larger number ways to prune the initial tree to get any. We, therefore, cannot exhaustively go through all the subtrees to find out which one is the best in some sense. Moreover, we typically do not have a separate test dataset to serve as a basis for selection.

A smarter method is necessary. A feasible method of pruning should ensure the following:

  • The subtree is optimal in a certain sense, and
  • The search of the optimal subtree should be computationally tractable.

11.8.2 - Minimal Cost-Complexity Pruning

11.8.2 - Minimal Cost-Complexity Pruning

As we just discussed, \(R(T)\), is not a good measure for selecting a subtree because it always favors bigger trees. We need to add a complexity penalty to this resubstitution error rate. The penalty term favors smaller trees, and hence balances with \(R(T)\).

The definition for the cost-complexity measure:

For any subtree \(T < T_{max}\) , we will define its complexity as |\(\tilde{T}\)|, the number of terminal or leaf nodes in T . Let \(\alpha ≥ 0\) be a real number called the complexity parameter and define the cost-complexity measure \(R_{\alpha}(T)\) as:

\(R_{\alpha}(T)=R(T) +\alpha| \tilde{T}|  \)

The more leaf nodes that the tree contains the higher complexity of the tree because we have more flexibility in partitioning the space into smaller pieces, and therefore more possibilities for fitting the training data. There's also the issue of how much importance to put on the size of the tree. The complexity parameter α adjusts that.

In the end, the cost complexity measure comes as a penalized version of the resubstitution error rate. This is the function to be minimized when pruning the tree.

Which subtree is selected eventually depends on \(\alpha\) . If \(\alpha  = 0\) then the biggest tree will be chosen because the complexity penalty term is essentially dropped. As \(\alpha\) approaches infinity, the tree of size 1, i.e., a single root node, will be selected.

In general, given a pre-selected \(\alpha\) , find the subtree \(T(\alpha)\) that minimizes \(R_{\alpha}(T)\), i.e.,

\(R_{\alpha}(T(\alpha))= \underset{T \preceq T_{max}}{min}R_{\alpha}(T)\)

The minimizing subtree for any \(\alpha\) always exists since there are only finitely many subtrees.

Since there are at most a finite number of subtrees of \(T_{max}\) , \(R_{\alpha}(T(\alpha))\) yields different values for only finitely many \(\alpha\)’s. \(T(α) \) continues to be the minimizing tree when \(\alpha\) increases until a jump point is reached.

Two questions:

  1. Is there a unique subtree \(T < T_{max}\) which minimizes \(R_{\alpha}(T)\)?
  2. In the minimizing sequence of trees \(T_1, T_2, \cdots \) is each subtree obtained by pruning upward from the previous subtree, i.e., does the nesting \(T_1 > T_2 > \cdots > {t_1}\) hold?

If the optimal subtrees are nested, the computation will be a lot easier. We can first find \(T_1\), and then to find \(T_2\) , we don't need to start again from the maximum tree, but from \(T_1\), (because \(T_2\) is guaranteed to be a subtree of \(T_1\)).  In this way when α increases, we prune based on a smaller and smaller subtree.

Definition: The smallest minimizing subtree \(T_{\alpha}\) for complexity parameter α is defined by the conditions:

  1. \(R_{\alpha}(T(\alpha)) = min_{T \leq Tmax} R_{\alpha}(T)\)
  2. If \(R_{\alpha}(T) = R_{\alpha}(T(\alpha))\) , then \(T(\alpha) \leq T\). - if another tree achieves the minimum at the same α, then the other tree is guaranteed to be bigger than the smallest minimized tree T(α).

By definition, (according to the second requirement above), if the smallest minimizing subtree \(T(\alpha)\) exists, it must be unique. Earlier we argued that a minimizing subtree always exists because there are only a finite number of subtrees. Here we go one step more. We can prove that the smallest minimizing subtree always exists. This is not trivial to show because one tree smaller than another means the former is embedded in the latter. Tree ordering is a partial ordering.

The starting point for the pruning is not \(T_{max}\) , but rather \(T_1 = T(0)\), which is the smallest subtree of \(T_{max}\) satisfying:

\(R(T_1) = R(T_{max})\)

The way that you get \(T_1\) is as follows:

First, look at the biggest tree, \(T_{max}\) , and for any two terminal nodes descended from the same parent, for instance \(t_L\) and \(t_R\) , if they yield the same re-substitution error rate as the parent node t,  prune off these two terminal nodes, that is, if \(R(t) = R(t_L) + R(t_R)\), prune off \(t_L\) and \(t_R\).

This process is applied recursively. After we have pruned one pair of terminal nodes, the tree shrinks a little bit. Then based on the smaller tree, we do the same thing until we cannot find any pair of terminal nodes satisfying this equality. The resulting tree at this point is \(T_1\).

Let's take a look at an example using a plot to see how  \(T_{max}\) is obtained.

T max

 

We will use \(T_t\) to denote a branch rooted at t. Then, for \(T_t\),  we define \(R(T_t)\), (the resubstitution error rate for this branch ) by:

\(R(T_t)=\sum_{t' \in \tilde{T}_t}R(t')\)

where \(\tilde{T}_t\) is the set of terminal nodes of \(T_t\).

What we can prove is that, if t is any non-terminal or internal node of \(T_1\), then it is guaranteed to have a smaller re-substitution error rate than the re-substitution error rate of the branch, that is, \(R(t) > R(T_t)\).  If we prune off the branch at t, the resubstitution error rate will strictly increase.

 

Weakest-Link Cutting

The weakest link cutting method not only finds the next α which results in a different optimal subtree but find that optimal subtree.

Remember, we previously defined \(R_\alpha\) for the entire tree. Here, we extend the definition to a node and then for a single branch coming out of a node.

For any node \(t \in T_1\), we can set \(R_\alpha({t}) = R(t) + \alpha\) .

Also, for any branch \(T_t\) , we can define \(R_\alpha(T_t) = R(T_t) + \alpha | \tilde{T}_t\) |.

We know that when \(\alpha = 0, R_0(T_t) < R_0({t})\). The inequality holds for sufficiently small \(\alpha\).

If we gradually increase α, because \(R_\alpha(T_t)\) increases faster with α (the coefficient in front of \(\alpha\) is larger than that in\((R_\alpha({t}) )\) , at a certain  \(\alpha\) ( \(\alpha_1\) below), we will have \(R_\alpha(T_t) = R_\alpha({t})\).

graph

If α further increases, the inequality sign will be reversed, and we have \(R_\alpha(T_t) > R_\alpha({t})\). Some node t may reach the equality earlier than some other. The node that achieves the equality at the smallest α is called the weakest link. It is possible that several nodes achieve equality at the same time, and hence there are several weakest link nodes.

Solve the inequality \(R_\alpha(T_t) < R_\alpha({t})\) and get

\(\alpha < \frac{R(t)-R(T_t)}{|\tilde{T}_t| -1}\)

The right hand side is the ratio between the difference in resubstitution error rates and the difference in complexity, which is positive because both the numerator and the denominator are positive.

Define a function \(g_1(t), t \in T_1\) by

\(g_1(t)\left\{\begin{matrix}
\frac{R(t)-R(T_t)}{|\tilde{T}_t|-1}, & t \notin \tilde{T}_1 \\
+\infty, & t \in \tilde{T}_1
\end{matrix}\right. \)

The weakest link \(\bar{t}_1\) in \(T_1\) achieves the minimum of \(g_1(t)\).

\(g_1(\bar{t}_1)=\underset{t \in T_1}{\text{ min }}g_1(t)\)

and put \(\alpha_2=g_1(\bar{t}_1)\). To get the optimal subtree corresponding to \(\alpha_2\) , simply remove the branch growing out of \(\bar{t}_1\). When α increases, \(\bar{t}_1\) is the first node that becomes more preferable than the branch  \(T_{\bar{t}_1}\) descended from it. If there are several nodes that simultaneously achieve the minimum \(g_1(t)\), we remove the branch grown out of each of these nodes. \(\alpha_2\) is the first value after \(\alpha_1 =0\) that yields an optimal subtree strictly smaller than \(T_1\).  For all \(\alpha_1 \leq \alpha < \alpha_2\), the smallest minimizing subtree is the same as \(T_1\).

Let \(T_2=T_1-T_{\bar{t}_1}\).

Repeat the previous steps. Use \(T_2\) instead of \(T_1\) as the starting tree, find the weakest link in \(T_2\) and prune off at all the weakest link nodes to get the next optimal subtree.

\( \begin {align}g_2(t) & = \left\{\begin{matrix}
\frac{R(t)-R(T_{2t})}{|\tilde{T}_{2t}|-1}, & t \in T_2, t \notin \tilde{T}_2 \\
+\infty, & t \in \tilde{T}_2
\end{matrix}\right. \\
g_2(\bar{t}_2) & = \underset{t \in T_2}{\text{ min }g_2(t)}\\
\alpha_3 & = g_2(\bar{t}_2) \\
T_3 & = T_2- T_{\bar{t}_2} \\
\end {align} \)

 

Computation

In terms of computation, we need to store a few values at each node.

  • \(R(t)\) , the resubstitution error rate for node t. This only need to be computed once.
  • \(R(T_t)\) , the resubstitution error rate for the branch coming out of node t.  This may need to be updated after pruning because \(T_t\) may change after pruning.
  • \(|T_t|\) , the number of leaf nodes on the branch coming out of node t. This may need to be updated after pruning.

 

R(t)

 

In order to compute the resubstitution error rate \(R(t)\) we need the proportion of data points in each class that land in node t. Let's suppose we compute the class priors by the proportion of points in each class. As we grow the tree, we will store the number of points land in node t, as well as the number of points in each class that land in node t. Given those numbers, we can easily estimate the probability of node t and the class posterior given a data point is in node t. \(R(t)\) can then be calculated.

As far as calculating the next two numbers, a) the resubstitution error rate for the branch coming out of node t, and b) the number of leaf nodes that are on the branch coming out of node t, these two numbers change after pruning. After pruning we to need to update these values because the number of leaf nodes will have been reduced. To be specific we would need to update the values for all of the ancestor nodes of the branch.

A recursive procedure can be used to compute \(R(T_t)\) and \(|T_t|\) .

To find the number of leaf nodes in the branch coming out of node t, we can do a bottom-up sweep through the tree. The number of leaf nodes for any node is equal to the number of leaf nodes for the right child node plus the number of leaf nodes for the right child node. A bottom-up sweep ensures that the number of leaf nodes is computed for a child node before for a parent node. Similarly, \(R(T_t)\) is equal to the sum of the values for the two child nodes of t. And hence, a bottom-up sweep would do.

Once we have the three values at every node, we compute the ratio \(g(t)\) and find the weakest link. The corresponding ratio at the weakest link is the new α. It is guaranteed that the sequence of α obtained in the pruning process is strictly increasing.

If at any stage, there are multiple weakest links, for instance, if \(g_k(\bar{t}_k)=g_k(\bar{t}'_k)\), then define:

\(T_{k+1}=T_k - T_{\bar{t}_k}-T_{\bar{t}'_k}\)

In this case, the two branches are either nested or share no node. The pruning procedure gives sequence of nested subtrees:

\(T_1 > T_2 > T_3 > \cdots > { t_1}\)

Each embedded in the other.

 

Theorem for \(\alpha_k\)

The theorem states that the {\(\alpha_k\)} are an increasing sequence, that is, \(\alpha_k < \alpha_{k+1}, k \geq 1\),where \(\alpha_1 =0\).

For any \(k \geq 1, \alpha_k \leq \alpha < \alpha_{k+1}\) , the smallest optimal subtree \(T(\alpha) = T(\alpha_k) = T_k\) , i.e., is the same as the smallest optimal subtree for \(\alpha_k\).

Basically, this means that the smallest optimal subtree \(T_k\) stays optimal for all the \(\alpha\)'s starting from k until it reaches \(\alpha_k + 1\). Although we have a sequence of finite subtrees, they are optimal for a continuum of \(\alpha\).

At the initial steps of pruning, the algorithm tends to cut off large sub-branches with many leaf nodes very quickly. Then pruning becomes slower and slower as the tree becoming smaller. The algorithm tends to cut off fewer nodes. Let's look at an example.

 

Digital Recognition Example

\(T_1\) is the smallest optimal subtree for \(\alpha_1 = 0\). It has 71 leaf nodes. Next, by finding the weakest link, after one step of pruning, the tree is reduced to size 63 (8 leaf nodes are pruned off in one step). Next, five leaf nodes pruned off. From \(T_3\) to \(T_4\) , the pruning is significant, 18 leave nodes removed. Towards the end, pruning becomes slower.

For classification purpose, we have to select a single \(α\), or a single subtree to use.


11.8.3 - Best Pruned Subtree

11.8.3 - Best Pruned Subtree

There are two approaches to choosing the best-pruned subtree:

  • Use a test sample set

If we have a large test data set, we can compute the error rate using the test data set for all the subtrees and see which one achieves the minimum error rate. However, in practice, we rarely have a large test data set. Even if we have a large test data set, instead of using the data for testing, we might rather use this data for training in order to train a better tree. When data is scarce, we may not want to use too much for testing.

  • Cross-validation

How to conduct cross-validation for trees when trees are unstable? If the training data vary a little bit, the resulting tree may be very different. Therefore, we would have difficulty to match the trees obtained in each fold with the tree obtained using the entire data set.

However, although we said that the trees themselves can be unstable, this does not mean that the classifier resulting from the tree is unstable. We may end up with two trees that look very different, but make similar decisions for classification. The key strategy in a classification tree is to focus on choosing the right complexity parameter α. Instead of trying to say which tree is best, a classification tree tries to find the best complexity parameter \(\alpha\).

So, let's look at this...

 

Pruning by Cross-Validation

Let's consider V-fold cross-validation.

We will denote the original learning sample L which is divided randomly into V subsets, \(L_v, \;\;  v = 1, \cdots , V\). We will also let the training sample set in each fold be \(L^{(v)} = L - L_v\).

Next, the tree is grown on the original set and we call this \(T_{max}\). Then, we repeat this procedure for every fold in the cross-validation. So, V additional trees \(T^{(v)}_{max}\) are grown on \(L^{(v)}\).

For each value of the complexity parameter \(\alpha\) , we will let \(T^{(v)}, L^{(v)}(\alpha), \;\; v = 1, \cdots , V\) , be the corresponding minimal cost-complexity subtrees of \(T_{max}T^{(v)}T_{max}\).

For each maximum tree, we obtain a sequence of critical values for \(\alpha\) that are strictly increasing:

\(\alpha_1 < \alpha_2< \alpha_3 \cdots < \alpha_k < \cdots\)

Then, to find the corresponding minimal cost-complexity subtree at \(\alpha\) , we will find \(\alpha_k\) from the list such that \(\alpha_k \leq \alpha_{k+1}\). The optimal subtree corresponding to \(\alpha_k\) is the subtree for \(\alpha\).

The cross-validation error rate of \(T(\alpha)\) is computed by this formula:

\(R^{CV}(T(\alpha))=\frac{1}{V}\sum_{v=1}^{V}\frac{N^{(v)}_{miss}}{N^{(v)}}\)

where \(N^{(v)}\) is the number of samples in the test set \(L_v\) in fold v; and \(N^{(v)}_{miss}\) is the number of misclassified samples in \(L_v\) using the smallest minimizing subtree at \(\alpha\) , \(T^{(v)}(\alpha)\).

Remember: \(T^{(v)}(\alpha)\) is a pruned tree of \(T^{(v)}_{max}\) trained from \(L_v\) .

Although α is continuous, there are only finitely many minimum cost-complexity trees grown on L . Consider each subtree obtained in the pruning of the tree grown on L . Let \(T_k = T(\alpha_{k})\). To compute the cross-validation error rate of \(T_k\) , let \(\alpha'_k=\sqrt{\alpha_k \alpha'_{k+1}}\).

Then compute the cross-validation error rate using the formula below:

\(R^{CV}(T_k)=R^{CV}(T(\alpha^{'}_{k})\)

The \(T_k\) yielding the minimum cross-validation error rate is chosen.

 

Computational Cost

How much computation is involved? Let's take a look at this when using V-fold cross validation.

  1. Grow \(V + 1\) maximum trees.
  2. For each of the \(V + 1\) trees, find the sequence of subtrees with minimum cost-complexity.
  3. Suppose we have obtained the maximum tree grown on the original data set \(T_{max}\) and has K subtrees. Then, for each of the \( (K - 1) \alpha^{'}_{k}\), we would compute the misclassification rate of each of the V test sample set, average the error rates and use the mean as the cross-validation error rate.
  4. Choose the best subtree of \(T_{max}\) with minimum \(R^{CV}(T_k\)).

The main computation occurs with pruning. Once the trees and the subtrees are obtained, to find the best one out of these is computationally light. For programming, it is recommended that under every fold and for every subtree, compute the error rate of this subtree using the corresponding test data set under that fold and store the error rate for that subtree. This way, later we can easily compute the cross-validation error rate given any \(\alpha\).


11.8.4 - Related Methods for Decision Trees

11.8.4 - Related Methods for Decision Trees

Random Forests

Leo Breiman developed an extension of decision trees called random forests. There is publicly available software for this method. Here is a good place to start:

Random forests train multiple trees. To obtain a single tree, when splitting a node, only a randomly chosen subset of features are considered for thresholding. Leo Breiman did extensive experiments using random forests and compared it with support vector machines. He found that overall random forests seem to be slightly better. Moreover, random forests come with many other advantages.

Other extensions

The candidate questions in decision trees are about whether a variable is greater or smaller than a given value. There are some extensions to more complicated questions, or splitting methods, for instance, performing LDA at every node, but the original decision tree method seems to stay as the most popular and there is no strong evidence that the extensions are considerably better in general.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility