11.8.2 - Minimal Cost-Complexity Pruning

11.8.2 - Minimal Cost-Complexity Pruning

As we just discussed, \(R(T)\), is not a good measure for selecting a subtree because it always favors bigger trees. We need to add a complexity penalty to this resubstitution error rate. The penalty term favors smaller trees, and hence balances with \(R(T)\).

The definition for the cost-complexity measure:

For any subtree \(T < T_{max}\) , we will define its complexity as |\(\tilde{T}\)|, the number of terminal or leaf nodes in T . Let \(\alpha ≥ 0\) be a real number called the complexity parameter and define the cost-complexity measure \(R_{\alpha}(T)\) as:

\(R_{\alpha}(T)=R(T) +\alpha| \tilde{T}|  \)

The more leaf nodes that the tree contains the higher complexity of the tree because we have more flexibility in partitioning the space into smaller pieces, and therefore more possibilities for fitting the training data. There's also the issue of how much importance to put on the size of the tree. The complexity parameter α adjusts that.

In the end, the cost complexity measure comes as a penalized version of the resubstitution error rate. This is the function to be minimized when pruning the tree.

Which subtree is selected eventually depends on \(\alpha\) . If \(\alpha  = 0\) then the biggest tree will be chosen because the complexity penalty term is essentially dropped. As \(\alpha\) approaches infinity, the tree of size 1, i.e., a single root node, will be selected.

In general, given a pre-selected \(\alpha\) , find the subtree \(T(\alpha)\) that minimizes \(R_{\alpha}(T)\), i.e.,

\(R_{\alpha}(T(\alpha))= \underset{T \preceq T_{max}}{min}R_{\alpha}(T)\)

The minimizing subtree for any \(\alpha\) always exists since there are only finitely many subtrees.

Since there are at most a finite number of subtrees of \(T_{max}\) , \(R_{\alpha}(T(\alpha))\) yields different values for only finitely many \(\alpha\)’s. \(T(α) \) continues to be the minimizing tree when \(\alpha\) increases until a jump point is reached.

Two questions:

  1. Is there a unique subtree \(T < T_{max}\) which minimizes \(R_{\alpha}(T)\)?
  2. In the minimizing sequence of trees \(T_1, T_2, \cdots \) is each subtree obtained by pruning upward from the previous subtree, i.e., does the nesting \(T_1 > T_2 > \cdots > {t_1}\) hold?

If the optimal subtrees are nested, the computation will be a lot easier. We can first find \(T_1\), and then to find \(T_2\) , we don't need to start again from the maximum tree, but from \(T_1\), (because \(T_2\) is guaranteed to be a subtree of \(T_1\)).  In this way when α increases, we prune based on a smaller and smaller subtree.

Definition: The smallest minimizing subtree \(T_{\alpha}\) for complexity parameter α is defined by the conditions:

  1. \(R_{\alpha}(T(\alpha)) = min_{T \leq Tmax} R_{\alpha}(T)\)
  2. If \(R_{\alpha}(T) = R_{\alpha}(T(\alpha))\) , then \(T(\alpha) \leq T\). - if another tree achieves the minimum at the same α, then the other tree is guaranteed to be bigger than the smallest minimized tree T(α).

By definition, (according to the second requirement above), if the smallest minimizing subtree \(T(\alpha)\) exists, it must be unique. Earlier we argued that a minimizing subtree always exists because there are only a finite number of subtrees. Here we go one step more. We can prove that the smallest minimizing subtree always exists. This is not trivial to show because one tree smaller than another means the former is embedded in the latter. Tree ordering is a partial ordering.

The starting point for the pruning is not \(T_{max}\) , but rather \(T_1 = T(0)\), which is the smallest subtree of \(T_{max}\) satisfying:

\(R(T_1) = R(T_{max})\)

The way that you get \(T_1\) is as follows:

First, look at the biggest tree, \(T_{max}\) , and for any two terminal nodes descended from the same parent, for instance \(t_L\) and \(t_R\) , if they yield the same re-substitution error rate as the parent node t,  prune off these two terminal nodes, that is, if \(R(t) = R(t_L) + R(t_R)\), prune off \(t_L\) and \(t_R\).

This process is applied recursively. After we have pruned one pair of terminal nodes, the tree shrinks a little bit. Then based on the smaller tree, we do the same thing until we cannot find any pair of terminal nodes satisfying this equality. The resulting tree at this point is \(T_1\).

Let's take a look at an example using a plot to see how  \(T_{max}\) is obtained.

T max

 

We will use \(T_t\) to denote a branch rooted at t. Then, for \(T_t\),  we define \(R(T_t)\), (the resubstitution error rate for this branch ) by:

\(R(T_t)=\sum_{t' \in \tilde{T}_t}R(t')\)

where \(\tilde{T}_t\) is the set of terminal nodes of \(T_t\).

What we can prove is that, if t is any non-terminal or internal node of \(T_1\), then it is guaranteed to have a smaller re-substitution error rate than the re-substitution error rate of the branch, that is, \(R(t) > R(T_t)\).  If we prune off the branch at t, the resubstitution error rate will strictly increase.

 

Weakest-Link Cutting

The weakest link cutting method not only finds the next α which results in a different optimal subtree but find that optimal subtree.

Remember, we previously defined \(R_\alpha\) for the entire tree. Here, we extend the definition to a node and then for a single branch coming out of a node.

For any node \(t \in T_1\), we can set \(R_\alpha({t}) = R(t) + \alpha\) .

Also, for any branch \(T_t\) , we can define \(R_\alpha(T_t) = R(T_t) + \alpha | \tilde{T}_t\) |.

We know that when \(\alpha = 0, R_0(T_t) < R_0({t})\). The inequality holds for sufficiently small \(\alpha\).

If we gradually increase α, because \(R_\alpha(T_t)\) increases faster with α (the coefficient in front of \(\alpha\) is larger than that in\((R_\alpha({t}) )\) , at a certain  \(\alpha\) ( \(\alpha_1\) below), we will have \(R_\alpha(T_t) = R_\alpha({t})\).

graph

If α further increases, the inequality sign will be reversed, and we have \(R_\alpha(T_t) > R_\alpha({t})\). Some node t may reach the equality earlier than some other. The node that achieves the equality at the smallest α is called the weakest link. It is possible that several nodes achieve equality at the same time, and hence there are several weakest link nodes.

Solve the inequality \(R_\alpha(T_t) < R_\alpha({t})\) and get

\(\alpha < \frac{R(t)-R(T_t)}{|\tilde{T}_t| -1}\)

The right hand side is the ratio between the difference in resubstitution error rates and the difference in complexity, which is positive because both the numerator and the denominator are positive.

Define a function \(g_1(t), t \in T_1\) by

\(g_1(t)\left\{\begin{matrix}
\frac{R(t)-R(T_t)}{|\tilde{T}_t|-1}, & t \notin \tilde{T}_1 \\
+\infty, & t \in \tilde{T}_1
\end{matrix}\right. \)

The weakest link \(\bar{t}_1\) in \(T_1\) achieves the minimum of \(g_1(t)\).

\(g_1(\bar{t}_1)=\underset{t \in T_1}{\text{ min }}g_1(t)\)

and put \(\alpha_2=g_1(\bar{t}_1)\). To get the optimal subtree corresponding to \(\alpha_2\) , simply remove the branch growing out of \(\bar{t}_1\). When α increases, \(\bar{t}_1\) is the first node that becomes more preferable than the branch  \(T_{\bar{t}_1}\) descended from it. If there are several nodes that simultaneously achieve the minimum \(g_1(t)\), we remove the branch grown out of each of these nodes. \(\alpha_2\) is the first value after \(\alpha_1 =0\) that yields an optimal subtree strictly smaller than \(T_1\).  For all \(\alpha_1 \leq \alpha < \alpha_2\), the smallest minimizing subtree is the same as \(T_1\).

Let \(T_2=T_1-T_{\bar{t}_1}\).

Repeat the previous steps. Use \(T_2\) instead of \(T_1\) as the starting tree, find the weakest link in \(T_2\) and prune off at all the weakest link nodes to get the next optimal subtree.

\( \begin {align}g_2(t) & = \left\{\begin{matrix}
\frac{R(t)-R(T_{2t})}{|\tilde{T}_{2t}|-1}, & t \in T_2, t \notin \tilde{T}_2 \\
+\infty, & t \in \tilde{T}_2
\end{matrix}\right. \\
g_2(\bar{t}_2) & = \underset{t \in T_2}{\text{ min }g_2(t)}\\
\alpha_3 & = g_2(\bar{t}_2) \\
T_3 & = T_2- T_{\bar{t}_2} \\
\end {align} \)

 

Computation

In terms of computation, we need to store a few values at each node.

  • \(R(t)\) , the resubstitution error rate for node t. This only need to be computed once.
  • \(R(T_t)\) , the resubstitution error rate for the branch coming out of node t.  This may need to be updated after pruning because \(T_t\) may change after pruning.
  • \(|T_t|\) , the number of leaf nodes on the branch coming out of node t. This may need to be updated after pruning.

 

R(t)

 

In order to compute the resubstitution error rate \(R(t)\) we need the proportion of data points in each class that land in node t. Let's suppose we compute the class priors by the proportion of points in each class. As we grow the tree, we will store the number of points land in node t, as well as the number of points in each class that land in node t. Given those numbers, we can easily estimate the probability of node t and the class posterior given a data point is in node t. \(R(t)\) can then be calculated.

As far as calculating the next two numbers, a) the resubstitution error rate for the branch coming out of node t, and b) the number of leaf nodes that are on the branch coming out of node t, these two numbers change after pruning. After pruning we to need to update these values because the number of leaf nodes will have been reduced. To be specific we would need to update the values for all of the ancestor nodes of the branch.

A recursive procedure can be used to compute \(R(T_t)\) and \(|T_t|\) .

To find the number of leaf nodes in the branch coming out of node t, we can do a bottom-up sweep through the tree. The number of leaf nodes for any node is equal to the number of leaf nodes for the right child node plus the number of leaf nodes for the right child node. A bottom-up sweep ensures that the number of leaf nodes is computed for a child node before for a parent node. Similarly, \(R(T_t)\) is equal to the sum of the values for the two child nodes of t. And hence, a bottom-up sweep would do.

Once we have the three values at every node, we compute the ratio \(g(t)\) and find the weakest link. The corresponding ratio at the weakest link is the new α. It is guaranteed that the sequence of α obtained in the pruning process is strictly increasing.

If at any stage, there are multiple weakest links, for instance, if \(g_k(\bar{t}_k)=g_k(\bar{t}'_k)\), then define:

\(T_{k+1}=T_k - T_{\bar{t}_k}-T_{\bar{t}'_k}\)

In this case, the two branches are either nested or share no node. The pruning procedure gives sequence of nested subtrees:

\(T_1 > T_2 > T_3 > \cdots > { t_1}\)

Each embedded in the other.

 

Theorem for \(\alpha_k\)

The theorem states that the {\(\alpha_k\)} are an increasing sequence, that is, \(\alpha_k < \alpha_{k+1}, k \geq 1\),where \(\alpha_1 =0\).

For any \(k \geq 1, \alpha_k \leq \alpha < \alpha_{k+1}\) , the smallest optimal subtree \(T(\alpha) = T(\alpha_k) = T_k\) , i.e., is the same as the smallest optimal subtree for \(\alpha_k\).

Basically, this means that the smallest optimal subtree \(T_k\) stays optimal for all the \(\alpha\)'s starting from k until it reaches \(\alpha_k + 1\). Although we have a sequence of finite subtrees, they are optimal for a continuum of \(\alpha\).

At the initial steps of pruning, the algorithm tends to cut off large sub-branches with many leaf nodes very quickly. Then pruning becomes slower and slower as the tree becoming smaller. The algorithm tends to cut off fewer nodes. Let's look at an example.

 

Digital Recognition Example

\(T_1\) is the smallest optimal subtree for \(\alpha_1 = 0\). It has 71 leaf nodes. Next, by finding the weakest link, after one step of pruning, the tree is reduced to size 63 (8 leaf nodes are pruned off in one step). Next, five leaf nodes pruned off. From \(T_3\) to \(T_4\) , the pruning is significant, 18 leave nodes removed. Towards the end, pruning becomes slower.

For classification purpose, we have to select a single \(α\), or a single subtree to use.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility