There are two approaches to choosing the best-pruned subtree:

- Use a test sample set

If we have a large test data set, we can compute the error rate using the test data set for all the subtrees and see which one achieves the minimum error rate. However, in practice, we rarely have a large test data set. Even if we have a large test data set, instead of using the data for testing, we might rather use this data for training in order to train a better tree. When data is scarce, we may not want to use too much for testing.

- Cross-validation

How to conduct cross-validation for trees when trees are unstable? If the training data vary a little bit, the resulting tree may be very different. Therefore, we would have difficulty to match the trees obtained in each fold with the tree obtained using the entire data set.

However, although we said that the trees themselves can be unstable, this does not mean that the classifier resulting from the tree is unstable. We may end up with two trees that look very different, but make similar decisions for classification. The key strategy in a classification tree is to focus on choosing the right complexity parameter α. Instead of trying to say which tree is best, a classification tree tries to find the best complexity parameter \(\alpha\).

So, let's look at this...

### Pruning by Cross-Validation

Let's consider *V-*fold cross-validation.

We will denote the original learning sample *L* which is divided randomly into *V* subsets, \(L_v, \;\; v = 1, \cdots , V\). We will also let the training sample set in each fold be \(L^{(v)} = L - L_v\).

Next, the tree is grown on the original set and we call this \(T_{max}\). Then, we repeat this procedure for every fold in the cross-validation. So, *V* additional trees \(T^{(v)}_{max}\) are grown on \(L^{(v)}\).

For each value of the complexity parameter \(\alpha\) , we will let \(T^{(v)}, L^{(v)}(\alpha), \;\; v = 1, \cdots , V\) , be the corresponding minimal cost-complexity subtrees of \(T_{max}T^{(v)}T_{max}\).

For each maximum tree, we obtain a sequence of critical values for \(\alpha\) that are strictly increasing:

\(\alpha_1 < \alpha_2< \alpha_3 \cdots < \alpha_k < \cdots\)

Then, to find the corresponding minimal cost-complexity subtree at \(\alpha\)* _{ }*, we will find \(\alpha_k\)

*from the list such that \(\alpha_k \leq \alpha_{k+1}\). The optimal subtree corresponding to \(\alpha_k\)*

_{ }*is the subtree for \(\alpha\).*

_{ }The cross-validation error rate of \(T(\alpha)\) is computed by this formula:

\(R^{CV}(T(\alpha))=\frac{1}{V}\sum_{v=1}^{V}\frac{N^{(v)}_{miss}}{N^{(v)}}\)

where \(N^{(v)}\) is the number of samples in the test set \(L_v\) in fold *v*; and \(N^{(v)}_{miss}\) is the number of misclassified samples in \(L_v\) using the smallest minimizing subtree at \(\alpha\) , \(T^{(v)}(\alpha)\).

Remember: \(T^{(v)}(\alpha)\) is a pruned tree of \(T^{(v)}_{max}\) trained from* *\(L_v\) *.*

Although α is continuous, there are only finitely many minimum cost-complexity trees grown on *L* . Consider each subtree obtained in the pruning of the tree grown on *L* . Let* *\(T_k = T(\alpha_{k})\). To compute the cross-validation error rate of \(T_k\) , let \(\alpha'_k=\sqrt{\alpha_k \alpha'_{k+1}}\).

Then compute the cross-validation error rate using the formula below:

\(R^{CV}(T_k)=R^{CV}(T(\alpha^{'}_{k})\)

The* *\(T_k\) yielding the minimum cross-validation error rate is chosen.

### Computational Cost

How much computation is involved? Let's take a look at this when using V-fold cross validation.

- Grow \(V + 1\) maximum trees.
- For each of the \(V + 1\) trees, find the sequence of subtrees with minimum cost-complexity.
- Suppose we have obtained the maximum tree grown on the original data set
*K*subtrees. Then, for each of the \( (K - 1) \alpha^{'}_{k}\), we would compute the misclassification rate of each of the*V*test sample set, average the error rates and use the mean as the cross-validation error rate. - Choose the best subtree of \(T_{max}\) with minimum \(R^{CV}(T_k\)).

The main computation occurs with pruning. Once the trees and the subtrees are obtained, to find the best one out of these is computationally light. For programming, it is recommended that under every fold and for every subtree, compute the error rate of this subtree using the corresponding test data set under that fold and store the error rate for that subtree. This way, later we can easily compute the cross-validation error rate given any \(\alpha\).