The impurity function is a function of the posterior probabilities of *k* classes. In this section, we answer the question, "How do we estimate these probabilities?"

Let's begin by introducing the notation *N*, the total number of samples. The number of samples in class *j*, \(1 \leq j \leq K\), is* *\(N_j\) . If we add up all the \(N_j\) data points, we get the total number of data points *N*.

We also denote the number of samples going to node *t* by \(N(t)\), and, the number of samples of class *j* going to node *t* by* *\(N_j(t)\).

Then for every node *t*, if we add up over different classes we should get the total number of points back:

\(\sum_{j=1}^{K} N_j(t) =N(t)\)

And, if we add the points going to the left and the points going the right child node, we should also get the number of points in the parent node.

\(N_j (t_L) + N_j(t_R) = N_j(t)\)

For a full tree (balanced), the sum of \(N(t)\) over all the node *t*'s at the same level is *N*.

Next, we will denote the prior probability of class *j* by \(\pi_j\) . The prior probabilities very often are estimated from the data by calculating the proportion of data in every class. For instance, if we want the prior probability for class 1, we simply compute the ratio between the number of points in class one and the total number of points, \(N_j / N\). These are the so-called empirical frequencies for the classes.

This is one way of getting priors. Sometimes the priors may be pre-given. For instance, in medical studies, researchers collect a large amount of data from patients who have a disease. The percentage of cases with the disease in the collected data may be much higher than that in the population. In this case, it is inappropriate to use the empirical frequencies based on the data. If the data is a random sample from the population, then it may be reasonable to use empirical frequency.

The estimated probability of a sample in class *j *going to node *t* is \(p(t | j) = N_j (t) / N_j\) . Obviously,

\(p(t_L | j) + p(t_R | j) = p(t | j)\)

Next, we can assume that we know how to compute \(p(t | j)\) and then we will find the joint probability of a sample point in class* j *and in node *t*.

The joint probability of a sample being in class *j* and going to node *t* is as follows:

\(p(j, t) = \pi_j p(t | j) = \pi_j N_j (t) / N_j \)

Because the prior probability is assumed known (or calculated) and \(p(t | j)\) is computed, the joint probability can be computed. The probability of any sample going to node *t* regardless of its class is:

\(p(t)=\sum_{j=1}^{K}p(j,t)=\sum_{j=1}^{K}\pi_jN_j(t)/N_j\)

Note: \(p(t_L) + p(t_R) = p(t)\).

Now, what we really need is \(p(j | t)\). That is if I know a point goes to node *t*, what is the probability this point is in class *j*.

(Be careful because we have flipped the condition and the event to compute the probability for!)

The probability of a sample being in class *j* given that it goes to node *t* is:

\(p( j | t ) = p(j, t) / p(t)\)

Probabilities on the right-hand side are both solved from the previous formulas.

For any *t*, \(\sum_{j=1}^{K}p(j|t)=1\).

There is a shortcut if the prior is not pre-given, but estimated by the empirical frequency of class* j* in the dataset!

When \(\pi_j = N_j / N\) , the simplification is as follows:

- calculate \(p( j | t ) = N_j (t) / N (t)\) - for all the points that land in node
*t*. - \(p(t) = N (t) / N\).
- \(p(j, t) = N_j (t) / N\) .

This is the shortcut equivalent to the previous approach.

### 3) Determining Stopping Criteria

When we grow a tree, there are two basic types of calculations needed. First, for every node, we compute the posterior probabilities for the classes, that is, \(p( j | t )\) for all *j *and *t*. Then we have to go through all the possible splits and exhaustively search for the one with the maximum goodness. Suppose we have identified 100 candidate splits (i.e., splitting questions), to split each node, 100 class posterior distributions for the left and right child nodes each are computed, and 100 goodness measures are calculated. In the end, one split is chosen and only for this chosen split, the class posterior probabilities in the right and left child nodes are stored.

For the moment let's assume we will leave off our discussion of pruning for later and that we will grow the tree until some sort of stopping criteria is met.

A simple criterion is as follows. We will stop splitting a node* t* when:

\(\underset{s \in S}{\text{ max }} \Delta I(s,t) <\beta\)

where \(\Delta I\) (defined before) is the decrease in the impurity measure weighted by the percentage of points going to node *t*, *s* is the optimal split and \(\beta\) is a pre-determined threshold.

We must note, however, the above stopping criterion for deciding the size of the tree is not a satisfactory strategy. The reason is that the tree growing method is greedy. The split at every node is 'nearsighted'. We can only look one step forward. A bad split in one step may lead to very good splits in the future. The tree growing method does not consider such cases.

This process of growing the tree is greedy because it looks only one step ahead as it goes. Going one step forward and making a bad decision doesn't mean that it is always going to end up bad. If you go a few steps more you might actually gain something. You might even perfectly separate the classes. A response to this might be that what if we looked two steps ahead? How about three steps? You can do this but it begs the question, "How many steps forward should we look?"

No matter how many steps we look forward, this process will always be greedy. Looking ahead multiple steps will not fundamentally solve this problem. This is why we need pruning.

### 4) Determining Class Assignment Rules

Finally, how do we decide which class to assign to each leaf node?

The decision tree classifies new data points as follows. We let a data point pass down the tree and see which leaf node it lands in. The class of the leaf node is assigned to the new data point. Basically, all the points that land in the same leaf node will be given the same class. This is similar to k-means or any prototype method.

A class assignment rule assigns a class \(j = {1, \cdots , K}\) to every terminal (leaf) node \(t \in \tilde{T}\). The class is assigned to node *t* . \(\tilde{T}\) is denoted by \(\kappa(t)\), e.g., if \(\kappa(t)= 2\), all the points in node *t* would be assigned to class 2.

If we use 0-1 loss, the class assignment rule is very similar to k-means (where we pick the majority class or the class with the maximum posterior probability):

\(\kappa(t)= \text{ arg }\underset{j}{\text{ max }}p(j|t)\)

Let's assume for a moment that I have a tree and have the classes assigned for the leaf nodes. Now, I want to estimate the classification error rate for this tree. In this case, we need to introduce the *resubstitution* estimate \(r(t)\) for the probability of misclassification, given that a case falls into node *t*. This is:

\(r(t)=1-\underset{j}{\text{ max }}p(j|t)=1-p(\kappa(t)|t)\)

Denote \(R(t) = r(t) p(t)\). The resubstitution estimation for the overall misclassification rate \(R(T)\) of the tree classifier *T* is:

\(R(T)=\sum_{t \in \tilde{T}}R(t)\)

One thing that we should spend some time proving is that **if we split a node ***t ***into child nodes, the misclassification rate is ensured to improve**. In other words, if we estimate the error rate using the resubstitution estimate, the more splits, the better. This also indicates an issue with estimating the error rate using the re-substitution error rate because it is always biased towards a bigger tree.

Let's go through the proof.

**Proposition**: For any split of a node *t* into \(t_L\) and \(t_R\),

\(R(t) \geq R(t_L) + R(t_R)\)

**Proof**: Denote \( j* = \kappa(t)\).

Let's take a look at being in class \(j*\) given that you are in node *t*.

\( \begin {align}p(j^* |t)& = p(j^*,t_L |t) + p(j^*,t_R |t) \\

& = p(j^*|t_L) p(t_L|t)+p(j^*|t_R) p(t_R|t) \\

& = p_Lp(j^*|t_L)+p_Rp(j^*|t_R) \\

& \le p_L\underset{j}{\text{ max }}p(j|t_L)+p_R\underset{j}{\text{ max }}p(j|t_R) \\

\end {align} \)

Hence,

\( \begin {align}r(|t)& = 1-p(j^*|t) \\

& \ge 1-\left( p_L\underset{j}{\text{ max }}p(j|t_L)+p_R\underset{j}{\text{ max }}p(j|t_R) \right) \\

& = p_L(1-\underset{j}{\text{ max }}p(j|t_L))+p_R(1-\underset{j}{\text{ max }}p(j|t_R)) \\

& = p_Lr(t_L)+p_Rr(t_R) \\

\end {align} \)

Finally,

\( \begin {align}R(t)& = p(t)r(t) \\

& \ge p(t)p_Lr(t_L)+p(t)p_Rr(t_R) \\

& = p(t_L)r(t_L)+p(t_R)r(t_R) \\

& = R(t_L)+R(t_R) \\

\end {align} \)