We may have missing values for some variables in some training sample points. For instance, gene-expression microarray data often have missing gene measurements.
Suppose each variable has 5% chance of being missing independently. Then for a training data point with 50 variables, the probability of missing some variables is as high as 92.3%! This means that at least 90% of the data will have at least one missing value! Therefore, we cannot simply throw away data points whenever missing values occur.
A test point to be classified may also have missing variables.
Classification trees have a nice way of handling missing values by surrogate splits.
Suppose the best split for node t is s which involves a question on \(X_m\). Then think about what to do if this variable is not there. Classification trees tackle the issue by finding a replacement split. To find another split based on another variable, classification trees look at all the splits using all the other variables and search for the one yielding a division of training data points most similar to the optimal split. Along the same line of thought, the second best surrogate split could be found in case both the best variable and its top surrogate variable are missing, so on so forth.
One thing to notice is that to find the surrogate split, classification trees do not try to find the second-best split in terms of goodness measure. Instead, they try to approximate the result of the best split. Here, the goal is to divide data as similarly as possible to the best split so that it is meaningful to carry out the future decisions down the tree, which descend from the best split. There is no guarantee the second best split divides data similarly as the best split although their goodness measurements are close.