10.5  Estimating Misclassification Probabilities
10.5  Estimating Misclassification ProbabilitiesWhen an unknown specimen is classified according to any decision rule, there is always a possibility that the specimen is wrongly classified. This is unavoidable. This is part of the inherent uncertainty in any statistical procedure. One procedure to evaluate the discriminant rule is to classify the training data according to the developed discrimination rule. Because we know which unit comes from which population among the training data, this will give us some idea of the validity of the discrimination procedure.
Confusion Table
The confusion table describes how the discriminant function will classify each observation in the data set. In general, the confusion table takes the form:
Truth  1  2  \(\cdots\)  \(g\)  Total 

1  \(n_{11}\)  \(n_{12}\)  \(\cdots\)  \(n_{1g}\)  \(n_{1\cdot}\) 
2  \(n_{21}\)  \(n_{22}\)  \(\cdots\)  \(n_{2g}\)  \(n_{2\cdot}\) 
\(\vdots\)  \(\vdots\)  \(\vdots\)  \(\vdots\)  \(\vdots\)  \(\vdots\) 
\(g\)  \(n_{g1}\)  \(n_{g2}\)  \(\cdots\)  \(n_{gg}\)  \(n_{g\cdot}\) 
Total  \(n_{\cdot 1}\)  \(n_{\cdot 2}\)  \(\cdots\)  \(n_{\cdot g}\)  \(n_{\cdot \cdot}\) 
Rows 1 through g are g populations to which the items truly belong. Across the columns we are looking at how they are classified. \(n_{11}\) is the number of insects correctly classified in species (1). But \(n_{12}\) is the number of insects incorrectly classified into species (2). In this case \(n_{ij}\) = the number belonging to population i classified into population j. Ideally this matrix will be a diagonal matrix; in practice we hope to see very small offdiagonal elements.
The row totals provide the number of individuals belonging to each of our populations or species in our training dataset. The column totals are the number classified into each of these species. The total number of observations in the dataset is n... The dot notation is used here in the row totals for summing over the second subscript, whereas in the column totals we are summing over the first subscript.
We will let:
\(p(ij)\)
denote the probability that a unit from population π_{j} is classified into population π_{i}. These misclassification probabilities are estimated by taking the number of insects from population j that are misclassified into population i divided by the total number of insects in the sample from population j as shown here:
\(\hat{p}(ij) = \dfrac{n_{ji}}{n_{j.}}\)
These are the misclassification probabilities.
Method 1: Resubstitution
The resubstitution method uses the same set of discriminate functions computed from the entire data set to classify each observation. Its confusion matrix is output by default in SAS.
From the SAS output, we obtain the following confusion table.
Truth  \(a\)  \(b\)  Total 

\(a\)  10  0  10 
\(b\)  0  10  10 
Total  10  10  20 
Here, none of the insects were misclassified! The misclassification probabilities are all estimated equal to zero.
Method 2: Set Aside Method
 Step 1: Randomly partition the observations into two ”halves”
 Step 2: Use one ”half” to obtain the discriminant function.
 Step 3: Use the discriminant function from Step 2 to classify all members of the second ”half” of the data, from which the proportion of misclassified observations is computed.
Advantage: This method yields unbiased estimates of the misclassification probabilities.
Problem: This does not make optimum use of the data, and so, estimated misclassification probabilities are not as precise as possible.
Method 3: Cross Validation

Step 1: Delete one observation from the data.
 Step 2: Use the remaining observations to compute a discriminant function.
 Step 3: Use the discriminant function from Step 2 to classify the observation removed in Step 1. Steps 13 are repeated for all observations; compute the proportions of observations that are misclassified.
Example 105: Insect Data
The confusion table for the cross validation is
Truth  \(a\)  \(b\)  Total 

\(a\)  10  0  10 
\(b\)  2  8  10 
Total  12  8  20 
Here, the estimated misclassification probabilities are:
\(\hat{p}(ba) = \frac{0}{10} = 0.0\)
for insects belonging to species A, and
\(\hat{p}(ab) = \frac{2}{10} = 0.2\)
for insects belonging to species B.
Specifying Unequal Priors
Suppose that we have information (from prior experience or from another study) that suggests that 90% of the insects belong to Ch. concinna. Then the score functions for the unidentified specimen are
\begin{align} \hat{s}^L_a(\mathbf{x}) &= \hat{d}^L_a(\mathbf{x}) + \log{\hat{p}_a}\\[10pt] &= 203.052 + \log{0.9} \\[10pt] &= 202.946\end{align}
and
\begin{align} \hat{s}^L_b(\mathbf{x}) &= \hat{d}^L_b(\mathbf{x}) + \log{\hat{p}_b} \\[10pt] &= 205.912 + \log{0.1} \\[10pt] &= 203.609\end{align}
In this case, we would still classify this specimen into Ch. heikertlingeri with posterior probabilities
\(p(\pi_a\mathbf{x}) = 0.36\) and \(p(\pi_b\mathbf{x}) = 0.64\)
These priors can be specified in SAS by adding the ”priors” statement: priors ”a” = 0.9 ”b” = 0.1; following the var statement. However, it should be noted that when the "priors" statement is added, SAS will include log p_{i} as part of the constant term. In other words, SAS outputs the estimated linear score function, not the estimated linear discriminant function.