10.5 - Estimating Misclassification Probabilities

When an unknown specimen is classified according to any decision rule, there is always a possibility that the specimen is wrongly classified. This is unavoidable. This is part of the inherent uncertainty in any statistical procedure. One procedure to evaluate the discriminant rule is to classify the training data according to the developed discrimination rule. Because we know which unit comes from which population among the training data, this will give us some idea of the validity of the discrimination procedure.

Confusion Table Section

The confusion table describes how the discriminant function will classify each observation in the data set. In general, the confusion table takes the form:

Classified As
Truth	1	2	\(\cdots\)	\(g\)	Total
1	\(n_{11}\)	\(n_{12}\)	\(\cdots\)	\(n_{1g}\)	\(n_{1\cdot}\)
2	\(n_{21}\)	\(n_{22}\)	\(\cdots\)	\(n_{2g}\)	\(n_{2\cdot}\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
\(g\)	\(n_{g1}\)	\(n_{g2}\)	\(\cdots\)	\(n_{gg}\)	\(n_{g\cdot}\)
Total	\(n_{\cdot 1}\)	\(n_{\cdot 2}\)	\(\cdots\)	\(n_{\cdot g}\)	\(n_{\cdot \cdot}\)

Rows 1 through g are g populations to which the items truly belong. Across the columns, we are looking at how they are classified. \(n_{11}\) is the number of insects correctly classified in species (1). But \(n_{12}\) is the number of insects incorrectly classified into species (2). In this case \(n_{ij}\) = the number belonging to population i classified into population j. Ideally, this matrix will be a diagonal matrix; in practice, we hope to see very small off-diagonal elements.

The row totals provide the number of individuals belonging to each of our populations or species in our training dataset. The column totals are the number classified into each of these species. The total number of observations in the dataset is n... The dot notation is used here in the row totals for summing over the second subscript, whereas in the column totals, we are summing over the first subscript.

We will let:

\(p(i|j)\)

denote the probability that a unit from population π_j is classified into population π_i. These misclassification probabilities are estimated by taking the number of insects from population j that are misclassified into population i divided by the total number of insects in the sample from population j as shown here:

\(\hat{p}(i|j) = \dfrac{n_{ji}}{n_{j.}}\)

These are the misclassification probabilities.

Method 1: Resubstitution Section

The resubstitution method uses the same set of discriminate functions computed from the entire data set to classify each observation. Its confusion matrix is output by default in SAS.

From the SAS output, we obtain the following confusion table.

Classified As
Truth	\(a\)	\(b\)	Total
\(a\)	10	0	10
\(b\)	0	10	10
Total	10	10	20

Here, none of the insects were misclassified! The misclassification probabilities are all estimated equal to zero.

Method 2: Set Aside Method Section

Step 1: Randomly partition the observations into two ”halves”
Step 2: Use one ”half” to obtain the discriminant function.
Step 3: Use the discriminant function from Step 2 to classify all members of the second ”half” of the data, from which the proportion of misclassified observations is computed.

Advantage: This method yields unbiased estimates of the misclassification probabilities.

Problem: This does not make optimum use of the data, and so, estimated misclassification probabilities are not as precise as possible.

Method 3: Cross Validation Section

Step 1: Delete one observation from the data.
Step 2: Use the remaining observations to compute a discriminant function.
Step 3: Use the discriminant function from Step 2 to classify the observation removed in Step 1. Steps 1-3 are repeated for all observations; compute the proportions of observations that are misclassified.

Example 10-5: Insect Data Section

The confusion table for the cross-validation is

Classified As
Truth	\(a\)	\(b\)	Total
\(a\)	10	0	10
\(b\)	2	8	10
Total	12	8	20

Here, the estimated misclassification probabilities are:

\(\hat{p}(b|a) = \frac{0}{10} = 0.0\)

for insects belonging to species A, and

\(\hat{p}(a|b) = \frac{2}{10} = 0.2\)

for insects belonging to species B.

Specifying Unequal Priors Section

Suppose that we have information (from prior experience or from another study) that suggests that 90% of the insects belong to Ch. concinna. Then the score functions for the unidentified specimen are

\begin{align} \hat{s}^L_a(\mathbf{x}) &= \hat{d}^L_a(\mathbf{x}) + \log{\hat{p}_a}\\[10pt] &= 203.052 + \log{0.9} \\[10pt] &= 202.946\end{align}

and

\begin{align} \hat{s}^L_b(\mathbf{x}) &= \hat{d}^L_b(\mathbf{x}) + \log{\hat{p}_b} \\[10pt] &= 205.912 + \log{0.1} \\[10pt] &= 203.609\end{align}

In this case, we would still classify this specimen into Ch. heikertlingeri with posterior probabilities

\(p(\pi_a|\mathbf{x}) = 0.36\) and \(p(\pi_b|\mathbf{x}) = 0.64\)

These priors can be specified in SAS by adding the ”priors” statement: priors ”a” = 0.9 ”b” = 0.1; following the var statement. However, it should be noted that when the "priors" statement is added, SAS will include log p_i as part of the constant term. In other words, SAS outputs the estimated linear score function, not the estimated linear discriminant function.