Lesson 10: Discriminant Analysis
Lesson 10: Discriminant AnalysisOverview
Discriminant analysis is a classification problem, where two or more groups or clusters or populations are known a priori and one or more new observations are classified into one of the known populations based on the measured characteristics. Let us look at three different examples.
Example 101: Swiss Banknotes
We have two populations of banknotes, genuine, and counterfeit. Six measures are taken on each note:
 Length
 RightHand Width
 LeftHand Width
 Top Margin
 Bottom Margin
 Diagonal across the printed area
Take a banknote of unknown origin and determine just from these six measurements whether or not it is real or counterfeit. Perhaps this is not as impractical as it might sound. A more modern equivalent is a scanner that would measure the notes automatically and makes a decision.
Example 102: Pottery Data
Pottery shards are sampled from four sites: L) Llanedyrn, C) Caldicot, I) Ilse Thornes, and A) Ashley Rails, and the concentrations of the following chemical constituents were measured at a laboratory
 Al: Aluminum
 Fe: Iron
 Mg: Magnesium
 Ca: Calcium
 Na: Sodium
An archaeologist encounters a pottery specimen of unknown origin. To determine possible trade routes, the archaeologist may wish to classify its site of origin.
Example 103: Insect Data
Data were collected on two species of insects in the genus Chaetocnema, (a) Ch. concinna and (b) Ch. heikertlingeri. Three variables were measured on each insect:
 width of the 1^{st} joint of the tarsus (legs)
 width of the 2^{nd} joint of the tarsus
 width of the aedeagus (reproductive organ)
Our objective is to obtain a classification rule for identifying the insect species based on these three variables. An entomologist can identify these two closely related species, but the differences are so subtle that one has to have considerable experience to be able to tell the difference. If a classification rule may be developed, then this might be a more accurate way to help differentiate between these two different species.
Objectives
 Determine whether linear or quadratic discriminant analysis should be applied to a given data set;
 Be able to carry out both types of discriminant analyses using SAS/Minitab;
 Be able to apply the linear discriminant function to classify a subject by its measurements;
 Understand how to assess the efficacy of discriminant analysis.
10.1  Bayes Rule and Classification Problem
10.1  Bayes Rule and Classification ProblemBayes’ Rule
Consider any two events A and B. To find \(P(BA)\), the probability that B occurs given that A has occurred, Bayes’ Rule states the following:
\(P(BA) = \dfrac{P(A \text{ and } B)}{P(A)}\)
This says that conditional probability is the probability that both A and B occur divided by the unconditional probability that A occurs. This is a simple algebraic restatement of a rule for finding the probability that two events occur together, which is \(P(A\ and\ B) = P(A)P(BA)\).
Bayes’ Rule Applied to the Classification Problem
We are interested in \(P(\pi_{i}  \boldsymbol{x})\), the conditional probability that observation came from population \(\pi_{i}\) given that the observed values of the multivariate vector of variables \(\boldsymbol{x}\). We will classify an observation to the population for which the value of P(\(\pi_{i}  \boldsymbol{x})\) is greatest. This is the most probable group given the observed values of \(\boldsymbol{x}\).
 Suppose that we have g populations (groups) and that the \(i^{th}\) population is denoted as \(\pi_{i}\).
 Let \(p_{i}=P(\pi_{i})\), be the probability that a randomly selected observation is in population \(\pi_{i}\).
 Let \(f(\boldsymbol{x}  \pi_{i}\)) be the conditional probability density function of the multivariate set of variables \(\boldsymbol{x}\), given that the observation came from population \(\pi_{i}\).
Using the notation of Bayes’ Rule above, event A = observing the vector \(\boldsymbol{x}\) and event B = observation came from population \(\pi_{i}\). Thus our probability of interest can be found as...
\(P(\text{member of } \pi_i  \text{ we observed } \mathbf{x}) = \dfrac{P(\text{member of } \pi_i \text{ and we observe } \mathbf{x})}{P(\text{we observe } \mathbf{x})}\)
 The numerator of the expression just given is the likelihood that a randomly selected observation is both from population \(\pi_{i}\) and has the value \(\boldsymbol{x}\). This likelihood = \(p_{i}f(\boldsymbol{x} \pi_{i})\).
 The denominator is the unconditional likelihood (overall populations) that we could observe \(\boldsymbol{x}\). This likelihood = \(\sum_{j=1}^{g} p_j f(\mathbf{x}\pi_j)\)
Thus the posterior probability that an observation is a member of population \(\pi_{i}\) is
\(p(\pi_i\mathbf{x}) = \dfrac{p_i f(\mathbf{x}\pi_i)}{\sum_{j=1}^{g}p_j f(\mathbf{x}\pi_j)}\)
The classification rule is to assign observation \(\boldsymbol{x}\) to the population for which the posterior probability is the greatest.
The denominator is the same for all posterior probabilities (for the various populations) so it is equivalent to say that we will classify an observation to the population for which \(p_{i}f (\boldsymbol{x}\)  \(\pi_{i})\) is greatest.
Two Populations
With only two populations we can express a classification rule in terms of the ratio of the two posterior probabilities. Specifically, we would classify to population 1 when
\(\dfrac{p_1 f(\mathbf{x}\pi_1)}{p_2 f(\mathbf{x}\pi_2)} > 1\)
This can be rewritten to say that we classify to population 1 when
\(\dfrac{ f(\mathbf{x}\pi_1)}{ f(\mathbf{x}\pi_2)} > \dfrac{p_2}{p_1}\)
Decision Rule
We are going to classify the sample unit or subject into the population \(\pi_{i}\) that maximizes the posterior probability p(\(\pi_{i}\)). that is the population that maximizes
\(f(\mathbf{x}\pi_i)p_i\)
We are going to calculate the posterior probabilities for each of the populations. Then we are going to assign the subject or sample unit to that population that has the highest posterior probability. Ideally, that posterior probability is going to be greater than half, the closer to 100% the better!
Equivalently we are going to assign it to the population that maximizes this product:
\(\log f(\mathbf{x}\pi_i)p_i\)
The denominator that appears above does not depend on the population because it involves summing over all the populations. Equivalently all we really need to do is to assign it to the population that has the largest for this product, or equivalently we can maximize the log of that product. A lot of times it is easier to write the log.
10.2  Discriminant Analysis Procedure
10.2  Discriminant Analysis ProcedureDiscriminant analysis is a 7step procedure.
Step 1: Collect training data
Training data are data with known group memberships. Here, we actually know which population contains each subject. For example, in the Swiss Bank Notes, we actually know which of these are genuine notes and which others are counterfeit examples.
Step 2: Prior Probabilities
The prior probability \(p_i\) represents the expected portion of the community that belongs to population \(\pi_{i}\). There are three common choices:
 Equal priors: \(\hat{p}_i = \frac{1}{g}\) This is useful if we believe that all of the population sizes are equal
 Arbitrary priors were selected according to the investigator's beliefs regarding the relative population sizes.
Note! We require:
\(\hat{p}_1 + \hat{p}_2 + \dots + \hat{p}_g = 1\)
 Estimated priors:
\(\hat{p}_i = \dfrac{n_i}{N}\)
where \(n_{i}\) is the number observations from population \(\pi_{i}\) in the training data, and \(N = n _ { 1 } + n _ { 2 } + \ldots + n _ { g }\)
Step 3: Bartlett's test
Use Bartlett’s test to determine if the variancecovariance matrices are homogeneous for all populations involved. The result of this test will determine whether to use Linear or Quadratic Discriminant Analysis.:
Linear discriminant analysis is for homogeneous variancecovariance matrices:
\(\Sigma_1 = \Sigma_2 = \dots = \Sigma_g = \Sigma\)
In this case, the variancecovariance matrix does not depend on the population.
Quadratic discriminant analysis is used for heterogeneous variancecovariance matrices:
\(\Sigma_i \ne \Sigma_j\) for some \(i \ne j\)
This allows the variancecovariance matrices to depend on the population.
Step 4: Estimate the parameters of the conditional probability density functions \(f ( \mathbf{X} \pi_{i})\).
Here, we shall make the following standard assumptions:
 The data from group i has common mean vector \(\boldsymbol{\mu_i}\)
 The data from group i have a common variancecovariance matrix \(\Sigma\).
 Independence: The subjects are independently sampled.
 Normality: The data are multivariate normally distributed.
Step 5: Compute discriminant functions.
This is the rule to classify the new object into one of the known populations.
Step 6: Use crossvalidation to estimate misclassification probabilities.
As in all statistical procedures, it is helpful to use diagnostic procedures to assess the efficacy of the discriminant analysis. We use crossvalidation to assess the classification probability. Typically you are going to have some prior rule as to what is an acceptable misclassification rate. Those rules might involve things like, "what is the cost of misclassification?" This could come up in a medical study where you might be able to diagnose cancer. There are really two alternative costs. The cost of misclassifying someone as having cancer when they don't. This could cause a certain amount of emotional grief! There is also the alternative cost of misclassifying someone as not having cancer when in fact they do have it. The cost here is obviously greater if early diagnosis improves cure rates.
Step 7: Classify observations with unknown group memberships.
The procedure described above assumes that the unit or subject being classified actually belongs to one of the considered populations. If you have a study where you look at two species of insects, A and B, and the insect to classify actually belongs to species C, then it will obviously be misclassified as to belonging to either A or B.
10.3  Linear Discriminant Analysis
10.3  Linear Discriminant AnalysisWe assume that in population \(\pi_{i}\) the probability density function of \(\boldsymbol{x}\) is multivariate normal with mean vector \(\boldsymbol{\mu}_{i}\) and variancecovariance matrix \(\Sigma\) (same for all populations). As a formula, this is...
\(f(\mathbf{x}\pi_i) = \dfrac{1}{(2\pi)^{p/2}\mathbf{\Sigma}^{1/2}}\exp\left(\frac{1}{2}\mathbf{(x\mu_i)'\Sigma^{1}(x\mu_i)}\right)\)
We classify to the population for which \(p _ { i } f ( \mathbf { x }  \pi _ { i } )\) ) is largest.
Because a log transform is monotonic, this is equivalent to classifying an observation to the population for which log( \(p _ { i } f ( \mathbf { x }  \pi _ { i } )\) )) is largest.
Linear discriminant analysis is used when the variancecovariance matrix does not depend on the population. In this case, our decision rule is based on the Linear Score Function, a function of the population means for each of our g populations, \(\boldsymbol{\mu}_{i}\), as well as the pooled variancecovariance matrix.

Linear Score Function
 The Linear Score Function is:
 \(s^L_i(\mathbf{X}) = \dfrac{1}{2}\mathbf{\mu'_i \Sigma^{1}\mu_i + \mu'_i \Sigma^{1}x}+ \log p_i = d_{i0}+\sum_{j=1}^{p}d_{ij}x_j + \log p_i\)

where
\(d_{i0} = \dfrac{1}{2}\mathbf{\mu'_i\Sigma^{1}\mu_i}\)
\(d_{ij} = j\text{th element of } \mu'_i\Sigma^{1}\)
The far lefthand expression resembles a linear regression with intercept term d_{i}_{0} and regression coefficients d_{ij}.

Linear Discriminant Function

\(d^L_i(\mathbf{x}) = \dfrac{1}{2}\mathbf{\mu'_i\Sigma^{1}\mu_i + \mu'_i\Sigma^{1}x} = d_{i0} + \sum_{j=1}^{p}d_{ij}x_j\)
\(d_{i0} = \dfrac{1}{2}\mathbf{\mu'_i\Sigma^{1}\mu_i}\)
Given a sample unit with measurements \(x _ { 1 }, x _ { 2 }, \dots, x _ { p }\), we classify the sample unit into the population that has the largest Linear Score Function. This is equivalent to classifying the population for which the posterior probability of membership is the largest. The linear score function is computed for each population, then we plug in our observation values and assign the unit to the population with the largest score.
However, this is a function of unknown parameters, \(\boldsymbol{\mu}_{i}\) and \(\Sigma\). So, these must be estimated from the data.
Discriminant analysis requires estimates of:
\(p_i = \text{Pr}(\pi_i);\) \(i = 1, 2, \dots, g\)
\(\mathbf{\mu_i} = E(\mathbf{X}\pi_i)\); \(i = 1, 2, \dots, g\)
\(\Sigma = \text{var}(\mathbf{X} \pi_i)\); \(i = 1, 2, \dots, g\)
 Prior probabilities:
 The population means are estimated by the sample mean vectors:
 The variancecovariance matrix is estimated by using the pooled variancecovariance matrix:
Typically, these parameters are estimated from training data, in which the population membership is known.
Conditional Density Function Parameters
Population Means: \(\boldsymbol{\mu}_{i}\) is estimated by substituting in the sample means \(\bar{\mathbf{x}}_i\).
VarianceCovariance matrix: Let S_{i} denote the sample variancecovariance matrix for population i. Then the variancecovariance matrix \(Σ\) is estimated by substituting in the pooled variancecovariance matrix into the Linear Score Function as shown below:
\(\mathbf{S}_p = \dfrac{\sum_{i=1}^{g}(n_i1)\mathbf{S}_i}{\sum_{i=1}^{g}(n_i1)}\)
to obtain the estimated linear score function:
\(\hat{s}^L_i(\mathbf{x}) = \frac{1}{2}\mathbf{\bar{x}'_i S^{1}_p \bar{x}_i +\bar{x}'_i S^{1}_p x } + \log{\hat{p}_i} = \hat{d}_{i0} + \sum_{j=1}^{p}\hat{d}_{ij}x_j + \log{p}_i\)
where
\(\hat{d}_{i0} = \dfrac{1}{2}\mathbf{\bar{x}'_i S^{1}_p \bar{x}_i} \)
and
\( \hat{d}_{ij} = j^{th} \text{ element of} \ \ \bar{x}'_iS^{1}_p \)
This is a function of the sample mean vectors, the pooled variancecovariance matrix, and prior probabilities for g different populations. This is written in a form that looks like a linear regression formula with an intercept term plus a linear combination of response variables, plus the natural log of the prior probabilities.
10.4  Example: Insect Data
10.4  Example: Insect DataExample 104: Insect Data
Data were collected on two species of insects in the genus Chaetocnema, (species a) Ch. concinna and (species b) Ch. heikertlingeri. Three variables were measured on each insect:
 \(X_1\) = Width of the 1^{st} joint of the tarsus (legs)
 \(X_2\) = Width of the 2^{nd} joint of the tarsus
 \(X_3\) = Width of the aedeagus (reproductive organ)
We have ten individuals of each species to make up training data. Data on these ten individuals of each species are used to estimate the model parameters which we will use in the linear score function.
Our objective is to obtain a classification rule for identifying the insect species from these three variables.
Let's begin...
Step 1
Collect the training data. (described above)
Step 2
Specify the prior probabilities. In this case, we do not have any information regarding the relative abundances of the two species. Without any information in order to help specify prior probabilities, equal priors are selected:
\[\hat{p}_1 = \hat{p}_2 = \dfrac{1}{2}\]
Step 3
Test for homogeneity of the variancecovariance matrices using Bartlett's test.
Download the text file containing the data here: insect.csv
Here we will use the SAS program as shown below:
Download the SAS program here: insect.sas
Note: In the upper righthand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.
options ls=78;
title "Discriminant Analysis  Insect Data";
data insect;
infile "D:\Statistics\STAT 505\data\insect.csv" firstobs=2 delimiter=',';
input species $ joint1 joint2 aedeagus;
run;
/* A new data set called 'test' is created to store any new
* values to be classified with our discriminant rule.
* The variables must match the quantitative ones in the training set.
*/
data test;
input joint1 joint2 aedeagus;
cards;
194 124 49
; run;
/* The pool option conducts a test of equal covariance matrices.
* If the results of the test are insignificant (at the 0.10 level), the
* sample covariance matrices are pooled, resulting in a linear discriminant
* function; otherwise, the sample covariance matrices are not pooled,
* resulting in a quadratic discriminant function.
* The crossvalidate option calculates the confusion matrix based on
* the holdout method, where each obs is classified from the other obs only.
* The testdata= option specifies the data set with obs to be classified.
* The testout= option specifies the name of the data set where classification
* results are stored.
* The class statement specifies the variable with groups for classification.
* The var statement specifies the quantitative variables used to estimate
* the mean and covariance matrices of the groups.
*/
proc discrim data=insect pool=test crossvalidate testdata=test testout=a;
class species;
var joint1 joint2 aedeagus;
run;
/* This will print the results of the classifications of the obs
* from the 'test' data set.
*/
proc print data=a;
run;
Performing a discriminant analysis (insect data)
To perform linear discriminant analysis with equal prior probabilities:
 Open the ‘insect’ data set in a new worksheet.
 Stat > Multivariate > Discriminant Analysis
 Highlight and select ‘species’ to move it to the Groups window.
 Highlight and select ‘joint1’, ‘joint2’, and ‘aedeagus’ to move them to the Predictors window
 Select 'OK'. The results are displayed in the results area.
No significant difference between the variancecovariance matrices for the two species (L' = 9.83; d.f. = 6; p = 0.132) is found. Thus the linear discriminant analysis is appropriate for the data.
Step 4
Estimate the parameters of the conditional probability density functions, i.e., the population mean vectors and the population variancecovariance matrices involved. It turns out that all of this is done automatically in the discriminant analysis procedure.
Step 5
The linear discriminant functions for the two species can be obtained directly from the SAS or Minitab output.
Species  Function 

Ch. concinna 
\(\widehat { d } _ { a } ^ { L } ( \mathbf { x } ) =  247.276  1.417 x _ { 1 } + 1.520 x _ { 2 } + 10.954 x _ { 3 }\) 
Ch. heikertlingeri  \(\widehat { d } _ { b } ^ { L } ( \mathbf { x } ) =  193.178  0.738 x _ { 1 } + 1.113 x _ { 2 } + 8.250 x _ { 3 }\) 
Step 6
We will discuss this step in Lesson 10.5.
Step 7
Now, consider an insect with the following measurements. Which species does this belong to?
Variable  Measurement 

Joint 1  194 
Joint 2  124 
Aedeagus  49 
These are responses for the first three variables. The linear discriminant function for species a is obtained by plugging in the values for these three measurements into the equation for species (a):
\(\hat{d}^{L}_a(\textbf{x}) = 247.276  1.417 \times 194 + 1.520 \times 124 + 10.954 \times 49 = 203.052\)
and then for species (b):
\(\hat{d}^{L}_b(\textbf{x}) = 193.178  0.738 \times 194 + 1.113 \times 124 + 8.250 \times 49 = 205.912\)
Add in a log of .5 to obtain the linear score function for species (a):
\(\hat{s}^L_a(\mathbf{x}) = \hat{d}^L_a(\mathbf{x}) + \log{\hat{p}_a} = 203.052 + \log{0.5} = 202.359\)
and then for species (b):
\(\hat{s}^L_b(\mathbf{x}) = \hat{d}^L_b(\mathbf{x}) + \log{\hat{p}_b} = 205.912 + \log{0.5} = 205.219\)
Conclusion
According to the classification rule the insect is classified into the species with the highest linear discriminant function. Because \(\hat{s}^L_b(\mathbf{x}) > \hat{s}^L_a(\mathbf{x})\), we conclude that the insect belongs to species (b) Ch. heikertlingeri.
Of course, the addition of the log of .5 does not make any difference. Whether we classify on the basis of \(\hat{d}^L_b(\mathbf{x})\) or on the basis of the score function, the decision will remain the same. In case the priors are not equal, this would not hold.
You can think of the priors as a 'penalty' in some sense. If you have a higher prior probability of a given species you will give it very little 'penalty' because you will be taking the log of a number close to one which is not going to subtract much. On the other hand, if there is a low prior probability, then the log of a very small number results in a larger reduction.
Posterior Probabilities
You can also calculate the posterior probabilities. These are used to measure uncertainty regarding the classification of a unit from an unknown group. They will give us some indication of our confidence in our classification of individual subjects.
In this case, the estimated posterior probability that the insect belongs to species (a) Ch. concinna given the observed measurements is:
\begin{align} p(\pi_a\mathbf{x}) &= \frac{\exp\{\hat{s}^L_a(\mathbf{x})\}}{\exp\{\hat{s}^L_a(\mathbf{x})\}+\exp\{\hat{s}^L_b(\mathbf{x})\}} \\[10pt] &= \frac{\exp\{202.359\}}{\exp\{202.359\}+\exp\{205.219\}} \\[10pt] &= 0.05\end{align}
This is a function of the linear score functions for the two species. Here we are looking at the exponential function of the linear score function for species (a) divided by the sum of the exponential functions of the score functions for species (a) and species (b). Using the numbers obtained earlier, this equals 0.05.
Similarly, for species (b), the estimated posterior probability that the insect belongs to Ch. heikertlingeri is:
\begin{align} p(\pi_b\mathbf{x}) &= \frac{\exp\{\hat{s}^L_b(\mathbf{x})\}}{\exp\{\hat{s}^L_a(\mathbf{x})\}+\exp\{\hat{s}^L_b(\mathbf{x})\}} \\[10pt] &= \frac{\exp\{205.219\}}{\exp\{202.359\}+\exp\{205.219\}} \\[10pt] &= 0.95\end{align}
In this case, we are 95% confident that the insect belongs to species (b). This is a pretty high level of confidence with a 5% chance that we might be in error in this classification. You would have to decide what is an acceptable error rate here. For the classification of insects, this might be perfectly acceptable, however, in some situations, it might not be acceptable. For example, looking at the cancer case that we talked about earlier where we were trying to classify someone as having cancer or not having cancer, it may not be acceptable to have a 5% error rate. This is an ethical decision. It is a decision that has nothing to do with statistics and must be tailored to the situation at hand.
10.5  Estimating Misclassification Probabilities
10.5  Estimating Misclassification ProbabilitiesWhen an unknown specimen is classified according to any decision rule, there is always a possibility that the specimen is wrongly classified. This is unavoidable. This is part of the inherent uncertainty in any statistical procedure. One procedure to evaluate the discriminant rule is to classify the training data according to the developed discrimination rule. Because we know which unit comes from which population among the training data, this will give us some idea of the validity of the discrimination procedure.
Confusion Table
The confusion table describes how the discriminant function will classify each observation in the data set. In general, the confusion table takes the form:
Truth  1  2  \(\cdots\)  \(g\)  Total 

1  \(n_{11}\)  \(n_{12}\)  \(\cdots\)  \(n_{1g}\)  \(n_{1\cdot}\) 
2  \(n_{21}\)  \(n_{22}\)  \(\cdots\)  \(n_{2g}\)  \(n_{2\cdot}\) 
\(\vdots\)  \(\vdots\)  \(\vdots\)  \(\vdots\)  \(\vdots\)  \(\vdots\) 
\(g\)  \(n_{g1}\)  \(n_{g2}\)  \(\cdots\)  \(n_{gg}\)  \(n_{g\cdot}\) 
Total  \(n_{\cdot 1}\)  \(n_{\cdot 2}\)  \(\cdots\)  \(n_{\cdot g}\)  \(n_{\cdot \cdot}\) 
Rows 1 through g are g populations to which the items truly belong. Across the columns, we are looking at how they are classified. \(n_{11}\) is the number of insects correctly classified in species (1). But \(n_{12}\) is the number of insects incorrectly classified into species (2). In this case \(n_{ij}\) = the number belonging to population i classified into population j. Ideally, this matrix will be a diagonal matrix; in practice, we hope to see very small offdiagonal elements.
The row totals provide the number of individuals belonging to each of our populations or species in our training dataset. The column totals are the number classified into each of these species. The total number of observations in the dataset is n... The dot notation is used here in the row totals for summing over the second subscript, whereas in the column totals, we are summing over the first subscript.
We will let:
\(p(ij)\)
denote the probability that a unit from population π_{j} is classified into population π_{i}. These misclassification probabilities are estimated by taking the number of insects from population j that are misclassified into population i divided by the total number of insects in the sample from population j as shown here:
\(\hat{p}(ij) = \dfrac{n_{ji}}{n_{j.}}\)
These are the misclassification probabilities.
Method 1: Resubstitution
The resubstitution method uses the same set of discriminate functions computed from the entire data set to classify each observation. Its confusion matrix is output by default in SAS.
From the SAS output, we obtain the following confusion table.
Truth  \(a\)  \(b\)  Total 

\(a\)  10  0  10 
\(b\)  0  10  10 
Total  10  10  20 
Here, none of the insects were misclassified! The misclassification probabilities are all estimated equal to zero.
Method 2: Set Aside Method
 Step 1: Randomly partition the observations into two ”halves”
 Step 2: Use one ”half” to obtain the discriminant function.
 Step 3: Use the discriminant function from Step 2 to classify all members of the second ”half” of the data, from which the proportion of misclassified observations is computed.
Advantage: This method yields unbiased estimates of the misclassification probabilities.
Problem: This does not make optimum use of the data, and so, estimated misclassification probabilities are not as precise as possible.
Method 3: Cross Validation

Step 1: Delete one observation from the data.
 Step 2: Use the remaining observations to compute a discriminant function.
 Step 3: Use the discriminant function from Step 2 to classify the observation removed in Step 1. Steps 13 are repeated for all observations; compute the proportions of observations that are misclassified.
Example 105: Insect Data
The confusion table for the crossvalidation is
Truth  \(a\)  \(b\)  Total 

\(a\)  10  0  10 
\(b\)  2  8  10 
Total  12  8  20 
Here, the estimated misclassification probabilities are:
\(\hat{p}(ba) = \frac{0}{10} = 0.0\)
for insects belonging to species A, and
\(\hat{p}(ab) = \frac{2}{10} = 0.2\)
for insects belonging to species B.
Specifying Unequal Priors
Suppose that we have information (from prior experience or from another study) that suggests that 90% of the insects belong to Ch. concinna. Then the score functions for the unidentified specimen are
\begin{align} \hat{s}^L_a(\mathbf{x}) &= \hat{d}^L_a(\mathbf{x}) + \log{\hat{p}_a}\\[10pt] &= 203.052 + \log{0.9} \\[10pt] &= 202.946\end{align}
and
\begin{align} \hat{s}^L_b(\mathbf{x}) &= \hat{d}^L_b(\mathbf{x}) + \log{\hat{p}_b} \\[10pt] &= 205.912 + \log{0.1} \\[10pt] &= 203.609\end{align}
In this case, we would still classify this specimen into Ch. heikertlingeri with posterior probabilities
\(p(\pi_a\mathbf{x}) = 0.36\) and \(p(\pi_b\mathbf{x}) = 0.64\)
These priors can be specified in SAS by adding the ”priors” statement: priors ”a” = 0.9 ”b” = 0.1; following the var statement. However, it should be noted that when the "priors" statement is added, SAS will include log p_{i} as part of the constant term. In other words, SAS outputs the estimated linear score function, not the estimated linear discriminant function.
10.6  Quadratic Discriminant Analysis
10.6  Quadratic Discriminant AnalysisLinear Discriminant Analysis is for homogeneous variancecovariance matrices. However, not all cases come from such simplified situations. Quadratic Discriminant Analysis is used for heterogeneous variancecovariance matrices:
\(\Sigma_i \ne \Sigma_j\) for some \(i \ne j\)
Again, this allows the variancecovariance matrices to depend on the population.
Quadratic discriminant analysis calculates a Quadratic Score Function:
\(s^Q_i (\mathbf{x}) = \frac{1}{2}\log{\mathbf{\Sigma_i}}\frac{1}{2}{\mathbf{(x\mu_i)'\Sigma^{1}_i(x  \mu_i)}}+\log{p_i}\)
This is a function of population mean vectors and the variancecovariance matrices for the ith group. Similarly, we will determine a separate quadratic score function for each of the groups.
This is of course a function of the unknown population mean vector for group i and the variancecovariance matrix for group i. These will have to be estimated from the training data. As before, we replace the unknown values of \(\boldsymbol{\mu_i}\),\(\mathbf{\Sigma_i}\), and \(p_i\) by their estimates to obtain the estimated quadratic score function as shown below:
\(\hat{s}^Q_i (\mathbf{x}) = \frac{1}{2}\log{\mathbf{S_i}}\frac{1}{2}{\mathbf{(x\bar{x}_i)'S^{1}_i(x  \bar{x}_i)}}+\log{p_i}\)
All natural logs are used in this function.
\(\hat{s}^Q_i (\mathbf{x}) = \frac{1}{2}\log{\mathbf{S_i}}\frac{1}{2}{\mathbf{(x\bar{x})'S^{1}_i(x \bar{x})}}+\log{p_i}\)
Let's illustrate this using the Swiss Banknotes example...
10.7  Example: Swiss Banknotes
10.7  Example: Swiss BanknotesExample 106: Swiss Banknotes
Recall that we have two populations of notes, genuine and counterfeit, and that six measurements were taken on each note:
 Length
 RightHand Width
 LeftHand Width
 Top Margin
 Bottom Margin
 Diagonal
Priors
In this case, it would not be reasonable to consider equal priors for the two types of banknotes. Equal priors would assume that half the banknotes in circulation are counterfeit and half are genuine. This is a very high counterfeit rate and if it was that bad the Swiss government would probably be bankrupt! We need to consider unequal priors in which the vast majority of banknotes are thought to be genuine. For this example let us assume that no more than 1% of banknotes in circulation are counterfeit and 99% of the notes are genuine. The prior probabilities can then be expressed as:
\(\hat{p}_1 = 0.99\) and \(\hat{p}_2 = 0.01\)
The first step in the analysis is going to carry out Bartlett's test to check for homogeneity of the variancecovariance matrices.
Download the text file with the data here: swiss3.csv
To do this we will use the SAS program shown below:
Download the SAS program here: swiss9.sas
View the video explanation of the SAS code.Note: In the upper righthand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.
options ls=78;
title "Discriminant  Swiss Bank Notes";
data swiss;
infile "D:\Statistics\STAT 505\data\swiss3.csv" firstobs=2 delimiter=',';
input type $ length left right bottom top diag;
run;
/* A new data set called 'test' is created to store any new
* values to be classified with our discriminant rule.
* The variables must match the quantitative ones in the training set.
*/
data test;
input length left right bottom top diag;
cards;
214.9 130.1 129.9 9 10.6 140.5
; run;
run;
/* The pool option conducts a test of equal covariance matrices.
* If the results of the test are insignificant (at the 0.10 level), the
* sample covariance matrices are pooled, resulting in a linear discriminant
* function; otherwise, the sample covariance matrices are not pooled,
* resulting in a quadratic discriminant function.
* The crossvalidate option calculates the confusion matrix based on
* the holdout method, where each obs is classified from the other obs only.
* The testdata= option specifies the data set with obs to be classified.
* The testout= option specifies the name of the data set where classification
* results are stored.
* The class statement specifies the variable with groups for classification.
* The var statement specifies the quantitative variables used to estimate
* the mean and covariance matrices of the groups.
*/
proc discrim data=swiss pool=test crossvalidate testdata=test testout=a;
class type;
var length left right bottom top diag;
priors "real"=0.99 "fake"=0.01;
run;
/* This will print the results of the classifications of the obs
* from the 'test' data set.
*/
proc print data=a;
run;
SAS Notes
By default, SAS will make this decision for you. Let's look at the proc descrim procedure in the SAS Program that we just used.
By including pool=test, SAS will decide what kind of discriminant analysis to carry out based on the results of this test.
If the test fails to reject, then SAS will automatically do a linear discriminant analysis. If the test rejects, then SAS will do a quadratic discriminant analysis.
There are two other options here. If we put pool=yes then SAS will conduct a linear discriminant analysis whether it is warranted or not. It will pool the variancecovariance matrices and do a linear discriminant analysis without reporting Bartlett's test.
If pool=no then SAS will not pool the variancecovariance matrices and perform the quadratic discriminant analysis.
SAS does not actually print out the quadratic discriminant function, but it will use quadratic discriminant analysis to classify sample units into populations.
Performing discriminant analysis (Swiss bank notes data)
To perform quadratic discriminant analysis with unequal prior probabilities:
 Open the ‘swiss3’ data set in a new worksheet.
 Stat > Multivariate > Discriminant Analysis
 Highlight and select ‘type’ to move it to the Groups window.
 Highlight and select all six quantitative variables (‘length’ through ‘diag’) to move them to the Predictors window.
 Choose Quadratic under Discriminant Function.
 Choose Options, and enter the prior probabilities ‘0.99 0.01’ (without quotes) to apply them to the groups ‘a’ and ‘b’, respectively (alphabetical order).
 Choose 'OK' twice. The results are displayed in the results area.
Bartlett's Test finds a significant difference between the variancecovariance matrices of the genuine and counterfeit banknotes \(\left( \mathrm { L } ^ { \prime } = 121.90; \mathrm { d.f. } = 21; \mathrm { p } < 0.0001 \right)\). The variancecovariance matrix for the genuine notes is not equal to the variancecovariance matrix for the counterfeit notes. Because we reject the null hypothesis of equal variancecovariance matrices, this suggests that a linear discriminant analysis is not appropriate for these data. A quadratic discriminant analysis is necessary.
Example 107: Swiss Bank notes
Let us consider a banknote with the following measurements:
Variable

Measurement


Length

214.9

Left Width

130.1

Right Width

129.9

Bottom Margin

9.0

Top Margin

10.6

Diagonal

140.5

Any number of lines of measurement may be considered. Here we are just interested in one set of measurements. It is requested that this banknote be classified as real or genuine. The posterior probability that it is fake or counterfeit is only 0.000002526. So, the posterior probability that it is genuine is very close to one (actually, this posterior probability is 1  0.000002526 = 0.999997474). We are nearly 100% confident that this is a real note and not counterfeit.
Next, consider the results of crossvalidation.
The resulting confusion table is as follows:
Truth  Counterfeit  Genuine  Total 

Counterfeit 
98

2

100

Genuine 
1

99

100

Total 
99

101

200

Here, we can see that 98 out of 100 counterfeit notes are expected to be correctly classified, while 99 out of 100 genuine notes are expected to be correctly classified. Thus, the estimated misclassification probabilities are estimated to be:
\(\hat{p}(\text{real  fake}) = 0.02 \) and \(\hat{p}(\text{fake  real}) = 0.01 \)
The question remains: Are these acceptable misclassification rates?
A decision should be made in advance as to what would be the acceptable levels of error. Here again, you need to think about the consequences of making a mistake. In terms of classifying a genuine note as a counterfeit, one might put an innocent person in jail. If you make the opposite error you might let a criminal go free. What are the costs of these types of errors? And, are the above error rates acceptable? This decision should be made in advance. You should have some prior notion of what you would consider reasonable.
10.8  Summary
10.8  SummaryIn this lesson we learned:
 How to determine which type of discriminant analysis is appropriate, linear or quadratic;
 How the linear discriminant function is used to classify a subject into the appropriate population;
 About issues to consider when selecting prior probabilities that a randomly selected subject belongs to a particular population;
 How to use posterior probabilities to assess the uncertainty of the classification of a particular subject;
 How to use crossvalidation and confusion tables to assess the efficacy of discriminant analysis.