Lesson 10: Discriminant Analysis

Overview

Discriminant analysis is a classification problem, where two or more groups or clusters or populations are known a priori and one or more new observations are classified into one of the known populations based on the measured characteristics. Let us look at three different examples.

Example 10-1: Swiss Banknotes

We have two populations of banknotes, genuine, and counterfeit. Six measures are taken on each note:

Length
Right-Hand Width
Left-Hand Width
Top Margin
Bottom Margin
Diagonal across the printed area

Take a banknote of unknown origin and determine just from these six measurements whether or not it is real or counterfeit. Perhaps this is not as impractical as it might sound. A more modern equivalent is a scanner that would measure the notes automatically and makes a decision.

Example 10-2: Pottery Data

Pottery shards are sampled from four sites: L) Llanedyrn, C) Caldicot, I) Ilse Thornes, and A) Ashley Rails, and the concentrations of the following chemical constituents were measured at a laboratory

Al: Aluminum
Fe: Iron
Mg: Magnesium
Ca: Calcium
Na: Sodium

An archaeologist encounters a pottery specimen of unknown origin. To determine possible trade routes, the archaeologist may wish to classify its site of origin.

Example 10-3: Insect Data

Data were collected on two species of insects in the genus Chaetocnema, (a) Ch. concinna and (b) Ch. heikertlingeri. Three variables were measured on each insect:

width of the 1^st joint of the tarsus (legs)
width of the 2^nd joint of the tarsus
width of the aedeagus (reproductive organ)

Our objective is to obtain a classification rule for identifying the insect species based on these three variables. An entomologist can identify these two closely related species, but the differences are so subtle that one has to have considerable experience to be able to tell the difference. If a classification rule may be developed, then this might be a more accurate way to help differentiate between these two different species.

Objectives

Upon completion of this lesson, you should be able to:

Determine whether linear or quadratic discriminant analysis should be applied to a given data set;
Be able to carry out both types of discriminant analyses using SAS/Minitab;
Be able to apply the linear discriminant function to classify a subject by its measurements;
Understand how to assess the efficacy of discriminant analysis.

10.1 - Bayes Rule and Classification Problem

Bayes’ Rule

Consider any two events A and B. To find \(P(B|A)\), the probability that B occurs given that A has occurred, Bayes’ Rule states the following:

\(P(B|A) = \dfrac{P(A \text{ and } B)}{P(A)}\)

This says that conditional probability is the probability that both A and B occur divided by the unconditional probability that A occurs. This is a simple algebraic restatement of a rule for finding the probability that two events occur together, which is \(P(A\ and\ B) = P(A)P(B|A)\).

Bayes’ Rule Applied to the Classification Problem

We are interested in \(P(\pi_{i} | \boldsymbol{x})\), the conditional probability that observation came from population \(\pi_{i}\) given that the observed values of the multivariate vector of variables \(\boldsymbol{x}\). We will classify an observation to the population for which the value of P(\(\pi_{i} | \boldsymbol{x})\) is greatest. This is the most probable group given the observed values of \(\boldsymbol{x}\).

Suppose that we have g populations (groups) and that the \(i^{th}\) population is denoted as \(\pi_{i}\).
Let \(p_{i}=P(\pi_{i})\), be the probability that a randomly selected observation is in population \(\pi_{i}\).
Let \(f(\boldsymbol{x} | \pi_{i}\)) be the conditional probability density function of the multivariate set of variables \(\boldsymbol{x}\), given that the observation came from population \(\pi_{i}\).

Note! We have to be careful about the word probability in conjunction with our observed vector \(\mathbf{x}\). A probability density function for continuous variables does not give a probability, but instead gives a measure of “likelihood.”

Using the notation of Bayes’ Rule above, event A = observing the vector \(\boldsymbol{x}\) and event B = observation came from population \(\pi_{i}\). Thus our probability of interest can be found as...

\(P(\text{member of } \pi_i | \text{ we observed } \mathbf{x}) = \dfrac{P(\text{member of } \pi_i \text{ and we observe } \mathbf{x})}{P(\text{we observe } \mathbf{x})}\)

The numerator of the expression just given is the likelihood that a randomly selected observation is both from population \(\pi_{i}\) and has the value \(\boldsymbol{x}\). This likelihood = \(p_{i}f(\boldsymbol{x}| \pi_{i})\).
The denominator is the unconditional likelihood (overall populations) that we could observe \(\boldsymbol{x}\). This likelihood = \(\sum_{j=1}^{g} p_j f(\mathbf{x}|\pi_j)\)

Thus the posterior probability that an observation is a member of population \(\pi_{i}\) is

\(p(\pi_i|\mathbf{x}) = \dfrac{p_i f(\mathbf{x}|\pi_i)}{\sum_{j=1}^{g}p_j f(\mathbf{x}|\pi_j)}\)

The classification rule is to assign observation \(\boldsymbol{x}\) to the population for which the posterior probability is the greatest.

The denominator is the same for all posterior probabilities (for the various populations) so it is equivalent to say that we will classify an observation to the population for which \(p_{i}f (\boldsymbol{x}\) | \(\pi_{i})\) is greatest.

Two Populations

With only two populations we can express a classification rule in terms of the ratio of the two posterior probabilities. Specifically, we would classify to population 1 when

\(\dfrac{p_1 f(\mathbf{x}|\pi_1)}{p_2 f(\mathbf{x}|\pi_2)} > 1\)

This can be rewritten to say that we classify to population 1 when

\(\dfrac{ f(\mathbf{x}|\pi_1)}{ f(\mathbf{x}|\pi_2)} > \dfrac{p_2}{p_1}\)

Decision Rule

We are going to classify the sample unit or subject into the population \(\pi_{i}\) that maximizes the posterior probability p(\(\pi_{i}\)). that is the population that maximizes

\(f(\mathbf{x}|\pi_i)p_i\)

We are going to calculate the posterior probabilities for each of the populations. Then we are going to assign the subject or sample unit to that population that has the highest posterior probability. Ideally, that posterior probability is going to be greater than half, the closer to 100% the better!

Equivalently we are going to assign it to the population that maximizes this product:

\(\log f(\mathbf{x}|\pi_i)p_i\)

The denominator that appears above does not depend on the population because it involves summing over all the populations. Equivalently all we really need to do is to assign it to the population that has the largest for this product, or equivalently we can maximize the log of that product. A lot of times it is easier to write the log.

10.2 - Discriminant Analysis Procedure

Discriminant analysis is a 7-step procedure.

Step 1: Collect training data

Training data are data with known group memberships. Here, we actually know which population contains each subject. For example, in the Swiss Bank Notes, we actually know which of these are genuine notes and which others are counterfeit examples.

Step 2: Prior Probabilities

The prior probability \(p_i\) represents the expected portion of the community that belongs to population \(\pi_{i}\). There are three common choices:

Equal priors: \(\hat{p}_i = \frac{1}{g}\) This is useful if we believe that all of the population sizes are equal
Arbitrary priors were selected according to the investigator's beliefs regarding the relative population sizes.
Note! We require:
\(\hat{p}_1 + \hat{p}_2 + \dots + \hat{p}_g = 1\)
Estimated priors:
\(\hat{p}_i = \dfrac{n_i}{N}\)

where \(n_{i}\) is the number observations from population \(\pi_{i}\) in the training data, and \(N = n _ { 1 } + n _ { 2 } + \ldots + n _ { g }\)

Step 3: Bartlett's test

Use Bartlett’s test to determine if the variance-covariance matrices are homogeneous for all populations involved. The result of this test will determine whether to use Linear or Quadratic Discriminant Analysis.:

Case 1: Linear

Linear discriminant analysis is for homogeneous variance-covariance matrices:

\(\Sigma_1 = \Sigma_2 = \dots = \Sigma_g = \Sigma\)

In this case, the variance-covariance matrix does not depend on the population.

Case 2: Quadratic

Quadratic discriminant analysis is used for heterogeneous variance-covariance matrices:

\(\Sigma_i \ne \Sigma_j\) for some \(i \ne j\)

This allows the variance-covariance matrices to depend on the population.

Note! We do not discuss testing whether the means of the populations are different. If they are not, there is no case for DA

Step 4: Estimate the parameters of the conditional probability density functions \(f ( \mathbf{X} |\pi_{i})\).

Here, we shall make the following standard assumptions:

The data from group i has common mean vector \(\boldsymbol{\mu_i}\)
The data from group i have a common variance-covariance matrix \(\Sigma\).
Independence: The subjects are independently sampled.
Normality: The data are multivariate normally distributed.

Step 5: Compute discriminant functions.

This is the rule to classify the new object into one of the known populations.

Step 6: Use cross-validation to estimate misclassification probabilities.

As in all statistical procedures, it is helpful to use diagnostic procedures to assess the efficacy of the discriminant analysis. We use cross-validation to assess the classification probability. Typically you are going to have some prior rule as to what is an acceptable misclassification rate. Those rules might involve things like, "what is the cost of misclassification?" This could come up in a medical study where you might be able to diagnose cancer. There are really two alternative costs. The cost of misclassifying someone as having cancer when they don't. This could cause a certain amount of emotional grief! There is also the alternative cost of misclassifying someone as not having cancer when in fact they do have it. The cost here is obviously greater if early diagnosis improves cure rates.

Step 7: Classify observations with unknown group memberships.

The procedure described above assumes that the unit or subject being classified actually belongs to one of the considered populations. If you have a study where you look at two species of insects, A and B, and the insect to classify actually belongs to species C, then it will obviously be misclassified as to belonging to either A or B.

10.3 - Linear Discriminant Analysis

We assume that in population \(\pi_{i}\) the probability density function of \(\boldsymbol{x}\) is multivariate normal with mean vector \(\boldsymbol{\mu}_{i}\) and variance-covariance matrix \(\Sigma\) (same for all populations). As a formula, this is...

\(f(\mathbf{x}|\pi_i) = \dfrac{1}{(2\pi)^{p/2}|\mathbf{\Sigma}|^{1/2}}\exp\left(-\frac{1}{2}\mathbf{(x-\mu_i)'\Sigma^{-1}(x-\mu_i)}\right)\)

We classify to the population for which \(p _ { i } f ( \mathbf { x } | \pi _ { i } )\) ) is largest.

Because a log transform is monotonic, this is equivalent to classifying an observation to the population for which log( \(p _ { i } f ( \mathbf { x } | \pi _ { i } )\) )) is largest.

Linear discriminant analysis is used when the variance-covariance matrix does not depend on the population. In this case, our decision rule is based on the Linear Score Function, a function of the population means for each of our g populations, \(\boldsymbol{\mu}_{i}\), as well as the pooled variance-covariance matrix.

Linear Score Function

The Linear Score Function is:

\(s^L_i(\mathbf{X}) = -\dfrac{1}{2}\mathbf{\mu'_i \Sigma^{-1}\mu_i + \mu'_i \Sigma^{-1}x}+ \log p_i = d_{i0}+\sum_{j=1}^{p}d_{ij}x_j + \log p_i\)

where

\(d_{i0} = -\dfrac{1}{2}\mathbf{\mu'_i\Sigma^{-1}\mu_i}\)

\(d_{ij} = j\text{th element of } \mu'_i\Sigma^{-1}\)

The far left-hand expression resembles a linear regression with intercept term d_i₀ and regression coefficients d_ij.

Linear Discriminant Function

\(d^L_i(\mathbf{x}) = -\dfrac{1}{2}\mathbf{\mu'_i\Sigma^{-1}\mu_i + \mu'_i\Sigma^{-1}x} = d_{i0} + \sum_{j=1}^{p}d_{ij}x_j\)

\(d_{i0} = -\dfrac{1}{2}\mathbf{\mu'_i\Sigma^{-1}\mu_i}\)

Given a sample unit with measurements \(x _ { 1 }, x _ { 2 }, \dots, x _ { p }\), we classify the sample unit into the population that has the largest Linear Score Function. This is equivalent to classifying the population for which the posterior probability of membership is the largest. The linear score function is computed for each population, then we plug in our observation values and assign the unit to the population with the largest score.

However, this is a function of unknown parameters, \(\boldsymbol{\mu}_{i}\) and \(\Sigma\). So, these must be estimated from the data.

Discriminant analysis requires estimates of:

\(p_i = \text{Pr}(\pi_i);\) \(i = 1, 2, \dots, g\)

\(\mathbf{\mu_i} = E(\mathbf{X}|\pi_i)\); \(i = 1, 2, \dots, g\)

\(\Sigma = \text{var}(\mathbf{X}| \pi_i)\); \(i = 1, 2, \dots, g\)

Prior probabilities:
The population means are estimated by the sample mean vectors:
The variance-covariance matrix is estimated by using the pooled variance-covariance matrix:

Typically, these parameters are estimated from training data, in which the population membership is known.

Conditional Density Function Parameters

Population Means: \(\boldsymbol{\mu}_{i}\) is estimated by substituting in the sample means \(\bar{\mathbf{x}}_i\).

Variance-Covariance matrix: Let S_i denote the sample variance-covariance matrix for population i. Then the variance-covariance matrix \(Σ\) is estimated by substituting in the pooled variance-covariance matrix into the Linear Score Function as shown below:

\(\mathbf{S}_p = \dfrac{\sum_{i=1}^{g}(n_i-1)\mathbf{S}_i}{\sum_{i=1}^{g}(n_i-1)}\)

to obtain the estimated linear score function:

\(\hat{s}^L_i(\mathbf{x}) = -\frac{1}{2}\mathbf{\bar{x}'_i S^{-1}_p \bar{x}_i +\bar{x}'_i S^{-1}_p x } + \log{\hat{p}_i} = \hat{d}_{i0} + \sum_{j=1}^{p}\hat{d}_{ij}x_j + \log{p}_i\)

where

\(\hat{d}_{i0} = -\dfrac{1}{2}\mathbf{\bar{x}'_i S^{-1}_p \bar{x}_i} \)

and

\( \hat{d}_{ij} = j^{th} \text{ element of} \ \ \bar{x}'_iS^{-1}_p \)

This is a function of the sample mean vectors, the pooled variance-covariance matrix, and prior probabilities for g different populations. This is written in a form that looks like a linear regression formula with an intercept term plus a linear combination of response variables, plus the natural log of the prior probabilities.

Decision Rule: Classify the sample unit into the population that has the largest estimated linear score function.

10.4 - Example: Insect Data

Example 10-4: Insect Data

Data were collected on two species of insects in the genus Chaetocnema, (species a) Ch. concinna and (species b) Ch. heikertlingeri. Three variables were measured on each insect:

\(X_1\) = Width of the 1^st joint of the tarsus (legs)
\(X_2\) = Width of the 2^nd joint of the tarsus
\(X_3\) = Width of the aedeagus (reproductive organ)

We have ten individuals of each species to make up training data. Data on these ten individuals of each species are used to estimate the model parameters which we will use in the linear score function.

Our objective is to obtain a classification rule for identifying the insect species from these three variables.

Let's begin...

Step 1

Collect the training data. (described above)

Step 2

Specify the prior probabilities. In this case, we do not have any information regarding the relative abundances of the two species. Without any information in order to help specify prior probabilities, equal priors are selected:

\[\hat{p}_1 = \hat{p}_2 = \dfrac{1}{2}\]

Step 3

Test for homogeneity of the variance-covariance matrices using Bartlett's test.

Download the text file containing the data here: insect.csv

Example
Example

Here we will use the SAS program as shown below:

Download the SAS program here: insect.sas

Note: In the upper right-hand corner of the code block you will have the option of copying () the code to your clipboard or downloading () the file to your computer.

options ls=78;
title "Discriminant Analysis - Insect Data";

data insect;
  infile "D:\Statistics\STAT 505\data\insect.csv" firstobs=2 delimiter=',';
  input species $ joint1 joint2 aedeagus;
  run;

 /* A new data set called 'test' is created to store any new
  * values to be classified with our discriminant rule.
  * The variables must match the quantitative ones in the training set.
  */

data test;
  input joint1 joint2 aedeagus;
  cards;
  194 124 49
  ; run;

 /* The pool option conducts a test of equal covariance matrices.
  * If the results of the test are insignificant (at the 0.10 level), the
  * sample covariance matrices are pooled, resulting in a linear discriminant
  * function; otherwise, the sample covariance matrices are not pooled, 
  * resulting in a quadratic discriminant function.
  * The crossvalidate option calculates the confusion matrix based on 
  * the holdout method, where each obs is classified from the other obs only. 
  * The testdata= option specifies the data set with obs to be classified.
  * The testout= option specifies the name of the data set where classification
  * results are stored.
  * The class statement specifies the variable with groups for classification.
  * The var statement specifies the quantitative variables used to estimate
  * the mean and covariance matrices of the groups.
  */

proc discrim data=insect pool=test crossvalidate testdata=test testout=a;
  class species;
  var joint1 joint2 aedeagus;
  run;

 /* This will print the results of the classifications of the obs
  * from the 'test' data set.
  */

proc print data=a;
  run;

Performing a discriminant analysis (insect data)

To perform linear discriminant analysis with equal prior probabilities:

Open the ‘insect’ data set in a new worksheet.
Stat > Multivariate > Discriminant Analysis
Highlight and select ‘species’ to move it to the Groups window.
Highlight and select ‘joint1’, ‘joint2’, and ‘aedeagus’ to move them to the Predictors window
Select 'OK'. The results are displayed in the results area.

No significant difference between the variance-covariance matrices for the two species (L' = 9.83; d.f. = 6; p = 0.132) is found. Thus the linear discriminant analysis is appropriate for the data.

Step 4

Estimate the parameters of the conditional probability density functions, i.e., the population mean vectors and the population variance-covariance matrices involved. It turns out that all of this is done automatically in the discriminant analysis procedure.

Step 5

The linear discriminant functions for the two species can be obtained directly from the SAS or Minitab output.

Species	Function
Ch. concinna	\(\widehat { d } _ { a } ^ { L } ( \mathbf { x } ) = - 247.276 - 1.417 x _ { 1 } + 1.520 x _ { 2 } + 10.954 x _ { 3 }\)
Ch. heikertlingeri	\(\widehat { d } _ { b } ^ { L } ( \mathbf { x } ) = - 193.178 - 0.738 x _ { 1 } + 1.113 x _ { 2 } + 8.250 x _ { 3 }\)

Step 6

We will discuss this step in Lesson 10.5.

Step 7

Now, consider an insect with the following measurements. Which species does this belong to?

Variable	Measurement
Joint 1	194
Joint 2	124
Aedeagus	49

These are responses for the first three variables. The linear discriminant function for species a is obtained by plugging in the values for these three measurements into the equation for species (a):

\(\hat{d}^{L}_a(\textbf{x}) = -247.276 - 1.417 \times 194 + 1.520 \times 124 + 10.954 \times 49 = 203.052\)

and then for species (b):

\(\hat{d}^{L}_b(\textbf{x}) = -193.178 - 0.738 \times 194 + 1.113 \times 124 + 8.250 \times 49 = 205.912\)

Add in a log of .5 to obtain the linear score function for species (a):

\(\hat{s}^L_a(\mathbf{x}) = \hat{d}^L_a(\mathbf{x}) + \log{\hat{p}_a} = 203.052 + \log{0.5} = 202.359\)

and then for species (b):

\(\hat{s}^L_b(\mathbf{x}) = \hat{d}^L_b(\mathbf{x}) + \log{\hat{p}_b} = 205.912 + \log{0.5} = 205.219\)

Conclusion

According to the classification rule the insect is classified into the species with the highest linear discriminant function. Because \(\hat{s}^L_b(\mathbf{x}) > \hat{s}^L_a(\mathbf{x})\), we conclude that the insect belongs to species (b) Ch. heikertlingeri.

Of course, the addition of the log of .5 does not make any difference. Whether we classify on the basis of \(\hat{d}^L_b(\mathbf{x})\) or on the basis of the score function, the decision will remain the same. In case the priors are not equal, this would not hold.

You can think of the priors as a 'penalty' in some sense. If you have a higher prior probability of a given species you will give it very little 'penalty' because you will be taking the log of a number close to one which is not going to subtract much. On the other hand, if there is a low prior probability, then the log of a very small number results in a larger reduction.

Note!: SAS by default assumes equal priors. Later on, we will look at an example where we do not assume equal priors - the Swiss Banknotes example.

Posterior Probabilities

You can also calculate the posterior probabilities. These are used to measure uncertainty regarding the classification of a unit from an unknown group. They will give us some indication of our confidence in our classification of individual subjects.

In this case, the estimated posterior probability that the insect belongs to species (a) Ch. concinna given the observed measurements is:

\begin{align} p(\pi_a|\mathbf{x}) &= \frac{\exp\{\hat{s}^L_a(\mathbf{x})\}}{\exp\{\hat{s}^L_a(\mathbf{x})\}+\exp\{\hat{s}^L_b(\mathbf{x})\}} \\[10pt] &= \frac{\exp\{202.359\}}{\exp\{202.359\}+\exp\{205.219\}} \\[10pt] &= 0.05\end{align}

This is a function of the linear score functions for the two species. Here we are looking at the exponential function of the linear score function for species (a) divided by the sum of the exponential functions of the score functions for species (a) and species (b). Using the numbers obtained earlier, this equals 0.05.

Similarly, for species (b), the estimated posterior probability that the insect belongs to Ch. heikertlingeri is:

\begin{align} p(\pi_b|\mathbf{x}) &= \frac{\exp\{\hat{s}^L_b(\mathbf{x})\}}{\exp\{\hat{s}^L_a(\mathbf{x})\}+\exp\{\hat{s}^L_b(\mathbf{x})\}} \\[10pt] &= \frac{\exp\{205.219\}}{\exp\{202.359\}+\exp\{205.219\}} \\[10pt] &= 0.95\end{align}

In this case, we are 95% confident that the insect belongs to species (b). This is a pretty high level of confidence with a 5% chance that we might be in error in this classification. You would have to decide what is an acceptable error rate here. For the classification of insects, this might be perfectly acceptable, however, in some situations, it might not be acceptable. For example, looking at the cancer case that we talked about earlier where we were trying to classify someone as having cancer or not having cancer, it may not be acceptable to have a 5% error rate. This is an ethical decision. It is a decision that has nothing to do with statistics and must be tailored to the situation at hand.

10.5 - Estimating Misclassification Probabilities

When an unknown specimen is classified according to any decision rule, there is always a possibility that the specimen is wrongly classified. This is unavoidable. This is part of the inherent uncertainty in any statistical procedure. One procedure to evaluate the discriminant rule is to classify the training data according to the developed discrimination rule. Because we know which unit comes from which population among the training data, this will give us some idea of the validity of the discrimination procedure.

Confusion Table

The confusion table describes how the discriminant function will classify each observation in the data set. In general, the confusion table takes the form:

Classified As
Truth	1	2	\(\cdots\)	\(g\)	Total
1	\(n_{11}\)	\(n_{12}\)	\(\cdots\)	\(n_{1g}\)	\(n_{1\cdot}\)
2	\(n_{21}\)	\(n_{22}\)	\(\cdots\)	\(n_{2g}\)	\(n_{2\cdot}\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
\(g\)	\(n_{g1}\)	\(n_{g2}\)	\(\cdots\)	\(n_{gg}\)	\(n_{g\cdot}\)
Total	\(n_{\cdot 1}\)	\(n_{\cdot 2}\)	\(\cdots\)	\(n_{\cdot g}\)	\(n_{\cdot \cdot}\)

Rows 1 through g are g populations to which the items truly belong. Across the columns, we are looking at how they are classified. \(n_{11}\) is the number of insects correctly classified in species (1). But \(n_{12}\) is the number of insects incorrectly classified into species (2). In this case \(n_{ij}\) = the number belonging to population i classified into population j. Ideally, this matrix will be a diagonal matrix; in practice, we hope to see very small off-diagonal elements.

The row totals provide the number of individuals belonging to each of our populations or species in our training dataset. The column totals are the number classified into each of these species. The total number of observations in the dataset is n... The dot notation is used here in the row totals for summing over the second subscript, whereas in the column totals, we are summing over the first subscript.

We will let:

\(p(i|j)\)

denote the probability that a unit from population π_j is classified into population π_i. These misclassification probabilities are estimated by taking the number of insects from population j that are misclassified into population i divided by the total number of insects in the sample from population j as shown here:

\(\hat{p}(i|j) = \dfrac{n_{ji}}{n_{j.}}\)

These are the misclassification probabilities.

Method 1: Resubstitution

The resubstitution method uses the same set of discriminate functions computed from the entire data set to classify each observation. Its confusion matrix is output by default in SAS.

From the SAS output, we obtain the following confusion table.

Classified As
Truth	\(a\)	\(b\)	Total
\(a\)	10	0	10
\(b\)	0	10	10
Total	10	10	20

Here, none of the insects were misclassified! The misclassification probabilities are all estimated equal to zero.

Method 2: Set Aside Method

Step 1: Randomly partition the observations into two ”halves”
Step 2: Use one ”half” to obtain the discriminant function.
Step 3: Use the discriminant function from Step 2 to classify all members of the second ”half” of the data, from which the proportion of misclassified observations is computed.

Advantage: This method yields unbiased estimates of the misclassification probabilities.

Problem: This does not make optimum use of the data, and so, estimated misclassification probabilities are not as precise as possible.

Method 3: Cross Validation

Step 1: Delete one observation from the data.
Step 2: Use the remaining observations to compute a discriminant function.
Step 3: Use the discriminant function from Step 2 to classify the observation removed in Step 1. Steps 1-3 are repeated for all observations; compute the proportions of observations that are misclassified.

Example 10-5: Insect Data

The confusion table for the cross-validation is

Classified As
Truth	\(a\)	\(b\)	Total
\(a\)	10	0	10
\(b\)	2	8	10
Total	12	8	20

Here, the estimated misclassification probabilities are:

\(\hat{p}(b|a) = \frac{0}{10} = 0.0\)

for insects belonging to species A, and

\(\hat{p}(a|b) = \frac{2}{10} = 0.2\)

for insects belonging to species B.

Specifying Unequal Priors

Suppose that we have information (from prior experience or from another study) that suggests that 90% of the insects belong to Ch. concinna. Then the score functions for the unidentified specimen are

\begin{align} \hat{s}^L_a(\mathbf{x}) &= \hat{d}^L_a(\mathbf{x}) + \log{\hat{p}_a}\\[10pt] &= 203.052 + \log{0.9} \\[10pt] &= 202.946\end{align}

and

\begin{align} \hat{s}^L_b(\mathbf{x}) &= \hat{d}^L_b(\mathbf{x}) + \log{\hat{p}_b} \\[10pt] &= 205.912 + \log{0.1} \\[10pt] &= 203.609\end{align}

In this case, we would still classify this specimen into Ch. heikertlingeri with posterior probabilities

\(p(\pi_a|\mathbf{x}) = 0.36\) and \(p(\pi_b|\mathbf{x}) = 0.64\)

These priors can be specified in SAS by adding the ”priors” statement: priors ”a” = 0.9 ”b” = 0.1; following the var statement. However, it should be noted that when the "priors" statement is added, SAS will include log p_i as part of the constant term. In other words, SAS outputs the estimated linear score function, not the estimated linear discriminant function.

10.6 - Quadratic Discriminant Analysis

Linear Discriminant Analysis is for homogeneous variance-covariance matrices. However, not all cases come from such simplified situations. Quadratic Discriminant Analysis is used for heterogeneous variance-covariance matrices:

\(\Sigma_i \ne \Sigma_j\) for some \(i \ne j\)

Again, this allows the variance-covariance matrices to depend on the population.

Quadratic discriminant analysis calculates a Quadratic Score Function:

\(s^Q_i (\mathbf{x}) = -\frac{1}{2}\log{|\mathbf{\Sigma_i}|}-\frac{1}{2}{\mathbf{(x-\mu_i)'\Sigma^{-1}_i(x - \mu_i)}}+\log{p_i}\)

This is a function of population mean vectors and the variance-covariance matrices for the ith group. Similarly, we will determine a separate quadratic score function for each of the groups.

This is of course a function of the unknown population mean vector for group i and the variance-covariance matrix for group i. These will have to be estimated from the training data. As before, we replace the unknown values of \(\boldsymbol{\mu_i}\),\(\mathbf{\Sigma_i}\), and \(p_i\) by their estimates to obtain the estimated quadratic score function as shown below:

\(\hat{s}^Q_i (\mathbf{x}) = -\frac{1}{2}\log{|\mathbf{S_i}|}-\frac{1}{2}{\mathbf{(x-\bar{x}_i)'S^{-1}_i(x - \bar{x}_i)}}+\log{p_i}\)

All natural logs are used in this function.

Decision Rule: Our decision rule remains the same as well. We will classify the sample unit into the population that has the largest quadratic score function.

\(\hat{s}^Q_i (\mathbf{x}) = -\frac{1}{2}\log{|\mathbf{S_i}|}-\frac{1}{2}{\mathbf{(x-\bar{x})'S^{-1}_i(x -\bar{x})}}+\log{p_i}\)

Let's illustrate this using the Swiss Banknotes example...

10.7 - Example: Swiss Banknotes

Example 10-6: Swiss Banknotes

Recall that we have two populations of notes, genuine and counterfeit, and that six measurements were taken on each note:

Length
Right-Hand Width
Left-Hand Width
Top Margin
Bottom Margin
Diagonal

Priors

In this case, it would not be reasonable to consider equal priors for the two types of banknotes. Equal priors would assume that half the banknotes in circulation are counterfeit and half are genuine. This is a very high counterfeit rate and if it was that bad the Swiss government would probably be bankrupt! We need to consider unequal priors in which the vast majority of banknotes are thought to be genuine. For this example let us assume that no more than 1% of banknotes in circulation are counterfeit and 99% of the notes are genuine. The prior probabilities can then be expressed as:

\(\hat{p}_1 = 0.99\) and \(\hat{p}_2 = 0.01\)

The first step in the analysis is going to carry out Bartlett's test to check for homogeneity of the variance-covariance matrices.

Download the text file with the data here: swiss3.csv

Example
Example

To do this we will use the SAS program shown below:

Download the SAS program here: swiss9.sas

View the video explanation of the SAS code.

Note: In the upper right-hand corner of the code block you will have the option of copying () the code to your clipboard or downloading () the file to your computer.

options ls=78;
title "Discriminant - Swiss Bank Notes";

data swiss;
  infile "D:\Statistics\STAT 505\data\swiss3.csv" firstobs=2 delimiter=',';
  input type $ length left right bottom top diag;
  run;

 /* A new data set called 'test' is created to store any new
  * values to be classified with our discriminant rule.
  * The variables must match the quantitative ones in the training set.
  */

data test;
  input length left right bottom top diag;
  cards;
  214.9 130.1 129.9 9 10.6 140.5
  ; run;
  run;

 /* The pool option conducts a test of equal covariance matrices.
  * If the results of the test are insignificant (at the 0.10 level), the
  * sample covariance matrices are pooled, resulting in a linear discriminant
  * function; otherwise, the sample covariance matrices are not pooled, 
  * resulting in a quadratic discriminant function.
  * The crossvalidate option calculates the confusion matrix based on 
  * the holdout method, where each obs is classified from the other obs only. 
  * The testdata= option specifies the data set with obs to be classified.
  * The testout= option specifies the name of the data set where classification
  * results are stored.
  * The class statement specifies the variable with groups for classification.
  * The var statement specifies the quantitative variables used to estimate
  * the mean and covariance matrices of the groups.
  */

proc discrim data=swiss pool=test crossvalidate testdata=test testout=a;
  class type;
  var length left right bottom top diag;
  priors "real"=0.99 "fake"=0.01;
  run;

 /* This will print the results of the classifications of the obs
  * from the 'test' data set.
  */

proc print data=a;
  run;

SAS Notes

By default, SAS will make this decision for you. Let's look at the proc descrim procedure in the SAS Program that we just used.

By including pool=test, SAS will decide what kind of discriminant analysis to carry out based on the results of this test.

If the test fails to reject, then SAS will automatically do a linear discriminant analysis. If the test rejects, then SAS will do a quadratic discriminant analysis.

There are two other options here. If we put pool=yes then SAS will conduct a linear discriminant analysis whether it is warranted or not. It will pool the variance-covariance matrices and do a linear discriminant analysis without reporting Bartlett's test.

If pool=no then SAS will not pool the variance-covariance matrices and perform the quadratic discriminant analysis.

SAS does not actually print out the quadratic discriminant function, but it will use quadratic discriminant analysis to classify sample units into populations.

Performing discriminant analysis (Swiss bank notes data)

To perform quadratic discriminant analysis with unequal prior probabilities:

Open the ‘swiss3’ data set in a new worksheet.
Stat > Multivariate > Discriminant Analysis
Highlight and select ‘type’ to move it to the Groups window.
Highlight and select all six quantitative variables (‘length’ through ‘diag’) to move them to the Predictors window.
Choose Quadratic under Discriminant Function.
Choose Options, and enter the prior probabilities ‘0.99 0.01’ (without quotes) to apply them to the groups ‘a’ and ‘b’, respectively (alphabetical order).
Choose 'OK' twice. The results are displayed in the results area.

Bartlett's Test finds a significant difference between the variance-covariance matrices of the genuine and counterfeit banknotes \(\left( \mathrm { L } ^ { \prime } = 121.90; \mathrm { d.f. } = 21; \mathrm { p } < 0.0001 \right)\). The variance-covariance matrix for the genuine notes is not equal to the variance-covariance matrix for the counterfeit notes. Because we reject the null hypothesis of equal variance-covariance matrices, this suggests that a linear discriminant analysis is not appropriate for these data. A quadratic discriminant analysis is necessary.

Example 10-7: Swiss Bank notes

Let us consider a banknote with the following measurements:

Variable	Measurement
Length	214.9
Left Width	130.1
Right Width	129.9
Bottom Margin	9.0
Top Margin	10.6
Diagonal	140.5

Any number of lines of measurement may be considered. Here we are just interested in one set of measurements. It is requested that this banknote be classified as real or genuine. The posterior probability that it is fake or counterfeit is only 0.000002526. So, the posterior probability that it is genuine is very close to one (actually, this posterior probability is 1 - 0.000002526 = 0.999997474). We are nearly 100% confident that this is a real note and not counterfeit.

Next, consider the results of cross-validation.

Note! Cross-validation yields estimates of the probability that a randomly selected note is correctly classified.

The resulting confusion table is as follows:

Classified As
Truth	Counterfeit	Genuine	Total
Counterfeit	98	2	100
Genuine	1	99	100
Total	99	101	200

Here, we can see that 98 out of 100 counterfeit notes are expected to be correctly classified, while 99 out of 100 genuine notes are expected to be correctly classified. Thus, the estimated misclassification probabilities are estimated to be:

\(\hat{p}(\text{real | fake}) = 0.02 \) and \(\hat{p}(\text{fake | real}) = 0.01 \)

The question remains: Are these acceptable misclassification rates?

A decision should be made in advance as to what would be the acceptable levels of error. Here again, you need to think about the consequences of making a mistake. In terms of classifying a genuine note as a counterfeit, one might put an innocent person in jail. If you make the opposite error you might let a criminal go free. What are the costs of these types of errors? And, are the above error rates acceptable? This decision should be made in advance. You should have some prior notion of what you would consider reasonable.

10.8 - Summary

In this lesson we learned:

How to determine which type of discriminant analysis is appropriate, linear or quadratic;
How the linear discriminant function is used to classify a subject into the appropriate population;
About issues to consider when selecting prior probabilities that a randomly selected subject belongs to a particular population;
How to use posterior probabilities to assess the uncertainty of the classification of a particular subject;
How to use cross-validation and confusion tables to assess the efficacy of discriminant analysis.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility