10.7 - Example: Swiss Banknotes

Example 10-6: Swiss Banknotes Section

Recall that we have two populations of notes, genuine and counterfeit, and that six measurements were taken on each note:

  • Length
  • Right-Hand Width
  • Left-Hand Width
  • Top Margin
  • Bottom Margin
  • Diagonal

Priors

In this case, it would not be reasonable to consider equal priors for the two types of banknotes. Equal priors would assume that half the banknotes in circulation are counterfeit and half are genuine. This is a very high counterfeit rate and if it was that bad the Swiss government would probably be bankrupt!  We need to consider unequal priors in which the vast majority of banknotes are thought to be genuine. For this example let us assume that no more than 1% of banknotes in circulation are counterfeit and 99% of the notes are genuine. The prior probabilities can then be expressed as:

\(\hat{p}_1 = 0.99\) and \(\hat{p}_2 = 0.01\)

The first step in the analysis is going to carry out Bartlett's test to check for homogeneity of the variance-covariance matrices.

Download the text file with the data here: swiss3.csv

To do this we will use the SAS program shown below:

Download the SAS program here: swiss9.sas

  View the video explanation of the SAS code.
 

Note: In the upper right-hand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.

options ls=78;
title "Discriminant - Swiss Bank Notes";

data swiss;
  infile "D:\Statistics\STAT 505\data\swiss3.csv" firstobs=2 delimiter=',';
  input type $ length left right bottom top diag;
  run;

 /* A new data set called 'test' is created to store any new
  * values to be classified with our discriminant rule.
  * The variables must match the quantitative ones in the training set.
  */

data test;
  input length left right bottom top diag;
  cards;
  214.9 130.1 129.9 9 10.6 140.5
  ; run;
  run;

 /* The pool option conducts a test of equal covariance matrices.
  * If the results of the test are insignificant (at the 0.10 level), the
  * sample covariance matrices are pooled, resulting in a linear discriminant
  * function; otherwise, the sample covariance matrices are not pooled, 
  * resulting in a quadratic discriminant function.
  * The crossvalidate option calculates the confusion matrix based on 
  * the holdout method, where each obs is classified from the other obs only. 
  * The testdata= option specifies the data set with obs to be classified.
  * The testout= option specifies the name of the data set where classification
  * results are stored.
  * The class statement specifies the variable with groups for classification.
  * The var statement specifies the quantitative variables used to estimate
  * the mean and covariance matrices of the groups.
  */

proc discrim data=swiss pool=test crossvalidate testdata=test testout=a;
  class type;
  var length left right bottom top diag;
  priors "real"=0.99 "fake"=0.01;
  run;

 /* This will print the results of the classifications of the obs
  * from the 'test' data set.
  */

proc print data=a;
  run;

SAS Notes

By default, SAS will make this decision for you. Let's look at the proc descrim procedure in the SAS Program that we just used.

By including pool=test, SAS will decide what kind of discriminant analysis to carry out based on the results of this test.

If the test fails to reject, then SAS will automatically do a linear discriminant analysis. If the test rejects, then SAS will do a quadratic discriminant analysis.

There are two other options here. If we put pool=yes then SAS will conduct a linear discriminant analysis whether it is warranted or not. It will pool the variance-covariance matrices and do a linear discriminant analysis without reporting Bartlett's test.

If pool=no then SAS will not pool the variance-covariance matrices and perform the quadratic discriminant analysis.

SAS does not actually print out the quadratic discriminant function, but it will use quadratic discriminant analysis to classify sample units into populations.

Performing discriminant analysis (Swiss bank notes data)

To perform quadratic discriminant analysis with unequal prior probabilities:

  1. Open the ‘swiss3’ data set in a new worksheet.
  2. Stat > Multivariate > Discriminant Analysis
  3. Highlight and select ‘type’ to move it to the Groups window.
  4. Highlight and select all six quantitative variables (‘length’ through ‘diag’) to move them to the Predictors window.
  5. Choose Quadratic under Discriminant Function.
  6. Choose Options, and enter the prior probabilities ‘0.99 0.01’ (without quotes) to apply them to the groups ‘a’ and ‘b’, respectively (alphabetical order).
  7. Choose 'OK' twice. The results are displayed in the results area.

Bartlett's Test finds a significant difference between the variance-covariance matrices of the genuine and counterfeit banknotes \(\left( \mathrm { L } ^ { \prime } = 121.90; \mathrm { d.f. } = 21; \mathrm { p } < 0.0001 \right)\).  The variance-covariance matrix for the genuine notes is not equal to the variance-covariance matrix for the counterfeit notes.  Because we reject the null hypothesis of equal variance-covariance matrices, this suggests that a linear discriminant analysis is not appropriate for these data.  A quadratic discriminant analysis is necessary.

Example 10-7: Swiss Bank notes Section

Let us consider a banknote with the following measurements:

Variable
Measurement
Length
214.9
Left Width
130.1
Right Width
129.9
Bottom Margin
9.0
Top Margin
10.6
Diagonal
140.5

Any number of lines of measurement may be considered. Here we are just interested in one set of measurements. It is requested that this banknote be classified as real or genuine. The posterior probability that it is fake or counterfeit is only 0.000002526. So, the posterior probability that it is genuine is very close to one (actually, this posterior probability is 1 - 0.000002526 = 0.999997474). We are nearly 100% confident that this is a real note and not counterfeit.

Next, consider the results of cross-validation.

Note! Cross-validation yields estimates of the probability that a randomly selected note is correctly classified.

The resulting confusion table is as follows:

Classified As
Truth Counterfeit Genuine Total
Counterfeit
98
2
100
Genuine
1
99
100
Total
99
101
200

Here, we can see that 98 out of 100 counterfeit notes are expected to be correctly classified, while 99 out of 100 genuine notes are expected to be correctly classified. Thus, the estimated misclassification probabilities are estimated to be:

\(\hat{p}(\text{real | fake}) = 0.02 \) and \(\hat{p}(\text{fake | real}) = 0.01 \)

The question remains: Are these acceptable misclassification rates?

A decision should be made in advance as to what would be the acceptable levels of error. Here again, you need to think about the consequences of making a mistake. In terms of classifying a genuine note as a counterfeit, one might put an innocent person in jail. If you make the opposite error you might let a criminal go free. What are the costs of these types of errors? And, are the above error rates acceptable? This decision should be made in advance. You should have some prior notion of what you would consider reasonable.