10.7  Example: Swiss Banknotes
10.7  Example: Swiss BanknotesExample 106: Swiss Banknotes
Recall that we have two populations of notes, genuine and counterfeit, and that six measurements were taken on each note:
 Length
 RightHand Width
 LeftHand Width
 Top Margin
 Bottom Margin
 Diagonal
Priors
In this case, it would not be reasonable to consider equal priors for the two types of banknotes. Equal priors would assume that half the banknotes in circulation are counterfeit and half are genuine. This is a very high counterfeit rate and if it was that bad the Swiss government would probably be bankrupt! We need to consider unequal priors in which the vast majority of banknotes are thought to be genuine. For this example let us assume that no more than 1% of banknotes in circulation are counterfeit and 99% of the notes are genuine. The prior probabilities can then be expressed as:
\(\hat{p}_1 = 0.99\) and \(\hat{p}_2 = 0.01\)
The first step in the analysis is going to carry out Bartlett's test to check for homogeneity of the variancecovariance matrices.
Download the text file with the data here: swiss3.csv
To do this we will use the SAS program shown below:
Download the SAS program here: swiss9.sas
View the video explanation of the SAS code.Note: In the upper righthand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.
options ls=78;
title "Discriminant  Swiss Bank Notes";
data swiss;
infile "D:\Statistics\STAT 505\data\swiss3.csv" firstobs=2 delimiter=',';
input type $ length left right bottom top diag;
run;
/* A new data set called 'test' is created to store any new
* values to be classified with our discriminant rule.
* The variables must match the quantitative ones in the training set.
*/
data test;
input length left right bottom top diag;
cards;
214.9 130.1 129.9 9 10.6 140.5
; run;
run;
/* The pool option conducts a test of equal covariance matrices.
* If the results of the test are insignificant (at the 0.10 level), the
* sample covariance matrices are pooled, resulting in a linear discriminant
* function; otherwise, the sample covariance matrices are not pooled,
* resulting in a quadratic discriminant function.
* The crossvalidate option calculates the confusion matrix based on
* the holdout method, where each obs is classified from the other obs only.
* The testdata= option specifies the data set with obs to be classified.
* The testout= option specifies the name of the data set where classification
* results are stored.
* The class statement specifies the variable with groups for classification.
* The var statement specifies the quantitative variables used to estimate
* the mean and covariance matrices of the groups.
*/
proc discrim data=swiss pool=test crossvalidate testdata=test testout=a;
class type;
var length left right bottom top diag;
priors "real"=0.99 "fake"=0.01;
run;
/* This will print the results of the classifications of the obs
* from the 'test' data set.
*/
proc print data=a;
run;
SAS Notes
By default, SAS will make this decision for you. Let's look at the proc descrim procedure in the SAS Program that we just used.
By including pool=test, SAS will decide what kind of discriminant analysis to carry out based on the results of this test.
If the test fails to reject, then SAS will automatically do a linear discriminant analysis. If the test rejects, then SAS will do a quadratic discriminant analysis.
There are two other options here. If we put pool=yes then SAS will conduct a linear discriminant analysis whether it is warranted or not. It will pool the variancecovariance matrices and do a linear discriminant analysis without reporting Bartlett's test.
If pool=no then SAS will not pool the variancecovariance matrices and perform the quadratic discriminant analysis.
SAS does not actually print out the quadratic discriminant function, but it will use quadratic discriminant analysis to classify sample units into populations.
Performing discriminant analysis (Swiss bank notes data)
To perform quadratic discriminant analysis with unequal prior probabilities:
 Open the ‘swiss3’ data set in a new worksheet.
 Stat > Multivariate > Discriminant Analysis
 Highlight and select ‘type’ to move it to the Groups window.
 Highlight and select all six quantitative variables (‘length’ through ‘diag’) to move them to the Predictors window.
 Choose Quadratic under Discriminant Function.
 Choose Options, and enter the prior probabilities ‘0.99 0.01’ (without quotes) to apply them to the groups ‘a’ and ‘b’, respectively (alphabetical order).
 Choose 'OK' twice. The results are displayed in the results area.
Bartlett's Test finds a significant difference between the variancecovariance matrices of the genuine and counterfeit banknotes \(\left( \mathrm { L } ^ { \prime } = 121.90; \mathrm { d.f. } = 21; \mathrm { p } < 0.0001 \right)\). The variancecovariance matrix for the genuine notes is not equal to the variancecovariance matrix for the counterfeit notes. Because we reject the null hypothesis of equal variancecovariance matrices, this suggests that a linear discriminant analysis is not appropriate for these data. A quadratic discriminant analysis is necessary.
Example 107: Swiss Bank notes
Let us consider a banknote with the following measurements:
Variable

Measurement


Length

214.9

Left Width

130.1

Right Width

129.9

Bottom Margin

9.0

Top Margin

10.6

Diagonal

140.5

Any number of lines of measurement may be considered. Here we are just interested in one set of measurements. It is requested that this banknote be classified as real or genuine. The posterior probability that it is fake or counterfeit is only 0.000002526. So, the posterior probability that it is genuine is very close to one (actually, this posterior probability is 1  0.000002526 = 0.999997474). We are nearly 100% confident that this is a real note and not counterfeit.
Next, consider the results of crossvalidation.
The resulting confusion table is as follows:
Truth  Counterfeit  Genuine  Total 

Counterfeit 
98

2

100

Genuine 
1

99

100

Total 
99

101

200

Here, we can see that 98 out of 100 counterfeit notes are expected to be correctly classified, while 99 out of 100 genuine notes are expected to be correctly classified. Thus, the estimated misclassification probabilities are estimated to be:
\(\hat{p}(\text{real  fake}) = 0.02 \) and \(\hat{p}(\text{fake  real}) = 0.01 \)
The question remains: Are these acceptable misclassification rates?
A decision should be made in advance as to what would be the acceptable levels of error. Here again, you need to think about the consequences of making a mistake. In terms of classifying a genuine note as a counterfeit, one might put an innocent person in jail. If you make the opposite error you might let a criminal go free. What are the costs of these types of errors? And, are the above error rates acceptable? This decision should be made in advance. You should have some prior notion of what you would consider reasonable.