10.4 - Example: Insect Data

Example 10-4: Insect Data Section

Data were collected on two species of insects in the genus Chaetocnema, (species a) Ch. concinna and (species b) Ch. heikertlingeri. Three variables were measured on each insect:

\(X_1\) = Width of the 1^st joint of the tarsus (legs)
\(X_2\) = Width of the 2^nd joint of the tarsus
\(X_3\) = Width of the aedeagus (reproductive organ)

We have ten individuals of each species to make up training data. Data on these ten individuals of each species are used to estimate the model parameters which we will use in the linear score function.

Our objective is to obtain a classification rule for identifying the insect species from these three variables.

Let's begin...

Step 1

Collect the training data. (described above)

Step 2

Specify the prior probabilities. In this case, we do not have any information regarding the relative abundances of the two species. Without any information in order to help specify prior probabilities, equal priors are selected:

\[\hat{p}_1 = \hat{p}_2 = \dfrac{1}{2}\]

Step 3

Test for homogeneity of the variance-covariance matrices using Bartlett's test.

Download the text file containing the data here: insect.csv

Example
Example

Here we will use the SAS program as shown below:

Download the SAS program here: insect.sas

Note: In the upper right-hand corner of the code block you will have the option of copying () the code to your clipboard or downloading () the file to your computer.

options ls=78;
title "Discriminant Analysis - Insect Data";

data insect;
  infile "D:\Statistics\STAT 505\data\insect.csv" firstobs=2 delimiter=',';
  input species $ joint1 joint2 aedeagus;
  run;

 /* A new data set called 'test' is created to store any new
  * values to be classified with our discriminant rule.
  * The variables must match the quantitative ones in the training set.
  */

data test;
  input joint1 joint2 aedeagus;
  cards;
  194 124 49
  ; run;

 /* The pool option conducts a test of equal covariance matrices.
  * If the results of the test are insignificant (at the 0.10 level), the
  * sample covariance matrices are pooled, resulting in a linear discriminant
  * function; otherwise, the sample covariance matrices are not pooled, 
  * resulting in a quadratic discriminant function.
  * The crossvalidate option calculates the confusion matrix based on 
  * the holdout method, where each obs is classified from the other obs only. 
  * The testdata= option specifies the data set with obs to be classified.
  * The testout= option specifies the name of the data set where classification
  * results are stored.
  * The class statement specifies the variable with groups for classification.
  * The var statement specifies the quantitative variables used to estimate
  * the mean and covariance matrices of the groups.
  */

proc discrim data=insect pool=test crossvalidate testdata=test testout=a;
  class species;
  var joint1 joint2 aedeagus;
  run;

 /* This will print the results of the classifications of the obs
  * from the 'test' data set.
  */

proc print data=a;
  run;

Performing a discriminant analysis (insect data)

To perform linear discriminant analysis with equal prior probabilities:

Open the ‘insect’ data set in a new worksheet.
Stat > Multivariate > Discriminant Analysis
Highlight and select ‘species’ to move it to the Groups window.
Highlight and select ‘joint1’, ‘joint2’, and ‘aedeagus’ to move them to the Predictors window
Select 'OK'. The results are displayed in the results area.

No significant difference between the variance-covariance matrices for the two species (L' = 9.83; d.f. = 6; p = 0.132) is found. Thus the linear discriminant analysis is appropriate for the data.

Step 4

Estimate the parameters of the conditional probability density functions, i.e., the population mean vectors and the population variance-covariance matrices involved. It turns out that all of this is done automatically in the discriminant analysis procedure.

Step 5

The linear discriminant functions for the two species can be obtained directly from the SAS or Minitab output.

Species	Function
Ch. concinna	\(\widehat { d } _ { a } ^ { L } ( \mathbf { x } ) = - 247.276 - 1.417 x _ { 1 } + 1.520 x _ { 2 } + 10.954 x _ { 3 }\)
Ch. heikertlingeri	\(\widehat { d } _ { b } ^ { L } ( \mathbf { x } ) = - 193.178 - 0.738 x _ { 1 } + 1.113 x _ { 2 } + 8.250 x _ { 3 }\)

Step 6

We will discuss this step in Lesson 10.5.

Step 7

Now, consider an insect with the following measurements. Which species does this belong to?

Variable	Measurement
Joint 1	194
Joint 2	124
Aedeagus	49

These are responses for the first three variables. The linear discriminant function for species a is obtained by plugging in the values for these three measurements into the equation for species (a):

\(\hat{d}^{L}_a(\textbf{x}) = -247.276 - 1.417 \times 194 + 1.520 \times 124 + 10.954 \times 49 = 203.052\)

and then for species (b):

\(\hat{d}^{L}_b(\textbf{x}) = -193.178 - 0.738 \times 194 + 1.113 \times 124 + 8.250 \times 49 = 205.912\)

Add in a log of .5 to obtain the linear score function for species (a):

\(\hat{s}^L_a(\mathbf{x}) = \hat{d}^L_a(\mathbf{x}) + \log{\hat{p}_a} = 203.052 + \log{0.5} = 202.359\)

and then for species (b):

\(\hat{s}^L_b(\mathbf{x}) = \hat{d}^L_b(\mathbf{x}) + \log{\hat{p}_b} = 205.912 + \log{0.5} = 205.219\)

Conclusion

According to the classification rule the insect is classified into the species with the highest linear discriminant function. Because \(\hat{s}^L_b(\mathbf{x}) > \hat{s}^L_a(\mathbf{x})\), we conclude that the insect belongs to species (b) Ch. heikertlingeri.

Of course, the addition of the log of .5 does not make any difference. Whether we classify on the basis of \(\hat{d}^L_b(\mathbf{x})\) or on the basis of the score function, the decision will remain the same. In case the priors are not equal, this would not hold.

You can think of the priors as a 'penalty' in some sense. If you have a higher prior probability of a given species you will give it very little 'penalty' because you will be taking the log of a number close to one which is not going to subtract much. On the other hand, if there is a low prior probability, then the log of a very small number results in a larger reduction.

Note!: SAS by default assumes equal priors. Later on, we will look at an example where we do not assume equal priors - the Swiss Banknotes example.

Posterior Probabilities Section

You can also calculate the posterior probabilities. These are used to measure uncertainty regarding the classification of a unit from an unknown group. They will give us some indication of our confidence in our classification of individual subjects.

In this case, the estimated posterior probability that the insect belongs to species (a) Ch. concinna given the observed measurements is:

\begin{align} p(\pi_a|\mathbf{x}) &= \frac{\exp\{\hat{s}^L_a(\mathbf{x})\}}{\exp\{\hat{s}^L_a(\mathbf{x})\}+\exp\{\hat{s}^L_b(\mathbf{x})\}} \\[10pt] &= \frac{\exp\{202.359\}}{\exp\{202.359\}+\exp\{205.219\}} \\[10pt] &= 0.05\end{align}

This is a function of the linear score functions for the two species. Here we are looking at the exponential function of the linear score function for species (a) divided by the sum of the exponential functions of the score functions for species (a) and species (b). Using the numbers obtained earlier, this equals 0.05.

Similarly, for species (b), the estimated posterior probability that the insect belongs to Ch. heikertlingeri is:

\begin{align} p(\pi_b|\mathbf{x}) &= \frac{\exp\{\hat{s}^L_b(\mathbf{x})\}}{\exp\{\hat{s}^L_a(\mathbf{x})\}+\exp\{\hat{s}^L_b(\mathbf{x})\}} \\[10pt] &= \frac{\exp\{205.219\}}{\exp\{202.359\}+\exp\{205.219\}} \\[10pt] &= 0.95\end{align}

In this case, we are 95% confident that the insect belongs to species (b). This is a pretty high level of confidence with a 5% chance that we might be in error in this classification. You would have to decide what is an acceptable error rate here. For the classification of insects, this might be perfectly acceptable, however, in some situations, it might not be acceptable. For example, looking at the cancer case that we talked about earlier where we were trying to classify someone as having cancer or not having cancer, it may not be acceptable to have a 5% error rate. This is an ethical decision. It is a decision that has nothing to do with statistics and must be tailored to the situation at hand.