Example 10-4: Insect Data Section
Data were collected on two species of insects in the genus Chaetocnema, (species a) Ch. concinna and (species b) Ch. heikertlingeri. Three variables were measured on each insect:
- \(X_1\) = Width of the 1st joint of the tarsus (legs)
- \(X_2\) = Width of the 2nd joint of the tarsus
- \(X_3\) = Width of the aedeagus (reproductive organ)
We have ten individuals of each species to make up training data. Data on these ten individuals of each species is used to estimate the model parameters which we will use in linear score function.
Our objective is to obtain a classification rule for identifying the insect species from these three variables.
Let's begin...
Step 1
Collect the training data. (described above)
Step 2
Specify the prior probabilities. In this case we do not have any information regarding the relative abundances of the two species. Without any information in order to help specify prior probabilities, equal priors are selected:
\[\hat{p}_1 = \hat{p}_2 = \dfrac{1}{2}\]
Step 3
Test for homogeneity of the variance-covariance matrices using Bartlett's test.
Download the text file containing the data here: insect.txt
Using SAS
Here we will use the SAS program as shown below:
Download the SAS program here: insect.sas
Using Minitab
Click on the window below to see how discriminant analysis is performed using the Minitab statistical software application.
No significant difference between the variance-covariance matrices for the two species (L' = 9.83; d.f. = 6; p = 0.132) is found. Thus linear discriminant analysis is appropriate for the data.
Step 4
Estimate the parameters of the conditional probability density functions, i.e., the population mean vectors and the population variance-covariance matrices involved. It turns out that all of this is done automatically in the discriminant analysis procedure.
Step 5
The linear discriminant functions for the two species can be obtained directly from the SAS or Minitab output.
Species | Function |
---|---|
Ch. concinna |
\(\widehat { d } _ { a } ^ { L } ( \mathbf { x } ) = - 247.276 - 1.417 x _ { 1 } + 1.520 x _ { 2 } + 10.954 x _ { 3 }\) |
Ch. heikertlingeri | \(\widehat { d } _ { b } ^ { L } ( \mathbf { x } ) = - 193.178 - 0.738 x _ { 1 } + 1.113 x _ { 2 } + 8.250 x _ { 3 }\) |
Step 6
We will discuss this step in Lesson 10.5.
Step 7
Now, consider an insect with the following measurements. Which species does this belong to?
Variable | Measurement |
Joint 1 | 194 |
Joint 2 | 124 |
Aedeagus | 49 |
These are responses for the first three variables. The linear discriminant function for species a is obtained by plugging in the values for these three measurements into the equation for species (a):
\(\hat{d}^{L}_a(\textbf{x}) = -247.276 - 1.417 \times 194 + 1.520 \times 124 + 10.954 \times 49 = 203.052\)
and then for species (b):
\(\hat{d}^{L}_b(\textbf{x}) = -193.178 - 0.738 \times 194 + 1.113 \times 124 + 8.250 \times 49 = 205.912\)
Add in a log of .5 to obtain the linear score function for species (a):
\(\hat{s}^L_a(\mathbf{x}) = \hat{d}^L_a(\mathbf{x}) + \log{\hat{p}_a} = 203.052 + \log{0.5} = 202.359\)
and then for species (b):
\(\hat{s}^L_b(\mathbf{x}) = \hat{d}^L_b(\mathbf{x}) + \log{\hat{p}_b} = 205.912 + \log{0.5} = 205.219\)
Conclusion
According to the classificaqtion rule the insect is classified into the species with the highest linear discriminant function. Because \(\hat{s}^L_b(\mathbf{x}) > \hat{s}^L_a(\mathbf{x})\), we conclude that the insect belongs to species (b) Ch. heikertlingeri.
Of course, the addition of the log of .5 does not make any difference. Whether we classify on the basis of \(\hat{d}^L_b(\mathbf{x})\) or on the basis of the score function, the decision will remain the same. In case the priors are not equal, this would not hold.
You can think of the priors as a 'penalty' in some sense. If you have a higher prior probability of a given species you will give it very little 'penalty' because you will be taking the log of a number close to one which is not going to subtract much. On the other hand, if there is a low prior probability, then the log of a very small number results in a larger reduction.
Posterior Probabilities Section
You can also calculate the posterior probabilities. These are used to measure uncertainty regarding the classification of a unit from an unknown group. They will give us some indication of our confidence in our classification of individual subjects.
In this case, the estimated posterior probability that the insect belongs to species (a) Ch. concinna given the observed measurements is:
\begin{align} p(\pi_a|\mathbf{x}) &= \frac{\exp\{\hat{s}^L_a(\mathbf{x})\}}{\exp\{\hat{s}^L_a(\mathbf{x})\}+\exp\{\hat{s}^L_b(\mathbf{x})\}} \\[10pt] &= \frac{\exp\{202.359\}}{\exp\{202.359\}+\exp\{205.219\}} \\[10pt] &= 0.05\end{align}
This is a function of the linear score functions for the two species. Here we are looking at the exponential function of the linear score function for species (a) divided by the sum of the exponential functions of the score functions for species (a) and species (b). Using the numbers obtained earlier, this equals 0.05.
Similarly for species (b), the estimated posterior probability that the insect belongs to Ch. heikertlingeri is:
\begin{align} p(\pi_b|\mathbf{x}) &= \frac{\exp\{\hat{s}^L_b(\mathbf{x})\}}{\exp\{\hat{s}^L_a(\mathbf{x})\}+\exp\{\hat{s}^L_b(\mathbf{x})\}} \\[10pt] &= \frac{\exp\{205.219\}}{\exp\{202.359\}+\exp\{205.219\}} \\[10pt] &= 0.95\end{align}
In this case we are 95% confident that the insect belongs to species (b). This is a pretty high level of confidence with a 5% chance that we might be in error in this classification. You would have to decide what is an acceptable error rate here. For classification of insects this might be perfectly acceptable, however, in some situations it might not be acceptable. For example, looking at the cancer case that we talked about earlier where we were trying to classify someone as having cancer or not having cancer, it may not be acceptable to have a 5% error rate. This is an ethical decision. It is a decision that has nothing to do with statistics and must be tailored to the situation at hand.