9.2 - Discriminant Analysis

Introduction
Let the feature vector be X and the class labels be Y.
The Bayes rule says that if you have the joint distribution of X and Y, and if X is given, under 0-1 loss, the optimal decision on Y is to choose a class with maximum posterior probability given X.
Discriminant analysis belongs to the branch of classification methods called generative modeling, where we try to estimate the within class density of X given the class label. Combined with the prior probability (unconditioned probability) of classes, the posterior probability of Y can be obtained by the Bayes formula.
Notation
Assume the prior probability or the marginal pmf for class k is denoted as πk, \(\sum^{K}_{k=1} \pi_k =1 \).
πk is usually estimated simply by empirical frequencies of the training set:
\[\hat{\pi}_k=\frac{\text{# of Samples in class } k}{\text{Total # of samples}}\]
You have the training data set and you count what percentage of data come from a certain class.
Then we need the class-conditional density of X. Remember this is the density of X conditioned on the class k, or class G = k denoted by fk(x).
According to the Bayes rule, what we need is to compute the posterior probability:
\[Pr(G=k|X=x)=\frac{f_k(x)\pi_k}{\sum^{K}_{l=1}f_l(x)\pi_l}\]
This is a conditional probability of class G given X.
By MAP (maximum a posteriori, i.e., the Bayes rule for 0-1 loss):
\( \begin {align} \hat{G}(x) &=\text{arg }\underset{k}{max} Pr(G=k|X=x)\\
& = \text{arg }\underset{k}{max} f_k(x)\pi_k\\
\end {align} \)
Notice that the denominator is identical no matter what class k you are using. Therefore, for maximization, it does not make a difference in the choice of k. The MAP rule is essentially trying to maximize \(\pi_k\)times \(f_k(x)\).