9.2 - Discriminant Analysis

Introduction

Let the feature vector be X and the class labels be Y.

The Bayes rule says that if you have the joint distribution of X and Y, and if X is given, under 0-1 loss, the optimal decision on Y is to choose a class with maximum posterior probability given X.

Discriminant analysis belongs to the branch of classification methods called generative modeling, where we try to estimate the within class density of X given the class label. Combined with the prior probability (unconditioned probability) of classes, the posterior probability of Y can be obtained by the Bayes formula.

Notation

Assume the prior probability or the marginal pmf for class k is denoted as π_k, $\sum^{K}_{k=1} \pi_k =1 $.

π_k is usually estimated simply by empirical frequencies of the training set:

\[\hat{\pi}_k=\frac{\text{# of Samples in class } k}{\text{Total # of samples}}\]

You have the training data set and you count what percentage of data come from a certain class.

Then we need the class-conditional density of X. Remember this is the density of X conditioned on the class k, or class G = k denoted by f_k(x).

According to the Bayes rule, what we need is to compute the posterior probability:

\[Pr(G=k|X=x)=\frac{f_k(x)\pi_k}{\sum^{K}_{l=1}f_l(x)\pi_l}\]

This is a conditional probability of class G given X.

By MAP (maximum a posteriori, i.e., the Bayes rule for 0-1 loss):

$ \begin {align} \hat{G}(x) &=\text{arg }\underset{k}{max} Pr(G=k|X=x)\\
& = \text{arg }\underset{k}{max} f_k(x)\pi_k\\
\end {align} $

Notice that the denominator is identical no matter what class k you are using. Therefore, for maximization, it does not make a difference in the choice of k. The MAP rule is essentially trying to maximize $\pi_k$times $f_k(x)$.

9.2.1 - Class Density Estimation

Depending on which algorithms you use, you end up with different ways of density estimation within every class.

In Linear Discriminant Analysis (LDA) we assume that every density within each class is a Gaussian distribution.

Linear and Quadratic Discriminant Analysis: Gaussian densities.

In LDA we assume those Gaussian distributions for different classes share the same covariance structure. In Quadratic Discriminant Analysis (QDA) we don't have such a constraint. You will see the difference later.

General Nonparametric Density Estimates:

You can also use general nonparametric density estimates, for instance kernel estimates and histograms.

Naive Bayes: assume each of the class densities are products of marginal densities, that is, all the variables are independent.

There is a well-known algorithm called the Naive Bayes algorithm. Here the basic assumption is that all the variables are independent given the class label. Therefore, to estimate the class density, you can separately estimate the density for every dimension and then multiply them to get the joint density. This makes the computation much simpler.

X may be discrete, not continuous. Instead of talking about density, we will use the probability mass function. In this case, we would compute a probability mass function for every dimension and then multiply them to get the joint probability mass function.

You can see that we have swept through several prominent methods for classification. You should also see that they all fall into the Generative Modeling idea. The only essential difference is in how you actually estimate the density for every class.

9.2.2 - Linear Discriminant Analysis

Under LDA we assume that the density for X, given every class k is following a Gaussian distribution. Here is the density formula for a multivariate Gaussian distribution:

\[f_k(x)=\dfrac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_{k}^{-1}(x-\mu_k)}\]

p is the dimension and $\Sigma_k$ is the covariance matrix. This involves the square root of the determinant of this matrix. In this case, we are doing matrix multiplication. The vector x and the mean vector $\mu_k$ are both column vectors.

For Linear discriminant analysis (LDA): $\Sigma_k=\Sigma$, ∀k.

In LDA, as we mentioned, you simply assume for different k that the covariance matrix is identical. By making this assumption, the classifier becomes linear. The only difference from quadratic discriminant analysis is that we do not assume that the covariance matrix is identical for different classes. For QDA, the decision boundary is determined by a quadratic function.

Since the covariance matrix determines the shape of the Gaussian density, in LDA, the Gaussian densities for different classes have the same shape, but are shifted versions of each other (different mean vectors). Example densities for the LDA model are shown below.

9.2.3 - Optimal Classification

For the moment, we will assume that we already have the covariance matrix for every class. And we will talk about how to estimate this in a moment. Let's look at what the optimal classification would be based on the Bayes rule

Bayes rule says that we should pick a class that has the maximum posterior probability given the feature vector X. If we are using the generative modeling approach this is equivalent to maximizing the product of the prior and the within class density.

Since the log function is an increasing function, the maximization is equivalent because whatever gives you the maximum should also give you a maximum under a log function. Next, we plug in the density of the Gaussian distribution assuming common covariance and then multiplying the prior probabilities.

\[\begin{align*} \hat{G}(x)
& = \text{arg } \underset{k}{\text{max}} Pr(G=k|X=x) \\
& = \text{arg } \underset{k}{\text{max}}f_k(x)\pi_k \\
& = \text{arg } \underset{k}{\text{max }} \text{ log}(f_k(x)\pi_k) \\
& = \text{arg } \underset{k}{\text{max}}\left[-\text{log}((2\pi)^{p/2}|\Sigma|^{1/2})-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)+\text{log}(\pi_k) \right] \\
& = \text{arg } \underset{k}{\text{max}}\left[-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)+\text{log}(\pi_k) \right]
\end{align*}\]

[1]

Note:

\[-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)=x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_{k}^{T}\Sigma^{-1}\mu_k-\frac{1}{2}x^T\Sigma^{-1}x\]

To sum up, after simplification we obtain this formula:

\[\hat{G}(x)= \text{ arg }\underset{k}{max}\left[x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_{k}^{T}\Sigma^{-1}\mu_{k} + log(\pi_k) \right]\]

This is the final classifier. Given any x, you simply plug into this formula and see which k maximizes this. Usually the number of classes is pretty small, and very often only two classes. Hence, an exhaustive search over the classes is effective.

LDA gives you a linear boundary because the quadratic term is dropped.

To sum up

\[\hat{G}(x)= \text{ arg }\underset{k}{max}\left[x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_{k}^{T}\Sigma^{-1}\mu_{k} + log(\pi_k) \right]\]

Define the linear discriminant function

Then

The decision boundary between class k and l is:
Or equivalently the following holds

9.2.4 - Binary Classification

In binary classification in particular, for instance if we let (k =1, l =2), then we would define constant $a_0$, given below, where $\pi_1$ and $\pi_2$ are prior probabilities for the two classes and $\mu_1$ and $\mu_2$ are mean vectors.

Binary classification (k = 1, l = 2):
- Define $a_0 =\text{log }\frac{\pi_1}{\pi_2}-\frac{1}{2}(\mu_1+\mu_2)^T\Sigma^{-1}(\mu_1-\mu_2)$
- Define $(a_1, a_2, ... , a_p)^T = \Sigma^{-1}(\mu_1-\mu_2)$
- Classify to class 1 if $a_0 +\sum_{j=1}^{p}a_jx_j >0$ ; to class 2 otherwise.
- An example
  
  $\ast \pi_1=\pi_2=0.5$
  $\ast \mu_1=(0,0)^T, \mu_2=(2,-2)^T$
  $\ast \Sigma = \begin{pmatrix}
  1.0&0.0 \\
  0.0 & 0.5625
  \end{pmatrix}$
  $\ast \text{Decision boundary: } 5.56-2.00x_1+3.56x_2=0.0$

Here is a contour plot of this result:

graph

We have two classes and we know the within class density. The marginal density is simply the weighted sum of the within class densities, where the weights are the prior probabilities. Because we have equal weights and because the covariance matrix two classes are identical, we get these symmetric lines in the contour plot. The black diagonal line is the decision boundary for the two classes. Basically, if you are given an x above the line, then we would classify this x into the first-class. If it is below the line, we would classify it into the second class.

There is a missing piece here, right?

For all of the discussion above we assume that we have the prior probabilities for the classes and we also had the within class densities given to us. Of course, in practice you don't have this. In practice, what we have is only a set of training data.

The question is how do we find the $\pi_k$'s and the $f_k(x)$?

9.2.5 - Estimating the Gaussian Distributions

We need to estimate the Gaussian distribution. Here is the formula for estimating the π_k's and the parameters in the Gaussian distributions. The formula below is actually the maximum likelihood estimator:

\[\hat{\pi}_k=N_k/N\]

where $N_k$ is the number of class-k samples and N is the total number of points in the training data. As we mentioned, to get the prior probabilities for class k, you simply count the frequency of data points in class k.

Then, the mean vector for every class is also simple. You take all of the data points in a given class and compute the average, the sample mean:

\[\hat{\mu}_k=\sum_{g_i=k}x^{(i)}/N_k\]

Next, the covariance matrix formula looks slightly complicated. The reason is because we have to get a common covariance matrix for all of the classes. First you divide the data points in two given classes according to the given labels. If we were looking at class k, for every point we subtract the corresponding mean which we computed earlier. Then multiply its transpose. Remember x is a column vector, therefore if we have a column vector multiplied by a row vector, we get a square matrix, which is what we need.

\[\hat{\Sigma}=\sum_{k=1}^{K}\sum_{g_i=k}\left(x^{(i)}-\hat{\mu}_k \right)\left(x^{(i)}-\hat{\mu}_k \right)^T/(N-K)\]

First, we do the summation within every class k, then we have the sum over all of the classes. Next, we normalize by the scalar quantity, N - K. When we fit a maximum likelihood estimator it should be divided by N, but if it is divided by N – K, we get an unbiased estimator. Remember, K is the number of classes. So, when N is large, the difference between N and N - K is pretty small.

Note that $x^{(i)}$ denotes the ith sample vector.

In summary, if you want to use LDA to obtain a classification rule, the first step would involve estimating the parameters using the formulas above. Once you have these, then go back and find the linear discriminant function and choose a class according to the discriminant functions.

9.2.6 - Example - Diabetes Data Set

Let's take a look at specific data set. This is the diabetes data set from the UC Irvine Machine Learning Repository [2]. It is a fairly small data set by today's standards. The original data had eight variable dimensions. To simplify the example, we obtain the two prominent principal components from these eight variables. Instead of using the original eight dimensions we will just use these two principal components for this example.

The Diabetes data set has two types of samples in it. One sample type are healthy individuals and the other are individuals with a higher risk of diabetes. Here are the prior probabilities estimated for both of the sample types, first for the healthy individuals and second for those individuals at risk:

\[\hat{\pi}_0 =0.651, \hat{\pi}_1 =0.349 \]

The first type has a prior probability estimated at 0.651. This means that for this data set about 65% of these belong to class 0 and the other 35% belong to class 1. Next, we computed the mean vector for the two classes separately:

\[\hat{\mu}_0 =(-0.4038, -0.1937)^T, \hat{\mu}_1 =(0.7533, 0.3613)^T \]

Then we computed $\hat{\Sigma}$ using the formulas discussed earlier.

\[\hat{\Sigma}=
\begin{pmatrix}
1.7949 & -0.1463\\
-0.1463 & 1.6656
\end{pmatrix} \]

Once we have done all of this, we compute the linear discriminant function and find the classification rule.

Classification rule:

\[ \begin{align*}\hat{G}(x)
&=\begin{cases}
0 & 0.7748-0.6767x_1-0.3926x_2 \ge 0 \\
1 & otherwise
\end{cases} \\
& = \begin{cases}
0 & x_2 \le (0.7748/0.3926) - (0.6767/0.3926)x_1 \\
1 & otherwise
\end{cases} \end{align*}\]

In the first specification of the classification rule, plug a given x into the above linear function. If the result is greater than or equal to zero, then claim that it is in class 0, otherwise claim that it is in class 1.

Below is a scatter plot of the two principle components. The two classes are represented, the first, without diabetes, are the red stars (class 0), and the second class with diabetes are the blue circles (class 1). The solid line represents the classification boundary obtained by LDA. It seems as though the two classes are not that well separated. The dashed or dotted line is the boundary obtained by linear regression of an indicator matrix. In this case the results of the two different linear boundaries are very close.

graph

It is always a good practice to plot things so that if something went terribly wrong it would show up in the plots.

Within training data classification error rate: 28.26%.
Sensitivity: 45.90%.
Specificity: 85.60%.

Here is the contour plot for the density for class 0. [Actually, the figure looks a little off - it should be centered slightly to the left and below the origin.] The contour plot for the density for class 1 would be similar except centered above and to the right.

graph

9.2.7 - Simulated Examples

LDA makes some strong assumptions. It assumes that the covariance matrix is identical for different classes. It also assumes that the density is Gaussian. What if these are not true? LDA may not necessarily be bad when the assumptions about the density functions are violated. Here are some examples that might illustrate this.

In certain cases, LDA may yield poor results.

In the first example (a), we do have similar data sets which follow exactly the model assumptions of LDA. This means that the two classes, red and blue, actually have the same covariance matrix and they are generated by Gaussian distributions. Below, in the plots, the black line represents the decision boundary. The second example (b) violates all of the assumptions made by LDA. First of all the within class of density is not a single Gaussian distribution, instead it is a mixture of two Gaussian distributions. The overall density would be a mixture of four Gaussian distributions. Also, they have different covariance matrices as well.

graph

Then, if we apply LDA we get this decision boundary (above, black line), which is actually very close to the ideal boundary between the two classes. By ideal boundary, we mean the boundary given by the Bayes rule using the true distribution (since we know it in this simulated example).

If you look at another example, (c) below, here we also generated two classes. The red class still contains two Gaussian distributions. The blue class, which spreads itself over the red class with one mass of data in the upper right and another data mass in the lower left. If we force LDA we get a decision boundary, as displayed. In this case the result is very bad (far below ideal classification accuracy). You can see that in the upper right the red and blue are very well mixed, however, in the lower left the mix is not as great. You can imagine that the error rate would be very high for classification using this decision boundary.

graph

In plot (d), the density of each class is estimated by a mixture of two Gaussians. The Bayes rule is applied. The resulting boundaries are two curves. The separation of the red and the blue is much improved.

This example illustrates when LDA gets into trouble. LDA separates the two classes with a hyper plane. This means that the two classes have to pretty much be two separated masses, each occupying half of the space. In the above example, the blue class breaks into two pieces, left and right. Then, you have to use more sophisticated density estimation for the two classes if you want to get a good result. This is an example where LDA has seriously broken down. This is why it's always a good idea to look at the scatter plot before you choose a method. The scatter plot will often show whether a certain method is appropriate. If you see a scatter plot like this example, you can see that the blue class is broken into pieces, and you can imagine if you used LDA, no matter how you position your linear boundary, you are not going to get a good separation between the red and the blue class. Largely you will find out that LDA is not appropriate and you want to take another approach.

9.2.8 - Quadratic Discriminant Analysis (QDA)

QDA is not really that much different from LDA except that you assume that the covariance matrix can be different for each class and so, we will estimate the covariance matrix $\Sigma_k$ separately for each class k, k =1, 2, ... , K.

Quadratic discriminant function:

\[\delta_k(x)= -\frac{1}{2}\text{log}|\Sigma_k|-\frac{1}{2}(x-\mu_{k})^{T}\Sigma_{k}^{-1}(x-\mu_{k})+\text{log}\pi_k\]

This quadratic discriminant function is very much like the linear discriminant function except that because Σ_k, the covariance matrix, is not identical, you cannot throw away the quadratic terms. This discriminant function is a quadratic function and will contain second order terms.

Classification rule:

\[\hat{G}(x)=\text{arg }\underset{k}{\text{max }}\delta_k(x)\]

The classification rule is similar as well. You just find the class k which maximizes the quadratic discriminant function.

The decision boundaries are quadratic equations in x.

QDA, because it allows for more flexibility for the covariance matrix, tends to fit the data better than LDA, but then it has more parameters to estimate. The number of parameters increases significantly with QDA. Because, with QDA, you will have a separate covariance matrix for every class. If you have many classes and not so many sample points, this can be a problem.

As we talked about at the beginning of this course, there are trade-offs between fitting the training data well and having a simple model to work with. A simple model sometimes fits the data just as well as a complicated model. Even if the simple model doesn't fit the training data as well as a complex model, it still might be better on the test data because it is more robust.

QDA Example - Diabetes Data Set

In this example we do the same things as we have previously with LDA on the prior probabilities and the mean vectors, except now we estimate the covariance matrices separately for each class.

How do we estimate the covariance matrices separately?

Remember, in LDA once we had the summation over the data points in every class we had to pull all the classes together. In QDA we don't do this.

Prior probabilities: $\hat{\pi}_0=0.651, \hat{\pi}_1=0.349 $.

$\hat{\mu}_0=(-0.4038, -0.1937)^T, \hat{\mu}_1=(0.7533, 0.3613)^T $

$\hat{\Sigma_0}= \begin{pmatrix}
1.6790 & -0.0461 \\
-0.0461 & 1.5985
\end{pmatrix} $

$\hat{\Sigma_1}= \begin{pmatrix}
2.0114 & -0.3334 \\
-0.3334 & 1.7910
\end{pmatrix} $

The dashed line in the plot below is decision boundary given by LDA.The curved line is the decision boundary resulting from the QDA method.

graph

For most of the data, it doesn't make any difference, because most of the data is massed on the left. The percentage of the data in the area where the two decision boundaries differ a lot is small. Therefore, you can imagine that the difference in the error rate is very small.

Within training data classification error rate: 29.04%.
Sensitivity: 45.90%.
Specificity: 84.40%.

Sensitivity for QDA is the same as that obtained by LDA, but specificity is slightly lower.

9.2.9 - Connection between LDA and logistic regression

Under the model of LDA, we can compute the log-odds:

\[ \begin {align} & \text{log }\frac{Pr(G=k|X=x)}{Pr(G=K|X=x)}\\
& = \text{log }\frac{\pi_k}{\pi_K}-\frac{1}{2}(\mu_k+\mu_K)^T\Sigma^{-1}(\mu_k-\mu_K) \\
& = a_{k0}+a_{k}^{T}x \\
\end {align} \]

The model of LDA satisfies the assumption of the linear logistic model.

The difference between linear logistic regression and LDA is that the linear logistic model only specifies the conditional distribution $Pr(G = k | X = x)$. No assumption is made about $Pr(X)$; while the LDA model specifies the joint distribution of X and G. $Pr(X)$ is a mixture of Gaussians:

\[Pr(X)=\sum_{k=1}^{K}\pi_k \phi (X; \mu_k, \Sigma) \]

where $\phi$ is the Gaussian density function.

Moreover, linear logistic regression is solved by maximizing the conditional likelihood of G given X: $Pr(G = k | X = x)$; while LDA maximizes the joint likelihood of G and X: $Pr(X = x, G = k)$.

If the additional assumption made by LDA is appropriate, LDA tends to estimate the parameters more efficiently by using more information about the data.

Another advantage of LDA is that samples without class labels can be used under the model of LDA. On the other hand, LDA is not robust to gross outliers. Because logistic regression relies on fewer assumptions, it seems to be more robust to non-Gaussian type of data.

In practice, logistic regression and LDA often give similar results.

Simulated Example

Consider input X being 1-D.

Two classes have equal priors and the class-conditional densities of X are shifted versions of each other, as shown in the plot below.

Each within-class density of X is a mixture of two normals:

Class 1 (red): 0.6N(-2, ¼ ) + 0.4N(0, 1).
Class 2 (blue): 0.6N(0, ¼ ) + 0.4N(2, 1).

The class-conditional densities are shown below.

plot

LDA Result

Training data set: 2000 samples for each class.

Test data set: 1000 samples for each class.

The means and variance of the two classes estimated by LDA are:

$\hat{\mu}_1$ = -1.1948,
$\hat{\mu}_2$ = 0.8224,
$\hat{\sigma}^2$ = 1.5268.

Boundary value between the two classes is $(\hat{\mu}_1 + \hat{\mu}_2) / 2 = -0.1862$.

The classification error rate on the test data is 0.2315.

Based on the true distribution, the Bayes (optimal) boundary value between the two classes is -0.7750 and the error rate is 0.1765.

The estimated within-class densities by LDA are shown in the plot below. Both densities are Gaussian and are shifted version of each other, as assumed by LDA.

plot

Logistic Regression Result

Linear logistic regression obtains:

$\beta = (-0.3288, -1.3275)^T$.

The boundary value satisfies $-0.3288 - 1.3275X = 0$, hence equals -0.2477.

The error rate on the test data set is 0.2205.

The estimated posterior probability is:

\[ Pr(G=1|X=x) =\frac{e^{- 0.3288-1.3275x}}{1+e^{- 0.3288-1.3275x}} \]

The estimated posterior probability, $Pr(G =1 | X = x)$, and its true value based on the true distribution are compared in the graph below. We can see that although the Bayes classifier (theoretically optimal) is indeed a linear classifier (in 1-D, this means thresholding by a single value), the posterior probability of the class being 1 bears a form more complicated than the one implied by the logistic regression model. Under the logistic regression model, the posterior probability is a monotonic function of a specific shape, while the true posterior probability is not a monotonic function of x. The assumption made by the logistic regression model is more restrictive than a general linear boundary classifier.

plot

9.2.10 - R Scripts

1) Acquire Data

Diabetes data

The diabetes data set is taken from the UCI machine learning database repository at: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes [3].

768 samples in the dataset
8 quantitative variables
2 classes; with or without signs of diabetes

Load data into R as follows:

# set the working directory
setwd("C:/STAT 897D data mining")
# comma delimited data and no header for each variable
RawData <- read.table("diabetes.data",sep = ",",header=FALSE)

In RawData, the response variable is its last column; and the remaining columns are the predictor variables.

responseY <- as.matrix(RawData[,dim(RawData)[2]])
predictorX <- as.matrix(RawData[,1:(dim(RawData)[2]-1)])

For the convenience of visualization, we take the first two principle components as the new feature variables and conduct k-means only on these two dimensional data.

pca <- princomp(predictorX, cor=T) # principal components analysis using correlation matrix

pc.comp <- pca$scores
pc.comp1 <- -1*pc.comp[,1] # principal component 1 scores (negated for convenience)
pc.comp2 <- -1*pc.comp[,2] # principal component 2 scores (negated for convenience)

2) Linear Discriminant Analysis

The MASS package contains functions for performing linear and quadratic discriminant analysis. Both estimate the class prior probabilities by the proportion of data in each class, unless prior probabilities are specified. The code below performs LDA.

library(MASS)
model.lda <- lda(responseY ~ pc.comp1+pc.comp2)

predict()$class predicts a class label for a new sample point using an obtained model, e.g., LDA model.

predict(model.lda)$class

The classification result by LDA is shown in Figure 1. The red circles correspond to Class 1 (with diabetes), the blue circles to Class 0 (non-diabetes).

R output

Figure 1: Classification result by LDA

3) Quadratic Discriminant Analysis

Quadratic discriminant analysis does not assume homogeneity of the covariance matrices of all the class. The following code performs the QDA.

model.qda <- qda(responseY ~ pc.comp1+pc.comp2)

Again, we can use predict()$class to obtain classification result based on the estimated QDA model.

predict(model.qda)$class

The classification result by QDA is shown in Figure 2.

R output

Figure 2: Classification result by QDA