10.3  Linear Discriminant Analysis
10.3  Linear Discriminant AnalysisWe assume that in population \(\pi_{i}\) the probability density function of \(\boldsymbol{x}\) is multivariate normal with mean vector \(\boldsymbol{\mu}_{i}\) and variancecovariance matrix \(\Sigma\) (same for all populations). As a formula, this is...
\(f(\mathbf{x}\pi_i) = \dfrac{1}{(2\pi)^{p/2}\mathbf{\Sigma}^{1/2}}\exp\left(\frac{1}{2}\mathbf{(x\mu_i)'\Sigma^{1}(x\mu_i)}\right)\)
We classify to the population for which \(p _ { i } f ( \mathbf { x }  \pi _ { i } )\) ) is largest.
Because a log transform is monotonic, this is equivalent to classifying an observation to the population for which log( \(p _ { i } f ( \mathbf { x }  \pi _ { i } )\) )) is largest.
Linear discriminant analysis is used when the variancecovariance matrix does not depend on the population. In this case, our decision rule is based on the Linear Score Function, a function of the population means for each of our g populations, \(\boldsymbol{\mu}_{i}\), as well as the pooled variancecovariance matrix.

Linear Score Function
 The Linear Score Function is:
 \(s^L_i(\mathbf{X}) = \dfrac{1}{2}\mathbf{\mu'_i \Sigma^{1}\mu_i + \mu'_i \Sigma^{1}x}+ \log p_i = d_{i0}+\sum_{j=1}^{p}d_{ij}x_j + \log p_i\)

where
\(d_{i0} = \dfrac{1}{2}\mathbf{\mu'_i\Sigma^{1}\mu_i}\)
\(d_{ij} = j\text{th element of } \mu'_i\Sigma^{1}\)
The far lefthand expression resembles a linear regression with intercept term d_{i}_{0} and regression coefficients d_{ij}.

Linear Discriminant Function

\(d^L_i(\mathbf{x}) = \dfrac{1}{2}\mathbf{\mu'_i\Sigma^{1}\mu_i + \mu'_i\Sigma^{1}x} = d_{i0} + \sum_{j=1}^{p}d_{ij}x_j\)
\(d_{i0} = \dfrac{1}{2}\mathbf{\mu'_i\Sigma^{1}\mu_i}\)
Given a sample unit with measurements \(x _ { 1 }, x _ { 2 }, \dots, x _ { p }\), we classify the sample unit into the population that has the largest Linear Score Function. This is equivalent to classifying the population for which the posterior probability of membership is the largest. The linear score function is computed for each population, then we plug in our observation values and assign the unit to the population with the largest score.
However, this is a function of unknown parameters, \(\boldsymbol{\mu}_{i}\) and \(\Sigma\). So, these must be estimated from the data.
Discriminant analysis requires estimates of:
\(p_i = \text{Pr}(\pi_i);\) \(i = 1, 2, \dots, g\)
\(\mathbf{\mu_i} = E(\mathbf{X}\pi_i)\); \(i = 1, 2, \dots, g\)
\(\Sigma = \text{var}(\mathbf{X} \pi_i)\); \(i = 1, 2, \dots, g\)
 Prior probabilities:
 The population means are estimated by the sample mean vectors:
 The variancecovariance matrix is estimated by using the pooled variancecovariance matrix:
Typically, these parameters are estimated from training data, in which the population membership is known.
Conditional Density Function Parameters
Population Means: \(\boldsymbol{\mu}_{i}\) is estimated by substituting in the sample means \(\bar{\mathbf{x}}_i\).
VarianceCovariance matrix: Let S_{i} denote the sample variancecovariance matrix for population i. Then the variancecovariance matrix \(Σ\) is estimated by substituting in the pooled variancecovariance matrix into the Linear Score Function as shown below:
\(\mathbf{S}_p = \dfrac{\sum_{i=1}^{g}(n_i1)\mathbf{S}_i}{\sum_{i=1}^{g}(n_i1)}\)
to obtain the estimated linear score function:
\(\hat{s}^L_i(\mathbf{x}) = \frac{1}{2}\mathbf{\bar{x}'_i S^{1}_p \bar{x}_i +\bar{x}'_i S^{1}_p x } + \log{\hat{p}_i} = \hat{d}_{i0} + \sum_{j=1}^{p}\hat{d}_{ij}x_j + \log{p}_i\)
where
\(\hat{d}_{i0} = \dfrac{1}{2}\mathbf{\bar{x}'_i S^{1}_p \bar{x}_i} \)
and
\( \hat{d}_{ij} = j^{th} \text{ element of} \ \ \bar{x}'_iS^{1}_p \)
This is a function of the sample mean vectors, the pooled variancecovariance matrix, and prior probabilities for g different populations. This is written in a form that looks like a linear regression formula with an intercept term plus a linear combination of response variables, plus the natural log of the prior probabilities.