10 Classification
Overview
Textbook reading: Chapter 4: Classification and Section 2.2.3.
Classification is supervised learning for which the true class labels for the data points are given in the training data.
Setup for supervised learning
- Training data: \({(x_1, g_1), (x_2, g_2), \cdots , (x_N, g_N )}\)
- The feature vector \(X = (X_1, X_2, \cdots , X_p)\), where each variable \(X_j\) is quantitative.
- The response variable G is categorical. \(G \in G = {1, 2, \cdots , K}\)
- Form a predictor \(G(x)\) to predict G based on X.
\(G(x)\) divides the input space (feature vector space) into a collection of regions, each labeled by one class. See the example partition of the feature vector space by \(G(x)\) in the following plots. For each plot, the space is divided into three pieces, each assigned with a particular class.
Classification Error Rate
The classification error rate is the number of observations that are misclassified over the sample size.
\[\begin{equation*}
\frac{1}{n} \sum_{i=1}^n I(\hat{Y_i} \neq y_i),
\end{equation*}\]
where \(I(\hat{Y_i} \neq y_i) = 1\) if \(\hat{Y_i} \neq y_i\) and 0 otherwise.
For binary classification, let ‘Y.hat’ be a 0-1 vector of the predicted class labels, and ‘y’ be a 0-1 vector of the observed class labels. We can calculate the classification error rate by “mean(abs(Y.hat-y))” in R.
Bayes Classification Rule
[Recall material from Section 2.2.3 in the textbook, specifically from the bottom half of page 37 through the top of page 39.]
Suppose the marginal distribution of G is specified by the probability mass function (pmf) \(p_G(g), g = 1, 2, \cdots , K\).
The conditional distribution of X given \(G = g\) is \(f_{X|G}(x | G = g)\).
The training data \((x_i,g_i),i=1, 2, \cdots ,N\), are independent samples from the joint distribution of X and G,
\[f_{X,G}(x, g) = p_G(g)f_{X|G}(x | G = g)\]
Assume the loss of predicting G at \(G(X)\) = \(\hat{G}\) is \(L(\hat{G}, G)\).
Goal of classification: to minimize the expected loss
\[E_{X,G}L(G(X), G) = E_X (E_{G|X}L(G(X), G))\]
To minimize the left hand side, it suffices to minimize \(E_{G|X}L(G(X), G)\) for each X. Hence the optimal predictor:
\[\hat{G}(x)= \text{arg} \underset{g}{min}E_{G|X=x}L(g, G)\]
The above decision rule is called the Bayes classification rule.
For 0-1 loss, i.e.,
\[L(g, g')= \begin{cases}
1 & g \ne g' \\
0 & g = g'
\end{cases}\]
\[E_{G|X=x}L(g, G)=1-Pr(G=g|X=x)\].
The Bayes rule becomes the rule of maximum posterior probability:
\[ \begin {align} \hat{G}(x) &=\text{arg }\underset{g}{min} E_{G|X=x}L(g, G)\\ & = \text{arg }\underset{g}{max} Pr(G=g|X=x)\\ \end {align} \]
Many classification algorithms attempt to estimate \(Pr(G = g | X = x)\), then apply the Bayes rule.
Linear Methods for Classification
Decision boundaries are linear.
Two class problem:
- The decision boundary between the two classes is a hyperplane in the feature vector space.
- A hyperplane in the p dimensional input space is the set:
\[x:\alpha_o + \sum^{p}_{j=1}\alpha_{j}x_{j}=0\]
The two regions separated by a hyperplane:
\[x:\alpha_o + \sum^{p}_{j=1}\alpha_{j}x_{j} \> 0\] and \[{x:\alpha_o + \sum^{p}_{j=1}\alpha_{j}x_{j}<0}\]
More than two classes:
- The decision boundary between any pair of classes k and l is a hyperplane (shown in previous figure).
How do you choose the hyperplane?
Example methods for deciding the hyperplane:
- Linear discriminant analysis.
- Logistic regression.
Linear decision boundaries are not necessarily constrained. We can expand the feature space by adding in extra variables formed by functions of the original variables. Linear decision boundaries in the expanded feature space may be nonlinear in the original feature space.
Objectives
Upon successful completion of this lesson, you should be able to:
- The basic setup of a classification problem.
- Understand the Bayes classification rule.
- Understand the statistical model of logistic regression.
- Know the binary logistic regression algorithm and how to program it.
- Know the multi-class logistic regression algorithm.
- Understand the statistical model used by Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA).
- Understand how to estimate the Gaussian distributions within each class.
- Understand the LDA and QDA classification rule.
- Understand some similarities and differences between LDA and logistic regression.
- Understand the K-nearest-neighbor classification algorithm.
- Understand how the number of neighbors, K, adjusts the model complexity.
10.1 Logistic Regression
Introduction
There are two big branches of methods for classification. One is called generative modeling, the other is called discriminative modeling. Logistic regression for classification is a discriminative modeling approach, where we estimate the posterior probabilities of classes given X directly without assuming the marginal distribution on X.
It preserves linear classification boundaries.
A review of the Bayes rule shows that when we use 0-1 loss, we pick the class k that has the maximum posterior probability:
\[\hat{G}(x)= \text{arg } \underset{k}{max}Pr(G=k|X=x) \]
The decision boundary between classes k and l is determined by the equation:
\[Pr(G=k|X=x) =Pr(G=l|X=x) \]
that is, the x’s at which the two posterior probabilities of k and l are equal.
If we divide both sides by \(Pr(G = l | X = x)\) and take the log of this ratio, the above equation is equivalent to:
\[\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=0 \]
Since we want to enforce a linear classification boundary, we assume the function above is linear (below):
\[\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=a_{0}^{(k,l)}+\sum_{j=1}^{p}a_{j}^{(k,l)}x_j\]
This is the basic assumption of logistic regression (simple indeed).
We use the superscript (k, l) on the coefficients of the linear function because, for every pair of k and l, the decision boundary would be different, determined by the different coefficients.
For logistic regression, there are restrictive relations between \(a ^ { ( k , l ) }\) for different pairs of (k, l). We don’t really need to specify this equation for every pair of k and l. Instead, we only need to specify it for K - 1 such pairs.
Assumptions
Let’s look at our assumptions. If we take class K as the base class, the assumed equations are:
\[\text{log } \dfrac{Pr(G=1|X=x)}{Pr(G=K|X=x)}=\beta_{10} +
\beta_{1}^{T}x\] \[\text{log } \frac{Pr(G=2|X=x)}{Pr(G=K|X=x)}=\beta_{20} +
\beta_{2}^{T}x\] \[\vdots\]
\[\text{log } \dfrac{Pr(G=K-1|X=x)}{Pr(G=K|X=x)}=\beta_{(K-1)0} +
\beta_{K-1}^{T}x\]
This indicates that we don’t have to specify the decision boundary for every pair of classes. We only need to specify the decision boundary between class j and the base class K. (You can choose any class as the base class - it doesn’t really matter.)
Once we have specified the parameters for these K - 1 log ratios, then for any pair of classes (k, l), we can derive the log ratios without introducing new parameters:
\[\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=\beta_{k0} + \beta_{l0}+(\beta_k - \beta_l)^{T}x \]
Number of parameters: \((K - 1)(p + 1)\).
For convenience, we will denote the entire parameter set by θ and arrange them in this way:
\[\theta =\beta_{10},\beta_{1}^T,\beta_{20},\beta_{2}^T, ... , ,\beta_{(K-1)0},\beta_{K-1}^T\]
The log ratios of posterior probabilities are called log-odds or logit transformations.
Under these assumptions, the posterior probabilities are given by the following two equations:
\[ Pr(G=k|X=x)=\frac{\text{exp }(\beta_{k0} + \beta_{k}^{T}x)}{1+\sum_{l=1}^{K-1}\text{exp }(\beta_{l0} + \beta_{l}^{T}x)} \text{ for } k = 1, ... , K - 1 \]
\[ Pr(G=K|X=x)=\frac{1}{1+\sum_{l=1}^{K-1}\text{exp }(\beta_{l0} + \beta_{l}^{T}x)} \]
For \(Pr(G = k | X = x)\) given above, obviously
- These must sum up to 1: \(\sum_{k=1}^{K} Pr(G=k|X=x)=1\)
- A simple calculation shows that the assumptions are satisfied.
Comparison with Linear Regression on Indicators
Similarities:
- Both attempt to estimate \(Pr(G = k | X = x)\).
- Both have linear classification boundaries.
- Posterior probabilities sum to 1 across classes.
Difference:
- Linear regression on indicator matrix: approximate \(Pr(G = k | X = x)\) by a linear function of x. \(Pr(G = k | X = x)\) is not guaranteed to fall between 0 and 1.
- Logistic regression: \(Pr(G = k | X = x)\) is a nonlinear function of x. It is guaranteed to range from 0 to 1.
10.1.1 Fitting Logistic Regression Models
How do we estimate the parameters? How do we fit a logistic regression model?
We need a certain optimization criterion for choosing the parameters.
Optimization Criterion
What we want to do is to find parameters that maximize the conditional likelihood of class labels G given X using the training data. We are not interested in the distribution of X, instead, our focus is on the conditional probabilities of the class labels given X.
Given point xi , the posterior probability for the class to be k is denoted by:
\[p_k(x_i ; \theta) = Pr(G = k | X = x_i ; \theta) \]
Given the first input x1, the posterior probability of its class, denoted as g1, is computed by:
\[Pr(G = g_1| X = x_1)\].
Since samples in the training data set are assumed independent, the posterior probability for the N sample points each having class \(g_i , i =1, 2, \cdots , N\), given their inputs \(x_1, x_2, \cdots , x_N\) is:
\[\prod_{i=1}^{N} Pr(G=g_i|X=x_i)\]
In other words, the joint conditional likelihood is the product of the conditional probabilities of the classes given every data point.
The conditional log-likelihood of the class labels in the training data set becomes a summation:
\[ \begin {align} l(\theta) &=\sum_{i=1}^{N}\text{ log }Pr(G=g_i|X=x_i)\\ & = \sum_{i=1}^{N}\text{ log }p_{g_{i}}(x_i; \theta) \\ \end {align} \]
10.1.2 Binary Classification
Let’s look at binary classification first. In fact, when we go to more than two classes the formulas become much more complicated, although they are derived similarly. We will begin by looking at this simpler case.
For binary classification, if \(g_i\) = class 1, denote \(y_i\) = 1; if \(g_i\) = class 2, denote \(y_i\) = 0. All we are doing here is changing the labels to 1’s and 0’s so that the notation will be simpler.
Let \(p_1(x ; \theta) = p(x ; \theta)\).
Then \(p_2(x ; \theta) = 1 - p_1(x ; \theta) = 1- p(x ; \theta)\) because the posterior probabilities of the two classes have to sum up to 1.
Since K = 2 we only have one linear equation and one decision boundary between two classes, the parameters \(\theta = {\beta_{10}, \beta_{1}}\), (remember, \(\beta_{1}\) is a vector). And, we denote \(\beta = (\beta_{10}, \beta_1)^T\) which is a column vector.
Now, let’s try to simplify the equation a little bit.
If \(y_i = 1\), i.e.,\(g_i = 1\), then
\[ \begin {align} \text{log }p_{g_{i}}(x;\beta)& =\text{log }p_1(x;\beta)\\ & = 1 \cdot \text{log }p(x;\beta) \\ & = y_i \text{log }p(x;\beta) \\ \end {align} \]
If \(y_i = 0\), i.e., \(g_i = 2\),
\[ \begin {align} \text{log }p_{g_{i}}(x;\beta)& =\text{log }p_2(x;\beta)\\ & = 1 \cdot \text{log }(1-p(x;\beta)) \\ & = (1-y_i )\text{log }(1-p(x;\beta)) \\ \end {align} \]
Since either \(y_i = 0\) or \(1 - y_i = 0\), we can add the two (at any time, only one of the two is nonzero) and have:
\[\text{log }p_{g_{i}}(x;\beta)=y_i \text{log }p(x;\beta)+(1-y_i )\text{log }(1-p(x;\beta))\]
The reason for doing this is that when we later derive the logistic regression we do not want to work with different cases. It would be tedious in notation if we have to distinguish different cases each time. This unified equation is always correct for whatever yi you use.
Next, we will plug what we just derived into the log-likelihood, which is simply a summation over all the points:
\[ \begin {align} l(\beta)& =\sum_{i=1}^{N}\text{log }p_{g_{i}}(x_i;\beta)\\ & = \sum_{i=1}^{N}(y_i \text{log }p(x_i;\beta)+(1-y_i )\text{log }(1-p(x_i;\beta)))\\ \end {align} \]
There are p + 1 parameters in \(\beta = (\beta_{10}, \beta_1)^T\).
Assume a column vector form for \(\beta\):
\[ \beta=\begin{pmatrix} \beta_{10}\\ \beta_{11}\\ \beta_{12}\\ \vdots\\ \beta_{1,p} \end{pmatrix} \]
Here we add the constant term 1 to x to accommodate the intercept and keep this in matrix form.
\[ x=\begin{pmatrix} 1\\ x,_1\\ x,_2\\ \vdots\\ x,_p \end{pmatrix} \]
Under the assumption of the logistic regression model:
\[ \begin{align} p(x;\beta) & =Pr(G=1|X=x)=\frac{ \text{exp }(\beta^Tx)}{1+ \text{exp }(\beta^Tx)}\\ 1-p(x;\beta) & = Pr(G=2|X=x)=\frac{1}{1+ \text{exp }(\beta^Tx)} \\ \end {align} \]
we substitute the above in \(l(\beta)\):
\[l(\beta) =\sum_{i=1}^{N}\left(y_i\beta^Tx_i-\text{log }(1+e^{\beta^{T}x_{i}}) \right)\]
The idea here is based on basic calculus. If we want to maximize \(l(\beta)\), we set the first order partial derivatives of the function \(l(\beta)\) with respect to \(\beta_{1j}\) to zero.
\[ \begin {align}\frac{\partial l(\beta)}{\beta_{1j}} & =\sum_{i=1}^{N}y_ix_{ij}-\sum_{i=1}^{N}\frac{x_{ij}e^{\beta^{T}x_{i}}}{1+e^{\beta^{T}x_{i}}}\\ & = \sum_{i=1}^{N}y_ix_{ij}-\sum_{i=1}^{N}p(x;\beta)x_{ij}\\ & = \sum_{i=1}^{N}x_{ij}(y_i-p(x_i;\beta))\\ \end {align} \]
for all j = 0, 1, … , p.
And, if we go through the math, we will find out that it is equivalent to the matrix form below.
\[\frac{\partial l(\beta)}{\beta_{1j}}= \sum_{i=1}^{N}x_{i}(y_i-p(x_i;\beta)) \]
Here \(x_i\) is a column vector and \(y_i\) is a scalar.
To solve the set of p +1 nonlinear equations \(\frac{\partial l(\beta)}{\partial\beta_{1j}}=0\), j = 0, 1, … , p , we will use the Newton-Raphson algorithm.
The Newton-Raphson algorithm requires the second-order derivatives or the so-called Hessian matrix (a square matrix of the second order derivatives) which is given by:
\[\frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^T} =-\sum_{i=1}^{N}x_{i}x_{i}^{T}p(x_i;\beta)(1-p(x_i;\beta)) \]
Below we derive the Hessian matrix, where the element on the jth row and nth column is (counting from 0):
\[ \begin {align} &\frac{\partial l(\beta)}{\partial\beta_{1j}\partial\beta_{1n}}\\ & = - \sum_{i=1}^{N}\frac{(1+e^{\beta^{T}x_{i}})e^{\beta^{T}x_{i}}x_{ij}x_{in}-(e^{\beta^{T}x_{i}} )^2x_{ij}x_{in} }{(1+e^{\beta^{T}x_{i}})^2}\\ & = - \sum_{i=1}^{N}x_{ij}x_{in}p(x_i;\beta)-x_{ij}x_{in}p(x_i;\beta)^2\\ & = - \sum_{i=1}^{N}x_{ij}x_{in}p(x_i;\beta)(1-p(x_i;\beta))\\ \end {align} \]
Starting with \(\beta^{old}\), a single Newton-Raphson update is given by this matrix forumla:
\[\beta^{new} = \beta^{old}-\left(\frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^T} \right)^{-1}\frac{\partial l(\beta)}{\partial\beta} \]
Basically, if given an old set of parameters, we update the new set of parameters by taking \(\beta^{old}\) minus the inverse of the Hessian matrix times the first order derivative vector. These derivatives are all evaluated at \(\beta^{old}\).
The iteration can be expressed compactly in matrix form.
Let y be the column vector of \(y_i\).
Let \(\mathbf{X}\) be the N × (p + 1) input matrix.
Let \(\mathbf{p}\) be the N-vector of fitted probabilities with ith element \(p(x_i ; \beta^{old})\).
Let \(\mathbf{W}\) be an N × N diagonal matrix of weights with ith element \(p(x_i ; \beta^{old})(1 - p(x_i ;\beta^{old}))\).
If you have set up all the matrices properly then the first order derivative vector is: \[\frac{\partial l(\beta)}{\partial\beta}=\mathbf{X}^T(\mathbf{y}-\mathbf{p}) \]
and the Hessian matrix is:
\[\frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^T}=-\mathbf{X}^T\mathbf{W}\mathbf{X} \]
The Newton-Raphson step is:
\[ \begin {align}\beta^{new} &=\beta^{old}+(\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{y}-\mathbf{p})\\ & = (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}(\mathbf{X}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p}))\\ & = (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}\mathbf{z} \end {align} \]
where \(\mathbf{z} ; \buildrel\triangle\over = ; \mathbf{X}\beta^{old} + \mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})\)
If \(\mathbf{z}\) is viewed as a response and \(\mathbf{X}\) is the input matrix, \(\beta^{new}\) is the solution to a weighted least square problem: \[\beta^{new} \leftarrow \text{arg } \underset{\beta}{min} (\mathbf{z}-\mathbf{X}\beta)^T\mathbf{W}(\mathbf{z}-\mathbf{X}\beta) \]
Recall that linear regression by least square solves \[\underset{\beta}{min} (\mathbf{z}-\mathbf{X}\beta)^T\mathbf{W}(\mathbf{z}-\mathbf{X}\beta) \mathbf{z}\] is referred to as the adjusted response.
The algorithm is referred to as iteratively reweighted least squares or IRLS.
The pseudo code for the IRLS algorithm is provided below.
Pseudo Code
\(0 \rightarrow \beta\)
Compute y by setting its elements to:
\[y_i=\left\{\begin{matrix} 1 & \text{if } g_i =1 \\ 0 & \text{if } g_i =2 \end{matrix}\right.\]
- Compute p by setting its elements to:
\[p(x_i;\beta)=\frac{e^{\beta^{T}x_{i}}}{1+e^{\beta^{T}x_{i}}} i = 1, 2, ... , N\]
Compute the diagonal matrix \(\mathbf{W}\). The ith diagonal element is \(p(x_i ; \beta)(1 - p(x_i ; \beta))\), i =1, 2, … , N.
\(\mathbf{z} \leftarrow \mathbf{X}\beta + \mathbf{W}^{ -1}(\mathbf{y} - \mathbf{p})\)
\(\beta \leftarrow (\mathbf{X}^{T} \mathbf{W}\mathbf{X})^{-1} \mathbf{X}^{T} \mathbf{W} \mathbf{z}\)
If the stopping criteria are met, stop; otherwise go back to step 3.
Computational Efficiency
Since W is an N × N diagonal matrix, direct matrix operations with it, as shown in the above pseudo code, is inefficient.
A modified pseudo code with much improved computational efficiency is given below.
\(0 \rightarrow \beta\)
Compute y by setting its elements to:
\[y_i=\left\{\begin{matrix} 1 & \text{if } g_i =1 \\ 0 & \text{if } g_i =2 \end{matrix}\right.\]
- Compute p by setting its elements to:
\[p(x_i;\beta)=\dfrac{e^{\beta^{T}x_{i}}}{1+e^{\beta^{T}x_{i}}} i = 1, 2, ... , N\]
- Compute the N × (p +1) matrix \(\tilde{\mathbf{X}}\) by multiplying the ith row of matrix \(\mathbf{X}\) by \(p(x_i ; \beta)(1 - p(x_i ; \beta))\), i =1, 2, … , N:
\[\mathbf{X}=\begin{pmatrix} x_{1}^{T}\\ x_{2}^{T}\\ ...\\ x_{N}^{T}\end{pmatrix} \tilde{\mathbf{X}}= \begin{pmatrix} p(x_1;\beta)(1-p(x_1;\beta))x_{1}^{T}\\ p(x_2;\beta)(1-p(x_2;\beta))x_{2}^{T}\\ ...\\ p(x_N;\beta)(1-p(x_N;\beta))x_{N}^{T}\end{pmatrix} \]
\(\beta \leftarrow \beta+(\mathbf{X}^T\tilde{\mathbf{X}})^{-1}\mathbf{X}^T(\mathbf{y}-\mathbf{p})\).
If the stopping criteria are met, stop; otherwise go back to step 3.
10.1.3 Example - Diabetes Dataset
This is the Pima Indians diabetes dataset we used before in Lesson 4.5. The input X is two dimensional. The two variables \(X_1\) and \(X_2\) are the first two principal components of the original 8 variables.
There are two classes: without diabetes \((Y=0)\); with diabetes \((Y=1)\). To avoid confusion, we’ll label the class variable, G, the same way (so \(G=0\) corresponds to \(Y=0\) and \(G=1\) corresponds to \(Y=1\)).
Applying logistic regression, we obtain \[\beta = (–0.7682, 0.6816, 0.3663)^T\].
Typically in logistic regression we focus on the probability equation for \(Y=1\), which in this case can be rewritten as:
\[Pr(G=1|X=x) =\frac{e^{\beta_0+\beta_1X_1+\beta_2X_2}}{1+e^{\beta_0+\beta_1X_1+\beta_2X_2}}=\frac{1}{1+e^{-\beta_0-\beta_1X_1-\beta_2X_2}}.\]
The posterior probabilities based on the estimated parameters are therefore: \[ \begin {align} Pr(G=1|X=x) & =\frac{e^{-0.7682+0.6816X_1+0.3663X_2}}{1+e^{-0.7682+0.6816X_1+0.3663X_2}} =\frac{1}{1+e^{0.7682-0.6816X_1-0.3663X_2}} \\ Pr(G=0|X=x) & =1-Pr(G=1|X=x) \end {align} \]
The classification rule based on a 0.5 probability cut-off is: \[\hat{G}(x) = \left\{\begin{matrix} 0 & \text{if } 0.7682-0.6816X_1-0.3663X_2 \ge 0\\ 1 & \text{if } 0.7682-0.6816X_1-0.3663X_2 < 0 \end{matrix}\right. \]
In the following scatter plot, we show the classification boundary obtained by logistic regression and compare it to that by LDA. Solid line: decision boundary obtained by logistic regression. Dash line: decision boundary obtained by LDA. Red cross: without diabetes (class 0). Blue circle: with diabetes (class 1).
The performance of the logistic regression classification is as follows.
We obtain the classifier and apply it to the training data set and see what percentage of data is classified incorrectly because we know the true labels in the training data.
By sensitivity, we mean that if the person has diabetes what percentage of the data points will say that they actually have diabetes. In other words, if it is, in fact, a positive case, a percentage of the times that we will classify it as a positive case.
This refers to the percentage of correctness of negative samples. If the person does not have diabetes, what percentage of the times that we classify them as not having diabetes.
- QDA within training data classification error rate: 28.13%.
- Sensitivity: 45.90%.
- Specificity: 85.80%.
10.1.4 Multiclass Case (K ≥ 3)
When \(K \geq 3\), the parameter \(\beta\) is a \((K - 1)(p + 1)\) - vector:
\[\beta = \begin{pmatrix} \beta_{10}\\ \beta_{1}\\ \beta_{20}\\ \beta_{2}\\ \vdots\\ \beta_{(K-1)0}\\ \beta_{K-1} \end{pmatrix} = \begin{pmatrix} \beta_{10}\\ \beta_{11}\\ \vdots\\ \beta_{1p}\\ \beta_{20}\\ \vdots\\ \beta_{2p}\\ \vdots\\ \beta_{(K-1)0}\\ \vdots\\ \beta_{(K-1)p} \end{pmatrix} \]
Let \(\bar{\beta}_l=\begin{pmatrix} \beta_{10}\\\beta_{l}\end{pmatrix}\)
The likelihood function becomes \[ \begin {align}l(\beta)& = \sum_{i=1}^{N}\text{log }p_{g_{i}}(x_i;\beta)\\ & = \sum_{i=1}^{N}\text{log } \left( \frac{e^{\bar{\beta}_{g_{i}}^{T}x_i}}{1+\sum_{l=1}^{K-1}e^{\bar{\beta}_{g_{i}}^{T}x_i}} \right) \\ & = \sum_{i=1}^{N}\left(\bar{\beta}_{g_{i}}^{T}x_i-\text{log }\left( 1+\sum_{l=1}^{K-1}e^{\bar{\beta}_{g_{i}}^{T}x_i} \right) \right) \\ \end {align} \]
First order derivatives of the log likelihood function are: \[ \begin {align}\frac{\partial l(\beta)}{\partial \beta_{kj}}& = \sum_{i=1}^{N}\left( I(g_i=k)x_{ij}-\frac{e^{\bar{\beta}_{k}^{T}x_i}x_{ij}}{1+\sum_{l=1}^{K-1}e^{\bar{\beta}_{l}^{T}x_i}} \right)\\ & = \sum_{i=1}^{N}x_{ij}(I(g_i=k)-p_k(x_i;\beta)) \\ \end {align} \]
Second order derivatives:
Expressing the solution in matrix form is much more compact. First, we introduce several notations.
- \(\mathbf{y}\) is the concatenated indicator vector, a column vector of dimension \(N(K -1)\).
\[ \mathbf{y}=\begin{pmatrix} \mathbf{y}_{1}\\ \mathbf{y}_{2}\\ \vdots\\ \mathbf{y}_{K-1} \end{pmatrix} \mathbf{y}_{k}=\begin{pmatrix} I(g_1=k)\\ I(g_2=k)\\ \vdots\\ I(g_N=k) \end{pmatrix} 1 \le k \le K-1 \]
- \(\mathbf{p}\) is the concatenated vector of fitted probabilities, a column vector of dimension N(K -1).
\[ \mathbf{p}=\begin{pmatrix} \mathbf{p}_{1}\\ \mathbf{p}_{2}\\ \vdots\\ \mathbf{p}_{K-1} \end{pmatrix} \mathbf{p}_{k}=\begin{pmatrix} p_k(x_1;\beta)\\ p_k(x_2;\beta)\\ \vdots\\ p_k(x_N;\beta) \end{pmatrix} 1 \le k \le K-1 \]
–\(\tilde{X}\) is an \(N(K -1) \times (p + 1)(K -1)\) matrix, where X is the \(N(p + 1)\) input matrix defined before: \[ \tilde{\mathbf{X}}= \begin{pmatrix} \mathbf{X} & 0 & ... & 0\\ 0 & \mathbf{X} & ... & 0\\ ... & ... & ... &... \\ 0 & 0 & ... & \mathbf{X} \end{pmatrix} \]
- Matrix \(\mathbf{W}\) is an \(N(K -1) \times N(K -1)\) square matrix:
\[ \mathbf{W}= \begin{pmatrix} \mathbf{W}_{11} & \mathbf{W}_{12} & ... & \mathbf{W}_{1(K-1)}\\ \mathbf{W}_{21} & \mathbf{W}_{22} & ... & \mathbf{W}_{2(K-1)}\\ ... & ... & ... &... \\ \mathbf{W}_{(K-1),1} & \mathbf{W}_{(K-1),2} & ... & \mathbf{W}_{(K-1),(K-1)} \end{pmatrix} \]
– Each submatrix \(\mathbf{W}_{km}\), \(1 \leq k\), \(m \leq K -1\), is an N × N diagonal matrix.
– When \(k = m\), the ith diagonal element in \(\mathbf{W}_{kk}\) is \(p_k(x_i ; \beta^{old})(1 -p_k(x_i ; \beta^{old}))\).
– When \(k \ne m\), the ith diagonal element in \(\mathbf{W}_{km}\) is \(-p_k(x_i ; \beta^{old})(1 -p_m(x_i ; \beta^{old}))\).
Similarly as with binary classification \[ \begin {align}\frac{\partial l(\beta)}{\partial \beta}& = \tilde{\mathbf{X}}^T(\mathbf{y}-\mathbf{p}) \\ \frac{\partial^2 l(\beta)}{\partial \beta \partial \beta^T}& = \tilde{\mathbf{X}}^T \mathbf{W}\tilde{\mathbf{X}}\\ \end {align} \]
The formula for updating \(\beta^{new}\) in the binary classification case holds for multiclass, except that the definitions of the matrices are more complicated. The formula itself may look simple because the complexity is wrapped in the definitions of the matrices.
\[ \begin {align}\beta^{new}& = (\tilde{\mathbf{X}}^T\mathbf{W}\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^T\mathbf{W}\mathbf{z}\\ \text{where }\mathbf{z}& \buildrel\triangle\over = \tilde{\mathbf{X}}\beta^{old} + \mathbf{W}^{-1}(\mathbf{y}-\mathbf{p}) \text{ or simply:}\\ \beta^{new}& = \beta^{old}+ (\tilde{\mathbf{X}}^T\mathbf{W}\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^T(\mathbf{y}-\mathbf{p}) \end {align} \]
Computation Issues
We now elaborate on some details in the computation.
To initialize the parameters before starting the iterations, we often use \(\beta = 0\).
Convergence of the algorithm is not guaranteed but is usually the case.
Usually, the log-likelihood increases after each iteration, but overshooting can occur.
In the rare case that the log-likelihood decreases, cut step size by half. That is, instead of using the calculated new parameters, use the average of the new parameters and the old parameters.
10.2 Discriminant Analysis
Introduction
Let the feature vector be X and the class labels be Y.
The Bayes rule says that if you have the joint distribution of X and Y, and if X is given, under 0-1 loss, the optimal decision on Y is to choose a class with maximum posterior probability given X.
Discriminant analysis belongs to the branch of classification methods called generative modeling, where we try to estimate the within-class density of X given the class label. Combined with the prior probability (unconditioned probability) of classes, the posterior probability of Y can be obtained by the Bayes formula.
Notation
Assume the prior probability or the marginal pmf for class k is denoted as \(\pi_k\), \(\sum^{K}_{k=1} \pi_k =1\).
\(\pi_k\) is usually estimated simply by empirical frequencies of the training set:
\[\hat{\pi}_k=\dfrac{\text{\# of Samples in class} k}{\text{Total \# of samples}}\]
You have the training data set and you count what percentage of data come from a certain class.
Then we need the class-conditional density of X. Remember this is the density of X conditioned on the class k, or class G = k denoted by \(f _ { k } ( x )\).
According to the Bayes rule, what we need is to compute the posterior probability:
\[Pr(G=k|X=x)=\frac{f_k(x)\pi_k}{\sum^{K}_{l=1}f_l(x)\pi_l}\]
This is a conditional probability of class G given X.
By MAP (maximum a posteriori, i.e., the Bayes rule for 0-1 loss):
\[ \begin {align} \hat{G}(x) &=\text{arg }\underset{k}{max} Pr(G=k|X=x)\\ & = \text{arg }\underset{k}{max} f_k(x)\pi_k\\ \end {align} \]
Notice that the denominator is identical no matter what class k you are using. Therefore, for maximization, it does not make a difference in the choice of k. The MAP rule is essentially trying to maximize \(\pi_k\)times \(f_k(x)\).
10.2.1 Class Density Estimation
Depending on which algorithms you use, you end up with different ways of density estimation within every class.
In Linear Discriminant Analysis (LDA) we assume that every density within each class is a Gaussian distribution.
- Linear and Quadratic Discriminant Analysis: Gaussian densities.
In LDA we assume those Gaussian distributions for different classes share the same covariance structure. In Quadratic Discriminant Analysis (QDA) we don’t have such a constraint. You will see the difference later.
- General Nonparametric Density Estimates:
You can also use general nonparametric density estimates, for instance kernel estimates and histograms.
- Naive Bayes: assume each of the class densities are products of marginal densities, that is, all the variables are independent.
There is a well-known algorithm called the Naive Bayes algorithm. Here the basic assumption is that all the variables are independent given the class label. Therefore, to estimate the class density, you can separately estimate the density for every dimension and then multiply them to get the joint density. This makes the computation much simpler.
X may be discrete, not continuous. Instead of talking about density, we will use the probability mass function. In this case, we would compute a probability mass function for every dimension and then multiply them to get the joint probability mass function.
You can see that we have swept through several prominent methods for classification. You should also see that they all fall into the Generative Modeling idea. The only essential difference is in how you actually estimate the density for every class.
10.2.2 Linear Discriminant Analysis
Under LDA we assume that the density for X, given every class k is following a Gaussian distribution. Here is the density formula for a multivariate Gaussian distribution:
\[f_k(x)=\dfrac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_{k}^{-1}(x-\mu_k)}\]
p is the dimension and \(\Sigma_k\) is the covariance matrix. This involves the square root of the determinant of this matrix. In this case, we are doing matrix multiplication. The vector x and the mean vector \(\mu_k\) are both column vectors.
For Linear discriminant analysis (LDA): \(\Sigma_k=\Sigma\), \(\forall k\).
In LDA, as we mentioned, you simply assume for different k that the covariance matrix is identical. By making this assumption, the classifier becomes linear. The only difference from a quadratic discriminant analysis is that we do not assume that the covariance matrix is identical for different classes. For QDA, the decision boundary is determined by a quadratic function.
Since the covariance matrix determines the shape of the Gaussian density, in LDA, the Gaussian densities for different classes have the same shape but are shifted versions of each other (different mean vectors). Example densities for the LDA model are shown below.
10.2.3 Optimal Classification
For the moment, we will assume that we already have the covariance matrix for every class. And we will talk about how to estimate this in a moment. Let’s look at what the optimal classification would be based on the Bayes rule
Bayes rule says that we should pick a class that has the maximum posterior probability given the feature vector X. If we are using the generative modeling approach this is equivalent to maximizing the product of the prior and the within-class density.
Since the log function is an increasing function, the maximization is equivalent because whatever gives you the maximum should also give you a maximum under a log function. Next, we plug in the density of the Gaussian distribution assuming common covariance and then multiplying the prior probabilities.
\[\begin{align*} \hat{G}(x) & = \text{arg } \underset{k}{\text{max}} Pr(G=k|X=x) \\ & = \text{arg } \underset{k}{\text{max}}f_k(x)\pi_k \\ & = \text{arg } \underset{k}{\text{max }} \text{ log}(f_k(x)\pi_k) \\ & = \text{arg } \underset{k}{\text{max}}\left[-\text{log}((2\pi)^{p/2}|\Sigma|^{1/2})-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)+\text{log}(\pi_k) \right] \\ & = \text{arg } \underset{k}{\text{max}}\left[-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)+\text{log}(\pi_k) \right] \end{align*}\]
View the Video Explanation
LDA Formula
Note:
To sum up, after simplification we obtain this formula:
\[\hat{G}(x)= \text{ arg }\underset{k}{max}\left[x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_{k}^{T}\Sigma^{-1}\mu_{k} + log(\pi_k) \right] \]
This is the final classifier. Given any x, you simply plug into this formula and see which k maximizes this. Usually the number of classes is pretty small, and very often only two classes. Hence, an exhaustive search over the classes is effective.
LDA gives you a linear boundary because the quadratic term is dropped.
To sum up
\[\hat{G}(x)= \text{ arg }\underset{k}{max}\left[x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_{k}^{T}\Sigma^{-1}\mu_{k} + log(\pi_k) \right]\]
Then
- Define the linear discriminant function
\[\delta_k(x)=x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_{k}^{T}\Sigma^{-1}\mu_{k} + log(\pi_k)\]
\[\hat{G}(x)= \text{ arg }\underset{k}{max}\delta_k(x)\]
- The decision boundary between class k and l is:
\[\left\{ x : \delta_k(x) = \delta_l(x)\right\}\]
- Or equivalently the following holds
\[log\frac{\pi_k}{\pi_l}-\frac{1}{2}(\mu_k+\mu_l)^T\Sigma^{-1}(\mu_k-\mu_l)+x^T\Sigma^{-1}(\mu_k-\mu_l)=0\]
10.2.4 Binary Classification
In binary classification in particular, for instance if we let (k =1, l =2), then we would define constant \(a_0\), given below, where \(\pi_1\) and \(\pi_2\) are prior probabilities for the two classes and \(\mu_1\) and \(\mu_2\) are mean vectors.
- Binary classification (k = 1, l = 2):
- Define \(a_0 =\text{log }\dfrac{\pi_1}{\pi_2}-\dfrac{1}{2}(\mu_1+\mu_2)^T\Sigma^{-1}(\mu_1-\mu_2)\)
- Define \((a_1, a_2, ... , a_p)^T = \Sigma^{-1}(\mu_1-\mu_2)\)
- Classify to class 1 if \(a_0 +\sum_{j=1}^{p}a_jx_j >0\) ; to class 2 otherwise.
- An example \(\ast \pi_1=\pi_2=0.5\) \(\ast \mu_1=(0,0)^T, \mu_2=(2,-2)^T\)
\(\ast \Sigma = \begin{pmatrix} 1.0&0.0 \\ 0.0 & 0.5625 \end{pmatrix}\)
\(\ast \text{Decision boundary: } 5.56-2.00x_1+3.56x_2=0.0\)
Here is a contour plot of this result:
We have two classes and we know the within-class density. The marginal density is simply the weighted sum of the within-class densities, where the weights are the prior probabilities. Because we have equal weights and because the covariance matrix two classes are identical, we get these symmetric lines in the contour plot. The black diagonal line is the decision boundary for the two classes. Basically, if you are given an x above the line, then we would classify this x into the first-class. If it is below the line, we would classify it into the second class.
There is a missing piece here, right?
For all of the discussion above we assume that we have the prior probabilities for the classes and we also had the within-class densities given to us. Of course, in practice, you don’t have this. In practice, what we have is only a set of training data.
The question is how do we find the \(\pi_k\)’s and the \(f_k(x)\)?
10.2.5 Estimating the Gaussian Distributions
We need to estimate the Gaussian distribution. Here is the formula for estimating the \(\pi_k\)’s and the parameters in the Gaussian distributions. The formula below is actually the maximum likelihood estimator:
\[\hat{\pi}_k=N_k/N\]
where \(N_k\) is the number of class-k samples and N is the total number of points in the training data. As we mentioned, to get the prior probabilities for class k, you simply count the frequency of data points in class k.
Then, the mean vector for every class is also simple. You take all of the data points in a given class and compute the average, the sample mean:
\[\hat{\mu}_k=\sum_{g_i=k}x^{(i)}/N_k\]
Next, the covariance matrix formula looks slightly complicated. The reason is that we have to get a common covariance matrix for all of the classes. First, you divide the data points into two given classes according to the given labels. If we were looking at class k, for every point we subtract the corresponding mean which we computed earlier. Then multiply its transpose. Remember x is a column vector, therefore if we have a column vector multiplied by a row vector, we get a square matrix, which is what we need. \[\hat{\Sigma}=\sum_{k=1}^{K}\sum_{g_i=k}\left(x^{(i)}-\hat{\mu}_k \right)\left(x^{(i)}-\hat{\mu}_k \right)^T/(N-K)\]
First, we do the summation within every class k, then we have the sum over all of the classes. Next, we normalize by the scalar quantity, N - K. When we fit a maximum likelihood estimator it should be divided by N, but if it is divided by N – K, we get an unbiased estimator. Remember, K is the number of classes. So, when N is large, the difference between N and N - K is pretty small.
Note that \(x^{(i)}\) denotes the ith sample vector.
In summary, if you want to use LDA to obtain a classification rule, the first step would involve estimating the parameters using the formulas above. Once you have these, then go back and find the linear discriminant function and choose a class according to the discriminant functions.
10.2.6 Example - Diabetes Data Set
Let’s take a look at a specific data set. This is the diabetes data set from the UC Irvine Machine Learning Repository. It is a fairly small data set by today’s standards. The original data had eight variable dimensions. To simplify the example, we obtain the two prominent principal components from these eight variables. Instead of using the original eight dimensions we will just use these two principal components for this example.
The Diabetes data set has two types of samples in it. One sample type is healthy individuals and the other are individuals with a higher risk of diabetes. Here are the prior probabilities estimated for both of the sample types, first for the healthy individuals and second for those individuals at risk:
\[\hat{\pi}_0 =0.651, \hat{\pi}_1 =0.349 \]
The first type has a prior probability estimated at 0.651. This means that for this data set about 65% of these belong to class 0 and the other 35% belong to class 1. Next, we computed the mean vector for the two classes separately:
\[\hat{\mu}_0 =(-0.4038, -0.1937)^T, \hat{\mu}_1 =(0.7533, 0.3613)^T \]
Then we computed \(\hat{\Sigma}\) using the formulas discussed earlier.
\[\hat{\Sigma}= \begin{pmatrix} 1.7949 & -0.1463\\ -0.1463 & 1.6656 \end{pmatrix} \]
Once we have done all of this, we compute the linear discriminant function and find the classification rule.
Classification rule:
\[ \begin{align*}\hat{G}(x) &=\begin{cases} 0 & 0.7748-0.6767x_1-0.3926x_2 \ge 0 \\ 1 & otherwise \end{cases} \\ & = \begin{cases} 0 & x_2 \le (0.7748/0.3926) - (0.6767/0.3926)x_1 \\ 1 & otherwise \end{cases} \end{align*}\]
In the first specification of the classification rule, plug a given x into the above linear function. If the result is greater than or equal to zero, then claim that it is in class 0, otherwise claim that it is in class 1.
Below is a scatter plot of the two principle components. The two classes are represented, the first, without diabetes, are the red stars (class 0), and the second class with diabetes are the blue circles (class 1). The solid line represents the classification boundary obtained by LDA. It seems as though the two classes are not that well separated. The dashed or dotted line is the boundary obtained by linear regression of an indicator matrix. In this case, the results of the two different linear boundaries are very close.
It is always a good practice to plot things so that if something went terribly wrong it would show up in the plots.
- Within training data classification error rate: 28.26%.
- Sensitivity: 45.90%.
- Specificity: 85.60%.
Here is the contour plot for the density for class 0. [Actually, the figure looks a little off - it should be centered slightly to the left and below the origin.] The contour plot for the density for class 1 would be similar except centered above and to the right.
10.2.7 Simulated Examples
LDA makes some strong assumptions. It assumes that the covariance matrix is identical for different classes. It also assumes that the density is Gaussian. What if these are not true? LDA may not necessarily be bad when the assumptions about the density functions are violated. Here are some examples that might illustrate this.
In certain cases, LDA may yield poor results.
In the first example (a), we do have similar data sets which follow exactly the model assumptions of LDA. This means that the two classes, red and blue, actually have the same covariance matrix and they are generated by Gaussian distributions. Below, in the plots, the black line represents the decision boundary. The second example (b) violates all of the assumptions made by LDA. First of all the within the class of density is not a single Gaussian distribution, instead, it is a mixture of two Gaussian distributions. The overall density would be a mixture of four Gaussian distributions. Also, they have different covariance matrices as well.
Then, if we apply LDA we get this decision boundary (above, black line), which is actually very close to the ideal boundary between the two classes. By ideal boundary, we mean the boundary given by the Bayes rule using the true distribution (since we know it in this simulated example).
If you look at another example, (c) below, here we also generated two classes. The red class still contains two Gaussian distributions. The blue class, which spreads itself over the red class with one mass of data in the upper right and another data mass in the lower left. If we force LDA we get a decision boundary, as displayed. In this case, the result is very bad (far below ideal classification accuracy). You can see that in the upper right the red and blue are very well mixed, however, in the lower left the mix is not as great. You can imagine that the error rate would be very high for classification using this decision boundary.
In plot (d), the density of each class is estimated by a mixture of two Gaussians. The Bayes rule is applied. The resulting boundaries are two curves. The separation of the red and the blue is much improved.
This example illustrates when LDA gets into trouble. LDA separates the two classes with a hyperplane. This means that the two classes have to pretty much be two separated masses, each occupying half of the space. In the above example, the blue class breaks into two pieces, left and right. Then, you have to use more sophisticated density estimation for the two classes if you want to get a good result. This is an example where LDA has seriously broken down. This is why it’s always a good idea to look at the scatter plot before you choose a method. The scatter plot will often show whether a certain method is appropriate. If you see a scatter plot like this example, you can see that the blue class is broken into pieces, and you can imagine if you used LDA, no matter how you position your linear boundary, you are not going to get a good separation between the red and the blue class. Largely you will find out that LDA is not appropriate and you want to take another approach.
10.2.8 Quadratic Discriminant Analysis (QDA)
QDA is not really that much different from LDA except that you assume that the covariance matrix can be different for each class and so, we will estimate the covariance matrix \(\Sigma_k\) separately for each class k, k =1, 2, … , K.
Quadratic discriminant function:
\[\delta_k(x)= -\frac{1}{2}\text{log}|\Sigma_k|-\frac{1}{2}(x-\mu_{k})^{T}\Sigma_{k}^{-1}(x-\mu_{k})+\text{log}\pi_k\]
This quadratic discriminant function is very much like the linear discriminant function except that because Σk, the covariance matrix, is not identical, you cannot throw away the quadratic terms. This discriminant function is a quadratic function and will contain second order terms.
Classification rule:
\[\hat{G}(x)=\text{arg }\underset{k}{\text{max }}\delta_k(x)\]
The classification rule is similar as well. You just find the class k which maximizes the quadratic discriminant function.
The decision boundaries are quadratic equations in x.
QDA, because it allows for more flexibility for the covariance matrix, tends to fit the data better than LDA, but then it has more parameters to estimate. The number of parameters increases significantly with QDA. Because, with QDA, you will have a separate covariance matrix for every class. If you have many classes and not so many sample points, this can be a problem.
As we talked about at the beginning of this course, there are trade-offs between fitting the training data well and having a simple model to work with. A simple model sometimes fits the data just as well as a complicated model. Even if the simple model doesn’t fit the training data as well as a complex model, it still might be better on the test data because it is more robust.
QDA Example - Diabetes Data Set
In this example, we do the same things as we have previously with LDA on the prior probabilities and the mean vectors, except now we estimate the covariance matrices separately for each class.
How do we estimate the covariance matrices separately?
Remember, in LDA once we had the summation over the data points in every class we had to pull all the classes together. In QDA we don’t do this.
Prior probabilities: \(\hat{\pi}_0=0.651, \hat{\pi}_1=0.349\).
\[\hat{\mu}_0=(-0.4038, -0.1937)^T, \hat{\mu}_1=(0.7533, 0.3613)^T \] \[\hat{\Sigma_0}= \begin{pmatrix} 1.6790 & -0.0461 \\ -0.0461 & 1.5985 \end{pmatrix} \] \[\hat{\Sigma_1}= \begin{pmatrix} 2.0114 & -0.3334 \\ -0.3334 & 1.7910 \end{pmatrix} \]
The dashed line in the plot below is a decision boundary given by LDA. The curved line is the decision boundary resulting from the QDA method.
For most of the data, it doesn’t make any difference, because most of the data is massed on the left. The percentage of the data in the area where the two decision boundaries differ a lot is small. Therefore, you can imagine that the difference in the error rate is very small.
- Within training data classification error rate: 29.04%.
- Sensitivity: 45.90%.
- Specificity: 84.40%.
Sensitivity for QDA is the same as that obtained by LDA, but specificity is slightly lower.
10.2.9 Connection between LDA and logistic regression
Under the model of LDA, we can compute the log-odds:
\[\begin{align}&\text{log}\frac{Pr(G=k|X=x)}{Pr(G=K|X=x)}\\ & =\text{log}\frac{\pi_k}{\pi_K}-\frac{1}{2}(\mu_k+\mu_K)^T\Sigma^{-1}(\mu_k-\mu_K)& = a_{k0}+a_{k}^{T}x \\ \end {align} \]
The model of LDA satisfies the assumption of the linear logistic model.
The difference between linear logistic regression and LDA is that the linear logistic model only specifies the conditional distribution \(Pr(G = k | X = x)\). No assumption is made about \(Pr(X)\); while the LDA model specifies the joint distribution of X and G. \(Pr(X)\) is a mixture of Gaussians: \[Pr(X)=\sum_{k=1}^{K}\pi_k \phi (X; \mu_k, \Sigma) \] where \(\phi\) is the Gaussian density function.
Moreover, linear logistic regression is solved by maximizing the conditional likelihood of G given X: \(Pr(G = k | X = x)\); while LDA maximizes the joint likelihood of G and X: \(Pr(X = x, G = k)\).
If the additional assumption made by LDA is appropriate, LDA tends to estimate the parameters more efficiently by using more information about the data.
Another advantage of LDA is that samples without class labels can be used under the model of LDA. On the other hand, LDA is not robust to gross outliers. Because logistic regression relies on fewer assumptions, it seems to be more robust to the non-Gaussian type of data.
In practice, logistic regression and LDA often give similar results.
Simulated Example
Consider input X being 1-D.
Two classes have equal priors and the class-conditional densities of X are shifted versions of each other, as shown in the plot below.
Each within-class density of X is a mixture of two normals:
- Class 1 (red): 0.6N(-2, ¼ ) + 0.4N(0, 1).
- Class 2 (blue): 0.6N(0, ¼ ) + 0.4N(2, 1).
The class-conditional densities are shown below.
LDA Result
Training data set: 2000 samples for each class.
Test data set: 1000 samples for each class.
The means and variance of the two classes estimated by LDA are:
\[\hat{\mu}_1 = -1.1948, \\\hat{\mu}_2 = 0.8224, \\\hat{\sigma}^2 = 1.5268.\]
Boundary value between the two classes is \((\hat{\mu}_1 + \hat{\mu}_2) / 2 = -0.1862\).
The classification error rate on the test data is 0.2315.
Based on the true distribution, the Bayes (optimal) boundary value between the two classes is -0.7750 and the error rate is 0.1765.
The estimated within-class densities by LDA are shown in the plot below. Both densities are Gaussian and are shifted version of each other, as assumed by LDA.
Logistic Regression Result
Linear logistic regression obtains:
\[\beta = (-0.3288, -1.3275)^T\].
The boundary value satisfies \(-0.3288 - 1.3275X = 0\), hence equals -0.2477.
The error rate on the test data set is 0.2205.
The estimated posterior probability is: \[Pr(G=1|X=x) =\frac{e^{- 0.3288-1.3275x}}{1+e^{- 0.3288-1.3275x}}\]
The estimated posterior probability, \(Pr(G =1 | X = x)\), and its true value based on the true distribution are compared in the graph below. We can see that although the Bayes classifier (theoretically optimal) is indeed a linear classifier (in 1-D, this means thresholding by a single value), the posterior probability of the class being 1 bears a form more complicated than the one implied by the logistic regression model. Under the logistic regression model, the posterior probability is a monotonic function of a specific shape, while the true posterior probability is not a monotonic function of x. The assumption made by the logistic regression model is more restrictive than a general linear boundary classifier.
10.3 Nearest-Neighbor Methods
The Nearest Neighbor Classifier
Training data: \((g_i, x_i)\), \(i=1,2,\ldots,N\).
- Define distance on input x, e.g. Euclidian distance.
- Classify new instance by looking at label of closest sample in the training set: \(\hat{G}(x^*) = argmin_i d(x_i, x^*)\).
Problem: Over-fitting
It is easy to overfit data. For example, assume we know that the data generating process has a linear boundary, but there is some random noise to our measurements.
Solution: Smoothing
To prevent overfitting, we can smooth the decision boundary by K nearest neighbors instead of 1.
Find the K training samples \(x_r\), \(r = 1, \ldots , K\) closest in distance to \(x^*\), and then classify using majority vote among the k neighbors.
The amount of computation can be intense when the training data is large since the distance between a new data point and every training point has to be computed and sorted.
Feature standardization is often performed in pre-processing. Because standardization affects the distance, if one wants the features to play a similar role in determining the distance, standardization is recommended. However, whether to apply normalization is rather subjective. One has to decide on an individual basis for the problem in consideration.
The only parameter that can adjust the complexity of KNN is the number of neighbors k. The larger k is, the smoother the classification boundary. Or we can think of the complexity of KNN as lower when k increases.
The classification boundaries generated by a given training dataset and 15 Nearest Neighbors are shown below. As a comparison, we also show the classification boundaries generated for the same training data but with 1 Nearest Neighbor. We can see that the classification boundaries induced by 1 NN are much more complicated than 15 NN.
15 Nearest Neighbors (below)
Figure 13.3 k-nearest-neighbor classifiers applied to the simulation data of figure 13.1. The broken purple curve in the background is the Bayes decision boundary.
1 Nearest Neighbor (below)
For another simulated data set, there are two classes. The error rates based on the training data, the test data, and 10 fold cross validation are plotted against K, the number of neighbors. We can see that the training error rate tends to grow when k grows, which is not the case for the error rate based on a separate test dataset or cross-validation. The test error rate or cross-validation results indicate there is a balance between k and the error rate. When k first increases, the error rate decreases, and it increases again when k becomes too big. Hence, there is a preference for k in a certain range.
Figure 13.4 k-nearest-neighbors on the two-class mixture data. The upper panel shows the misclassification errors as a function of neighborhood size. Standard error bars are included for 10-fold cross-validation. The lower panel shows the decision boundary for 7-nearest neighbors, which appears to be optimal for minimizing test error. The broken purple curve in the background is the Bayes decision boundary.
7 Nearest Neighbors (below)