Introduction Section
There are two big branches of methods for classification. One is called generative modeling, the other is called discriminative modeling. Logistic regression for classification is a discriminative modeling approach, where we estimate the posterior probabilities of classes given X directly without assuming the marginal distribution on X.
It preserves linear classification boundaries.
A review of the Bayes rule shows that when we use 0-1 loss, we pick the class k that has the maximum posterior probability:
\(\hat{G}(x)= \text{arg } \underset{k}{max}Pr(G=k|X=x) \)
The decision boundary between classes k and l is determined by the equation:
\(Pr(G=k|X=x) =Pr(G=l|X=x) \)
that is, the x's at which the two posterior probabilities of k and l are equal.
If we divide both sides by \(Pr(G = l | X = x)\) and take the log of this ratio, the above equation is equivalent to:
\(\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=0 \)
Since we want to enforce a linear classification boundary, we assume the function above is linear (below):
\(\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=a_{0}^{(k,l)}+\sum_{j=1}^{p}a_{j}^{(k,l)}x_j \)
This is the basic assumption of logistic regression (simple indeed).
We use the superscript (k, l) on the coefficients of the linear function because, for every pair of k and l, the decision boundary would be different, determined by the different coefficients.
For logistic regression, there are restrictive relations between \(a ^ { ( k , l ) }\) for different pairs of (k, l). We don't really need to specify this equation for every pair of k and l. Instead, we only need to specify it for K - 1 such pairs.
Assumptions
Let's look at our assumptions. If we take class K as the base class, the assumed equations are:
\(\text{log } \frac{Pr(G=1|X=x)}{Pr(G=K|X=x)}=\beta_{10} + \beta_{1}^{T}x \)
\(\text{log } \frac{Pr(G=2|X=x)}{Pr(G=K|X=x)}=\beta_{20} + \beta_{2}^{T}x \)
\(\vdots\)
\(\text{log } \frac{Pr(G=K-1|X=x)}{Pr(G=K|X=x)}=\beta_{(K-1)0} + \beta_{K-1}^{T}x \)
This indicates that we don't have to specify the decision boundary for every pair of classes. We only need to specify the decision boundary between class j and the base class K. (You can choose any class as the base class - it doesn't really matter.)
Once we have specified the parameters for these K - 1 log ratios, then for any pair of classes (k, l), we can derive the log ratios without introducing new parameters:
\(\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=\beta_{k0} + \beta_{l0}+(\beta_k - \beta_l)^{T}x \)
Number of parameters: \((K - 1)(p + 1)\).
For convenience, we will denote the entire parameter set by θ and arrange them in this way:
\(\theta =\{ \beta_{10},\beta_{1}^T,\beta_{20},\beta_{2}^T, ... , ,\beta_{(K-1)0},\beta_{K-1}^T\} \)
The log ratios of posterior probabilities are called log-odds or logit transformations.
Under these assumptions, the posterior probabilities are given by the following two equations:
\( Pr(G=k|X=x)=\frac{\text{exp }(\beta_{k0} + \beta_{k}^{T}x)}{1+\sum_{l=1}^{K-1}\text{exp }(\beta_{l0} + \beta_{l}^{T}x)} \text{ for } k = 1, ... , K - 1 \)
\( Pr(G=K|X=x)=\frac{1}{1+\sum_{l=1}^{K-1}\text{exp }(\beta_{l0} + \beta_{l}^{T}x)} \)
For \(Pr(G = k | X = x)\) given above, obviously
- These must sum up to 1: \(\sum_{k=1}^{K} Pr(G=k|X=x)=1 \)
- A simple calculation shows that the assumptions are satisfied.
Comparison with Linear Regression on Indicators
Similarities:
- Both attempt to estimate \(Pr(G = k | X = x)\).
- Both have linear classification boundaries.
- Posterior probabilities sum to 1 across classes.
Difference:
- Linear regression on indicator matrix: approximate \(Pr(G = k | X = x)\) by a linear function of x. \(Pr(G = k | X = x)\) is not guaranteed to fall between 0 and 1.
- Logistic regression: \(Pr(G = k | X = x)\) is a nonlinear function of x. It is guaranteed to range from 0 to 1.