##
Introduction
Section* *

There are two big branches of methods for classification. One is called generative modeling, the other is called discriminative modeling. Logistic regression for classification is a discriminative modeling approach, where we estimate the posterior probabilities of classes given *X* directly without assuming the marginal distribution on *X*.

It preserves linear classification boundaries.

A review of the Bayes rule shows that when we use 0-1 loss, we pick the class *k* that has the maximum posterior probability:

\(\hat{G}(x)= \text{arg } \underset{k}{max}Pr(G=k|X=x) \)

The decision boundary between classes *k* and *l* is determined by the equation:

\(Pr(G=k|X=x) =Pr(G=l|X=x) \)

that is, the *x*'s at which the two posterior probabilities of *k* and *l *are equal.

If we divide both sides by \(Pr(G = l | X = x)\) and take the log of this ratio, the above equation is equivalent to:

\(\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=0 \)

Since we want to enforce a linear classification boundary, we assume the function above is linear (below):

\(\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=a_{0}^{(k,l)}+\sum_{j=1}^{p}a_{j}^{(k,l)}x_j \)

This is the basic assumption of logistic regression (simple indeed).

We use the superscript (*k*, *l*) on the coefficients of the linear function because, for every pair of *k* and *l*, the decision boundary would be different, determined by the different coefficients.

For logistic regression, there are restrictive relations between \(a ^ { ( k , l ) }\) for different pairs of (*k*, *l*). We don't really need to specify this equation for every pair of *k* and *l*. Instead, we only need to specify it for *K* - 1 such pairs.

### Assumptions

Let's look at our assumptions. If we take class *K* as the base class, the assumed equations are:

\(\text{log } \frac{Pr(G=1|X=x)}{Pr(G=K|X=x)}=\beta_{10} + \beta_{1}^{T}x \)

\(\text{log } \frac{Pr(G=2|X=x)}{Pr(G=K|X=x)}=\beta_{20} + \beta_{2}^{T}x \)

\(\vdots\)

\(\text{log } \frac{Pr(G=K-1|X=x)}{Pr(G=K|X=x)}=\beta_{(K-1)0} + \beta_{K-1}^{T}x \)

This indicates that we don't have to specify the decision boundary for every pair of classes. We only need to specify the decision boundary between class *j* and the base class *K*. (You can choose any class as the base class - it doesn't really matter.)

Once we have specified the parameters for these *K* - 1 log ratios, then for any pair of classes (*k*, *l*), we can derive the log ratios without introducing new parameters:

\(\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=\beta_{k0} + \beta_{l0}+(\beta_k - \beta_l)^{T}x \)

Number of parameters: \((K - 1)(p + 1)\).

For convenience, we will denote the entire parameter set by θ and arrange them in this way:

\(\theta =\{ \beta_{10},\beta_{1}^T,\beta_{20},\beta_{2}^T, ... , ,\beta_{(K-1)0},\beta_{K-1}^T\} \)

The log ratios of posterior probabilities are called *log-odds* or* logit transformations*.

Under these assumptions, the posterior probabilities are given by the following two equations:

\( Pr(G=k|X=x)=\frac{\text{exp }(\beta_{k0} + \beta_{k}^{T}x)}{1+\sum_{l=1}^{K-1}\text{exp }(\beta_{l0} + \beta_{l}^{T}x)} \text{ for } k = 1, ... , K - 1 \)

\( Pr(G=K|X=x)=\frac{1}{1+\sum_{l=1}^{K-1}\text{exp }(\beta_{l0} + \beta_{l}^{T}x)} \)

For \(Pr(G = k | X = x)\) given above, obviously

- These must sum up to 1: \(\sum_{k=1}^{K} Pr(G=k|X=x)=1 \)
- A simple calculation shows that the assumptions are satisfied.

### Comparison with Linear Regression on Indicators

Similarities:

- Both attempt to estimate \(Pr(G = k | X = x)\).
- Both have linear classification boundaries.
- Posterior probabilities sum to 1 across classes.

Difference:

- Linear regression on indicator matrix: approximate \(Pr(G = k | X = x)\) by a linear function of
*x*. \(Pr(G = k | X = x)\) is not guaranteed to fall between 0 and 1. - Logistic regression: \(Pr(G = k | X = x)\) is a
*nonlinear*function of*x*. It is guaranteed to range from 0 to 1.