9.1 - Logistic Regression

Introduction Section

There are two big branches of methods for classification. One is called generative modeling, the other is called discriminative modeling. Logistic regression for classification is a discriminative modeling approach, where we estimate the posterior probabilities of classes given X directly without assuming the marginal distribution on X.

It preserves linear classification boundaries.

A review of the Bayes rule shows that when we use 0-1 loss, we pick the class k that has the maximum posterior probability:

\(\hat{G}(x)= \text{arg } \underset{k}{max}Pr(G=k|X=x)  \)

The decision boundary between classes k and l is determined by the equation:

\(Pr(G=k|X=x) =Pr(G=l|X=x) \)

that is, the x's at which the two posterior probabilities of k and l are equal.

If we divide both sides by \(Pr(G = l | X = x)\) and take the log of this ratio, the above equation is equivalent to:

\(\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=0 \)

Since we want to enforce a linear classification boundary, we assume the function above is linear (below):

\(\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=a_{0}^{(k,l)}+\sum_{j=1}^{p}a_{j}^{(k,l)}x_j \)

This is the basic assumption of logistic regression (simple indeed).

We use the superscript (k, l) on the coefficients of the linear function because, for every pair of k and l, the decision boundary would be different, determined by the different coefficients.

For logistic regression, there are restrictive relations between \(a ^ { ( k , l ) }\) for different pairs of (k, l). We don't really need to specify this equation for every pair of k and l.  Instead, we only need to specify it for K - 1 such pairs.

Assumptions

Let's look at our assumptions. If we take class K as the base class, the assumed equations are:

\(\text{log } \frac{Pr(G=1|X=x)}{Pr(G=K|X=x)}=\beta_{10} + \beta_{1}^{T}x  \)
\(\text{log } \frac{Pr(G=2|X=x)}{Pr(G=K|X=x)}=\beta_{20} + \beta_{2}^{T}x  \)
\(\vdots\)
\(\text{log } \frac{Pr(G=K-1|X=x)}{Pr(G=K|X=x)}=\beta_{(K-1)0} + \beta_{K-1}^{T}x  \)

This indicates that we don't have to specify the decision boundary for every pair of classes.  We only need to specify the decision boundary between class j and the base class K. (You can choose any class as the base class - it doesn't really matter.)

Once we have specified the parameters for these K - 1 log ratios, then for any pair of classes (k, l), we can derive the log ratios without introducing new parameters:

\(\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=\beta_{k0} + \beta_{l0}+(\beta_k  - \beta_l)^{T}x  \)

Number of parameters: \((K - 1)(p + 1)\).

For convenience, we will denote the entire parameter set by θ and arrange them in this way:

\(\theta =\{ \beta_{10},\beta_{1}^T,\beta_{20},\beta_{2}^T, ... , ,\beta_{(K-1)0},\beta_{K-1}^T\} \)

The log ratios of posterior probabilities are called log-odds or logit transformations.

Under these assumptions, the posterior probabilities are given by the following two equations:

\( Pr(G=k|X=x)=\frac{\text{exp }(\beta_{k0} + \beta_{k}^{T}x)}{1+\sum_{l=1}^{K-1}\text{exp }(\beta_{l0} + \beta_{l}^{T}x)} \text{ for } k = 1, ... , K - 1 \)

\( Pr(G=K|X=x)=\frac{1}{1+\sum_{l=1}^{K-1}\text{exp }(\beta_{l0} + \beta_{l}^{T}x)}  \)

For \(Pr(G = k | X = x)\) given above, obviously

  • These must sum up to 1: \(\sum_{k=1}^{K} Pr(G=k|X=x)=1 \)
  • A simple calculation shows that the assumptions are satisfied.

Comparison with Linear Regression on Indicators

Similarities:

  • Both attempt to estimate \(Pr(G = k | X = x)\).
  • Both have linear classification boundaries.
  • Posterior probabilities sum to 1 across classes.

Difference:

  • Linear regression on indicator matrix: approximate \(Pr(G = k | X = x)\) by a linear function of x. \(Pr(G = k | X = x)\) is not guaranteed to fall between 0 and 1.
  • Logistic regression: \(Pr(G = k | X = x)\) is a nonlinear function of x. It is guaranteed to range from 0 to 1.