9.1 - Logistic Regression

Printer-friendly versionPrinter-friendly version

Introduction

There are two big branches of methods for classification. One is called generative modeling, the other is called discriminative modeling. Logistic regression for classification is a discriminative modeling approach, where we estimate the posterior probabilities of classes given X directly without assuming the marginal distribution on X.

It preserves linear classification boundaries.

A review of the Bayes rule shows that when we use 0-1 loss, we pick the class k that has the maximum posterior probability:

\[\hat{G}(x)= \text{arg } \underset{k}{max}Pr(G=k|X=x)  \]

The decision boundary between classes k and l is determined by the equation:

\[Pr(G=k|X=x) =Pr(G=l|X=x) \]

that is, the x's at which the two posterior probabilities of k and l are equal.

If we divide both sides by \(Pr(G = l | X = x)\) and take the log of this ratio, the above equation is equivalent to:

\[\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=0 \]

Since we want to enforce a linear classification boundary, we assume the function above is linear (below):

\[\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=a_{0}^{(k,l)}+\sum_{j=1}^{p}a_{j}^{(k,l)}x_j \]

This is the basic assumption of logistic regression (simple indeed).

We use the superscript (k, l) on the coefficients of the linear function because for every pair of k and l, the decision boundary would be different, determined by the different coefficients.

For logistic regression, there are restrictive relations between a(k,l) for different pairs of (k, l). We don't really need to specify this equation for every pair of k and l.  Instead, we only need to specify it for K - 1 such pairs.

Assumptions

Let's look at our assumptions. If we take class K as the base class, the assumed equations are:

\(\text{log } \frac{Pr(G=1|X=x)}{Pr(G=K|X=x)}=\beta_{10} + \beta_{1}^{T}x  \)
\(\text{log } \frac{Pr(G=2|X=x)}{Pr(G=K|X=x)}=\beta_{20} + \beta_{2}^{T}x  \)
\(\vdots\)
\(\text{log } \frac{Pr(G=K-1|X=x)}{Pr(G=K|X=x)}=\beta_{(K-1)0} + \beta_{K-1}^{T}x  \)

This indicates that we don't have to specify the decision boundary for every pair of classes.  We only need to specify the decision boundary between class j and the base class K. (You can choose any class as the base class - it doesn't really matter.)

Once we have specified the parameters for these K - 1 log ratios, then for any pair of classes (k, l), we can derive the log ratios without introducing new parameters:

\[\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=\beta_{k0} + \beta_{l0}+(\beta_k  - \beta_l)^{T}x  \]

Number of parameters: \((K - 1)(p + 1)\).

For convenience, we will denote the entire parameter set by θ and arrange them in this way:

\[\theta =\{ \beta_{10},\beta_{1}^T,\beta_{20},\beta_{2}^T, ... , ,\beta_{(K-1)0},\beta_{K-1}^T\} \]

The log ratios of posterior probabilities are called log-odds or logit transformations.

Under these assumptions, the posterior probabilities are given by the following two equations:

\[ Pr(G=k|X=x)=\frac{\text{exp }(\beta_{k0} + \beta_{k}^{T}x)}{1+\sum_{l=1}^{K-1}\text{exp }(\beta_{l0} + \beta_{l}^{T}x)} \text{ for } k = 1, ... , K - 1 \]

\[ Pr(G=K|X=x)=\frac{1}{1+\sum_{l=1}^{K-1}\text{exp }(\beta_{l0} + \beta_{l}^{T}x)}  \]

For \(Pr(G = k | X = x)\) given above, obviously

  • These must sum up to 1: \(\sum_{k=1}^{K} Pr(G=k|X=x)=1 \)
  • A simple calculation shows that the assumptions are satisfied.

Comparison with Linear Regression on Indicators

Similarities:

  • Both attempt to estimate \(Pr(G = k | X = x)\).
  • Both have linear classification boundaries.
  • Posterior probabilities sum to 1 across classes.

Difference:

  • Linear regression on indicator matrix: approximate \(Pr(G = k | X = x)\) by a linear function of x. \(Pr(G = k | X = x)\) is not guaranteed to fall between 0 and 1.
  • Logistic regression: \(Pr(G = k | X = x)\) is a nonlinear function of x. It is guaranteed to range from 0 to 1.