9.1 - Logistic Regression

Introduction

There are two big branches of methods for classification. One is called generative modeling, the other is called discriminative modeling. Logistic regression for classification is a discriminative modeling approach, where we estimate the posterior probabilities of classes given X directly without assuming the marginal distribution on X.

It preserves linear classification boundaries.

A review of the Bayes rule shows that when we use 0-1 loss, we pick the class k that has the maximum posterior probability:

\[\hat{G}(x)= \text{arg } \underset{k}{max}Pr(G=k|X=x) \]

The decision boundary between classes k and l is determined by the equation:

\[Pr(G=k|X=x) =Pr(G=l|X=x) \]

that is, the x's at which the two posterior probabilities of k and l are equal.

If we divide both sides by \(Pr(G = l | X = x)\) and take the log of this ratio, the above equation is equivalent to:

\[\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=0 \]

Since we want to enforce a linear classification boundary, we assume the function above is linear (below):

\[\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=a_{0}^{(k,l)}+\sum_{j=1}^{p}a_{j}^{(k,l)}x_j \]

This is the basic assumption of logistic regression (simple indeed).

We use the superscript (k, l) on the coefficients of the linear function because for every pair of k and l, the decision boundary would be different, determined by the different coefficients.

For logistic regression, there are restrictive relations between a^(k,l) for different pairs of (k, l). We don't really need to specify this equation for every pair of k and l. Instead, we only need to specify it for K - 1 such pairs.

Assumptions

Let's look at our assumptions. If we take class K as the base class, the assumed equations are:

\(\text{log } \frac{Pr(G=1|X=x)}{Pr(G=K|X=x)}=\beta_{10} + \beta_{1}^{T}x \)
\(\text{log } \frac{Pr(G=2|X=x)}{Pr(G=K|X=x)}=\beta_{20} + \beta_{2}^{T}x \)
\(\vdots\)
\(\text{log } \frac{Pr(G=K-1|X=x)}{Pr(G=K|X=x)}=\beta_{(K-1)0} + \beta_{K-1}^{T}x \)

This indicates that we don't have to specify the decision boundary for every pair of classes. We only need to specify the decision boundary between class j and the base class K. (You can choose any class as the base class - it doesn't really matter.)

Once we have specified the parameters for these K - 1 log ratios, then for any pair of classes (k, l), we can derive the log ratios without introducing new parameters:

\[\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=\beta_{k0} + \beta_{l0}+(\beta_k - \beta_l)^{T}x \]

Number of parameters: \((K - 1)(p + 1)\).

For convenience, we will denote the entire parameter set by θ and arrange them in this way:

\[\theta =\{ \beta_{10},\beta_{1}^T,\beta_{20},\beta_{2}^T, ... , ,\beta_{(K-1)0},\beta_{K-1}^T\} \]

The log ratios of posterior probabilities are called log-odds or logit transformations.

Under these assumptions, the posterior probabilities are given by the following two equations:

\[ Pr(G=k|X=x)=\frac{\text{exp }(\beta_{k0} + \beta_{k}^{T}x)}{1+\sum_{l=1}^{K-1}\text{exp }(\beta_{l0} + \beta_{l}^{T}x)} \text{ for } k = 1, ... , K - 1 \]

\[ Pr(G=K|X=x)=\frac{1}{1+\sum_{l=1}^{K-1}\text{exp }(\beta_{l0} + \beta_{l}^{T}x)} \]

For \(Pr(G = k | X = x)\) given above, obviously

These must sum up to 1: \(\sum_{k=1}^{K} Pr(G=k|X=x)=1 \)
A simple calculation shows that the assumptions are satisfied.

Comparison with Linear Regression on Indicators

Similarities:

Both attempt to estimate \(Pr(G = k | X = x)\).
Both have linear classification boundaries.
Posterior probabilities sum to 1 across classes.

Difference:

Linear regression on indicator matrix: approximate \(Pr(G = k | X = x)\) by a linear function of x. \(Pr(G = k | X = x)\) is not guaranteed to fall between 0 and 1.
Logistic regression: \(Pr(G = k | X = x)\) is a nonlinear function of x. It is guaranteed to range from 0 to 1.

Printer-friendly version

9.1 - Logistic Regression

Introduction

Comparison with Linear Regression on Indicators

Navigation

Start Here!

Lessons

Resources