# 9.1 - Logistic Regression

9.1 - Logistic Regression

## Introduction

There are two big branches of methods for classification. One is called generative modeling, the other is called discriminative modeling. Logistic regression for classification is a discriminative modeling approach, where we estimate the posterior probabilities of classes given X directly without assuming the marginal distribution on X.

It preserves linear classification boundaries.

A review of the Bayes rule shows that when we use 0-1 loss, we pick the class k that has the maximum posterior probability:

$$\hat{G}(x)= \text{arg } \underset{k}{max}Pr(G=k|X=x)$$

The decision boundary between classes k and l is determined by the equation:

$$Pr(G=k|X=x) =Pr(G=l|X=x)$$

that is, the x's at which the two posterior probabilities of k and l are equal.

If we divide both sides by $$Pr(G = l | X = x)$$ and take the log of this ratio, the above equation is equivalent to:

$$\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=0$$

Since we want to enforce a linear classification boundary, we assume the function above is linear (below):

$$\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=a_{0}^{(k,l)}+\sum_{j=1}^{p}a_{j}^{(k,l)}x_j$$

This is the basic assumption of logistic regression (simple indeed).

We use the superscript (k, l) on the coefficients of the linear function because, for every pair of k and l, the decision boundary would be different, determined by the different coefficients.

For logistic regression, there are restrictive relations between $$a ^ { ( k , l ) }$$ for different pairs of (k, l). We don't really need to specify this equation for every pair of k and l.  Instead, we only need to specify it for K - 1 such pairs.

### Assumptions

Let's look at our assumptions. If we take class K as the base class, the assumed equations are:

$$\text{log } \frac{Pr(G=1|X=x)}{Pr(G=K|X=x)}=\beta_{10} + \beta_{1}^{T}x$$
$$\text{log } \frac{Pr(G=2|X=x)}{Pr(G=K|X=x)}=\beta_{20} + \beta_{2}^{T}x$$
$$\vdots$$
$$\text{log } \frac{Pr(G=K-1|X=x)}{Pr(G=K|X=x)}=\beta_{(K-1)0} + \beta_{K-1}^{T}x$$

This indicates that we don't have to specify the decision boundary for every pair of classes.  We only need to specify the decision boundary between class j and the base class K. (You can choose any class as the base class - it doesn't really matter.)

Once we have specified the parameters for these K - 1 log ratios, then for any pair of classes (k, l), we can derive the log ratios without introducing new parameters:

$$\text{log } \frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=\beta_{k0} + \beta_{l0}+(\beta_k - \beta_l)^{T}x$$

Number of parameters: $$(K - 1)(p + 1)$$.

For convenience, we will denote the entire parameter set by θ and arrange them in this way:

$$\theta =\{ \beta_{10},\beta_{1}^T,\beta_{20},\beta_{2}^T, ... , ,\beta_{(K-1)0},\beta_{K-1}^T\}$$

The log ratios of posterior probabilities are called log-odds or logit transformations.

Under these assumptions, the posterior probabilities are given by the following two equations:

$$Pr(G=k|X=x)=\frac{\text{exp }(\beta_{k0} + \beta_{k}^{T}x)}{1+\sum_{l=1}^{K-1}\text{exp }(\beta_{l0} + \beta_{l}^{T}x)} \text{ for } k = 1, ... , K - 1$$

$$Pr(G=K|X=x)=\frac{1}{1+\sum_{l=1}^{K-1}\text{exp }(\beta_{l0} + \beta_{l}^{T}x)}$$

For $$Pr(G = k | X = x)$$ given above, obviously

• These must sum up to 1: $$\sum_{k=1}^{K} Pr(G=k|X=x)=1$$
• A simple calculation shows that the assumptions are satisfied.

### Comparison with Linear Regression on Indicators

Similarities:

• Both attempt to estimate $$Pr(G = k | X = x)$$.
• Both have linear classification boundaries.
• Posterior probabilities sum to 1 across classes.

Difference:

• Linear regression on indicator matrix: approximate $$Pr(G = k | X = x)$$ by a linear function of x. $$Pr(G = k | X = x)$$ is not guaranteed to fall between 0 and 1.
• Logistic regression: $$Pr(G = k | X = x)$$ is a nonlinear function of x. It is guaranteed to range from 0 to 1.

# 9.1.1 - Fitting Logistic Regression Models

9.1.1 - Fitting Logistic Regression Models

How do we estimate the parameters? How do we fit a logistic regression model?

We need a certain optimization criterion for choosing the parameters.

### Optimization Criterion

What we want to do is to find parameters that maximize the conditional likelihood of class labels G given X using the training data.  We are not interested in the distribution of X, instead, our focus is on the conditional probabilities of the class labels given X.

Given point xi , the posterior probability for the class to be k is denoted by:

$$p_k(x_i ; \theta) = Pr(G = k | X = x_i ; \theta)$$

Given the first input x1, the posterior probability of its class, denoted as g1, is computed by:

$$Pr(G = g_1| X = x_1)$$.

Since samples in the training data set are assumed independent, the posterior probability for the N sample points each having class $$g_i , i =1, 2, \cdots , N$$, given their inputs $$x_1, x_2, \cdots , x_N$$ is:

$$\prod_{i=1}^{N} Pr(G=g_i|X=x_i)$$

In other words, the joint conditional likelihood is the product of the conditional probabilities of the classes given every data point.

The conditional log-likelihood of the class labels in the training data set becomes a summation:

\begin {align} l(\theta) &=\sum_{i=1}^{N}\text{ log }Pr(G=g_i|X=x_i)\\ & = \sum_{i=1}^{N}\text{ log }p_{g_{i}}(x_i; \theta) \\ \end {align}

# 9.1.2 - Binary Classification

9.1.2 - Binary Classification

Let's look at binary classification first. In fact, when we go to more than two classes the formulas become much more complicated, although they are derived similarly.  We will begin by looking at this simpler case.

For binary classification, if $$g_i$$ = class 1, denote $$y_i$$ = 1; if $$g_i$$ = class 2, denote $$y_i$$ = 0. All we are doing here is changing the labels to 1's and 0's so that the notation will be simpler.

Let $$p_1(x ; \theta) = p(x ; \theta)$$.

Then $$p_2(x ; \theta) = 1 - p_1(x ; \theta) = 1- p(x ; \theta)$$ because the posterior probabilities of the two classes have to sum up to 1.

Since K = 2 we only have one linear equation and one decision boundary between two classes, the parameters $$\theta = {\beta_{10}, \beta_{1}}$$, (remember, $$\beta_{1}$$ is a vector). And, we denote $$\beta = (\beta_{10}, \beta_1)^T$$ which is a column vector.

Now, let's try to simplify the equation a little bit.

If $$y_i = 1$$, i.e.,$$g_i = 1$$, then

\begin {align} \text{log }p_{g_{i}}(x;\beta)& =\text{log }p_1(x;\beta)\\ & = 1 \cdot \text{log }p(x;\beta) \\ & = y_i \text{log }p(x;\beta) \\ \end {align}

If $$y_i = 0$$, i.e., $$g_i = 2$$,

\begin {align} \text{log }p_{g_{i}}(x;\beta)& =\text{log }p_2(x;\beta)\\ & = 1 \cdot \text{log }(1-p(x;\beta)) \\ & = (1-y_i )\text{log }(1-p(x;\beta)) \\ \end {align}

Since either $$y_i = 0$$ or $$1 - y_i = 0$$, we can add the two (at any time, only one of the two is nonzero) and have:

$$\text{log }p_{g_{i}}(x;\beta)=y_i \text{log }p(x;\beta)+(1-y_i )\text{log }(1-p(x;\beta))$$

The reason for doing this is that when we later derive the logistic regression we do not want to work with different cases. It would be tedious in notation if we have to distinguish different cases each time. This unified equation is always correct for whatever yi you use.

Next, we will plug what we just derived into the log-likelihood, which is simply a summation over all the points:

\begin {align} l(\beta)& =\sum_{i=1}^{N}\text{log }p_{g_{i}}(x_i;\beta)\\ & = \sum_{i=1}^{N}(y_i \text{log }p(x_i;\beta)+(1-y_i )\text{log }(1-p(x_i;\beta)))\\ \end {align}

There are p + 1 parameters in $$\beta = (\beta_{10}, \beta_1)^T$$.

Assume a column vector form for $$\beta$$:

$$\beta=\begin{pmatrix} \beta_{10}\\ \beta_{11}\\ \beta_{12}\\ \vdots\\ \beta_{1,p} \end{pmatrix}$$

Here we add the constant term 1 to x to accommodate the intercept and keep this in matrix form.

$$x=\begin{pmatrix} 1\\ x,_1\\ x,_2\\ \vdots\\ x,_p \end{pmatrix}$$

Under the assumption of the logistic regression model:

\begin{align} p(x;\beta) & =Pr(G=1|X=x)=\frac{ \text{exp }(\beta^Tx)}{1+ \text{exp }(\beta^Tx)}\\ 1-p(x;\beta) & = Pr(G=2|X=x)=\frac{1}{1+ \text{exp }(\beta^Tx)} \\ \end {align}

we substitute the above in $$l(\beta)$$:

$$l(\beta) =\sum_{i=1}^{N}\left(y_i\beta^Tx_i-\text{log }(1+e^{\beta^{T}x_{i}}) \right)$$

The idea here is based on basic calculus. If we want to maximize $$l(\beta)$$, we set the first order partial derivatives of the function $$l(\beta)$$ with respect to $$\beta_{1j}$$ to zero.

\begin {align}\frac{\partial l(\beta)}{\beta_{1j}} & =\sum_{i=1}^{N}y_ix_{ij}-\sum_{i=1}^{N}\frac{x_{ij}e^{\beta^{T}x_{i}}}{1+e^{\beta^{T}x_{i}}}\\ & = \sum_{i=1}^{N}y_ix_{ij}-\sum_{i=1}^{N}p(x;\beta)x_{ij}\\ & = \sum_{i=1}^{N}x_{ij}(y_i-p(x_i;\beta))\\ \end {align}

for all j = 0, 1, ... , p.

And, if we go through the math, we will find out that it is equivalent to the matrix form below.

$$\frac{\partial l(\beta)}{\beta_{1j}}= \sum_{i=1}^{N}x_{i}(y_i-p(x_i;\beta))$$

Here $$x_i$$ is a column vector and $$y_i$$ is a scalar.

To solve the set of p +1 nonlinear equations $$\frac{\partial l(\beta)}{\partial\beta_{1j}}=0$$, j = 0, 1, ... , p , we will use the Newton-Raphson algorithm.

The Newton-Raphson algorithm requires the second-order derivatives or the so-called Hessian matrix (a square matrix of the second order derivatives) which is given by:

$$\frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^T} =-\sum_{i=1}^{N}x_{i}x_{i}^{T}p(x_i;\beta)(1-p(x_i;\beta))$$

Below we derive the Hessian matrix, where the element on the jth row and nth column is (counting from 0):

\begin {align} &\frac{\partial l(\beta)}{\partial\beta_{1j}\partial\beta_{1n}}\\ & = - \sum_{i=1}^{N}\frac{(1+e^{\beta^{T}x_{i}})e^{\beta^{T}x_{i}}x_{ij}x_{in}-(e^{\beta^{T}x_{i}} )^2x_{ij}x_{in} }{(1+e^{\beta^{T}x_{i}})^2}\\ & = - \sum_{i=1}^{N}x_{ij}x_{in}p(x_i;\beta)-x_{ij}x_{in}p(x_i;\beta)^2\\ & = - \sum_{i=1}^{N}x_{ij}x_{in}p(x_i;\beta)(1-p(x_i;\beta))\\ \end {align}

Starting with $$\beta^{old}$$, a single Newton-Raphson update is given by this matrix forumla:

$$\beta^{new} = \beta^{old}-\left(\frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^T} \right)^{-1}\frac{\partial l(\beta)}{\partial\beta}$$

Basically, if given an old set of parameters, we update the new set of parameters by taking $$\beta^{old}$$ minus the inverse of the Hessian matrix times the first order derivative vector. These derivatives are all evaluated at $$\beta^{old}$$.

The iteration can be expressed compactly in matrix form.

• Let y be the column vector of $$y_i$$.
• Let $$\mathbf{X}$$ be the N × (p + 1) input matrix.
• Let  $$\mathbf{p}$$ be the N-vector of fitted probabilities with ith element $$p(x_i ; \beta^{old})$$.
• Let  $$\mathbf{W}$$ be an N × N diagonal matrix of weights with ith element $$p(x_i ; \beta^{old})(1 - p(x_i ;\beta^{old}))$$.
• If you have set up all the matrices properly then the first order derivative vector is:

$$\frac{\partial l(\beta)}{\partial\beta}=\mathbf{X}^T(\mathbf{y}-\mathbf{p})$$

and the Hessian matrix is:

$$\frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^T}=-\mathbf{X}^T\mathbf{W}\mathbf{X}$$

The Newton-Raphson step is:

\begin {align}\beta^{new} &=\beta^{old}+(\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{y}-\mathbf{p})\\ & = (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}(\mathbf{X}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p}))\\ & = (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}\mathbf{z} \\ \end {align}

where $$\mathbf{z} \; \buildrel\triangle\over = \; \mathbf{X}\beta^{old} + \mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})$$

If  $$\mathbf{z}$$ is viewed as a response and  $$\mathbf{X}$$ is the input matrix, $$\beta^{new}$$ is the solution to a weighted least square problem:

$$\beta^{new} \leftarrow \text{arg } \underset{\beta}{min} (\mathbf{z}-\mathbf{X}\beta)^T\mathbf{W}(\mathbf{z}-\mathbf{X}\beta)$$

Recall that linear regression by least square solves

$$\underset{\beta}{min} (\mathbf{z}-\mathbf{X}\beta)^T\mathbf{W}(\mathbf{z}-\mathbf{X}\beta)$$

$$\mathbf{z}$$ is referred to as the adjusted response.

The algorithm is referred to as iteratively reweighted least squares or IRLS.

The pseudo code for the IRLS algorithm is provided below.

### Pseudo Code

1. $$0 \rightarrow \beta$$
2. Compute y by setting its elements to:

$$y_i=\left\{\begin{matrix} 1 & \text{if } g_i =1 \\ 0 & \text{if } g_i =2 \end{matrix}\right.$$

3. Compute p by setting its elements to:

$$p(x_i;\beta)=\frac{e^{\beta^{T}x_{i}}}{1+e^{\beta^{T}x_{i}}} i = 1, 2, ... , N$$

4. Compute the diagonal matrix $$\mathbf{W}$$. The ith diagonal element is $$p(x_i ; \beta)(1 - p(x_i ; \beta))$$, i =1, 2, ... , N.
5. $$\mathbf{z} \leftarrow \mathbf{X}\beta + \mathbf{W}^{ -1}(\mathbf{y} - \mathbf{p})$$.
6. $$\beta \leftarrow (\mathbf{X}^{T} \mathbf{W}\mathbf{X})^{-1} \mathbf{X}^{T} \mathbf{W} \mathbf{z}$$.
7. If the stopping criteria are met, stop; otherwise go back to step 3.

### Computational Efficiency

Since W is an N × N diagonal matrix, direct matrix operations with it, as shown in the above pseudo code, is inefficient.

A modified pseudo code with much improved computational efficiency is given below.

1. $$0 \rightarrow \beta$$
2. Compute y by setting its elements to:

$$y_i=\left\{\begin{matrix} 1 & \text{if } g_i =1 \\ 0 & \text{if } g_i =2 \end{matrix}\right.$$

3. Compute p by setting its elements to:

$$p(x_i;\beta)=\dfrac{e^{\beta^{T}x_{i}}}{1+e^{\beta^{T}x_{i}}} i = 1, 2, ... , N$$

4. Compute the N × (p +1) matrix $$\tilde{\mathbf{X}}$$ by multiplying the ith row of matrix $$\mathbf{X}$$ by $$p(x_i ; \beta)(1 - p(x_i ; \beta))$$, i =1, 2, ... , N:

$$\mathbf{X}=\begin{pmatrix} x_{1}^{T}\\ x_{2}^{T}\\ ...\\ x_{N}^{T}\end{pmatrix} \tilde{\mathbf{X}}= \begin{pmatrix} p(x_1;\beta)(1-p(x_1;\beta))x_{1}^{T}\\ p(x_2;\beta)(1-p(x_2;\beta))x_{2}^{T}\\ ...\\ p(x_N;\beta)(1-p(x_N;\beta))x_{N}^{T}\end{pmatrix}$$

5. $$\beta \leftarrow \beta+(\mathbf{X}^T\tilde{\mathbf{X}})^{-1}\mathbf{X}^T(\mathbf{y}-\mathbf{p})$$.
6. If the stopping criteria are met, stop; otherwise go back to step 3.

# 9.1.3 - Example - Diabetes Dataset

9.1.3 - Example - Diabetes Dataset

This is the Pima Indians diabetes dataset we used before in Lesson 3.5.  The input X is two dimensional. The two variables $$X_1$$ and $$X_2$$ are the first two principal components of the original 8 variables.

There are two classes: without diabetes $$(Y=0)$$; with diabetes $$(Y=1)$$. To avoid confusion, we'll label the class variable, G, the same way (so $$G=0$$ corresponds to $$Y=0$$ and $$G=1$$ corresponds to $$Y=1$$).

Applying logistic regression, we obtain

$$\beta = (–0.7682, 0.6816, 0.3663)^T$$.

Typically in logistic regression we focus on the probability equation for $$Y=1$$, which in this case can be rewritten as:

$$Pr(G=1|X=x) =\frac{e^{\beta_0+\beta_1X_1+\beta_2X_2}}{1+e^{\beta_0+\beta_1X_1+\beta_2X_2}}=\frac{1}{1+e^{-\beta_0-\beta_1X_1-\beta_2X_2}}.$$

The posterior probabilities based on the estimated parameters are therefore:

\begin {align} Pr(G=1|X=x) & =\frac{e^{-0.7682+0.6816X_1+0.3663X_2}}{1+e^{-0.7682+0.6816X_1+0.3663X_2}} =\frac{1}{1+e^{0.7682-0.6816X_1-0.3663X_2}} \\ Pr(G=0|X=x) & =1-Pr(G=1|X=x) \end {align}

The classification rule based on a 0.5 probability cut-off is:

$$\hat{G}(x) = \left\{\begin{matrix} 0 & \text{if } 0.7682-0.6816X_1-0.3663X_2 \ge 0\\ 1 & \text{if } 0.7682-0.6816X_1-0.3663X_2 < 0 \end{matrix}\right.$$

In the following scatter plot, we show the classification boundary obtained by logistic regression and compare it to that by LDA.  Solid line: decision boundary obtained by logistic regression. Dash line: decision boundary obtained by LDA. Red cross: without diabetes (class 0). Blue circle: with diabetes (class 1).

The performance of the logistic regression classification is as follows.

We obtain the classifier and apply it to the training data set and see what percentage of data is classified incorrectly because we know the true labels in the training data.

By sensitivity, we mean that if the person has diabetes what percentage of the data points will say that they actually have diabetes. In other words, if it is, in fact, a positive case, a percentage of the times that we will classify it as a positive case.

This refers to the percentage of correctness of negative samples. If the person does not have diabetes, what percentage of the times that we classify them as not having diabetes.

• QDA within training data classification error rate: 28.13%.
• Sensitivity: 45.90%.
• Specificity: 85.80%.

# 9.1.4 - Multiclass Case (K ≥ 3)

9.1.4 - Multiclass Case (K ≥ 3)

When $$K \geq 3$$, the parameter $$\beta$$ is a $$(K - 1)(p + 1)$$ - vector:

$$\beta = \begin{pmatrix} \beta_{10}\\ \beta_{1}\\ \beta_{20}\\ \beta_{2}\\ \vdots\\ \beta_{(K-1)0}\\ \beta_{K-1} \end{pmatrix} = \begin{pmatrix} \beta_{10}\\ \beta_{11}\\ \vdots\\ \beta_{1p}\\ \beta_{20}\\ \vdots\\ \beta_{2p}\\ \vdots\\ \beta_{(K-1)0}\\ \vdots\\ \beta_{(K-1)p} \end{pmatrix}$$

Let $$\bar{\beta}_l=\begin{pmatrix} \beta_{10}\\ \beta_{l}\end{pmatrix}$$

The likelihood function becomes

\begin {align}l(\beta)& = \sum_{i=1}^{N}\text{log }p_{g_{i}}(x_i;\beta)\\ & = \sum_{i=1}^{N}\text{log } \left( \frac{e^{\bar{\beta}_{g_{i}}^{T}x_i}}{1+\sum_{l=1}^{K-1}e^{\bar{\beta}_{g_{i}}^{T}x_i}} \right) \\ & = \sum_{i=1}^{N}\left(\bar{\beta}_{g_{i}}^{T}x_i-\text{log }\left( 1+\sum_{l=1}^{K-1}e^{\bar{\beta}_{g_{i}}^{T}x_i} \right) \right) \\ \end {align}

Note: the indicator function I(•) equals 1 when the argument is true and 0 otherwise.

First order derivatives of the log likelihood function are:

\begin {align}\frac{\partial l(\beta)}{\partial \beta_{kj}}& = \sum_{i=1}^{N}\left( I(g_i=k)x_{ij}-\frac{e^{\bar{\beta}_{k}^{T}x_i}x_{ij}}{1+\sum_{l=1}^{K-1}e^{\bar{\beta}_{l}^{T}x_i}} \right)\\ & = \sum_{i=1}^{N}x_{ij}(I(g_i=k)-p_k(x_i;\beta)) \\ \end {align}

Second order derivatives:

\begin {align}\frac{\partial^2 l(\beta)}{\partial \beta_{kj}\partial \beta_{mn}}& = \sum_{i=1}^{N}x _{ij}\cdot \frac{1}{(1+\sum_{l=1}^{K-1}e^{\bar{\beta}_{l}^{T}x_i} )^2} \cdot \left(-e^{\bar{\beta}_{k}^{T}x_i} I(k=m)x_{in}(1+ \sum_{l=1}^{K-1}e^{\bar{\beta}_{l}^{T}x_i})+ e^{\bar{\beta}_{k}^{T}x_i}x_{in} \right)\\ & = \sum_{i=1}^{N}x_{ij}x_{in}(-p_k(x_i;\beta) I(k=m)+p_k(x_i;\beta)p_m(x_i;\beta)) \\ & = - \sum_{i=1}^{N}x_{ij}x_{in}p_k(x_i;\beta)(I(k=m)-p_m(x_i;\beta)) \\ \end {align}

Expressing the solution in matrix form is much more compact.  First, we introduce several notations.

- $$\mathbf{y}$$ is the concatenated indicator vector, a column vector of dimension $$N(K -1)$$.

$$\mathbf{y}=\begin{pmatrix} \mathbf{y}_{1}\\ \mathbf{y}_{2}\\ \vdots\\ \mathbf{y}_{K-1} \end{pmatrix} \mathbf{y}_{k}=\begin{pmatrix} I(g_1=k)\\ I(g_2=k)\\ \vdots\\ I(g_N=k) \end{pmatrix} 1 \le k \le K-1$$

- $$\mathbf{p}$$ is the concatenated vector of fitted probabilities, a column vector of dimension N(K -1).

$$\mathbf{p}=\begin{pmatrix} \mathbf{p}_{1}\\ \mathbf{p}_{2}\\ \vdots\\ \mathbf{p}_{K-1} \end{pmatrix} \mathbf{p}_{k}=\begin{pmatrix} p_k(x_1;\beta)\\ p_k(x_2;\beta)\\ \vdots\\ p_k(x_N;\beta) \end{pmatrix} 1 \le k \le K-1$$

–$$\tilde{X}$$ is an $$N(K -1) \times (p + 1)(K -1)$$ matrix, where  X is the $$N(p + 1)$$ input matrix defined before:

$$\tilde{\mathbf{X}}= \begin{pmatrix} \mathbf{X} & 0 & ... & 0\\ 0 & \mathbf{X} & ... & 0\\ ... & ... & ... &... \\ 0 & 0 & ... & \mathbf{X} \end{pmatrix}$$

- Matrix $$\mathbf{W}$$ is an $$N(K -1) \times N(K -1)$$ square matrix:

$$\mathbf{W}= \begin{pmatrix} \mathbf{W}_{11} & \mathbf{W}_{12} & ... & \mathbf{W}_{1(K-1)}\\ \mathbf{W}_{21} & \mathbf{W}_{22} & ... & \mathbf{W}_{2(K-1)}\\ ... & ... & ... &... \\ \mathbf{W}_{(K-1),1} & \mathbf{W}_{(K-1),2} & ... & \mathbf{W}_{(K-1),(K-1)} \end{pmatrix}$$

– Each submatrix $$\mathbf{W}_{km}$$, $$1 \leq k$$, $$m \leq K -1$$, is an N × N diagonal matrix.

– When $$k = m$$, the ith diagonal element in $$\mathbf{W}_{kk}$$ is $$p_k(x_i ; \beta^{old})(1 -p_k(x_i ; \beta^{old}))$$.

– When $$k \ne m$$, the ith diagonal element in $$\mathbf{W}_{km}$$ is $$-p_k(x_i ; \beta^{old})(1 -p_m(x_i ; \beta^{old}))$$.

Similarly as with binary classification

\begin {align}\frac{\partial l(\beta)}{\partial \beta}& = \tilde{\mathbf{X}}^T(\mathbf{y}-\mathbf{p}) \\ \frac{\partial^2 l(\beta)}{\partial \beta \partial \beta^T}& = \tilde{\mathbf{X}}^T \mathbf{W}\tilde{\mathbf{X}}\\ \end {align}

The formula for updating $$\beta^{new}$$ in the binary classification case holds for multiclass, except that the definitions of the matrices are more complicated.  The formula itself may look simple because the complexity is wrapped in the definitions of the matrices.

\begin {align}\beta^{new}& = (\tilde{\mathbf{X}}^T\mathbf{W}\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^T\mathbf{W}\mathbf{z} \\ \text{where }\mathbf{z}& \buildrel\triangle\over = \; \tilde{\mathbf{X}}\beta^{old} + \mathbf{W}^{-1}(\mathbf{y}-\mathbf{p}) \text{ or simply:}\\ \beta^{new}& = \beta^{old}+ (\tilde{\mathbf{X}}^T\mathbf{W}\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^T(\mathbf{y}-\mathbf{p}) \\ \end {align}

### Computation Issues

We now elaborate on some details in the computation.

To initialize the parameters before starting the iterations, we often use $$\beta = 0$$.

Convergence of the algorithm is not guaranteed but is usually the case.

Usually, the log-likelihood increases after each iteration, but overshooting can occur.

In the rare case that the log-likelihood decreases, cut step size by half. That is, instead of using the calculated new parameters, use the average of the new parameters and the old parameters.

 [1] Link ↥ Has Tooltip/Popover Toggleable Visibility