# 9.1.4 - Multiclass Case (K ≥ 3)

9.1.4 - Multiclass Case (K ≥ 3)

When $$K \geq 3$$, the parameter $$\beta$$ is a $$(K - 1)(p + 1)$$ - vector:

$$\beta = \begin{pmatrix} \beta_{10}\\ \beta_{1}\\ \beta_{20}\\ \beta_{2}\\ \vdots\\ \beta_{(K-1)0}\\ \beta_{K-1} \end{pmatrix} = \begin{pmatrix} \beta_{10}\\ \beta_{11}\\ \vdots\\ \beta_{1p}\\ \beta_{20}\\ \vdots\\ \beta_{2p}\\ \vdots\\ \beta_{(K-1)0}\\ \vdots\\ \beta_{(K-1)p} \end{pmatrix}$$

Let $$\bar{\beta}_l=\begin{pmatrix} \beta_{10}\\ \beta_{l}\end{pmatrix}$$

The likelihood function becomes

\begin {align}l(\beta)& = \sum_{i=1}^{N}\text{log }p_{g_{i}}(x_i;\beta)\\ & = \sum_{i=1}^{N}\text{log } \left( \frac{e^{\bar{\beta}_{g_{i}}^{T}x_i}}{1+\sum_{l=1}^{K-1}e^{\bar{\beta}_{g_{i}}^{T}x_i}} \right) \\ & = \sum_{i=1}^{N}\left(\bar{\beta}_{g_{i}}^{T}x_i-\text{log }\left( 1+\sum_{l=1}^{K-1}e^{\bar{\beta}_{g_{i}}^{T}x_i} \right) \right) \\ \end {align}

Note: the indicator function I(•) equals 1 when the argument is true and 0 otherwise.

First order derivatives of the log likelihood function are:

\begin {align}\frac{\partial l(\beta)}{\partial \beta_{kj}}& = \sum_{i=1}^{N}\left( I(g_i=k)x_{ij}-\frac{e^{\bar{\beta}_{k}^{T}x_i}x_{ij}}{1+\sum_{l=1}^{K-1}e^{\bar{\beta}_{l}^{T}x_i}} \right)\\ & = \sum_{i=1}^{N}x_{ij}(I(g_i=k)-p_k(x_i;\beta)) \\ \end {align}

Second order derivatives:

\begin {align}\frac{\partial^2 l(\beta)}{\partial \beta_{kj}\partial \beta_{mn}}& = \sum_{i=1}^{N}x _{ij}\cdot \frac{1}{(1+\sum_{l=1}^{K-1}e^{\bar{\beta}_{l}^{T}x_i} )^2} \cdot \left(-e^{\bar{\beta}_{k}^{T}x_i} I(k=m)x_{in}(1+ \sum_{l=1}^{K-1}e^{\bar{\beta}_{l}^{T}x_i})+ e^{\bar{\beta}_{k}^{T}x_i}x_{in} \right)\\ & = \sum_{i=1}^{N}x_{ij}x_{in}(-p_k(x_i;\beta) I(k=m)+p_k(x_i;\beta)p_m(x_i;\beta)) \\ & = - \sum_{i=1}^{N}x_{ij}x_{in}p_k(x_i;\beta)(I(k=m)-p_m(x_i;\beta)) \\ \end {align}

Expressing the solution in matrix form is much more compact.  First, we introduce several notations.

- $$\mathbf{y}$$ is the concatenated indicator vector, a column vector of dimension $$N(K -1)$$.

$$\mathbf{y}=\begin{pmatrix} \mathbf{y}_{1}\\ \mathbf{y}_{2}\\ \vdots\\ \mathbf{y}_{K-1} \end{pmatrix} \mathbf{y}_{k}=\begin{pmatrix} I(g_1=k)\\ I(g_2=k)\\ \vdots\\ I(g_N=k) \end{pmatrix} 1 \le k \le K-1$$

- $$\mathbf{p}$$ is the concatenated vector of fitted probabilities, a column vector of dimension N(K -1).

$$\mathbf{p}=\begin{pmatrix} \mathbf{p}_{1}\\ \mathbf{p}_{2}\\ \vdots\\ \mathbf{p}_{K-1} \end{pmatrix} \mathbf{p}_{k}=\begin{pmatrix} p_k(x_1;\beta)\\ p_k(x_2;\beta)\\ \vdots\\ p_k(x_N;\beta) \end{pmatrix} 1 \le k \le K-1$$

–$$\tilde{X}$$ is an $$N(K -1) \times (p + 1)(K -1)$$ matrix, where  X is the $$N(p + 1)$$ input matrix defined before:

$$\tilde{\mathbf{X}}= \begin{pmatrix} \mathbf{X} & 0 & ... & 0\\ 0 & \mathbf{X} & ... & 0\\ ... & ... & ... &... \\ 0 & 0 & ... & \mathbf{X} \end{pmatrix}$$

- Matrix $$\mathbf{W}$$ is an $$N(K -1) \times N(K -1)$$ square matrix:

$$\mathbf{W}= \begin{pmatrix} \mathbf{W}_{11} & \mathbf{W}_{12} & ... & \mathbf{W}_{1(K-1)}\\ \mathbf{W}_{21} & \mathbf{W}_{22} & ... & \mathbf{W}_{2(K-1)}\\ ... & ... & ... &... \\ \mathbf{W}_{(K-1),1} & \mathbf{W}_{(K-1),2} & ... & \mathbf{W}_{(K-1),(K-1)} \end{pmatrix}$$

– Each submatrix $$\mathbf{W}_{km}$$, $$1 \leq k$$, $$m \leq K -1$$, is an N × N diagonal matrix.

– When $$k = m$$, the ith diagonal element in $$\mathbf{W}_{kk}$$ is $$p_k(x_i ; \beta^{old})(1 -p_k(x_i ; \beta^{old}))$$.

– When $$k \ne m$$, the ith diagonal element in $$\mathbf{W}_{km}$$ is $$-p_k(x_i ; \beta^{old})(1 -p_m(x_i ; \beta^{old}))$$.

Similarly as with binary classification

\begin {align}\frac{\partial l(\beta)}{\partial \beta}& = \tilde{\mathbf{X}}^T(\mathbf{y}-\mathbf{p}) \\ \frac{\partial^2 l(\beta)}{\partial \beta \partial \beta^T}& = \tilde{\mathbf{X}}^T \mathbf{W}\tilde{\mathbf{X}}\\ \end {align}

The formula for updating $$\beta^{new}$$ in the binary classification case holds for multiclass, except that the definitions of the matrices are more complicated.  The formula itself may look simple because the complexity is wrapped in the definitions of the matrices.

\begin {align}\beta^{new}& = (\tilde{\mathbf{X}}^T\mathbf{W}\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^T\mathbf{W}\mathbf{z} \\ \text{where }\mathbf{z}& \buildrel\triangle\over = \; \tilde{\mathbf{X}}\beta^{old} + \mathbf{W}^{-1}(\mathbf{y}-\mathbf{p}) \text{ or simply:}\\ \beta^{new}& = \beta^{old}+ (\tilde{\mathbf{X}}^T\mathbf{W}\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^T(\mathbf{y}-\mathbf{p}) \\ \end {align}

### Computation Issues

We now elaborate on some details in the computation.

To initialize the parameters before starting the iterations, we often use $$\beta = 0$$.

Convergence of the algorithm is not guaranteed but is usually the case.

Usually, the log-likelihood increases after each iteration, but overshooting can occur.

In the rare case that the log-likelihood decreases, cut step size by half. That is, instead of using the calculated new parameters, use the average of the new parameters and the old parameters.

 [1] Link ↥ Has Tooltip/Popover Toggleable Visibility