13.9 - Viterbi Algorithm

In many applications using HMM, we need to predict the state sequence s = {s1, ... , sT}based on the observed data u = {u1, ... , uT}.

Optimization criterion: find s that maximizes the posterior probability P(s | u):

\[  \begin {align} \mathbf{s}^*& =\text{arg }\underset{\mathbf{s}}{\text{ max }}P(\mathbf{s}| \mathbf{u})\\
& =  \text{arg }\underset{\mathbf{s}}{\text{ max }}\frac{P(\mathbf{s}, \mathbf{u})}{P(\mathbf{u})} \\
& =  \text{arg }\underset{\mathbf{s}}{\text{ max }}P(\mathbf{s}, \mathbf{u}) \\
\end {align} \]

This criterion is called the rule of Maximum A Posteriori (MAP).

The optimal sequence {s1, s2, ... , sT}can be found by the Viterbi algorithm.  We will prove the optimality of the Viterbi algorithm below.  Again, the proof can be omitted for lack of interest.

The amount of computation in the Viterbi algorithm is at the order of M2T. Memory required is at the order of MT.

The Viterbi algorithm maximizes an objective function G(s), where s = {s1, ... , sT}, st ∈{1, ... , M}, is a state sequence and G(s) has a special property.

Brute-force optimization of G(s) involves an exhaustive search of all the MT possible sequences.

Property of G(s) for the applicability of the Viterbi algorithm:

\[G(s)=g_1(s_1)+g_2(s_2,s_1)+g_3(s_3,s_2)+ ... + g_T(s_T,s_{T-1})\]

The key is that the objective function can be written as a sum of "merit" functions depending on one state and its preceding one.

Under this property of G(s), the optimal solution for s has a Markovian kind of property:

  • Suppose in the optimal state sequence s*, the tth position s*t = k. To maximize G(s1, s2, ... , sT), we can maximize the following two functions separately:

    \(G_{t,k}(s_1, ... , s_{t-1}) =g_1(s_1)+g_2(s_2,s_1)+ ... + g_t(k,s_{t-1})\)
    \(\bar{G}_{t,k}(s_{t+1}, ... , s_{T}) =g_{t+1}(s_{t+1},k)+ ... + g_T(s_T,s_{T-1})\)

  • The first function involves only states before t; and the second only states after t.

  • Also note the recursion of Gt,k(s1, ... , st-1):

    \(G_{t,l}(s_1, ... , s_{t-2}, k) = G_{t-1, k}(s_1, ... , s_{t-2})+g_t(l,k)\)

plot

As shown in the above figure, every state sequence s corresponds to a path from t = 1 to t = T.

We put weight gt(k,l) on the link from state l at t −1 to state k at t.

At the starting node, we put weight g1(k) for state k.

G(s) is the sum of the weights on the links in path s.  The goal is to find a path with the maximum total weight along the path.

In the figure, suppose the colored path is the optimal one. At t =3, this path passes through state 2. Then the sub-path before t =3 should be the best among all paths from t =1 to t =3 that end at state 2. The sub-path after t =3 should be the best among all paths from t =3 to t =6 that start at state 2.

plot

Pseudocode

At t = 1, for each node (state) k = 1, ..., 4, record G*1,k = g1(k).

At t = 2, for each node k = 1, ... , 4, only need to record which node is the best preceding one. Suppose node k is linked to node l at t = 1, record G*2,k = G*1,l + g2(k, l). The best preceding state l for state k is found by exhaustive search.  That is, for every possible previous state l', compute G*t-1,l' + gt(k, l') whichever l that maximizes the sum is chosen. This l and G*t,k are stored.

The same procedure is applied successively for t = 2, 3, ... , T . At every node, link it to its best preceding one. Set G*t,k = G*t-1,l' + gt(k, l'), assuming l is the best preceding node of k. G*t,k is the sum of weights of the best path up to t and with the end tied at state k.

At the end, only M paths are formed, each ending with a different state at t = T. The objective function for a path ending at node k is G*T,k . Pick k* that maximizes G*T,k . Trace the path backwards from the last state k*.  The tracing back is shown in the above figure.  Because at every state at every time, the algorithm records the corresponding best preceding state, by finding the best preceding state backwards and recursively, we recover the optimal path. 

Proof for the Viterbi Algorithm

Notation:

Let s* (t, k) be the sequence {s1, ... , st-1} that maximizes Gt,k(s1, ... , st-1):

\(\bar{\mathbf{s}}^*(t,k)=\text{arg }\underset{s_{t+1}, ... , s_{T}}{max}G_{t,k}(s_{1}, ... , s_{t-1})\)

Let G*t,k = maxs1, ... , st-1 Gt,k(s1, ... , st-1).

Let \(\mathbf{s}^*(t,k)\) be the sequence {st+1 , ... , sT}that maximizes \(\bar{G}_{t,k}(s_{t+1}, ... , s_T)\):

\(\bar{\mathbf{s}}^*(t,k)=\text{arg }\underset{s_{t+1}, ... , s_{T}}{max}\bar{G}_{t,k}(s_{t+1}, ... , s_{T})\)

Key facts for proving the Viterbi algorithm:

  • If the optimal state sequence s* has the last state s*T = k, then the subsequence of s* from 1 to T−1 should be s* (T,k) and

    \(\underset{\mathbf{s}}{\text{ max }}G(\mathbf{s})=G_{T,k}(\mathbf{s}^*(T,k))\)

  • Since we don't know what should be s*T , we should compare all the possible states k = 1, ... , M :

    \(\underset{\mathbf{s}}{\text{ max }}G(\mathbf{s})=\underset{k}{\text{ max }}G_{T,k}(\mathbf{s}^*(T,k))\) 

  • Gt,k(s*(t, k)) and s*(t, k) can be obtained recursively for t = 1, ... , T .

Proof for the recursion:

  • Suppose Gt-1,k(s*(t-1, k))and s*(t-1, k) for k = 1, ... , M have been obtained. For any l = 1, ... , M:

    \[  \begin {align} & \underset{s_1, ... , s_{t-1}}{\text{ max }}G_{t,l}(s_1, ... , s_{t-1})\\
    & =  \underset{k}{\text{ max }}\underset{s_1, ... , s_{t-2}}{\text{ max }}G_{t,l}(s_1, ... , s_{t-2}) \\
    & =  \underset{k}{\text{ max }}\underset{s_1, ... , s_{t-2}}{\text{ max }}(G_{t-1,k}(s_1, ... , s_{t-2}) +g_t(l,k))\\
    & =  \underset{k}{\text{ max }}(g_t(l,k)+\underset{s_1, ... , s_{t-2}}{\text{ max }}G_{t-1,k}(s_1, ... , s_{t-2}))\\
    & =  \underset{k}{\text{ max }}(g_t(l,k)+G_{t-1,k}(\mathbf{s}^*(t-1,k))\\
    \end {align} \]

  • Suppose k* achieves the maximum, then for s*(t,l), the last state s*t = k* and the subsequence from position 1 to t −1 is s*(t-1, k*).
  • The amount of computation involved in deciding Gt,l(s*(t,l))and s*(t,l) for all l = 1, ... , M is at the order of M2 . For each l, we have to exhaust M possible k's to find k*.
  • To start the recursion, we have

    \[  \begin {align}G_{1,k}( \cdot)& =g_1(k)\\
    \mathbf{s}^*(1,k) & = \{k \} \\
    \end {align} \]

  • Note: at t = 1, there is no preceding state.