Lesson 13: Hidden Markov Model

Introduction

Key Learning Goals for this Lesson:
  • understand the basic model setup in HMM. The comparison with the mixture model is intended to help you understand this.
  • understand the forward-backward algorithm for estimating HMM.
  • understand the estimation of HMM based on multiple sequences.
  • understand a common HMM used for discrete data.
  • understand the Viterbi algorithm, which is used to find the optimal underlying state sequence that has the maximum posteriori probability given a sequence.
  • know some applications of HMM in various research fields.

Hidden Markov models have a close connection with mixture models.

From Mixture Model to Hidden Markov Model (HMM)

When you cluster the data using a mixture model you don't know the cluster label. The cluster label is hidden from you.

A mixture model generates data very much like random switch.  Consider the cluster labels as the different states of the switch, for instance, as shown in the figure below.  Each time, the system  randomly picks one of the three states according to some fixed distribution or probabilities on the states. These are the prior probabilities for the states, denoted by πk.

plot

Inspect!

Once the state is determined, a random observation is generated according to a distribution specific to this state. If we are dealing with multi-variate continuous data, we often assume a Gaussian distribution for the data given the state.  This Gaussian distribution has parameters which vary with the states.  In the mixture model, each data point is produced independently by the same system.  Mathematically this means the sequence of the hidden states is iid (independent and identically distributed).  

For sequential or spatial data, the assumption of independent samples is too constrained.  Thus came the hidden Markov model approach.

Examples where a Hidden Markov model approach is used include:

  • Speech signals
  • Genomic sequences

In these cases, it is not appropriate to assume that the sequential data points are independent. For instance, when we sample speech at a fixed rate and measure the amplitude of the voice, you can imagine that nearby amplitudes of the sound are highly correlated.  The statistical dependence among samples may bear critical information about the characteristics of the speech signal.

Model Setup

Suppose we have sequential data u = {u1, u2, ... , ut , ... , uT}, utRd .

As in the mixture model, every ut , t =1, ... , T , is generated by a hidden state, st .

Let's take a look at how the Hidden Markov model generates data.  Assume that our sequence of data is generated based on a hidden layer of states. (This is just like in the mixture model where you have hidden cluster labels.) These states are assumed to follow a Markov chain (instead of iid!).

The Markov chain can be conveniently represented by a graph model. For instance, below we have a Markov chain with three states. Currently, I am in state 1. The next time I may stay in state 1, or transit state 2 or state 3.

plot

Inspect!

The probabilities for the next state depend on the current state. This is a key difference from mixture models where every time the switch randomly picks a state according to a fixed distribution. The current state does not matter. In the Markov chain, the probability of entering a new state depends on the current state.

Very brief review of the Markov chain (MC):

The basic assumption of the first order MC, referred to in short as MC, is that given the present, the future and the past are conditionally independent:

P(st+1 | st , st-1, ... , s0) = P(st+1 | st).

We can define the transition probabilities: ak,l = P(st+1 = l | st = k),
k, l = 1, 2, ... , M, where M is the total number of states.

Initial probabilities of states: πk .

\[\sum_{l=1}^{M} a_{k,l} = 1 \;\;\; \text{ for any } k, \sum_{l=1}^{M}\pi_k =1\]

Here is a basic equation used for computing probability for a sequence under a MC. To compute the probability of observing a sequence, s1, s2, ... , sT , by the chain rule of conditional probabilities, and the Markov chain property, we have:

\[P(s_1, s_2, ... , s_T) =P(s_1)P(s_2|s_1)P(s_3|s_2) ... P(s_T|s_{T-1})=\pi_{s_1} a_{s_1, s_2} a_{s_2, s_3} ... a_{s_{T-1}, s_T}\]

The joint probability is expanded as P(s1) multiplied by the probability of s2 conditioned on s1, the probability of s3 conditioned on s2 and so on, in this chain like fashion.

Other assumptions made in an HMM include:

  • Given the state st , the observation ut is independent of other observations and states. In other words, if I know the state at t, then the data observed at t is statistically independent from everything else.
  • For a fixed state, the observation ut is generated according to a fixed probability law.

Given state k, we would have the probability law describing how U distributes given the state. This is specified in the literature by bk(u).

Discrete case: suppose U takes finitely many possible values, then bk(u) is specified by the pmf (probability mass function) of the possible values.

Continuous: most often the Gaussian distribution is assumed. The Gaussian distribution is parameterized by the covariance matrix and the mean vector μ.

\[b_k(u)=\frac{1}{\sqrt{(2\pi)^d|\Sigma_k|}}\text{ exp } (-\frac{1}{2}(u-\mu_k)^t\Sigma_{k}^{-1}(u-\mu_k))\]

Some detailed explanation is provided at the website: http://www.stat.psu.edu/~jiali/hmm

In summary:

What we want to do next is to estimate the HMM. Let's say we have an observed sequence. If we want to model this using the HMM, how do we choose the parameters for this model? HMM is a complicated model posing great computational challenge for estimation.  In addition to estimating the model, we also need to efficiently compute the joint probability of observing a sequence of states and the observed sequence of data vectors.  The following two equations are for computing the joint density of a sequence of states and a sequence of observed vectors, and for computing the joint density of the sequence of vectors.

\[  \begin {align} P(\mathbf{u}, \mathbf{s})& =P(\mathbf{s})P(\mathbf{u}|\mathbf{s})\\
& =  \pi_{s_1}b_{s_1}(u_1)a_{s_1, s_2}b_{s_2}(u_2) ... a_{s_{T-1},s_{T}}b_{s_T}(u_{T}) \\
P(\mathbf{u}) & = \sum_{\mathbf{s}}P(\mathbf{s})P(\mathbf{u}|\mathbf{s}) \;\;\;\;\;  \text{ total prob. formula }\\
& = \sum_{\mathbf{s}} \pi_{s_1}b_{s_1}(u_1)a_{s_1, s_2}b_{s_2}(u_2) ... a_{s_{T-1},s_{T}}b_{s_T}(u_{T})\\
\end {align} \]