# The Poisson Hidden Markov Model – Part 1 (Concepts and Theory)

###### How a mixture of two powerful random processes can be used to model time series data

A Poisson Hidden Markov Model uses a mixture of two random processes, a Poisson process and a discrete Markov process, to represent counts based time series data.

Counts based time series data contain only whole numbered values such as 0, 1,2,3 etc. Examples of such data are the daily number of hits on an eCommerce website, the number of bars of soap purchased each day at a department store and so on.

Such data cannot be adequately represented using models such as the Linear model or the ARIMA model because in those models, the dependent variable (y) is assumed to be real valued with both positive and negative values allowed. Counts based time series data require the use of models which assume that the dependent variable is 1) discrete, and 2) non-negative, in other words, whole-numbered.

The Poisson or Poisson-like models such as the Generalized Poisson and the Negative Binomial models often work well for modeling whole numbered datasets.

Unfortunately, the Poisson family of models do not explicitly account for, and therefore are unable to adequately capture the auto-correlation in the data, which happens to be a hallmark of time series datasets.

Over the years, researchers have devised several modifications to the basic Poisson model to enable it to account for the auto-correlated-ness in the time series data. Notable among them are the Poisson Auto-Regressive model and the Poisson Integer ARIMA model in which the Poisson conditional mean is expressed as a linear combination of not only the regression variables X but also lagged copies of the dependent variable y.

The Poisson Hidden Markov Model goes one step further by mixing in a discrete k-state Markov model that is ’hidden’ in that at each time step in the time series, one does not know for sure which Markov regime the Markov process is in. Instead, at each time step, one estimates the effect of the possible existence of each regime on the mean value that is predicted by the Poisson model. This makes the Poisson process mean a random variable whose expected value is a function of the probability that the underlying Markov Model is in a particular state.

In this chapter, we will precisely formulate this relationship between the mean predicted by the visible Poisson model and the Hidden Markov Model.

If you are new to Markov Processes or Hidden Markov Models, you may want to review the following two chapters:

An Introduction to Discrete Time Markov Processes

Hidden Markov Models

I’d also suggest reviewing the chapter on Markov Switching Dynamic Regression Model to get a detailed understanding of how the MSDR model is built. The Poisson HMM is an MSDR model where the ‘visible’ portion of the model obeys a Poisson process.

## A detailed specification of the Poisson Hidden Markov Model

We will first elaborate the ‘visible’ Poisson process, and then show how the Markov process ‘mixes’ into the Poisson process.

Consider the following model equation that incorporates an additive error component:

In the above model, the observed value y_t is the sum of the predicted value μ_cap_t and the residual error ε_t. We further assume that ε_t is a homoskedastic (constant variance) normally distributed random variable with a mean of zero, represented as N(0, σ²).

y_t is the t-th element of the [n x 1] vector of observed values y:

We assume that y obeys a Poisson process. And therefore, y_t for t in [1…n] are n independent, Poisson distributed random variables, each with a possibly different mean μ_t as shown below:

The Probability Mass Function (PMF) of y, which is just another way of saying —the probability of observing y_t — is given by:

In a trained regression model, we replace μ_t with the ‘fitted’ mean μ_cap_t.

Let X be an [n X (m+1)] sized matrix of regression variables as shown below. The first column of this matrix is a placeholder for the intercept, and x_t is one row of this matrix.

Let β_cap be an [(m+1) X 1] vector of regression coefficients. The ‘cap’ on β indicates that it is the fitted value of the coefficient produced from training the model.

We will express the mean μ_cap_t as the exponentiated linear combination of x_t and β_cap as follows:

Where, the dot product between x_t and β_cap can be expressed in expanded form as follows:

The exponentiation of the dot product ensures that the mean and therefore the model’s prediction is never negative. This is a key requirement for modeling a data set of whole numbered counts.

This completes the specification of the Poisson portion of the Poisson HMM. Now, let’s turn our attention to the Markov portion.

## “Mixing-in” the Hidden Markov Model

We will see how to ‘mix’ a discrete Markov model into the Poisson regression model.

Consider a k-state Markov process that is assumed to be in some state j ϵ [1,2,3,…k]. We do not know which state the Markov process is in at time t. We only assume that it influences the Poisson process model in the following manner:

Notice that the fitted mean μ_cap_t_j is now indexed with the Markov state j. μ_cap_t_j is expressed as follows:

Notice that we are now working with a Markov state-specific regression coefficients vector β_cap_j corresponding to the jth Markov state.

If the Markov model operates over ‘k’ states [1,2,…,j,…,k], the regression coefficients takes the form of a matrix of size [(m+1) X k] as follows:

The intuition here is that depending on which Markov state or ‘regime’, the regression model coefficients will switch to the appropriate regime-specific vector β_cap_j from β_cap_s.

The k-state Markov process is governed by the following state transition matrix P, where each element p_ij is the probability of transitioning to j at time t given that the process was in state i at time (t-1):

The Markov process also has the following state probability distribution π_t at time step t:

Quick tip: Some texts call the state vector δ_t instead of π_t, and they call the Markov state transition probabilities γ_ij instead of p_ij.

Knowing/assuming some value for π_0 at t=0, we calculate π_t as follows:

Let’s circle back to the state-specific formula for the Poisson mean:

μ_cap_t_j is the predicted mean of the Poisson regression model at time t assuming that the underlying Markov process is in state j. Since we don’t actually know which state the Markov process is in at time t, at each time step, for a k-state Markov process, we have to work with k such predicted means as follows:

It would be absurd to generate k predictions for y_t at each time step. Therefore, we collapse these k predictions into a single prediction μ_cap_t using the formula for expectation. The trick lies in realizing that μ_cap_t is a random variable that has k possible values, and each value is associated with a probability of occurrence. And this probability is simply π_tj which is the unconditional probability that the underlying Markov process would be in state j at time t.

Therefore:

The predicted mean of the Poisson HMM is the expectation across all possible Markov states as follows:

The probabilities π_tj are the component values of the vector π_t. At each time step t, we calculate π_t using the following formula:

## Model Training and estimation

Training of the Poisson Hidden Markov model involves estimating the coefficients matrix β_cap_s and the Markov transition probabilities matrix P. The estimation procedure is usually either Maximum Likelihood Estimation (MLE) or Expectation Maximization.

We’ll describe how MLE can be used to find the optimal values of P and β_cap_s that would maximize the joint probability density of observing the entire training data set y. In other words, we would want to maximize the following product:

In the above product, the probability P(y=y_t) is the Probability Mass Function (PMF) of the Poisson process:

The probability on the L.H.S. is read as the conditional probability of observing y_t at time t given the fitted mean rate μ_cap_t. μ_cap_t is the expected value of the predicted mean across all possible regimes as calculated using the formula for expectation that we saw earlier.

There is a second way to calculate the Poisson PMF.

Let’s once again look at the formula for the predicted Poisson probability of observing y_t, assuming that the underlying Markov process is in state j:

Where:

The Markov state-specific linear combination is expressed as follows:

Using the Poisson PMF, and assuming a k-state Markov process, at each time step t, we would get k probabilities of observing y_t, each one conditional upon the Markov process being in state j. The following vector captures these k probabilities:

As before, for each time step, we’d want to collapse these k probabilities into a single probability. To do so, we appeal to the Law of Total Probability which states that if event A can take place jointly with either event A1, or event A2, or event A3, and so on, then the unconditional probability of A can be expressed as follows:

Using this law, we will calculate the probability predicted by the Poisson HMM of observing y_t at time t as follows:

In the above summation, P(s_t=1), P(s_t=2),…etc. are the Markov process’s state probabilities at time t. We know from our earlier discussion that these are simply π_tj — the different components of the vector π_t.

Effectively, we have figured out two different ways of calculating the Poisson probability P(y=y_t).

Let’s circle back to the Likelihood equation:

It is often expeditious to maximize the natural logarithm of this product since it has the benefit of converting products into sums and the latter are easier to work with in Calculus (we’ll soon see why). Hence, we will maximize the following log-likelihood denoted by the stylized ℓ:

Maximization of Log-Likelihood is done by using the following procedure:

1. We will take the partial derivative of the log-likelihood w.r.t. each transition probability p_ij in P, and each state specific coefficient β_cap_q_j in the coefficients matrix β_cap_s where q lies in [0,1,…,m] and the Markov state j lies in [1,2,…,k].
2. We will set each partial derivative to zero, and,
3. We will solve the resulting system of (k² +(m+1)*k) equations (in practice, much fewer than that number) corresponding to k² Markov transition probabilities, (m+1)*k coefficients in β_cap_s using some optimization algorithm such as Newton-Raphson, Nelder-Mead or Powell’s.

There is a little wrinkle that we need to smooth out. All throughout the optimization process, the Markov state transition probabilities p_ij need to lie in the [0.0, 1.0] interval and the probabilities across any row of P need to sum to 1.0. In equation form, these two constraints are expresses as follows:

Summation to 1.0 constraint becomes obvious when one notes that if the Markov process is in state i at time (t-1), then at the next time step t, it has to be in one of the available states [1,2,3,…,k].

During optimization, we tackle these constraints by defining a matrix Q of size (k x k) as follows:

Q acts as a proxy for P. Instead of optimizing P, we optimize Q by allowing q_ij to range freely from -∞ to +∞. In each optimization iteration, we obtain p_ij by standardizing the locally optimized q values to the interval [0.0, 1.0], as follows:

In the next chapter, we will build and train a Poisson Hidden Markov Model using Python and statsmodels.