###### How a mixture of two powerful random processes can be used to model time series data

A **Poisson Hidden Markov Model** uses a mixture of two random processes, a **Poisson process** and a **discrete Markov process**, to represent **counts based time series** data.

Counts based time series data contain only whole numbered values such as 0, 1,2,3 etc. Examples of such data are the daily number of hits on an eCommerce website, the number of bars of soap purchased each day at a department store and so on.

Such data cannot be adequately represented using models such as the Linear model or the ARIMA model because in those models, the dependent variable (** y**) is assumed to be real valued with both positive and negative values allowed. Counts based time series data require the use of models which assume that the dependent variable is 1) discrete, and 2) non-negative, in other words, whole-numbered.

The Poisson or Poisson-like models such as the Generalized Poisson and the Negative Binomial models often work well for modeling whole numbered datasets.

Unfortunately, the Poisson family of models do not explicitly account for, and therefore are unable to adequately capture the auto-correlation in the data, which happens to be a hallmark of time series datasets.

Over the years, researchers have devised several modifications to the basic Poisson model to enable it to account for the auto-correlated-ness in the time series data. Notable among them are the **Poisson Auto-Regressive model** and the **Poisson Integer ARIMA model**** **in which the Poisson conditional mean is expressed as a linear combination of not only the regression variables ** X** but also lagged copies of the dependent variable

**.**

*y*The **Poisson Hidden Markov Model** goes one step further by mixing in a discrete *k*-state Markov model that is ’hidden’ in that at each time step in the time series, one does not know for sure which Markov regime the Markov process is in. Instead, at each time step, one estimates the effect of the possible existence of each regime on the mean value that is predicted by the Poisson model. This makes the Poisson process mean a random variable **whose expected value is a function of the probability that the underlying Markov Model is in a particular state.**

In this chapter, we will precisely formulate this relationship between the mean predicted by the visible Poisson model and the Hidden Markov Model.

If you are new to Markov Processes or Hidden Markov Models, you may want to review the following two chapters:

**An Introduction to Discrete Time Markov Processes**

I’d also suggest reviewing the chapter on **Markov Switching Dynamic Regression Model** to get a detailed understanding of how the MSDR model is built. The Poisson HMM is an MSDR model where the ‘visible’ portion of the model obeys a Poisson process.

## A detailed specification of the Poisson Hidden Markov Model

We will first elaborate the ‘visible’ Poisson process, and then show how the Markov process ‘mixes’ into the Poisson process.

Consider the following model equation that incorporates an additive error component:

In the above model, the observed value *y_t* is the sum of the predicted value *μ_cap_t a*nd the residual error *ε_t*. We further assume that *ε_t* is a homoskedastic (constant variance) normally distributed random variable with a mean of zero, represented as *N(0, σ²)*.

*y_t *is the *t-*th element of the *[n x 1]* vector of observed values ** y**:

We assume that ** y** obeys a Poisson process. And therefore,

*y_t*for

*t in [1…n]*are

*n*independent, Poisson distributed random variables, each with a possibly different mean

*μ_t*as shown below:

The **P**robability **M**ass **F**unction (PMF) of ** y**, which is just another way of saying —the probability of observing

*y_t*— is given by:

In a trained regression model, we replace *μ_t* with the ‘fitted’ mean *μ_cap_t.*

Let ** X** be an

*[n X (m+1)]*sized matrix of regression variables as shown below. The first column of this matrix is a placeholder for the intercept

**and**

*,*

*x**_t*is one row of this matrix.

Let ** β_cap** be an

*[(m+1) X 1]*vector of regression coefficients.

*The ‘cap’ on*

*β*indicates that it is the fitted value of the coefficient produced from training the model.

We will express the mean *μ_cap_t* as the exponentiated linear combination of *x**_t and *** β_cap** as follows:

Where, the dot product between *x**_t *and ** β_cap** can be expressed in expanded form as follows:

The exponentiation of the dot product ensures that the mean and therefore the model’s prediction is never negative. This is a key requirement for modeling a data set of whole numbered counts.

This completes the specification of the Poisson portion of the Poisson HMM. Now, let’s turn our attention to the Markov portion.

## “Mixing-in” the Hidden Markov Model

We will see how to ‘mix’ a discrete Markov model into the Poisson regression model.

Consider a *k*-state Markov process that is assumed to be in some state *j ϵ [1,2,3,…k]*. We do not know which state the Markov process is in at time *t*. We only assume that it influences the Poisson process model in the following manner:

Notice that the fitted mean *μ_cap_t_j *is now indexed with the Markov state *j. μ_cap_t_j *is expressed as follows:

Notice that we are now working with a Markov state-specific regression coefficients vector *β_cap**_j *corresponding to the *jth *Markov state.

If the Markov model operates over ‘*k’ *states *[1,2,…,j,…,k]*, the regression coefficients takes the form of a **matrix **of size *[(m+1) X k] *as follows:

The intuition here is that depending on which Markov state or ‘regime’, the regression model coefficients will **switch** to the appropriate regime-specific *vector **β_cap**_j *from ** β_cap_s**.

The *k*-state Markov process is governed by the following state transition matrix ** P**, where each element

*p_ij*is the probability of transitioning to

*j*at time

*t*given that the process was in state

*i*at time

*(t-1)*:

The Markov process also has the following state probability distribution *π_**t *at time step *t*:

**Quick tip: **Some texts call the state vector *δ_t *instead of *π_t*, and they call the Markov state transition probabilities *γ_ij* instead of *p_ij*.

Knowing/assuming some value for ** π_0** at

*t=0*, we calculate

*π_**t*as follows:

Let’s circle back to the state-specific formula for the Poisson mean:

*μ_cap_t_j* is the predicted mean of the Poisson regression model at time *t* assuming that the underlying Markov process is in state *j*. Since we don’t actually know which state the Markov process is in at time *t*, at each time step, for a *k*-state Markov process, we have to work with *k *such predicted means as follows:

It would be absurd to generate *k* predictions for *y_t* at each time step. Therefore, we collapse these *k* predictions into a single prediction *μ_cap_t* using the formula for **expectation**. The trick lies in realizing that *μ_cap**_t* is a random variable that has *k* possible values, and each value is associated with a probability of occurrence. And this probability is simply *π_tj* which is the unconditional probability that the underlying Markov process would be in state *j* at time *t*.

Therefore:

The predicted mean of the Poisson HMM is the expectation across all possible Markov states as follows:

The probabilities *π_tj* are the component values of the *vector π_t.* At each time step *t*, we calculate *π_t* using the following formula:

## Model Training and estimation

Training of the Poisson Hidden Markov model involves estimating the coefficients matrix ** β_cap_s **and the Markov transition probabilities matrix

**. The estimation procedure is usually either Maximum Likelihood Estimation (MLE) or Expectation Maximization.**

*P*We’ll describe how MLE can be used to find the optimal values of ** P** and

**that would**

*β_cap_s**maximize the joint probability density of observing the entire training data set*

**. In other words, we would want to maximize the following product:**

*y*In the above product, the probability *P(**y**=y_t)* is the **P**robability **M**ass **F**unction (PMF) of the Poisson process:

The probability on the L.H.S. is read as the conditional probability of observing *y_t *at time *t *given the fitted mean rate *μ_cap_t. μ_cap_t* is the expected value of the predicted mean across all possible regimes as calculated using the formula for expectation that we saw earlier.

There is a second way to calculate the Poisson PMF.

Let’s once again look at the formula for the predicted Poisson probability of observing *y_t*, assuming that the underlying Markov process is in state *j*:

Where:

The Markov state-specific linear combination is expressed as follows:

Using the Poisson PMF, and assuming a *k-*state Markov process, at each time step t, we would get *k *probabilities of observing *y_t*, each one conditional upon the Markov process being in state *j.* The following vector captures these *k* probabilities:

As before, for each time step, we’d want to collapse these *k* probabilities into a single probability. To do so, we appeal to the **Law of Total Probability** which states that if event A can take place jointly with either event A1, or event A2, or event A3, and so on, then the unconditional probability of A can be expressed as follows:

Using this law, we will calculate the probability predicted by the Poisson HMM of observing *y_t* at time *t* as follows:

In the above summation, *P(s_t=1), P(s_t=2),…etc.* are the Markov process’s state probabilities at time *t. *We know from our earlier discussion that these are simply *π_tj* — the different components of the vector *π_t.*

Effectively, we have figured out two different ways of calculating the Poisson probability *P(**y**=y_t).*

Let’s circle back to the Likelihood equation:

It is often expeditious to maximize the **natural logarithm of this product** since it has the benefit of converting products into sums and the latter are easier to work with in Calculus (we’ll soon see why). Hence, we will maximize the following ** log**-likelihood denoted by the stylized ℓ:

Maximization of Log-Likelihood is done by using the following procedure:

- We will take the partial derivative of the log-likelihood w.r.t. each transition
*probability p_ij*in, and each state specific coefficient*P**β_cap_q_j*in the coefficients matrix*β_cap_s**q*lies in*[0,1,…,m]*and the Markov state*j*lies in*[1,2,…,k].* - We will set each partial derivative to zero, and,
- We will solve the resulting system of
*(k² +(m+1)*k)*equations (in practice, much fewer than that number) corresponding to*k*² Markov transition probabilities,*(m+1)*k*coefficients inusing some optimization algorithm such as Newton-Raphson, Nelder-Mead or Powell’s.*β_cap_s*

There is a little wrinkle that we need to smooth out. All throughout the optimization process, the Markov state transition probabilities *p_ij* need to lie in the *[0.0, 1.0]* interval and the probabilities across any row of ** P **need to sum to 1.0. In equation form, these two constraints are expresses as follows:

Summation to 1.0 constraint becomes obvious when one notes that if the Markov process is in state *i* at time *(t-1)*, then at the next time step *t*, it has to be in one of the available states *[1,2,3,…,k]*.

During optimization, we tackle these constraints by defining a matrix ** Q** of size

*(k x k)*as follows:

** Q** acts as a proxy for

**.**

*P***Instead of optimizing**

**, we optimize**

*P***by allowing**

*Q**q_ij*to range freely from -∞ to +∞. In each optimization iteration,

**we obtain**

*p_ij*

**by standardizing the locally optimized**

*q*

**values to the interval**

*[0.0, 1.0]*, as follows:

*In the next chapter, we will **build and train a Poisson Hidden Markov Model using Python and statsmodels**.*

## References and Copyrights

### Books

Cameron, A., & Trivedi, P. (2013). *Regression Analysis of Count Data* (2nd ed., Econometric Society Monographs). Cambridge: Cambridge University Press. doi:10.1017/CBO9781139013567

James D. Hamilton, *Time Series Analysis*, Princeton University Press, 2020. ISBN: 0691218633

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **The Poisson Integer ARIMA Regression Model

**NEXT: **The Poisson Hidden Markov Model – Part 2 (Implementation in Python)