And a tutorial on how to implement a Poisson INAR(1) regression model using Python and statsmodels
Integer ARIMA models are used for modeling time series data consisting of whole numbered counts.
Such data sets pose a few unique challenges:
The data is auto-correlated: Time series data is often auto i.e. self correlated. Any model we build for such data needs to account for these serial correlations. ARIMA (Auto-regressive Integrated Moving Average) models are designed to capture auto-correlations in time series data.
The data consists of only whole numbered counts 0,1,2,3,…etc.: ARIMA models are designed for modeling real valued time series data, and not counts based time series data. Counts based data can be modeled using Poisson and Poisson-like models such as Negative Binomial and Generalized Poisson models. Unfortunately, Poisson and Poisson-like models are static models that are not designed to handle correlated time series data.
The Poisson Integer ARIMA model bridges the gap between the ARIMA model for time series data, and the Poisson and Poisson-like models for counts based data sets.
As we’ll see below, structurally, Poisson INARIMA models are built very differently than either the ARIMA or the Poisson model.
In this section, we’ll focus on the Poisson INAR(1) model introduced by Brannas in “Explanatory Variables in the AR(1) Model” (see paper link at the bottom of the page). INAR(1) is an integer ARIMA model with an auto-regressive component of order 1, i.e. it uses only the first time lagged component, y_(t-1). The Poisson AR(1) models the error component using a Poisson process.
Let’s examine the structure of the Poisson Integer AR(1) model. We’ll often use the symbolic conventions followed in the Regression Analysis of Count Data book by Cameron A.C. and P.K. Trivedi, and some of the source papers on the topic. Book and paper links are mentioned at the end of the page.
Structure of the Poisson INAR(1) model
The generic Integer AR(1) model is expressed as follows:
In the INAR(1) model, we express express y_t, which is the value of the time series at time t, as the sum of two quantities:
ρ ○ y_(t-1): This term is a function of the value of the time series at the previous time step (t-1). We’ll get to the meaning of the ○ operator soon.
ε_t: This is a random variable that represents the hidden and therefore unknown variables. We will choose to model the effect of the latent variables by assuming some known probability distribution such as Normal or Poisson for the random variable ε_t.
The binomial thinning operator ○
The first term, ρ ○ y_(t-1) deserves additional explanation. The way to understand this quantity is as follows:
Suppose we are modeling the number of hits per day on an eCommerce site. This is a time series data of counts, and one can reasonably expect the number of hits on day t to be correlated with the number of hits on prior days.
Now suppose on some day t, the number of hits y_t =1000. Suppose also that we look at this number, 1000, as 1000 independent, identically distributed random variables: b_1, b_2, b_3,…b_1000. Each variable b_i follows the following Bernoulli distribution:
So we can look at the value of y_t=1000 as 1000 independent Bernoulli trials, each one with a probability of success of ρ. If a trial ‘succeeds’, a hit is registered on the website. Thus, the expected number of hits one day t is simply ρ times y_t.
Now we can see how the INAR(1) equation models the number of hits on the website at time index t, as the sum of:
- The expected number of website hits at the previous time step (t-1), and
- An adjustment to this expected value, provided by the realized value of the random variable ε_t. ε_t is used to model some latent data generation process.
The Poisson INAR(1) model
Since we wish to model time series data of whole numbered counts, it would be appropriate to assume that the underlying (unknown) data generating process is a Poisson process. And therefore, ε_t is a Poisson distributed random variable.
Thus, the Poisson INAR(1) model can be expressed as follows:
Basically, what we are saying is that the value observed at time t is a combination of the expected value at the previous time step, and a Poisson distributed count assuming a Poisson rate of μ_t.
How to handle regression variables
Suppose X is the matrix of regression variables, and β is a vector of regression coefficients, as follows:
We will express the Poisson process’s mean rate of occurrence μ_t as the following function of X and the regression coefficients β as follows:
Estimation involves estimating ρ and β. Various researches in the field have proposed a number of estimation techniques such as Maximum Likelihood Estimation, Conditional Least Squares, Weighted Least Squares and Generalized Method of Moments. In “Explanatory Variables in the AR(1) Model”, Brannas used Conditional Least Squares and Conditional Generalized Method of Moments techniques for parameter estimation.
We will explore the use of the Maximum Likelihood Estimation (MLE) method to estimate ρ and β. Our choice of MLE is driven largely by the presence of the GenericLikelihoodModel class in the statsmodels library. The GenericLikelihoodModel class lets you specify a custom MLE function of your choice which statsmodels will happily maximize for you.
Maximum Likelihood Estimation for the Poisson INAR(1) model
In the maximum likelihood method, we seek values of ρ and β which would maximize the likelihood of observing the entire training data set. Specifically, we wish to find ρ and β which would maximize the natural logarithm of the joint probability of occurrence of y_t in y for t=1 to n. Notation-wise, we would like to maximize the following log-likelihood function:
In practice, we convert the logarithm of the product into a summation of logarithm of individual probabilities by using the rule:
ln(A*B*C*D*…) = ln(A) + ln(B) + ln(C) + ln(D) + …
To maximize ℓ(⍴;β|y), we need to construct the conditional probability distribution P(y_t|y_(t-1)). Let’s see how to do this.
The Probability Mass Function of the Poisson INAR(1) distributed random variable
Let’s look at the equation of the Poisson INAR(1) model:
We see that y_t is made up of two components: the expected value of y_(t-1) and a Poisson distributed variable ε_t. Hence the probability of observing y_t, given y_(t-1), can be expressed as the product of two probabilities:
- The Binomial probability of observing j number of ‘events’ out of y_(t-1) possible events, followed by,
- The Poisson probability of observing (y_t — j) ‘events’.
Since we don’t know what j is, we allow j to range from 0 through y_t (actually, 0 through min(y_t, y_(t-1), but that’s a technicality so that everything still makes sense). For each value of j, we calculate the product of the above two probabilities in (1) and (2). Finally, we sum up all the individual products, as follows:
What the above equation is saying is that the probability of observing y_t at time step t given y_(t-1) was observed at the previous time step, is equal to the Binomial probability of observing 0 out of y_(t-1) events and the Poisson probability of observing y_t events, OR, the Binomial probability of observing 1 out of y_(t-1) events and the Poisson probability of observing (y_t — 1) events, and so forth.
Let’s recollect that in the above equation, the Poisson process’s mean rate of occurrence μ_t is expressed as the following function of regression variables x_t and the regression coefficients β:
Let’s also recollect that ρ is the Binomial probability and while doing MLE, we don’t want ρ to go out of bounds. Hence, we define another variable γ such that:
The above logistic function ensures that as γ ranges from -∞ to +∞, the Bernoulli probability ρ remains bounded in [0,1].
How to build and train a Poisson INAR(1) using Python and Statsmodels
We’ll illustrate the process of using a Poisson INAR(1) model using the STRIKES data set:
The MANUFACTURING STRIKES data set
To illustrate the model fitting procedure, we will use the following open source data set that is widely used in regression modeling literature:
The data set is a monthly time series showing the relationship between U.S. manufacturing activity measured as a departure from the trend line, and the number of contract strikes in U.S. manufacturing industries beginning each month from 1968 through 1976.
This data set is available in R and it can be fetched using the statsmodels Datasets package.
The dependent variable y is strikes.
We’ll start by importing all the required packages:
import math import numpy as np import statsmodels.api as sm from statsmodels.base.model import GenericLikelihoodModel from scipy.stats import poisson from scipy.stats import binom from patsy import dmatrices import statsmodels.graphics.tsaplots as tsa from matplotlib import pyplot as plt
Let’s load the data set into memory using statsmodels:
strikes_dataset = sm.datasets.get_rdataset(dataname='StrikeNb', package='Ecdat')
Print out the data set:
We see the following output:
We’ll consider the first 92 data points as the training set and the remaining 16 data points as the test data set:
strikes_data = strikes_dataset.data.copy() strikes_data_train = strikes_data.query('time<=92') strikes_data_test = strikes_data.query('time>92').reset_index().drop('index', axis=1)
Here is our regression expression. strikes is the dependent variable and output is our explanatory variable. The intercept of regression is assumed to be present:
expr = 'strikes ~ output'
We’ll use Patsy to carve out the X and y matrices. Patsy will automatically add a regression intercept column to X:
y_train, X_train = dmatrices(expr, strikes_data_train, return_type='dataframe') print(y_train) print(X_train) y_test, X_test = dmatrices(expr, strikes_data_test, return_type='dataframe') print(y_test) print(X_test)
Next, we will extend the GenericLikelihoodModel:
class INAR(GenericLikelihoodModel): def __init__(self, endog, exog, **kwds): super(INAR, self).__init__(endog, exog, **kwds)
In our extension, we will override the nloglikeobs() and the fit() methods. The nloglikeobs() method is called by statsmodels to get the value of the log-likelihood of each observation y_t. Therefore, the likelihood function of the Poisson INAR(1) that we had described earlier goes into this method. This method returns an array of log-likelihoods, and the super-class supplied by statsmodels, sums up all the values in this array to get the total log-likelihood value that is optimized by the statsmodels’ optimizer.
class PoissonINAR(GenericLikelihoodModel): def __init__(self, endog, exog, **kwds): super(INAR, self).__init__(endog, exog, **kwds) def nloglikeobs(self, params): #Fetch the parameters gamma and beta #that we would be optimizing gamma = params[-1] beta = params[:-1] #Set y and X y = self.endog y = np.array(y) X = self.exog #Compute rho as a function of gamma rho = 1.0/(1.0+math.exp(-gamma)) #Compute the Poisson mean mu as a dot #product of X and Beta mu = np.exp(X.dot(beta)) #Init the list of log-likelihhod values, #one value for each y ll =  #Compute all the log-likelihood values for #the Poisson INAR(1) model for t in range(len(y)-1,0,-1): prob_y_t = 0 for j in range(int(min(y[t], y[t-1])+1)): prob_y_t += poisson.pmf((y[t]-j), mu[t]) * binom.pmf(j, y[t-1], rho) ll.append(math.log(prob_y_t)) ll = np.array(ll) #return the negated array of log-likelihoods return -ll
Let’s also implement the model.fit() method:
def fit(self, start_params=None, maxiter=1000, maxfun=5000, **kwds): #Add the gamma parameter to the list of #exogneous variables that the model will optimize self.exog_names.append('gamma') if start_params == None: #Start with some initial values of Beta and gamma start_params = np.append(np.ones(self.exog.shape), 1.0) #Call super.fit() to start the training return super(PoissonINAR, self).fit(start_params=start_params, maxiter=maxiter, maxfun=maxfun, **kwds)
Let’s create an instance of the Poisson INAR(1) model class, and train it on the training data set:
inar_model = PoissonINAR(y_train, X_train) inar_model_results = inar_model.fit()
Print the model training summary:
We see the following output:
Significance of regression coefficients
The coefficients for the output variable and gamma are both significant at a 95% confidence interval as evidenced by their p-values which are less than 0.05. So is the intercept of regression which is significantly different than zero:
Interpretation of regression coefficients
Interpretation of coefficients is not straightforward. Gamma is -0.7039 which corresponds to a ρ of 1/(1+exp(0.7039)) = 0.33095.
The estimated β is 2.6215.
The fitted model’s equation is:
For any given t, 0.33095 (roughly 33%) of y_(t-1) constitutes y_t. The rest comes from the Poisson process’s estimated mean μ_t which is a function of output at time t and the estimated β vector.
Measuring the effect of changes in output on frequency of strikes
From the model equation, it is easy to see that the number of observed strikes increases with increase in manufacturing output. We may want to measure by how much they increase.
In the training data set, the ‘output’ variable has a standard deviation of 0.05654. This can be seen by printing the following:
Now suppose we consider two hypothetical values of y_t, namely y_t1 and y_t2 such that their respective previous values y_(t1–1) and y_(t2–1) happen to be the same. Since β=2.6215, one standard deviation increase in ‘output’, will cause the Poisson process’s estimated mean to fluctuate by e^(2.6215*0.05654) = 1.15977, i.e. approximately 16%. Thus one standard deviation increase in output causes the number of strikes observed per month to increase by 16%.
Goodness of fit of the Poisson INAR(1) model
Determination of goodness of fit is made complicated by the lagged term y_(t-1) in the model’s equation. The non linearity of the Poisson INAR(1) model precludes the use of Mean Squared Error based measures such as R-squared. On the other hand, since MLE was used for model fitting, deviance based measures such as pseudo-r-squared and the Chi-squared distributed Likelihood Ratio (LR) test may be used to judge the goodness of fit. However, both pseudo-r-squared and the LR test require the Log-likelihood (LL) of the Null model (a.k.a. the Intercept only model) to be calculated. With the Poisson INAR(1) model, due to the presence of the lagged term y_(t-1) in the model’s equation, what constitutes a Null model is not immediately obvious and is open for debate.
Given these difficulties, one may want to the judge the goodness of fit via indirect means, particularly, by inspecting the standard errors and the corresponding 95% confidence intervals of the fitted model’s parameters.
Upon taking this approach with the ‘strikes’ model, we observe the following things:
The 95% confidence interval of ρ ranges from 1/(1+exp(1.174))=0.23613 to 1/(1+exp(0.233))=0.44201 which is reasonably tight around the fitted value of ρ of 1/(1+exp(0.7039)) = 0.33095.
Unfortunately, the ‘output’ variable has a pretty large standard error of 1.126 and a correspondingly wide confidence interval of 0.415 to 4.828. As expected, and as we saw in the previous section, it’s effect on ‘strikes’ is pretty weak. That has a compromising effect on the goodness of fit of the model.
The model allows for making one-step ahead forecasts. Prediction consists of running the fitted model on the X matrix and lagged values of y. We will run the fitted model on X_test and lagged values of y. To do so, we will implement the predict() method on model.py in statsmodels as follows:
def predict(self, params, exog=None, *args, **kwargs): #Fetch the optimized values of parameters gamma and beta fitted_gamma = params[-1] fitted_beta = params[:-1] #Compute rho as a function of gamma rho = 1.0/(1.0+math.exp(-fitted_gamma)) #Get the Intercept and the regression variables, #Don't get the last column which contains the lagged y values X = exog[:,:-1] #Fetch the lagged y values y_lag_1 = exog[:,-1] #Compute the predicted y using the fitted Poisson INAR(1) #model's equation y_pred = rho * y_lag_1 + np.exp(X.dot(fitted_beta)) return y_pred
Let’s prepare the X matrix for prediction:
X_test['y_lag_1'] = y_test.shift(1) X_test = X_test.fillna(0)
Generate the predictions on the test data set. We will round up the predictions as we are interested in the counts.
inar_predictions = np.round(inar_model_results.predict(exog=X_test)) print(inar_predictions)
We get the following output:
Let’s plot the one-step-ahead forecasts for y_test:
predicted_counts=inar_predictions actual_counts = y_test['strikes'] fig = plt.figure() fig.suptitle('Predicted versus actual strike counts') predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts') actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts') plt.legend(handles=[predicted, actual]) plt.show()
Here’s what we see:
The quality of the one-step ahead forecasts bear out the problems we highlighted earlier with the goodness-of-fit of the Poisson INAR(1) model.
Here is the link to the complete source code:
The Poisson Integer ARIMA model using Python and statsmodels
Citations and Copyrights
Cameron A. Colin, Trivedi Pravin K., Regression Analysis of Count Data, Econometric Society Monograph №30, Cambridge University Press, 1998. ISBN: 0521635675
Brännäs, Kurt. (1995). Explanatory variables in the AR(1) count data model PDF download link
Kennan J., The duration of contract strikes in U.S. manufacturing, Journal of Econometrics, Volume 28, Issue 1, 1985, Pages 5–28, ISSN 0304–4076, https://doi.org/10.1016/0304-4076(85)90064-8. PDF download link
The Manufacturing strikes data set is one of several data sets available for public use and experimentation in statistical software, most notably, over here as an R package. The data set has been made accessible for use in Python by Vincent Arel-Bundock via vincentarelbundock.github.io/rdatasets under a GPL v3 license.