# The Auto-Regressive Poisson Model

###### A Poisson regression model for auto-correlated time series data

Poisson and Poisson-like regression models are often used for counts based data sets, namely data that contain whole numbered counts. For example, the number of people walking into the emergency room of a hospital every hour is one such data set.

Ordinary Least Squares Linear Regression models or non-linear models such as those based on Neural Nets based regression techniques don’t work well for such data sets because they can predict negative values.

If the data set is a time series of counts, additional modeling complications arise because time series data are often auto-correlated. Previous counts influence the values of future counts. If the regression model is unable to adequately capture the ‘information’ contained within these correlations, the ‘unexplained’ information will leak into the model’s residual errors in the form of auto-correlated errors. The Goodness-of-fit of the model will be poor in this case.

Common remedies to address this issue are as follows:

1. Before fitting a regression model, check if the time series exhibits seasonality and if it does, perform seasonality adjustment. Doing so, explains away the seasonal auto- correlations if any.
2. Perform a first difference of the time series i.e. y_t — y_(t-1) for all t and perform a white-noise test on the differenced time series. If the differenced time series can be shown to be white noise, then the original time series is a Random Walk. In that case, no further modeling is needed.
3. Fit a Poisson (or a related) counts based regression model on the seasonally adjusted time series but include lagged copies of the dependent y variable as regression variables.

We’ll explain how to fit a Poisson or Poisson-like model on a time series of counts using approach (3).

## The MANUFACTURING STRIKES data set

To illustrate the model fitting procedure, we will use the following open source data set that is widely used in regression modeling literature:

The data set is a monthly time series showing the relationship between U.S. manufacturing activity measured as a departure from the trend line, and the number of contract strikes in U.S. manufacturing industries beginning each month from 1968 through 1976. The STRIKES Data set (Source: R data sets) (Image by Author)

This data set is available in R and it can be fetched using the statsmodels Datasets package.

The dependent variable y is strikes.

We’ll start by importing all the required packages:

```import statsmodels.api as sm
import statsmodels.discrete.discrete_model as dm
import numpy as np
from patsy import dmatrices
import statsmodels.graphics.tsaplots as tsa
from matplotlib import pyplot as plt
```

Let’s load the data set into memory using statsmodels:

```strikes_dataset = sm.datasets.get_rdataset(dataname='StrikeNb', package='Ecdat')
```

Print out the data set:

```print(strikes_dataset.data)
```

We see the following output:

We’ll consider the first 92 data points as the training set and the remaining 16 data points as the test data set.

```strikes_data = strikes_dataset.data.copy()
strikes_data_train = strikes_data.query('time<=92')

strikes_data_test = strikes_data.query('time>92').reset_index().drop('index', axis=1)
```

Print out the stats for the dependent variable:

```print('Mean='+str(np.mean(strikes_data_train['strikes'])) + ' Variance='+str(np.var(strikes_data_train['strikes'])))
```

We get the following output:

`Mean=5.5 Variance=14.728260869565217`

We see that y is over-dispersed so it violates the mean=variance assumption of the Poisson model. To account for the over-dispersion, we’ll fit a Negative Binomial regression model having the following NB2 Variance function:

Here is our regression expression. strikes is the dependent variable and output is our explanatory variable. The intercept of regression is assumed to be present:

```expr = 'strikes ~ output'
```

We’ll use Patsy to carve out the X and y matrices. Patsy will automatically add a regression intercept column to X:

```y_train, X_train = dmatrices(expr, strikes_data_train, return_type='dataframe')
print(y_train)
print(X_train)

y_test, X_test = dmatrices(expr, strikes_data_test, return_type='dataframe')
print(y_test)
print(X_test)
```

Build and train the Negative Binomial regression model with the NB2 variance function:

```nb2_model = dm.NegativeBinomial(endog=y_train, exog=X_train, loglike_method='nb2')

nb2_model_results = nb2_model.fit(maxiter=100)

print(nb2_model_results.summary())
```

We get the following output of the fitted model’s summary:

Both the output and the dispersion parameter alpha are statistically significant at 95% confidence level as indicated by their p-values (0.034 and 0.000) of the coefficients.

### Goodness of fit of the NB model

The Pseudo R-squared is only 0.9% indicating a very poor fit quality on the training data set.

The p-value of the Log-Likelihood Ratio test is 0.03589 indicating that the model is doing better than the Intercept Only Model (a.k.a. Null Model) at a 95% confidence level, but not at a 99% or higher confidence level.

Let’s look at the Auto-correlation plot of the fitted model’s residual errors: Autocorrelation plot of residual errors of the NB2 model (Image by Author)

We can see that the residual errors are auto-correlated at time lags 1, 2 and 3 indicating that there is auto-correlation in the dependent variable strikes that the NB2 model wasn’t able to fully explain causing it to leak into the model’s residuals.

Overall, the goodness of fit of this model is very poor.

## Building an Auto-regressive Poisson model

To fix the situation with auto-correlated residuals, we will introduce lagged copies of y, specifically y_(t-1), y_(t-2) and y_(t-3) as regression variables along with the output variable.

But instead of directly introducing y_(t-k) as a regression variables, we’ll use ln[y_(t-k)] so as to solve the ‘model explosion’ problem when the coefficient of y_(t-k) is positive.

But using the ln() transform gives rise to the question about how to deal with zero values of y_t for which the logarithm is undefined.

We solve this problem using the following trick outlined by Cameron and Trivedi in their book Regression Analysis of Count Data (See Section 7.5: Auto-regressive models):

We will define a new indicator variable d_t for each time lag of interest as follows:

When y_t = 0: d_t=1.

When y_t > 0: d_t=0.

Let’s make these changes to the data frame.

Define a function that will set the value of the indicator variable d_t as we have defined above:

```def indicator_func(x):
if x == 0:
return 1
else:
return 0
```

And use this function to create a new indicator variable column:

```strikes_data['d'] = strikes_data['strikes'].apply(indicator_func)
```

Let’s also create a new column strikes_adj that is set to 1 if strikes is 0, else to the value of strikes:

```strikes_data['strikes_adj'] = np.maximum(1, strikes_data['strikes'])
```

We will now create time lagged copies of strikes_adj and d at lags 1, 2 and 3:

```strikes_data['ln_strikes_adj_lag1'] = strikes_data['strikes_adj'].shift(1)

strikes_data['d_lag1'] = strikes_data['d'].shift(1)
strikes_data['d_lag2'] = strikes_data['d'].shift(2)
strikes_data['d_lag3'] = strikes_data['d'].shift(3)
```

Remove all rows with empty cells that were created by the shift operator:

```strikes_data = strikes_data.dropna()
```

Finally, take the natural log of ln_strikes_adj_lag1. Recollect that we want to add the natural log of lagged variables y_(t_1), y_(t_2) and y_(t_3).

Let’s see how our data frame looks like now:

```print(strikes_data)
```

Let’s once again split the data frame into training and test data sets:

```strikes_data_train=strikes_data.query('time<=92')
strikes_data_test=strikes_data.query('time>92').reset_index().drop('index', axis=1)
```

Our regression expression also needs to be updated to include the lagged variables:

```expr = 'strikes ~ output + ln_strikes_adj_lag1 + ln_strikes_adj_lag2 + ln_strikes_adj_lag3 + d_lag1 + d_lag2 + d_lag3'
```

Use Patsy to carve out the y and X matrices:

```y_train, X_train = dmatrices(expr, strikes_data_train, return_type='dataframe')
print(y_train)
print(X_train)
```

Finally, we’ll build and fit the regression model on (y_train, X_train). This time, we’ll use a straight-up Poisson regression model:

```poisson_model = dm.Poisson(endog=y_train, exog=X_train)

poisson_model_results = poisson_model.fit(maxiter=100)

print(poisson_model_results.summary())
```

We see the following results: The training summary of the Poisson regression model with lagged output variables

### Goodness of fit of the Auto-regressive Poisson model

The very first thing to note is that the goodness of fit as measured by Pseudo-R-squared has improved over the earlier NB2 model from 0.9% to 15.69%. That’s a big improvement. This time, the p-value of the LLR test is also vanishingly small at 1.295e-15. That means that we can say with near 100% confidence that the lagged variable Poisson model is better than the Intercept Only model. Recollect that earlier we could say that with at only a 95% confidence level.

Let’s look at the auto-correlation plot of residuals of this lagged variable Poisson model:

```tsa.plot_acf(poisson_model_results.resid, alpha=0.05)
plt.show()
```

We see the following plot: The auto-correlation plot of the lagged variable Poisson model’s residual errors (Image by Author)

Except for a very mildly significant correlation at LAG 13, the correlation of residuals with all other lags are well within the stated alpha bounds.

Our strategy of adding lagged copies of strikes to the regression variables of the Poisson model seems to have explained away much of auto-correlation in the strikes variable.

### Significance of regression variables

Finally, let us note from the training summary of the lagged variable Poisson model that while the coefficients of output, ln_strikes_adj_lag1 and ln_strikes_adj_lag2 are significant at 95% confidence level, the coefficient of the third lag ln_strikes_adj_lag3 is significant at only around 75% confidence level as indicated by its p-value of 0.237. Moreover, all three lagged indicator variables d_lag1, d_lag2 and d_lag3 are found to be not statistically significant at a 95% confidence level.

### Prediction

Let’s use the fitted lagged variable Poisson model to predict the count of strikes on the test data set that we had set aside earlier. We shouldn’t get out hopes up too high on the quality of the predictions. Remember that although this model has fitted a lot better than the previous NB2 model, pseudo-R-squared is still only 16%.

We’ll use Patsy to carve out (y_test, X_test):

```y_test, X_test = dmatrices(expr, strikes_data_test, return_type='dataframe')
print(y_test)
print(X_test)
```

Make the predictions on X_test:

```poisson_predictions = poisson_model_results.predict(X_test)
```

Plot the predicted and actual values:

```predicted_counts=poisson_predictions

actual_counts = y_test['strikes']

fig = plt.figure()
fig.suptitle('Predicted versus actual strike counts')

predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts')
actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts')

plt.legend(handles=[predicted, actual])
plt.show()
```

We get the following plot: Predicted versus actual count of strikes using the lagged variable Poisson model (Image by Author)

## Next steps

We could try improving the goodness of fit of our lagged variable model by experimenting with the following modifications:

• Include the first three time lags of the output variable as regression variables in addition to output.
• Include both time-lagged values of the output variable and the strikes variable as regression variables.
• Instead of using a Poisson model, use a Negative Binomial model (using either the NB1 or NB2 variance function) and with the above kinds of lagged variables as regression variables.

As an aside, it would also be interesting to use the Generalized Linear Model framework provided by statsmodels to build and train the Poisson or the Negative Binomial models. See links below on how to build and train GLM models.

### Data Set

The STRIKES Data set. Source: R data sets

### Papers

Kennan J., The duration of contract strikes in U.S. manufacturing, Journal of Econometrics, Volume 28, Issue 1, 1985, Pages 5–28, ISSN 0304–4076, https://doi.org/10.1016/0304-4076(85)90064-8. PDF download link

Cameron C. A., Trivedi P. K., Regression-based tests for overdispersion in the Poisson model, Journal of Econometrics, Volume 46, Issue 3, 1990, Pages 347–364, ISSN 0304–4076, https://doi.org/10.1016/0304-4076(90)90014-K.

### Books

Cameron A. Colin, Trivedi Pravin K., Regression Analysis of Count Data, Econometric Society Monograph №30, Cambridge University Press, 1998. ISBN: 0521635675

McCullagh P., Nelder John A., Generalized Linear Models, 2nd Ed., CRC Press, 1989, ISBN 0412317605, 9780412317606

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.