###### A Poisson regression model for auto-correlated time series data

Poisson and Poisson-like regression models are often used for counts based data sets, namely data that contain whole numbered counts. For example, the number of people walking into the emergency room of a hospital every hour is one such data set.

Ordinary Least Squares Linear Regression models or non-linear models such as those based on Neural Nets based regression techniques don’t work well for such data sets because they can predict negative values.

If the data set is a **time series of counts**, additional modeling complications arise because time series data are often **auto-correlated**. Previous counts influence the values of future counts. If the regression model is unable to adequately capture the ‘information’ contained within these correlations, the ‘unexplained’ information will leak into the model’s residual errors in the form of auto-correlated errors. The **Goodness-of-fit** of the model will be poor in this case.

Common remedies to address this issue are as follows:

- Before fitting a regression model, check if the time series exhibits seasonality and if it does,
**perform seasonality adjustment**. Doing so, explains away the seasonal auto- correlations if any. **Perform a first difference of the time series**i.e.*y_t — y_(t-1)*for all*t*and**perform a white-noise test**on the differenced time series. If the differenced time series can be shown to be white noise, then the original time series is a Random Walk. In that case, no further modeling is needed.- Fit a Poisson (or a related) counts based regression model on the seasonally adjusted time series but
*include lagged copies of the dependent**y**variable as regression variables*.

We’ll explain how to fit a Poisson or Poisson-like model on a time series of counts using **approach (3)**.

## The MANUFACTURING STRIKES data set

To illustrate the model fitting procedure, we will use the following open source data set that is widely used in regression modeling literature:

The data set is a monthly time series showing the relationship between U.S. manufacturing activity measured as a departure from the trend line, and the number of contract strikes in U.S. manufacturing industries beginning each month from 1968 through 1976.

This data set is available in R and it can be fetched using the statsmodels Datasets package.

The dependent variable ** y** is

*strikes*.

We’ll start by importing all the required packages:

```
import statsmodels.api as sm
import statsmodels.discrete.discrete_model as dm
import numpy as np
from patsy import dmatrices
import statsmodels.graphics.tsaplots as tsa
from matplotlib import pyplot as plt
```

Let’s load the data set into memory using statsmodels:

```
strikes_dataset = sm.datasets.get_rdataset(dataname='StrikeNb', package='Ecdat')
```

Print out the data set:

```
print(strikes_dataset.data)
```

We see the following output:

We’ll consider the first 92 data points as the training set and the remaining 16 data points as the test data set.

```
strikes_data = strikes_dataset.data.copy()
strikes_data_train = strikes_data.query('time<=92')
strikes_data_test = strikes_data.query('time>92').reset_index().drop('index', axis=1)
```

Print out the stats for the dependent variable:

```
print('Mean='+str(np.mean(strikes_data_train['strikes'])) + ' Variance='+str(np.var(strikes_data_train['strikes'])))
```

We get the following output:

Mean=5.5 Variance=14.728260869565217

We see that ** y** is over-dispersed so it violates the

*mean=variance*assumption of the Poisson model. To account for the over-dispersion, we’ll fit a

**Negative Binomial regression model**having the following NB2 Variance function:

Here is our regression expression. *strikes *is the dependent variable and *output* is our explanatory variable. The intercept of regression is assumed to be present:

```
expr = 'strikes ~ output'
```

We’ll use Patsy to carve out the ** X **and

**matrices. Patsy will automatically add a regression intercept column to**

*y***:**

*X*```
y_train, X_train = dmatrices(expr, strikes_data_train, return_type='dataframe')
print(y_train)
print(X_train)
y_test, X_test = dmatrices(expr, strikes_data_test, return_type='dataframe')
print(y_test)
print(X_test)
```

Build and train the Negative Binomial regression model with the NB2 variance function:

```
nb2_model = dm.NegativeBinomial(endog=y_train, exog=X_train, loglike_method='nb2')
nb2_model_results = nb2_model.fit(maxiter=100)
print(nb2_model_results.summary())
```

We get the following output of the fitted model’s summary:

Both the output and the dispersion parameter alpha are statistically significant at 95% confidence level as indicated by their p-values (0.034 and 0.000) of the coefficients.

### Goodness of fit of the NB model

The **Pseudo R-squared** is only 0.9% indicating a **very poor fit quality** on the training data set.

The *p-value* of the Log-Likelihood Ratio test is *0.03589 *indicating that the model is doing better than the Intercept Only Model (a.k.a. Null Model) at a *95% *confidence level, but not at a *99% *or higher confidence level.

Let’s look at the Auto-correlation plot of the fitted model’s residual errors:

We can see that the residual errors are auto-correlated at time lags 1, 2 and 3 indicating that there is auto-correlation in the dependent variable *strikes* that the NB2 model wasn’t able to fully explain causing it to leak into the model’s residuals.

Overall, the goodness of fit of this model is very poor.

## Building an Auto-regressive Poisson model

To fix the situation with auto-correlated residuals, we will introduce lagged copies of ** y**, specifically

*y_(t-1), y_(t-2)*and

*y_(t-3)*as regression variables along with the

*output*variable.

But instead of directly introducing *y_(t-k)* as a regression variables, we’ll use *ln[y_(t-k)]* so as to solve the ‘model explosion’ problem when the coefficient of *y_(t-k)* is positive.

But using the *ln()* transform gives rise to the question about how to deal with zero values of *y_t* for which the logarithm is undefined.

We solve this problem using the following trick outlined by Cameron and Trivedi in their book *Regression Analysis of Count Data* *(See Section 7.5: Auto-regressive models)*:

We will define a new indicator variable *d_t* for each time lag of interest as follows:

When *y_t = 0: d_t=1.*

When *y_t > 0:* *d_t=0*.

Let’s make these changes to the data frame.

Define a function that will set the value of the indicator variable *d_t *as we have defined above:

```
def indicator_func(x):
if x == 0:
return 1
else:
return 0
```

And use this function to create a new indicator variable column:

```
strikes_data['d'] = strikes_data['strikes'].apply(indicator_func)
```

Let’s also create a new column *strikes_adj* that is set to 1 if *strikes* is 0, else to the value of *strikes*:

```
strikes_data['strikes_adj'] = np.maximum(1, strikes_data['strikes'])
```

We will now create time lagged copies of *strikes_adj* and *d* at lags 1, 2 and 3:

```
strikes_data['ln_strikes_adj_lag1'] = strikes_data['strikes_adj'].shift(1)
strikes_data['ln_strikes_adj_lag2'] = strikes_data['strikes_adj'].shift(2)
strikes_data['ln_strikes_adj_lag3'] = strikes_data['strikes_adj'].shift(3)
strikes_data['d_lag1'] = strikes_data['d'].shift(1)
strikes_data['d_lag2'] = strikes_data['d'].shift(2)
strikes_data['d_lag3'] = strikes_data['d'].shift(3)
```

Remove all rows with empty cells that were created by the shift operator:

```
strikes_data = strikes_data.dropna()
```

Finally, take the natural log of *ln_strikes_adj_lag1*. Recollect that we want to add the natural log of lagged variables *y_(t_1)*, *y_(t_2) *and *y_(t_3)*.

Let’s see how our data frame looks like now:

```
print(strikes_data)
```

Let’s once again split the data frame into training and test data sets:

```
strikes_data_train=strikes_data.query('time<=92')
strikes_data_test=strikes_data.query('time>92').reset_index().drop('index', axis=1)
```

Our regression expression also needs to be updated to include the lagged variables:

```
expr = 'strikes ~ output + ln_strikes_adj_lag1 + ln_strikes_adj_lag2 + ln_strikes_adj_lag3 + d_lag1 + d_lag2 + d_lag3'
```

Use Patsy to carve out the *y** *and *X** *matrices:

```
y_train, X_train = dmatrices(expr, strikes_data_train, return_type='dataframe')
print(y_train)
print(X_train)
```

Finally, we’ll build and fit the regression model on *(**y_train**, **X_train**)*. This time, we’ll use a straight-up Poisson regression model:

```
poisson_model = dm.Poisson(endog=y_train, exog=X_train)
poisson_model_results = poisson_model.fit(maxiter=100)
print(poisson_model_results.summary())
```

We see the following results:

### Goodness of fit of the Auto-regressive Poisson model

The very first thing to note is that the goodness of fit as measured by Pseudo-R-squared has improved over the earlier NB2 model from *0.9% *to *15.69%*. That’s a big improvement. This time, the* p-value* of the LLR test is also vanishingly small at *1.295e-15*. That means that we can say with near *100%* confidence that the lagged variable Poisson model is better than the Intercept Only model. Recollect that earlier we could say that with at only a *95% *confidence level.

Let’s look at the auto-correlation plot of residuals of this lagged variable Poisson model:

```
tsa.plot_acf(poisson_model_results.resid, alpha=0.05)
plt.show()
```

We see the following plot:

Except for a very mildly significant correlation at LAG 13, the correlation of residuals with all other lags are well within the stated alpha bounds.

Our strategy of adding lagged copies of *strikes *to the regression variables of the Poisson model seems to have explained away much of auto-correlation in the *strikes* variable.

### Significance of regression variables

Finally, let us note from the training summary of the lagged variable Poisson model that while the coefficients of *output*, *ln_strikes_adj_lag1* and *ln_strikes_adj_lag2* are significant at *95% c*onfidence level, the coefficient of the third lag *ln_strikes_adj_lag3* is significant at only around *75% *confidence level as indicated by its *p-value* of *0.237*. Moreover, all three lagged indicator variables *d_lag1*, *d_lag2* and *d_lag3* are found to be not statistically significant at a *95% *confidence level.

### Prediction

Let’s use the fitted lagged variable Poisson model to predict the count of strikes on the test data set that we had set aside earlier. We shouldn’t get out hopes up too high on the quality of the predictions. Remember that although this model has fitted a lot better than the previous NB2 model, pseudo-R-squared is still only 16%.

We’ll use Patsy to carve out (** y_test**,

**):**

*X_test*```
y_test, X_test = dmatrices(expr, strikes_data_test, return_type='dataframe')
print(y_test)
print(X_test)
```

Make the predictions on ** X_test**:

```
poisson_predictions = poisson_model_results.predict(X_test)
```

Plot the predicted and actual values:

```
predicted_counts=poisson_predictions
actual_counts = y_test['strikes']
fig = plt.figure()
fig.suptitle('Predicted versus actual strike counts')
predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts')
actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts')
plt.legend(handles=[predicted, actual])
plt.show()
```

We get the following plot:

## Next steps

We could try improving the goodness of fit of our lagged variable model by experimenting with the following modifications:

- Include the first three time lags of the
*output*variable as regression variables in addition to*output*. - Include both time-lagged values of the
*output*variable and the*strikes*variable as regression variables. - Instead of using a Poisson model, use a Negative Binomial model (using either the NB1 or NB2 variance function) and with the above kinds of lagged variables as regression variables.

As an aside, it would also be interesting to use the Generalized Linear Model framework provided by statsmodels to build and train the Poisson or the Negative Binomial models. See links below on how to build and train GLM models.

## Citations and Copyrights

### Data Set

The STRIKES Data set. Source: R data sets

### Papers

Kennan J., The duration of contract strikes in U.S. manufacturing*, Journal of Econometrics*, Volume 28, Issue 1, 1985, Pages 5–28, ISSN 0304–4076, https://doi.org/10.1016/0304-4076(85)90064-8. **PDF download link**

Cameron C. A., Trivedi P. K., Regression-based tests for overdispersion in the Poisson model, *Journal of Econometrics*, Volume 46, Issue 3, 1990, Pages 347–364, ISSN 0304–4076, https://doi.org/10.1016/0304-4076(90)90014-K.

### Books

Cameron A. Colin, Trivedi Pravin K., Regression Analysis of Count Data, *Econometric Society Monograph №30*, Cambridge University Press, 1998. ISBN: 0521635675

McCullagh P., Nelder John A., Generalized Linear Models, 2nd Ed., *CRC Press*, 1989, ISBN 0412317605, 9780412317606

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **The Markov Switching Dynamic Regression Model

**NEXT: **The Poisson INAR(1) Regression Model