And a step by step tutorial for doing Generalized Poisson regression in Python
The regular Poisson Regression model is often a first-choice model for counts based datasets. The primary assumption of the Poisson Regression model is that the variance in the counts is the same as their mean value, namely, the data is equi-dispersed. Unfortunately, real world data is seldom equi-dispersed, which drives statisticians to other models for counts such as:
- The Negative Binomial (NB) Regression Model and,
- The Generalized Poisson Regression model which is the family of models that we will cover in this section.
Both these models do not make the variance = mean assumption about the data set. This makes them a better practical choice for modeling counts based data. In a previous section, I covered the NB model in detail. In this section we’ll cover Generalized Poisson Regression models.
We’ll cover the following topics:
- What are counts based data sets?
- What is equi-dispersion, under-dispersion, over-dispersion? We’ll cover the limitation of the Poisson model for under-dispersed and over-dispersed data sets.
- Introduction to the Consul’s Generalized Poisson Regression (GP-1) model and Famoye’s Restricted Generalized Poisson Regression (GP-2) model.
- A Python based tutorial for building and training GP-1 and GP-2 models, and comparison of their performance with the standard Poisson Regression Model.
What is counts based data?
Counts based data sets are ones in which the dependent variable y represents the number of occurrences of some event. Here are some examples:
- Number of exoplanets discovered per month.
- Number of sun-spots over the years.
- Number of people walking into an Emergency Room per hour.
In each one of the above datasets, the dependent variable y, represents the observed counts, while the choice of X, the matrix of explanatory variables, i.e. the set of variables that are thought to explain the variance in y is (sadly) left mostly to the judgment of the statistical modeller.
Equi-dispersion, under-dispersion, over-dispersion and the limitations of the Poisson model
One can start by trying to model the dependent counts variable as a Poisson process. Unfortunately, in many real-world datasets, the Poisson process is unable to satisfactorily explain the variability in the observed counts. This is largely due to the variance = mean assumption that the Poisson regression model makes about data. In other words, the Poisson model incorrectly assumes that the counts are equi-dispersed.
The Poisson model cannot get away from making this assumption because the model is based on the assumption that the dependent variable y is a random variable that follows the Poisson probability distribution, and it can be shown that the variance of a Poisson distributed random variable equals its mean value.
The following formula represents the probability distribution function (also know the Probability Mass Function) of a Poisson distributed random variable. It can be shown that:
Variance(X) = mean(X) = λ, the number of events occurring per unit time.
Here are a few interesting looking probability distribution plots generated by the above formula for different values of λ:
Incidentally, in regression modelling, it is uncommon to talk about unconditional variance and unconditional mean. Instead, one prefers talking about the variance as conditional upon the explanatory variables X taking on some specific value x_i for the ith observation.
The conditional variance can be denoted as Variance(y|X=x_i). Ditto for the conditional mean.
As mentioned earlier, many real world datasets are not equi-dispersed, so the conditional variance of y is not equal to the conditional mean of y. With some of them, the variance is greater than the mean, a phenomenon known as over-dispersion, while in others, variance is less than the mean a.k.a. under-dispersion.
When the variance in your dependent variable y is greater than what your model assumes it to be, then from the perspective of your model, your data set is over-dispersed.
When the variance in your dependent variable y is less than what your model assumes it to be, then from the perspective of your model, your data set is under-dispersed.
The effect of over (or under) dispersion is that your model will not be able to adequately explain the variability in the observed counts. Correspondingly, its predictions will be of a poor quality.
A common fix for this problem is to assume that the variance is some general function of the mean, i.e:
Variance = f(mean)
One of the commonly used forms for f(.) is as follows:
α is known as the dispersion parameter which represents the additional variability in y introduced by some unknown set of variables that are causing the overall variance in y to be different than what your regression model was expecting it to be.
Note that when α=0, variance = mean and you get the standard Poisson regression model.
When α>0, we consider two interesting cases, namely when p=1 and p=2. In both cases, we get what is known as the Negative Binomial (NB) regression model.
In the NB regression model, we assume that the observed counts y are a Poisson distributed random variable with event rate λ and λ itself is a Gamma distributed random variable.
The Negative Binomial regression model (often referred to as a Poisson-Gamma mixture) turns out to be an excellent model for many real-world counts datasets.
There are many other approaches for generalizing the Poisson Regression Model so as to increase its reach to over and under-dispersed data. In this section, we’ll cover two such models.
Consul’s Generalized Poisson Regression Model
In 1989, Prem C. Consul, in his book, Generalized Poisson Distributions: Properties and Applications, proposed a way to modify the probability distribution of the Poisson distribution so that it could handle both over dispersed and under dispersed data. This model came to be known as the GP-1 (Generalized Poisson-1) model. The GP-1 model assumes that the dependent variable y is a random variable with the following probability distribution:
If you set the dispersion parameter α to 0 in the above equations, the PMF, mean and variance of GP-1 reduce to essentially those of the standard Poisson distribution.
Famoye’s Restricted Generalized Poisson Distribution
In 1993, Felix Famoye introduced what he referred to as the Restricted Generalized Poisson Regression Model, as a way to extend the reach of the standard Poisson model to handling over-dispersed and under-dispersed data sets. This model has come to be known as the GP-2 (Generalized Poisson-2) model.
The GP-2 model assumes that the dependent variable y is a random variable with the following probability distribution:
As before, when the dispersion parameter α is set to 0, the PMF, mean and variance functions of GP-2 reduce to essentially those of the standard Poisson distribution.
The dispersion parameter α is estimated using the following formula:
In the statsmodels library, α is estimated for you as part of the model fitting process of GP-1 and GP-2 models.
Implementation of GP-1 and GP-2 in Python and Statsmodels
The statsmodels library contains an implementation of both GP-1 and GP-2 models via the
In this section, we’ll show how to use GP-1 and GP-2 for modeling the following real world data set of counts.
A real world data set of counts
The following table contains counts of bicyclists traveling across various NYC bridges. The counts were measured daily from 01 April 2017 to 31 October 2017.
Here is the link to the raw data set.
We will focus our analysis on the number of bicyclists crossing the Brooklyn bridge every day. Here is a time sequenced plot of the bicyclist counts seen on the Brooklyn bridge. Here is the link to the bicyclist counts data for the Brooklyn bridge that we will use.
Our regression goal
Our regression goal is to predict the number of bicyclists crossing the Brooklyn bridge on any given day.
Our regression strategy
We will use a set of regression variables from the data set, namely, Day, Day of the Week(Derived from Date), Month(Derived from Date), High Temp, Low Temp and Precipitation to ‘explain’ the variance in the observed counts on the Brooklyn Bridge.
The training algorithm of our regression model will fit the observed counts y to the regression matrix X.
Once the model is trained, we’ll test its performance on a hold out test data set that the model has not seen at all during training.
Regression Strategy — Step by Step
- We’ll first train the standard Poisson regression model on this data set. The standard Poisson model will also serve as the ‘control’ model for testing the efficacy of the generalized Poisson models.
- If the variance in the data set is approximately same as the mean value, then there is nothing more to be done except inspect the goodness-of-fit of the Poisson model.
- If the variance turns out to greater or smaller than the Poisson mean, then we’ll sequentially train the GP-1 and GP-2 models on the data set and compare their performance with the Poisson model.
We’ll start by importing all the required packages.
import pandas as pd from patsy import dmatrices import numpy as np import statsmodels.api as sm import matplotlib.pyplot as plt
Create a pandas DataFrame for the counts data set. Here is the link to the data set.
df = pd.read_csv('nyc_bb_bicyclist_counts.csv', header=0, infer_datetime_format=True, parse_dates=, index_col=)
We’ll add a few derived regression variables to the X matrix.
ds = df.index.to_series() df['MONTH'] = ds.dt.month df['DAY_OF_WEEK'] = ds.dt.dayofweek df['DAY'] = ds.dt.day
Let’s print out the first few rows of our data set to see how it looks like:
Let’s create the training and testing data sets.
mask = np.random.rand(len(df)) < 0.8 df_train = df[mask] df_test = df[~mask] print('Training data set length='+str(len(df_train))) print('Testing data set length='+str(len(df_test)))
Setup the regression expression in Patsy notation. We are telling patsy that BB_COUNT is our dependent variable y and it depends on the regression variables X: DAY, DAY_OF_WEEK, MONTH, HIGH_T, LOW_T and PRECIP.
expr = 'BB_COUNT ~ DAY + DAY_OF_WEEK + MONTH + HIGH_T + LOW_T + PRECIP'
Let’s use Patsy to carve out the X and y matrices for the training and testing data sets:
y_train, X_train = dmatrices(expr, df_train, return_type='dataframe') y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')
Using the statsmodels GLM class, train the Poisson regression model on the training data set.
poisson_training_results = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()
Print the training summary.
This prints out the following:
I have highlighted the significant sections in the output.
We can see that all regression coefficients (know as the β vector in regression parlance) are statistically significant at the 95% confidence level since their p-value is < 0.05.
The Maximum Log-likelihood value (-11872) will be used later on while comparing the model’s performance with that of GP-1 and GP-2 models.
Let’s print out the variance and mean of the data set:
This prints out the following:
The variance is clearly much greater than the mean. The data is grossly over-dispersed and the primary assumption of the Poisson model does not hold.
So we’ll build and train GP-1 and GP-2 models next and see if they perform any better.
Statsmodels lets you do this in 3 lines of code!
Build Consul’s Generalized Poison regression model, know as GP-1:
gen_poisson_gp1 = sm.GeneralizedPoisson(y_train, X_train, p=1)
Fit (train) the model:
gen_poisson_gp1_results = gen_poisson_gp1.fit()
Print the results:
This prints out the following (I have highlighted a few interesting areas in the output):
Notice that except for the DAY variable’s coefficient, all other regression coefficients are statistically significant at the 95% confidence level, i.e. their p-value > 0.05. Recollect that in the Poisson model all regression coefficients were found to be statistically significant at the 95% confidence level.
Goodness-of-fit of the GP-1 model
The GP-1 model’s Maximum Likelihood Estimate is -1350.6 which is greater than that of the null-model’s MLE of -1475.9. The null model is a simple intercept-only model, i.e. a horizontal line passing through the y-intercept. But is the difference statistically significant? The Likelihood Ratio (LR) test’s p-value is shown to be 3.12e-51, an extremely tiny number. So yes, the GP-1 model does actually do a better job of modeling the data than a simple intercept-only model.
But does GP-1 do a better job than the regular Poisson model?
Recollect that the regular Poisson model’s Maximum Likelihood Estimate was -11872. Compare it with the MLE of GP-1 which is -1350.6. Clearly, GP-1 produces a superior goodness-of-fit (it has a much larger maximum likelihood estimate). If we wanted to, we could do an LR test using the two likelihood estimates and assess the significance using the Chi-squared test. In this case, that is not necessary since the MLE of GP-1 is much greater than that of the regular Poisson model.
Let’s predict the cyclist counts using GP-1 using the test data set which the model has not seen during training:
gen_poisson_gp1_predictions = gen_poisson_gp1_results.predict(X_test)
gen_poisson_gp1_predictions is a pandas Series object that contains the predicted bicyclist count for each row in the X_test matrix. Remember that y_test contains the actual observed counts.
Let’s plot the predicted and the actual counts to visually assess the quality of the predictions:
predicted_counts=gen_poisson_gp1_predictions actual_counts = y_test['BB_COUNT'] fig = plt.figure() fig.suptitle('Predicted versus actual bicyclist counts on the Brooklyn bridge') predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts') actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts') plt.legend(handles=[predicted, actual]) plt.show()
We get the following plot of predicted versus actual bicyclist counts:
As you can see, except for a few counts, the GP-1 model has done a reasonably good job of predicting the bicyclist counts.
If needed, we can compare the prediction quality of GP-1 with regular Poisson, by comparing the Root Mean Square Error (RMSE) of GP-1’s predictions with that of the regular Poisson model for the same test data set. I’ll leave that as an exercise.
Finally, let’s also try out the Famoye’s Restricted Generalized Poisson regression model, known as GP-2:
We’ll use the same 3-step approach for building and training the model. Note the parameter p=2:
#Build Famoye's Restricted Generalized Poison regression model, know as GP-2 gen_poisson_gp2 = sm.GeneralizedPoisson(y_train, X_train, p=2) #Fit the model gen_poisson_gp2_results = gen_poisson_gp2.fit() #print the results print(gen_poisson_gp2_results.summary())
We see the following training output:
The way to analyze and compare GP-2’s goodness-of-fit and prediction quality is the same as with GP-1. However note that in this case, the model’s training algorithm was not able to converge. We could try to fix that problem by setting a higher iteration count in the iter parameter of the fit() method.
Here is the complete source code:
And here is the link to the data set which we have used.
- The standard Poisson model assumes that the variance of the counts based data is same as the mean value. This assumption is often violated by real world data sets which are either over-dispersed or under-dispersed.
- Hence we need to use other models for counts based data such as the Negative Binomial model or the Generalized Poisson Regression model which do not assume that the data is equi-dispersed. Such models assume that the variance is some function of the mean.
- The Consul’s Generalized Poisson Regression model (called GP-1) and the Famoye’s Restricted Generalized Poisson Regression model (GP-2) are two such GP models that can be used to model real-world counts based data sets.
- The Python library Statsmodels happens to have excellent support for building and training GP-1 and GP-2 models.
Citations and Copyrights
P.C. Consul & Felix Famoye (1992) Generalized poisson regression model, Communications in Statistics – Theory and Methods, 21:1, 89-109, DOI: 10.1080/03610929208830766
Cameron A. C. and Trivedi P. K., Regression Analysis of Count Data, Second Edition, Econometric Society Monograph No. 53, Cambridge University Press, Cambridge, May 2013.
Consul, P.C. (1989), Generalized Poisson Distributions: Properties and Applications, New York, Marcel Dekker.
PREVIOUS: The Negative Binomial Regression Model