And a tutorial on Poisson regression using Python
In this section, we’ll cover the following topics:
- Features of count based data: Count data sets are some of the most commonly occurring data in the world. We’ll look at what makes count based data different.
- Regression models for forecasting counts: We’ll look at Poisson regression model in detail. The Negative Binomial (NB) regression model is another commonly used model for count based data. I’ll cover that in a later section.
- Python tutorial on Poisson regression: I will take you through a step-by-step tutorial on how to create a Poisson regression model in Python using the GLM class of statsmodels, and how to train it on a real world data set.
What is count based data?
Count based data contains events that occur at a certain rate. The rate of occurrence may change over time or from one observation to next. Here are some examples of count based data:
- Number of vehicles crossing an intersection per hour,
- Number of people visiting a doctor’s office per month,
- Number of earth-like planets spotted per month.
A data set of counts has the following characteristics:
- Whole number data: The data consists of non-negative integers: [0… ∞] Regression techniques such as Ordinary Least Squares Regression may not be appropriate for modeling such data as OLSR works best on real numbers such as -656.0, -0.00000345, 13786.1 etc.
- Skewed Distribution: The data may contain a large number of data points for just a few values, thereby making the frequency distribution quite skewed. See for example above histogram.
- Sparsity: The data may reflect the occurrence of a rare event such as a gamma ray burst, thereby making the data sparse.
- Rate of occurrence: For the sake of creating a model, it can be assumed that there is a certain rate of occurrence of events λ that drives the generation of such data. The event rate may drift over time.
A real world data set of counts
The following table contains counts of bicyclists traveling over various NYC bridges. The counts were measured daily from 01 April 2017 to 31 October 2017.
Here is a time sequenced plot of the bicyclist counts on the Brooklyn bridge:
Regression models for counts
The Poisson regression model and the Negative Binomial regression model are two popular techniques for developing regression models for counts. Other possibilities are Ordered Logit, Ordered Probit and Nonlinear Least Squares models.
It’s good practice to start with the Poisson regression model and use it as the “control” for either more complex, or less constrained models. In their book Regression Analysis of Count Data, Cameron and Trivedi say the following:
“A sound practice is to estimate both Poisson and negative binomial models.”
In this section, we’ll use the Poisson regression model for regressing the bicyclist counts observed on the Brooklyn bridge, and in a following section, we’ll train the Negative Binomial model on the same data set.
Introducing the Poisson model
The Poisson distribution has the following Probability Mass Function.
The expected value (mean) for a Poisson distribution is λ. Thus in the absence of other information, one should expect to see λ events in any unit time interval such as 1 hour, 1 day, etc. For any interval t, one would expect to see λt events.
A Poisson regression model for constant λ
If the event rate λ is constant, one can simply use a modified Mean Model for predicting future counts of events. In this case, one would set all predicted values of counts to this constant value λ.
The following figure illustrates the constant λ scenario:
The following Python code was used to generate the blue dots (actual counts in the past time steps) using a Poisson process with λ=5. The orange dots (predictions) are all set to the same value 5:
A Poisson regression model for a non-constant λ
Now we get to the fun part. Let us examine a more common situation, one where λ can change from one observation to the next. In this case, we assume that the value of λ is influenced by a vector of explanatory variables, also known as predictors, regression variables, or regressors. We’ll call this matrix of regression variables, X.
The job of the regression model is to fit the observed counts y to the matrix of regression values X .
In the NYC bicyclist counts data set, the regression variables are Date, Day of Week, High Temp, Low Temp and Precipitation. We can also introduce additional regressors such as Month and Day of Month that are derived from Date, and we have the liberty to drop existing regressors such as Date.
The fitting of y to X happens by fixing the values of a vector of regression coefficients β.
In a Poisson Regression model, the event counts y are assumed to be Poisson distributed, which means the probability of observing y is a function of the event rate vector λ.
The job of the Poisson Regression model is to fit the observed counts y to the regression matrix X via a link-function that expresses the rate vector λ as a function of, 1) the regression coefficients β and 2) the regression matrix X.
The following figure illustrates the structure of the Poisson regression model.
What might be a good link function f(.) connecting λ with X? It turns out the following exponential link-function works great:
This link function keeps λ non-negative even when the regressors X or the regression coefficients β have negative values. This is a requirement for count based data.
In general, we have:
A Formal Specification of the Poisson Regression Model
The complete specification of the Poisson regression model for count based data is given as follows:
For the ith observation in the data set denoted by y_i corresponding to the row of regression variables x_i, the probability of observing the count y_i is Poisson distributed as per the following PMF:
Where the mean rate λ_i for the ith sample is given by the exponential link function shown earlier. We reproduce it here:
Once the model is fully trained on the data set, the regression coefficients β are known, and the model is ready to make predictions. To predict the event count y_p corresponding to an input row of regressors x_p that one has observed, one uses this formula:
All of this hinges on our ability to train the model successfully so that the regression coefficients vector β is known.
Let’s look at how this training takes place.
Training the Poisson regression model
Training a Poisson regression model involves finding the values of the regression coefficients β which would make the vector of observed counts y most likely.
The technique for identifying the coefficients β is called Maximum Likelihood Estimation (MLE).
Let’s get acquainted with the technique of MLE.
Understanding Maximum Likelihood Estimation (MLE)
I’ll illustrate the MLE technique using the bicyclist counts data set. Take a look at the first few rows of this data set:
Our assumption is that the bicyclist counts shown in the red box arise from a Poisson process. Hence we can say that their probabilities of occurrence is given by the Poisson PMF. Here are the probabilities for the first 4 occurrences:
We can similarly calculate the probabilities for all n counts observed in the training set.
Note that in the above formulae, λ_1, λ_2, λ_3,…,λ_n are calculated using the link function as follows:
Where x_1, x_2, x_3, x_4 are the first 4 rows of the regression matrix.
The probability of occurrence of the entire set of n counts y_1, y_2,…,y_n in the training set is the joint probability of occurrence of the individual counts.
The counts y are Poisson distributed, y_1, y_2,…,y_n are independent random variables, given correspondingly x_1, x_2,…,x_n. Hence the joint probability of occurrence of y_1, y_2,…,y_n can be expressed as a simple multiplication of the individual probabilities. Here is how the joint probability looks like for the entire training set:
Let’s recollect that λ_1, λ_2, λ_3,…,λ_n are linked to the regression vectors x_1, x_2,x_3,…,x_n via the regression coefficients β.
What value of β will make the given set of observed counts y most likely? It is the value of β for which the joint probability shown in the above equation achieves the maximum value. In other words, it is the value of β for which the rate of change of the joint probability function w.r.t. β is 0. Put another way, it is the solution of the equation obtained from differentiating the joint probability equation w.r.t. β and setting this differential equation to 0.
It is easier to differentiate the logarithm of the joint probability equation than the original equation. The solution to the logged equation yields the same optimal value of β.
This logarithmic equation is called the log-likelihood function. For the Poisson regression, the log-likelihood function is given by the following equation:
The above equation is obtained by taking the natural logarithm of both sides of the joint probability function shown earlier, after substituting the λ_i with exp(x_i*β).
As mentioned earlier, we differentiate this log-likelihood equation w.r.t. β, and set it to zero. This operation gives us the following equation:
Solving this equation for the regression coefficients β will yield the Maximum Likelihood Estimate (MLE) for β.
To solve the above equation one uses an iterative method such as Iteratively Reweighted Least Squares (IRLS). In practice, one does not solve this equation by hand. Instead, you use statistical software such as the Python statsmodels package which will do all the calculations for you while training the Poisson regression model on your data set.
Summary of steps for performing Poisson Regression
In summary, here are the steps for performing a Poisson Regression on a count based data set:
- First, make sure that your data set contains counts. One way to tell is that it contains only non-negative integer values that represent the number of occurrences of some event during some interval. In the bicyclist counts data set, it is the count of bicyclists crossing the Brooklyn bridge per day.
- Find out (or guess) the regression variables that will influence the observed counts. In the bicyclist counts data set the regression variables are Day of Week, Min Temp, Max Temp, Precipitation etc.
- Carve out a training data set that your regression model will train on, and a test data set that should keep aside. Do not train the model on the test data.
- Use a suitable statistical software such as the Python statsmodels package to configure and fit the Poisson Regression model on the training data set.
- Test the performance of the model by running it on the test data set so as to generate predicted counts. Compare them with the actual counts in the test data set.
- Use a goodness-of-fit measure to determine how well your model has trained on the training data set.
How to train a Poisson Regression Model in Python
Let’s put into practice what we have learnt. The Python statsmodels package has excellent support for doing Poisson regression.
Let’s use the Brooklyn bridge bicyclist counts data set. You can pick up the data set from here.
Our goal is to build a Poisson regression model for the observed bicyclist counts y. We will use the trained model to predict daily counts of bicyclists on the Brooklyn bridge that the model has not seen during training.
We’ll start by importing all the required packages.
import pandas as pd from patsy import dmatrices import numpy as np import statsmodels.api as sm import matplotlib.pyplot as plt
Create a pandas DataFrame for the counts data set.
df = pd.read_csv('nyc_bb_bicyclist_counts.csv', header=0, infer_datetime_format=True, parse_dates=, index_col=)
We’ll add a few derived regression variables to the X matrix.
ds = df.index.to_series() df['MONTH'] = ds.dt.month df['DAY_OF_WEEK'] = ds.dt.dayofweek df['DAY'] = ds.dt.day
We will not use the Date variable as a regressor since it contains an absolute date value but we don’t need to do anything special to drop Date as it is already consumed as the index of the pandas DataFrame. So it will not be available to us in the X matrix.
Let’s create the training and testing data sets.
mask = np.random.rand(len(df)) < 0.8 df_train = df[mask] df_test = df[~mask] print('Training data set length='+str(len(df_train))) print('Testing data set length='+str(len(df_test)))
Setup the regression expression in patsy notation. We are telling patsy that BB_COUNT is our dependent variable and it depends on the regression variables: DAY, DAY_OF_WEEK, MONTH, HIGH_T, LOW_T and PRECIP.
expr = """BB_COUNT ~ DAY + DAY_OF_WEEK + MONTH + HIGH_T + LOW_T + PRECIP"""
Set up the X and y matrices for the training and testing data sets. patsy makes this really simple.
y_train, X_train = dmatrices(expr, df_train, return_type='dataframe') y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')
Using the statsmodels GLM class, train the Poisson regression model on the training data set.
poisson_training_results = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()
Print the training summary.
This prints out the following:
So how well did our model do? Let’s make some predictions on the test data set.
poisson_predictions = poisson_training_results.get_prediction(X_test) #summary_frame() returns a pandas DataFrame predictions_summary_frame = poisson_predictions.summary_frame() print(predictions_summary_frame)
Here are the first few rows of the output:
Let’s plot the predicted counts versus the actual counts for the test data.
predicted_counts=predictions_summary_frame['mean'] actual_counts = y_test['BB_COUNT'] fig = plt.figure() fig.suptitle('Predicted versus actual bicyclist counts on the Brooklyn bridge') predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts') actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts') plt.legend(handles=[predicted, actual]) plt.show()
Here’s the output:
The model seems to be more or less tracking the trend in the actual counts although in many cases its predictions are way off the actual value.
Let’s also plot Actual versus Predicted counts.
plt.clf() fig = plt.figure() fig.suptitle('Scatter plot of Actual versus Predicted counts') plt.scatter(x=predicted_counts, y=actual_counts, marker='.') plt.xlabel('Predicted counts') plt.ylabel('Actual counts') plt.show()
Here is the plot:
Here is the complete source code for doing Poisson regression using Python:
Goodness-of-fit of the Poisson regression model
Recollect that both the expected value (i.e. mean) and the variance, of the Poisson distribution is λ. This rather strict condition is violated by most real-world data.
A common source of failure of the Poisson regression model is that the data does not satisfy the mean = variance criterion imposed by the Poisson distribution.
The summary() method on the statsmodels GLMResults class shows a couple of useful goodness-of-fit statistics to help you evaluate whether your Poisson regression model was able to successfully fit the training data. Let’s look at their values:
The reported values of Deviance and Pearson chi-squared are very large. A good fit is virtually impossible given these values. To make a quantitative determination of the goodness-of-fit at some confidence level, say 95% (p=0.05), we look up the value in the χ2 table for p=0.05 and Degrees of freedom of residuals=163. (DF Residuals = No. Observations minus DF model]). We compare this Chi-Squared value with the observed statistic, in this case, the Deviance or the Pearson’s chi-squared value reported in GLMResults. We find that at p=0.05 and DF Residuals = 163, the chi-squared value from a standard Chi-Squared table is 193.791 which is much smaller than the reported statistic of 23030 and 23300. Hence as per this test, the Poisson regression model, in spite of demonstrating an ‘okay’ visual fit for the test data set, has fit the training data rather poorly.
Conclusion and next steps
For counts based data, a useful starting point is the Poisson regression model. One can then compare its performance with other popular counts based models, such as:
- The The Negative Binomial regression model which does not make the mean = variance assumption about the data.
- Generalized Poisson regression models which also work well with over-dispersed or under-dispersed data sets.
- A Zero Inflated Poisson model if you suspect that your data contains excess zeros i.e. many more zeroes than what the regular Poisson model can explain
Citations and Copyrights
Cameron A. C. and Trivedi P. K., Regression Analysis of Count Data, Second Edition, Econometric Society Monograph No. 53, Cambridge University Press, Cambridge, May 2013.