###### And a tutorial on Poisson regression using Python

In this section, we’ll cover the following topics:

**Features of count based data:**Count data sets are some of the most commonly occurring data in the world. We’ll look at what makes count based data different.**Regression models for forecasting counts:**We’ll look at**Poisson regression model**in detail. The**Negative Binomial (NB) regression model**is another commonly used model for count based data. I’ll cover that in a later section.**Python tutorial on Poisson regression:**I will take you through a step-by-step tutorial on how to create a Poisson regression model in Python using the**GLM class**of**statsmodels**, and how to train it on a real world data set.

## What is count based data?

Count based data contains events that occur at a certain rate. The rate of occurrence may change over time or from one observation to next. Here are some examples of count based data:

- Number of vehicles crossing an intersection per hour,
- Number of people visiting a doctor’s office per month,
- Number of earth-like planets spotted per month.

**A data set of counts has the following characteristics:**

**Whole number data:**The data consists of non-negative integers: [0… ∞] Regression techniques such as Ordinary Least Squares Regression may not be appropriate for modeling such data as OLSR works best on real numbers such as -656.0, -0.00000345, 13786.1 etc.**Skewed Distribution:**The data may contain a large number of data points for just a few values, thereby making the frequency distribution quite skewed. See for example above histogram.**Sparsity:**The data may reflect the occurrence of a rare event such as a gamma ray burst, thereby making the data sparse.**Rate of occurrence:**For the sake of creating a model, it can be assumed that there is a certain rate of occurrence of events**λ**that drives the generation of such data. The event rate may drift over time.

## A real world data set of counts

The following table contains counts of bicyclists traveling over various NYC bridges. The counts were measured daily from 01 April 2017 to 31 October 2017.

Here is a time sequenced plot of the bicyclist counts on the Brooklyn bridge:

## Regression models for counts

The **Poisson regression model** and the **Negative Binomial regression model** are two popular techniques for developing regression models for counts. Other possibilities are Ordered Logit, Ordered Probit and Nonlinear Least Squares models.

### Regression strategy

It’s good practice to start with the Poisson regression model and use it as the “control” for either more complex, or less constrained models. In their book Regression Analysis of Count Data, Cameron and Trivedi say the following:

“A sound practice is to estimate both Poisson and negative binomial models.”

In this section, we’ll use the Poisson regression model for regressing the bicyclist counts observed on the Brooklyn bridge, and in a following section, we’ll train the Negative Binomial model on the same data set.

## Introducing the Poisson model

The Poisson distribution has the following **P**robability **M**ass **F**unction.

The expected value (mean) for a Poisson distribution is ** λ. **Thus in the absence of other information, one should expect to see

**events in any unit time interval such as 1 hour, 1 day, etc. For any interval**

*λ***, one would expect to see**

*t***events**

*λt*

*.*## A Poisson regression mode*l for **constant λ*

*constant λ*

If the event rate *λ** *is constant, one can simply use a modified Mean Model for predicting future counts of events.** **In this case, one would set all predicted values of counts to this constant value ** λ**.

The following figure illustrates the constant ** λ** scenario:

The following Python code was used to generate the blue dots (actual counts in the past time steps) using a Poisson process with ** λ=5**. The orange dots (predictions) are all set to the same value

**5**:

## A Poisson regression model for a **non-constant λ**

Now we get to the fun part. Let us examine a more common situation, one where **λ **can change from one observation to the next. In this case, we assume that the value of **λ **is influenced by a vector of **explanatory variables**, **also known as predictors, regression variables**, or **regressors**. We’ll call this matrix of regression variables, **X**.

The job of the regression model is to fit the observed counts

yto the matrix of regression valuesX.

In the NYC bicyclist counts data set, the regression variables are *Date*, *Day of Week*, *High Temp*, *Low Temp* and *Precipitation*. We can also introduce additional regressors such as *Month* and *Day of Month* that are derived from *Date*, and we have the liberty to drop existing regressors such as *Date*.

The fitting of ** y** to

**X**happens by fixing the values of a vector of regression coefficients

**β.**

In a Poisson Regression model, the event counts ** y** are assumed to be Poisson distributed, which means the probability of observing

**is a function of the event rate vector**

*y***λ.**

The job of the Poisson Regression model is to fit the observed counts

yto the regression matrixXvia a link-function that expresses the rate vectorλas a function of, 1) the regression coefficientsβand 2) the regression matrixX.

The following figure illustrates the structure of the Poisson regression model.

What might be a good link function ** f**(.) connecting

**λ**with

**X**? It turns out the following exponential link-function works great:

This link function keeps **λ** non-negative even when the regressors **X** or the regression coefficients **β **have negative values. This is a requirement for count based data.

In general, we have:

## A Formal Specification of the Poisson Regression Model

The complete specification of the Poisson regression model for count based data is given as follows:

For the *ith* observation in the data set denoted by *y_i* corresponding to the row of regression variables ** x_i**, the probability of observing the count

*y_i*is Poisson distributed as per the following PMF:

Where the mean rate *λ_i* for the ith sample is given by the exponential link function shown earlier. We reproduce it here:

Once the model is fully trained on the data set, the regression coefficients ** β **are known, and the model is ready to make predictions. To predict the event count

*y_p*corresponding to an input row of regressors

**that one has observed, one uses this formula:**

*x_p*All of this hinges on our ability to train the model successfully so that the regression coefficients vector ** β** is known

*.*Let’s look at how this training takes place.

## Training the Poisson regression model

Training a Poisson regression model involves finding the values of the regression coefficients **β** which would make the vector of observed counts **y **most likely.

The technique for identifying the coefficients **β **is called **M**aximum **L**ikelihood **E**stimation (MLE).

Let’s get acquainted with the technique of **MLE.**

### Understanding Maximum Likelihood Estimation (MLE)

I’ll illustrate the **MLE** technique using the bicyclist counts data set. Take a look at the first few rows of this data set:

Our assumption is that the bicyclist counts shown in the red box arise from a Poisson process. Hence we can say that their probabilities of occurrence is given by the Poisson PMF. Here are the probabilities for the first 4 occurrences:

We can similarly calculate the probabilities for all ** n** counts observed in the training set.

Note that in the above formulae*, λ_1, λ_2, λ_3,…,λ_n *are calculated using the link function as follows:

Where ** x_1, x_2, x_3, x_4** are the first 4 rows of the regression matrix.

The probability of occurrence of the entire set of *n* counts *y_1, y_2,…,y_n *in the training set is the joint probability of occurrence of the individual counts.

The counts ** y** are Poisson distributed,

*y_1, y_2,…,y_n*are independent random variables, given correspondingly

**. Hence the joint probability of occurrence of**

*x_1, x_2,…,x_n**y_1, y_2,…,y_n*can be expressed as a simple multiplication of the individual probabilities. Here is how the joint probability looks like for the entire training set:

Let’s recollect that *λ_1, λ_2, λ_3,…,λ_n *are linked to the regression vectors ** x_1, x_2,x_3,…,x_n** via the regression coefficients

*β.*What value of ** β** will make the given set of observed counts

**most likely? It is the value of**

*y***for which the joint probability shown in the above equation achieves the maximum value. In other words, it is the value of**

*β***for which the rate of change of the joint probability function w.r.t.**

*β***is**

*β***0.**Put another way, it is the solution of the equation obtained from differentiating the joint probability equation w.r.t.

**and setting this differential equation to**

*β***0.**

It is easier to differentiate the **logarithm **of the joint probability equation than the original equation. The solution to the logged equation yields the same optimal value of ** β**.

This logarithmic equation is called the **log-likelihood function**. For the Poisson regression, the log-likelihood function is given by the following equation:

The above equation is obtained by taking the natural logarithm of both sides of the joint probability function shown earlier, after substituting the *λ_i *with *exp*(**x_i***** β)**.

As mentioned earlier, we differentiate this log-likelihood equation w.r.t. ** β**, and set it to zero. This operation gives us the following equation:

Solving this equation for the regression coefficients ** β **will yield the

**Maximum Likelihood Estimate**

**(MLE)**for

*β.*To solve the above equation one uses an iterative method such as Iteratively Reweighted Least Squares (IRLS). In practice, one does not solve this equation by hand. Instead, you use statistical software such as the Python **statsmodels **package which will do all the calculations for you while training the Poisson regression model on your data set.

### Summary of steps for performing Poisson Regression

In summary, here are the steps for performing a Poisson Regression on a count based data set:

- First, make sure that your data set contains counts. One way to tell is that it contains only non-negative integer values that represent the number of occurrences of some event during some interval. In the bicyclist counts data set, it is the count of bicyclists crossing the Brooklyn bridge per day.
- Find out (or guess) the regression variables that will influence the observed counts. In the bicyclist counts data set the regression variables are
*Day of Week, Min Temp, Max Temp, Precipitation*etc. - Carve out a training data set that your regression model will train on, and a test data set that should keep aside. Do
*not*train the model on the test data. - Use a suitable statistical software such as the
**Python statsmodels package**to configure and fit the Poisson Regression model on the training data set. - Test the performance of the model by running it on the test data set so as to generate predicted counts. Compare them with the actual counts in the test data set.
- Use a goodness-of-fit measure to determine how well your model has trained on the training data set.

## How to train a Poisson Regression Model in Python

Let’s put into practice what we have learnt. The Python **statsmodels **package has excellent support for doing Poisson regression.

Let’s use the Brooklyn bridge bicyclist counts data set. You can pick up the data set from **here****.**

Our goal is to build a Poisson regression model for the observed bicyclist counts ** y.** We will use the trained model to predict daily counts of bicyclists on the Brooklyn bridge that the model has not seen during training.

We’ll start by importing all the required packages.

```
import pandas as pd
from patsy import dmatrices
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
```

Create a pandas DataFrame for the counts data set.

```
df = pd.read_csv('nyc_bb_bicyclist_counts.csv', header=0, infer_datetime_format=True, parse_dates=[0], index_col=[0])
```

We’ll add a few derived regression variables to the **X** matrix.

```
ds = df.index.to_series()
df['MONTH'] = ds.dt.month
df['DAY_OF_WEEK'] = ds.dt.dayofweek
df['DAY'] = ds.dt.day
```

We will not use the *Date* variable as a regressor since it contains an absolute date value but we don’t need to do anything special to drop *Date* as it is already consumed as the index of the pandas DataFrame. So it will not be available to us in the **X** matrix.

Let’s create the training and testing data sets.

```
mask = np.random.rand(len(df)) < 0.8
df_train = df[mask]
df_test = df[~mask]
print('Training data set length='+str(len(df_train)))
print('Testing data set length='+str(len(df_test)))
```

Setup the regression expression in patsy notation. We are telling patsy that BB_COUNT is our dependent variable and it depends on the regression variables: DAY, DAY_OF_WEEK, MONTH, HIGH_T, LOW_T and PRECIP.

```
expr = """BB_COUNT ~ DAY + DAY_OF_WEEK + MONTH + HIGH_T + LOW_T + PRECIP"""
```

Set up the **X** and **y** matrices for the training and testing data sets. patsy makes this really simple.

```
y_train, X_train = dmatrices(expr, df_train, return_type='dataframe')
y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')
```

Using the statsmodels GLM class, train the Poisson regression model on the training data set.

```
poisson_training_results = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()
```

Print the training summary.

```
print(poisson_training_results.summary())
```

This prints out the following:

So how well did our model do? Let’s make some **predictions **on the test data set.

```
poisson_predictions = poisson_training_results.get_prediction(X_test)
#summary_frame() returns a pandas DataFrame
predictions_summary_frame = poisson_predictions.summary_frame()
print(predictions_summary_frame)
```

Here are the first few rows of the output:

Let’s plot the predicted counts versus the actual counts for the test data.

```
predicted_counts=predictions_summary_frame['mean']
actual_counts = y_test['BB_COUNT']
fig = plt.figure()
fig.suptitle('Predicted versus actual bicyclist counts on the Brooklyn bridge')
predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts')
actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts')
plt.legend(handles=[predicted, actual])
plt.show()
```

Here’s the output:

The model seems to be more or less tracking the trend in the actual counts although **in many cases its predictions are way off the actual value**.

Let’s also plot Actual versus Predicted counts.

```
plt.clf()
fig = plt.figure()
fig.suptitle('Scatter plot of Actual versus Predicted counts')
plt.scatter(x=predicted_counts, y=actual_counts, marker='.')
plt.xlabel('Predicted counts')
plt.ylabel('Actual counts')
plt.show()
```

Here is the plot:

**Here is the complete source code for doing Poisson regression using Python:**

## Goodness-of-fit of the Poisson regression model

Recollect that both the expected value (i.e. mean) and the variance, of the Poisson distribution is *λ. *This rather strict condition is violated by most real-world data.

A common source of failure of the Poisson regression model is that the data does not satisfy the

mean = variancecriterion imposed by the Poisson distribution.

The *summary()* method on the statsmodels *GLMResult*s class shows a couple of useful goodness-of-fit statistics to help you evaluate whether your Poisson regression model was able to successfully fit the training data. Let’s look at their values:

The reported values of Deviance and Pearson chi-squared are very large. A good fit is virtually impossible given these values. To make a quantitative determination of the goodness-of-fit at some confidence level, say 95% (p=0.05), we look up the value in the *χ2* table for p=0.05 and Degrees of freedom of residuals=163. (DF Residuals = No. Observations *minus *DF model]). We compare this Chi-Squared value with the observed statistic, in this case, the Deviance or the Pearson’s chi-squared value reported in GLMResults. We find that at p=0.05 and DF Residuals = 163, the chi-squared value from a standard Chi-Squared table is 193.791 which is much smaller than the reported statistic of 23030 and 23300. Hence as per this test, the Poisson regression model, in spite of demonstrating an ‘okay’ *visual *fit for the test data set, has fit the training data rather poorly.

## Conclusion and next steps

For counts based data, a useful starting point is the Poisson regression model. One can then compare its performance with other popular counts based models, such as:

- The
**The Negative Binomial regression model**which does not make the*mean = variance*assumption about the data. **Generalized Poisson regression models**which also work well with over-dispersed or under-dispersed data sets.- A
**Zero Inflated Poisson model**if you suspect that your data contains excess zeros i.e. many more zeroes than what the regular Poisson model can explain

Happy modeling!

#### Related

Getting to Know The Poisson Process And The Poisson Probability Distribution

## Citations and Copyrights

### Data set

Bicycle Counts for East River Bridges. Daily total of bike counts conducted monthly on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. From NYC Open Data under Terms of Use. **Curated data set for download.**

### Book

Cameron A. C. and Trivedi P. K., Regression Analysis of Count Data, Second Edition, Econometric Society Monograph No. 53, Cambridge University Press, Cambridge, May 2013.

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **Building Robust Linear Models For Nonlinear, Heteroscedastic Data

**NEXT: **The Negative Binomial Regression Model