###### Plus a Python tutorial on Negative Binomial regression

In this section, we’ll cover the following topics:

- We’ll get introduced to the
**Negative Binomial (NB) regression model.**An NB model can be incredibly useful for predicting count based data. - We’ll go through a step-by-step tutorial on how to create, train and test a Negative Binomial regression model in
**Python**using the**GLM class**of**statsmodels**.

## Motivation for using the Negative Binomial regression model

In the previous section, we got introduced to the Poisson regression model and we saw how to apply it to count based data, such as the data set of bicyclist counts on the Brooklyn bridge:

We also saw that the Poisson regression model proved to be inadequate for modeling our bicyclist data set.

Although the Poisson regression model made predictions that were visually satisfying…:

…its results were statistically unsatisfactory:

The low performance of the model was because the data did not obey the ** variance = mean** criterion required of it by the Poisson regression model.

This rather strict criterion is often not satisfied by real world data. Often, the variance is greater than the mean, a property called **over-dispersion, **and sometimes the variance is less than the mean, called **under-dispersion**. In such cases, one needs to use a regression model that will not make the ** equi-dispersion assumption **i.e.not

**assume that variance=mean.**

The **Negative Binomial (NB) **regression model is one such model that does not make the ** variance = mean **assumption about the data.

In the rest of the section, we’ll learn about the NB model and see how to use it on the bicyclist counts data set.

## Layout of the section

The section is laid out as follows:

- We’ll get introduced to a real world data set of counts which we’ll use in the rest of this section.
- We’ll define our regression goal on this data set.
- We’ll formulate the regression strategy using the NB model as our regression model.
- We’ll configure the NB model, train it on the data set, and make some predictions on the test data set. We’ll do all of this using the
library.*Python statsmodels* - Lastly, we’ll examine if the NB model’s performance is really superior to the Poisson model’s performance.

#### A real world data set of counts

The following table contains counts of bicyclists traveling across various NYC bridges. The counts were measured daily from 01 April 2017 to 31 October 2017.

We will focus our analysis on the number of bicyclists crossing the Brooklyn bridge everyday. Here is a time sequenced plot of the bicyclist counts seen on the Brooklyn bridge.

Here is the **link to the curated data set of bicyclist count**s over the Brooklyn Bridge.

## Our regression goal

Our regression goal is to predict the number of bicyclists crossing the Brooklyn bridge on any given day.

## Our regression strategy

Given the values of a set of regression variables for a given day, we will use the NB model to predict the bicyclist count on the Brooklyn bridge on that day.

We need to detail out this strategy, so let’s dig deeper. Let us start with defining some variables:

*y** = *the vector of **bicyclist counts** seen on days *1* through *n*.

Thus *y = **[y_1, y_2, y_3,…,y_n]. y_i* is the number of bicyclists on day

*i.*

** X** = the matrix of

**predictors**a.k.a.

**regressors**a.k.a

**explanatory variables**a.k.a.

**regression variables**. The size of matrix

**is a**

*X**(n x m)*since there are

*n*independent observations (rows) in the data set and each row contains values of

*m*explanatory variables.

**λ** = the vector of event rates. The vector **λ **is a primary characteristic of count based data sets. **λ **is a vector of size (n x 1). It contains *n* rates *[λ_0, λ_1, λ_2,…,λ_n]***, **corresponding to the *n* observed counts in the counts vector *y**. *The rate *λ_i* for observation *‘i’* is assumed to drive the actual observed count *y_i* in the counts vector ** y**. The

**column is not present in the input data. Instead,**

*λ***vector is a deduced variable that is calculated by the regression model during the training phase.**

*λ*For the bicyclist counts data, each one of the *λ_i *values is defined as the number of bicyclists crossing the bridge in ‘unit’ time on day *i*. Unit time can be 1 second, 1 hour, 1 day, 1 week — whatever unit time interval we want to measure the rate over. This rate *λ_i *is assumed to drive the observed count of bicyclists *y_i *on day *i*.

The following figure illustrates these definitions on a subset of our bicyclist counts data set:

The training algorithm of the Negative Binomial regression model will fit the observed counts y

to the regression matrix X.

Once the model is trained, we’ll test its performance on a hold out test data set that the model has not seen at all during training.

Recollect that the Negative Binomial regression model does not make the ** variance = mean** assumption that the Poisson regression model does.

Instead, the NB model requires us to define a new parameter *α* which it uses to express the variance in terms of the mean as follows*:*

In practice, this equation takes one of two commonly occurring forms:

**When p = 1:**

In the regression literature, the p=1 case has been referred to as the **NB1 model**. See *Cameron, A.C., and P.K. Trivedi (1986), “Econometric Models Based on Count Data: Comparisons and Applications of Some Estimators*

**When p=2:**

The p=2 case is referred to as the **NB2 **model.

We will use the NB2 model.

The ** Python statsmodels** library also supports the NB2 model as part of the Generalized Linear Model class that it offers.

In fact, the `statsmodels.genmod.families.family`

** **package has a whole class devoted to the NB2 model:

```
class statsmodels.genmod.families.family.NegativeBinomial(link=None, alpha=1.0)
```

Note that the default value of *alpha=1* which this class assumes, is not always the right value for all data sets. So how can we determine the correct value of *α *for our bicyclist counts data set?

## Finding the correct value of *α*

Once again, Messrs Cameron and Trivedi come to our rescue. In their book, Regression Analysis of Count Data, Cameron and Trivedi suggest a clever means to calculate *α *using a technique they call *auxiliary OLS regression without a constant*. The regression equation that they recommend* *is as follows:

You can at once see the relationship of the aux OLS equation with the straight line regression equation: *Y** = **B_1*****X** + *** B_0**.

In case you are curious, the equation to estimate *α *for the NB1 model is as follows:

In the rest of this section, we’ll use the NB2 model.

We can find the value of *α*, once we have fitted the auxiliary regression equation using the Ordinary Least Squares Regression technique on our data set of counts*.* We’ll see how to do this soon.

But how to find *λ_i* that is contained within the aux OLS regression equation*?*

To find *λ_i***, **we fit the Poisson regression model to the our data set! In fact, doing so gives us the complete rate vector *λ** = [λ_1, λ_2, λ_3, …, λ_n]* corresponding to all *n* observations in the data set.

We now have all the ingredients in place for the NB2 regression strategy. Let’s summarize it.

## Summary of NB2 regression strategy

**STEP 1:**Fit the Poisson regression model on the data set. This will give us the vector of fitted rates*λ.***STEP 2:**Fit the aux OLS regression model on the data set. This will give us the value of*α.***STEP 3:**Use the*α*from STEP 2 to fit the NB2 regression model to the data set.**STEP 4:**Use the fitted NB2 model to make predictions about expected counts on the test data set.**STEP 5:**Test the goodness-of-fit of the NB2 model.

Now that our regression strategy is sketched out, let’s implement it using Python, Pandas and statsmodels.

## How to do Negative Binomial Regression in Python

We’ll start by importing all the required packages.

```
import pandas as pd
from patsy import dmatrices
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
```

Next, create a pandas DataFrame for the counts data set.

```
df = pd.read_csv('nyc_bb_bicyclist_counts.csv', header=0, infer_datetime_format=True, parse_dates=[0], index_col=[0])
```

We’ll add a few derived regression variables to the **X** matrix.

```
ds = df.index.to_series()
df['MONTH'] = ds.dt.month
df['DAY_OF_WEEK'] = ds.dt.dayofweek
df['DAY'] = ds.dt.day
```

We will not use the *Date* variable as a regressor since it contains an absolute date value but we don’t need to do anything special to drop *Date* as it is already consumed as the index of the pandas DataFrame. So it will not be available to us in the **X** matrix.

Let’s create the training and testing data sets.

```
mask = np.random.rand(len(df)) < 0.8
df_train = df[mask]
df_test = df[~mask]
print('Training data set length='+str(len(df_train)))
print('Testing data set length='+str(len(df_test)))
```

### STEP 1: We will now configure and fit the Poisson regression model on the training data set.

Setup the regression expression in patsy notation. We are telling patsy that BB_COUNT is our dependent variable and it depends on the regression variables: DAY, DAY_OF_WEEK, MONTH, HIGH_T, LOW_T and PRECIP.

```
expr = """BB_COUNT ~ DAY + DAY_OF_WEEK + MONTH + HIGH_T + LOW_T + PRECIP"""
```

Set up the **X** and **y** matrices for the training and testing data sets. patsy makes this really simple.

```
y_train, X_train = dmatrices(expr, df_train, return_type='dataframe')
y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')
```

Using the ** statsmodels GLM **class, train the Poisson regression model on the training data set.

```
poisson_training_results = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()
```

This finishes the training of the Poisson regression model. To see outcome of the training, you can print out the training summary.

```
print(poisson_training_results.summary())
```

This prints out the following:

Our real interest lies in the vector of fitted rates ** λ** produced by the training. This rate vector is contained in the parameter poisson_training_results.mu.

```
print(poisson_training_results.mu)
print(len(poisson_training_results.mu))
```

The following output shows the first few and last few values of fitted λ vector:

[1920.39434028 2544.81549207 2651.79330653 743.45309242 1072.77132837 1892.9428398 2320.70412868 3269.73598361 3313.21764921 2915.25363322 2614.39482509 2594.44594144 2415.29471195 3181.91998369 2154.15471026 2268.25625592 1793.29625903 2535.31903414 2566.70835529 970.82159668

2510.70775659 3016.19901465 2260.265944 3365.04650316 2695.6143122...

...2340.12964253 2568.40001641 2232.26752534 2604.97128321 145.92037793 2060.10442187 2518.70470296]

This completes **STEP1: **fitting the Poisson regression model.

### STEP 2: We will now fit the auxiliary OLS regression model on the data set and use the fitted model to get the value of *α.*

Import the *api* package.

```
import statsmodels.formula.api as smf
```

Add the ** λ **vector as a new column called ‘BB_LAMBDA’ to the Data Frame of the training data set. Recollect that

**s dimensions are (n x 1). In our example it will be (161 x 1). Also recollect that the**

*λ’***vector is available in**

*λ*`poisson_training_results.mu`

:```
df_train['BB_LAMBDA'] = poisson_training_results.mu
```

Next, let’s add a derived column called ‘AUX_OLS_DEP’ to the pandas Data Frame. This new column will store the values of **the dependent variable of the OLS regression**. It is the left hand side of the OLS regression equation below:

```
df_train['AUX_OLS_DEP'] = df_train.apply(lambda x: ((x['BB_COUNT'] - x['BB_LAMBDA'])**2 - x['BB_LAMBDA']) / x['BB_LAMBDA'], axis=1)
```

In the above code snippet, the part in bold is the left hand side of the above aux OLSR equation.

Let’s use patsy to form the model specification for the OLSR. We want to tell patsy that AUX_OLS_DEP is the dependent variable and it is explained by BB_LAMBDA (which is the rate vector ** λ**). The ‘-1’ at the end of the expression is patsy syntax for saying: do not to use an intercept of regression; i.e. just fit a straight line passing through the origin, as suggested by Messrs Cameron and Trivedi.

```
ols_expr = """AUX_OLS_DEP ~ BB_LAMBDA - 1"""
```

We are now ready to fit an OLSR model.

Configure and fit the OLSR model:

```
aux_olsr_results = smf.ols(ols_expr, df_train).fit()
```

Print the regression params:

```
print(aux_olsr_results.params)
```

You will see the following single coefficient being printed out corresponding to the single regression variable BB_LAMBDA. This coefficient is the *α* that we were seeking:

BB_LAMBDA 0.037343

#### Is *α *statistically significant?

We now need to answer a very important question. Is this value of *α (*0.037343*)* statistically significant? Or can it be considered to be zero for all practical purposes?

Why is it so important to find this out? Recollect that if *α *is zero, then the following equation:

…reduces to *Variance = mean**.* This is the variance function of the Poisson regression model.

If the value of α is statistically not significant, then the Negative Binomial regression model cannot do a better job of fitting the training data set than a Poisson regression model.

The *OLSResults* object contains the t-score of the regression coefficient *α. *Let’s print it out:

```
aux_olsr_results.tvalues
```

This prints out:

BB_LAMBDA 4.814096

From a t-value calculator, we can see that the critical t-value at a 99% confidence level (right-tailed), and degrees of freedom=(161 observations) — (1 dispersion parameter α)=160 is **2.34988**. This is comfortably less than the t-statistic of *α *which was 4.814096**. **We conclude that,

α=0.037343is statistically significantly.

This completes **STEP 2: **The determination of *α.*

**STEP 3: We supply the value of alpha found in STEP 2 into the** `statsmodels.genmod.families.family.`

**NegativeBinomial**** class, and train the NB2 model on the training data set.**

This is a one-step operation in statsmodels:

```
nb2_training_results = sm.GLM(y_train, X_train,family=sm.families.NegativeBinomial(alpha=aux_olsr_results.params[0])).fit()
```

As before, we’ll print the training summary:

```
print(nb2_training_results.summary())
```

Which prints the following summary:

### STEP 4: Let’s make some predictions using our trained NB2 model.

Prediction is again a single-step procedure in statsmodels:

```
nb2_predictions = nb2_training_results.get_prediction(X_test)
```

Let’s print out the predictions:

```
predictions_summary_frame = nb2_predictions.summary_frame()
print(predictions_summary_frame)
```

Following are the first few lines of the output:

Let’s also plot the predicted counts versus the actual counts for the test data.

```
predicted_counts=predictions_summary_frame['mean']
actual_counts = y_test['BB_COUNT']
fig = plt.figure()
fig.suptitle('Predicted versus actual bicyclist counts on the Brooklyn bridge')
predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts')
actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts')
plt.legend(handles=[predicted, actual])
plt.show()
```

Here’s the output:

Not too bad! The NB2 model seems to be more or less tracking the trend in the bicycle counts. And just as with the Poisson regression model’s performance, in some cases its predictions are way off the actual values.

Here is the complete Python source code for training a Negative Binomial Regression model and testing its predictions:

The final question before us is:

**Statistically, has our NB2 regression model done a better job than the Poisson regression model?**

Let’s find out.

### STEP 5: Measuring the goodness-of-fit of the NB2 model

From a goodness-of-fit perspective, there are three things of interest in the NB2 model’s training summary. They are red-boxed in the figure below. We’ll look at each one of them starting with the Log-Likelihood.

Let’s first compare the NB2 model’s training summary with that of the Poisson regression model on the same data set:

The first statistic to look at is the **Log-Likelihood** value. The Maximum Log-likelihood has been generated by the **Maximum Likelihood Estimation (MLE**) technique that was executed by statsmodels during the training of the Poisson and the NB2 models. The MLE technique is used to fix the values of all the model coefficients to some optimal values which will maximize the likelihood of seeing the vector of counts ** y** in the training data set. To know more about MLE and how it is used in model training, please refer to the section on the Poisson regression model.

#### The Likelihood-ratio (LR) test

The Likelihood-ratio test is used to compare how well two models fit the data.

The LR test statistic is simply negative two times the difference in the fitted log-likelihoods of the two models.

In our case, the Log-likelihood for NB2 is -1383.2, while for the Poisson regression model it is -12616. So the LR test statistic is 2 * (12616–1383.2) = 22465.6. This value is vastly greater than the critical value of χ2(1) at the 1% significance level which is 5.412.

As per the LR test, the trained NB2 regression model has demonstrated a much better goodness-of-fit on the bicyclists data set as compared the Poisson regression model.

Now let’s compare the goodness-of-fit of the NB2 regression model in absolute terms.

#### The Deviance and Pearson chi-squared statistics

The reported values of Deviance and Pearson chi-squared for the NB2 model are 330.99 and 310 respectively. To make a quantitative determination of the goodness-of-fit at some confidence level, say 95% (p=0.05), we look up the value in the *χ2* table for p=0.05 and Degrees of freedom of residuals=165. We compare this Chi-Squared value with the observed statistic — in this case it is the Deviance or the Pearson’s chi-squared value reported in GLMResults. We find that at p=0.05 and DF Residuals = 165, the chi-squared value from a standard Chi-Squared table is 195.973 which is smaller than the reported statistic of 330.99 and 310. Hence as per this test, the NB2 regression model, in spite of demonstrating a much better fit than the Poisson regression model, is still sub-optimal. We *might* be able to do better.

## Conclusion and next steps

The Poisson and the Negative Binomial regression models are used for modeling counts based data sets. Both models produce results that are:

- Explainable
- Comparable
- Defensible
- Usable

Both models are backed by statistical theory that is strong and very well understood.

For doing regression on counts based data sets, a good strategy to follow is to start with the Poisson regression model, then see if you can get better results by using the Negative Binomial regression model.

If neither Poisson nor NB2 are appropriate for your data set, consider using more advanced techniques such as:

- Complex variants of the Poisson regression model such as the Zero-inflated model.
- The hurdle model
- A Random Forest based regression model
- A Long-Short Term Memory (LSTM) neural network based regression model

## References, Citations and Copyrights

### Data set

Bicycle Counts for East River Bridges. Daily total of bike counts conducted monthly on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. From NYC Open Data under Terms of Use. **Curated data set for download.**

### Book and Paper Links

Cameron A. C. and Trivedi P. K., Regression Analysis of Count Data, Second Edition, Econometric Society Monograph No. 53, Cambridge University Press, Cambridge, May 2013.

Cameron, A. Colin, and Pravin K. Trivedi. “Econometric Models Based on Count Data: Comparisons and Applications of Some Estimators and Tests.” *Journal of Applied Econometrics*, vol. 1, no. 1, 1986, pp. 29–53. *JSTOR*, www.jstor.org/stable/2096536.

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **The Poisson Regression Model

**NEXT: **The Generalized Poisson Regression Model