Plus a Python tutorial on Negative Binomial regression
In this section, we’ll cover the following topics:
- We’ll get introduced to the Negative Binomial (NB) regression model. An NB model can be incredibly useful for predicting count based data.
- We’ll go through a step-by-step tutorial on how to create, train and test a Negative Binomial regression model in Python using the GLM class of statsmodels.
Motivation for using the Negative Binomial regression model
In the previous section, we got introduced to the Poisson regression model and we saw how to apply it to count based data, such as the data set of bicyclist counts on the Brooklyn bridge:
We also saw that the Poisson regression model proved to be inadequate for modeling our bicyclist data set.
Although the Poisson regression model made predictions that were visually satisfying…:
…its results were statistically unsatisfactory:
The low performance of the model was because the data did not obey the variance = mean criterion required of it by the Poisson regression model.
This rather strict criterion is often not satisfied by real world data. Often, the variance is greater than the mean, a property called over-dispersion, and sometimes the variance is less than the mean, called under-dispersion. In such cases, one needs to use a regression model that will not make the equi-dispersion assumption i.e.not assume that variance=mean.
The Negative Binomial (NB) regression model is one such model that does not make the variance = mean assumption about the data.
In the rest of the section, we’ll learn about the NB model and see how to use it on the bicyclist counts data set.
Layout of the section
The section is laid out as follows:
- We’ll get introduced to a real world data set of counts which we’ll use in the rest of this section.
- We’ll define our regression goal on this data set.
- We’ll formulate the regression strategy using the NB model as our regression model.
- We’ll configure the NB model, train it on the data set, and make some predictions on the test data set. We’ll do all of this using the Python statsmodels library.
- Lastly, we’ll examine if the NB model’s performance is really superior to the Poisson model’s performance.
A real world data set of counts
The following table contains counts of bicyclists traveling across various NYC bridges. The counts were measured daily from 01 April 2017 to 31 October 2017.
We will focus our analysis on the number of bicyclists crossing the Brooklyn bridge everyday. Here is a time sequenced plot of the bicyclist counts seen on the Brooklyn bridge.
Here is the link to the curated data set of bicyclist counts over the Brooklyn Bridge.
Our regression goal
Our regression goal is to predict the number of bicyclists crossing the Brooklyn bridge on any given day.
Our regression strategy
Given the values of a set of regression variables for a given day, we will use the NB model to predict the bicyclist count on the Brooklyn bridge on that day.
We need to detail out this strategy, so let’s dig deeper. Let us start with defining some variables:
y = the vector of bicyclist counts seen on days 1 through n.
Thus y = [y_1, y_2, y_3,…,y_n].
y_i is the number of bicyclists on day i.
X = the matrix of predictors a.k.a. regressors a.k.a explanatory variables a.k.a. regression variables. The size of matrix X is a (n x m) since there are n independent observations (rows) in the data set and each row contains values of m explanatory variables.
λ = the vector of event rates. The vector λ is a primary characteristic of count based data sets. λ is a vector of size (n x 1). It contains n rates [λ_0, λ_1, λ_2,…,λ_n], corresponding to the n observed counts in the counts vector y. The rate λ_i for observation ‘i’ is assumed to drive the actual observed count y_i in the counts vector y. The λ column is not present in the input data. Instead, λ vector is a deduced variable that is calculated by the regression model during the training phase.
For the bicyclist counts data, each one of the λ_i values is defined as the number of bicyclists crossing the bridge in ‘unit’ time on day i. Unit time can be 1 second, 1 hour, 1 day, 1 week — whatever unit time interval we want to measure the rate over. This rate λ_i is assumed to drive the observed count of bicyclists y_i on day i.
The following figure illustrates these definitions on a subset of our bicyclist counts data set:
The training algorithm of the Negative Binomial regression model will fit the observed counts y to the regression matrix X.
Once the model is trained, we’ll test its performance on a hold out test data set that the model has not seen at all during training.
Recollect that the Negative Binomial regression model does not make the variance = mean assumption that the Poisson regression model does.
Instead, the NB model requires us to define a new parameter α which it uses to express the variance in terms of the mean as follows:
In practice, this equation takes one of two commonly occurring forms:
When p = 1:
In the regression literature, the p=1 case has been referred to as the NB1 model. See Cameron, A.C., and P.K. Trivedi (1986), “Econometric Models Based on Count Data: Comparisons and Applications of Some Estimators
The p=2 case is referred to as the NB2 model.
We will use the NB2 model.
The Python statsmodels library also supports the NB2 model as part of the Generalized Linear Model class that it offers.
In fact, the
statsmodels.genmod.families.family package has a whole class devoted to the NB2 model:
class statsmodels.genmod.families.family.NegativeBinomial(link=None, alpha=1.0)
Note that the default value of alpha=1 which this class assumes, is not always the right value for all data sets. So how can we determine the correct value of α for our bicyclist counts data set?
Finding the correct value of α
Once again, Messrs Cameron and Trivedi come to our rescue. In their book, Regression Analysis of Count Data, Cameron and Trivedi suggest a clever means to calculate α using a technique they call auxiliary OLS regression without a constant. The regression equation that they recommend is as follows:
You can at once see the relationship of the aux OLS equation with the straight line regression equation: Y = B_1*X + B_0.
In case you are curious, the equation to estimate α for the NB1 model is as follows:
In the rest of this section, we’ll use the NB2 model.
We can find the value of α, once we have fitted the auxiliary regression equation using the Ordinary Least Squares Regression technique on our data set of counts. We’ll see how to do this soon.
But how to find λ_i that is contained within the aux OLS regression equation?
To find λ_i, we fit the Poisson regression model to the our data set! In fact, doing so gives us the complete rate vector λ = [λ_1, λ_2, λ_3, …, λ_n] corresponding to all n observations in the data set.
We now have all the ingredients in place for the NB2 regression strategy. Let’s summarize it.
Summary of NB2 regression strategy
- STEP 1: Fit the Poisson regression model on the data set. This will give us the vector of fitted rates λ.
- STEP 2: Fit the aux OLS regression model on the data set. This will give us the value of α.
- STEP 3: Use the α from STEP 2 to fit the NB2 regression model to the data set.
- STEP 4: Use the fitted NB2 model to make predictions about expected counts on the test data set.
- STEP 5: Test the goodness-of-fit of the NB2 model.
Now that our regression strategy is sketched out, let’s implement it using Python, Pandas and statsmodels.
How to do Negative Binomial Regression in Python
We’ll start by importing all the required packages.
import pandas as pd from patsy import dmatrices import numpy as np import statsmodels.api as sm import matplotlib.pyplot as plt
Next, create a pandas DataFrame for the counts data set.
df = pd.read_csv('nyc_bb_bicyclist_counts.csv', header=0, infer_datetime_format=True, parse_dates=, index_col=)
We’ll add a few derived regression variables to the X matrix.
ds = df.index.to_series() df['MONTH'] = ds.dt.month df['DAY_OF_WEEK'] = ds.dt.dayofweek df['DAY'] = ds.dt.day
We will not use the Date variable as a regressor since it contains an absolute date value but we don’t need to do anything special to drop Date as it is already consumed as the index of the pandas DataFrame. So it will not be available to us in the X matrix.
Let’s create the training and testing data sets.
mask = np.random.rand(len(df)) < 0.8 df_train = df[mask] df_test = df[~mask] print('Training data set length='+str(len(df_train))) print('Testing data set length='+str(len(df_test)))
STEP 1: We will now configure and fit the Poisson regression model on the training data set.
Setup the regression expression in patsy notation. We are telling patsy that BB_COUNT is our dependent variable and it depends on the regression variables: DAY, DAY_OF_WEEK, MONTH, HIGH_T, LOW_T and PRECIP.
expr = """BB_COUNT ~ DAY + DAY_OF_WEEK + MONTH + HIGH_T + LOW_T + PRECIP"""
Set up the X and y matrices for the training and testing data sets. patsy makes this really simple.
y_train, X_train = dmatrices(expr, df_train, return_type='dataframe') y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')
Using the statsmodels GLM class, train the Poisson regression model on the training data set.
poisson_training_results = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()
This finishes the training of the Poisson regression model. To see outcome of the training, you can print out the training summary.
This prints out the following:
Our real interest lies in the vector of fitted rates λ produced by the training. This rate vector is contained in the parameter poisson_training_results.mu.
The following output shows the first few and last few values of fitted λ vector:
[1920.39434028 2544.81549207 2651.79330653 743.45309242 1072.77132837 1892.9428398 2320.70412868 3269.73598361 3313.21764921 2915.25363322 2614.39482509 2594.44594144 2415.29471195 3181.91998369 2154.15471026 2268.25625592 1793.29625903 2535.31903414 2566.70835529 970.82159668
2510.70775659 3016.19901465 2260.265944 3365.04650316 2695.6143122...
...2340.12964253 2568.40001641 2232.26752534 2604.97128321 145.92037793 2060.10442187 2518.70470296]
This completes STEP1: fitting the Poisson regression model.
STEP 2: We will now fit the auxiliary OLS regression model on the data set and use the fitted model to get the value of α.
Import the api package.
import statsmodels.formula.api as smf
Add the λ vector as a new column called ‘BB_LAMBDA’ to the Data Frame of the training data set. Recollect that λ’s dimensions are (n x 1). In our example it will be (161 x 1). Also recollect that the λ vector is available in
df_train['BB_LAMBDA'] = poisson_training_results.mu
Next, let’s add a derived column called ‘AUX_OLS_DEP’ to the pandas Data Frame. This new column will store the values of the dependent variable of the OLS regression. It is the left hand side of the OLS regression equation below:
df_train['AUX_OLS_DEP'] = df_train.apply(lambda x: ((x['BB_COUNT'] - x['BB_LAMBDA'])**2 - x['BB_LAMBDA']) / x['BB_LAMBDA'], axis=1)
In the above code snippet, the part in bold is the left hand side of the above aux OLSR equation.
Let’s use patsy to form the model specification for the OLSR. We want to tell patsy that AUX_OLS_DEP is the dependent variable and it is explained by BB_LAMBDA (which is the rate vector λ). The ‘-1’ at the end of the expression is patsy syntax for saying: do not to use an intercept of regression; i.e. just fit a straight line passing through the origin, as suggested by Messrs Cameron and Trivedi.
ols_expr = """AUX_OLS_DEP ~ BB_LAMBDA - 1"""
We are now ready to fit an OLSR model.
Configure and fit the OLSR model:
aux_olsr_results = smf.ols(ols_expr, df_train).fit()
Print the regression params:
You will see the following single coefficient being printed out corresponding to the single regression variable BB_LAMBDA. This coefficient is the α that we were seeking:
Is α statistically significant?
We now need to answer a very important question. Is this value of α (0.037343) statistically significant? Or can it be considered to be zero for all practical purposes?
Why is it so important to find this out? Recollect that if α is zero, then the following equation:
…reduces to Variance = mean. This is the variance function of the Poisson regression model.
If the value of α is statistically not significant, then the Negative Binomial regression model cannot do a better job of fitting the training data set than a Poisson regression model.
The OLSResults object contains the t-score of the regression coefficient α. Let’s print it out:
This prints out:
From a t-value calculator, we can see that the critical t-value at a 99% confidence level (right-tailed), and degrees of freedom=(161 observations) — (1 dispersion parameter α)=160 is 2.34988. This is comfortably less than the t-statistic of α which was 4.814096. We conclude that,
α=0.037343 is statistically significantly.
This completes STEP 2: The determination of α.
STEP 3: We supply the value of alpha found in STEP 2 into the
statsmodels.genmod.families.family.NegativeBinomial class, and train the NB2 model on the training data set.
This is a one-step operation in statsmodels:
nb2_training_results = sm.GLM(y_train, X_train,family=sm.families.NegativeBinomial(alpha=aux_olsr_results.params)).fit()
As before, we’ll print the training summary:
Which prints the following summary:
STEP 4: Let’s make some predictions using our trained NB2 model.
Prediction is again a single-step procedure in statsmodels:
nb2_predictions = nb2_training_results.get_prediction(X_test)
Let’s print out the predictions:
predictions_summary_frame = nb2_predictions.summary_frame() print(predictions_summary_frame)
Following are the first few lines of the output:
Let’s also plot the predicted counts versus the actual counts for the test data.
predicted_counts=predictions_summary_frame['mean'] actual_counts = y_test['BB_COUNT'] fig = plt.figure() fig.suptitle('Predicted versus actual bicyclist counts on the Brooklyn bridge') predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts') actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts') plt.legend(handles=[predicted, actual]) plt.show()
Here’s the output:
Not too bad! The NB2 model seems to be more or less tracking the trend in the bicycle counts. And just as with the Poisson regression model’s performance, in some cases its predictions are way off the actual values.
Here is the complete Python source code for training a Negative Binomial Regression model and testing its predictions:
The final question before us is:
Statistically, has our NB2 regression model done a better job than the Poisson regression model?
Let’s find out.
STEP 5: Measuring the goodness-of-fit of the NB2 model
From a goodness-of-fit perspective, there are three things of interest in the NB2 model’s training summary. They are red-boxed in the figure below. We’ll look at each one of them starting with the Log-Likelihood.
Let’s first compare the NB2 model’s training summary with that of the Poisson regression model on the same data set:
The first statistic to look at is the Log-Likelihood value. The Maximum Log-likelihood has been generated by the Maximum Likelihood Estimation (MLE) technique that was executed by statsmodels during the training of the Poisson and the NB2 models. The MLE technique is used to fix the values of all the model coefficients to some optimal values which will maximize the likelihood of seeing the vector of counts y in the training data set. To know more about MLE and how it is used in model training, please refer to the section on the Poisson regression model.
The Likelihood-ratio (LR) test
The Likelihood-ratio test is used to compare how well two models fit the data.
The LR test statistic is simply negative two times the difference in the fitted log-likelihoods of the two models.
In our case, the Log-likelihood for NB2 is -1383.2, while for the Poisson regression model it is -12616. So the LR test statistic is 2 * (12616–1383.2) = 22465.6. This value is vastly greater than the critical value of χ2(1) at the 1% significance level which is 6.635.
As per the LR test, the trained NB2 regression model has demonstrated a much better goodness-of-fit on the bicyclists data set as compared the Poisson regression model.
Now let’s compare the goodness-of-fit of the NB2 regression model in absolute terms.
The Deviance and Pearson chi-squared statistics
The reported values of Deviance and Pearson chi-squared for the NB2 model are 330.99 and 310 respectively. To make a quantitative determination of the goodness-of-fit at some confidence level, say 95% (p=0.05), we look up the value in the χ2 table for p=0.05 and Degrees of freedom of residuals=165. We compare this Chi-Squared value with the observed statistic — in this case it is the Deviance or the Pearson’s chi-squared value reported in GLMResults. We find that at p=0.05 and DF Residuals = 165, the chi-squared value from a standard Chi-Squared table is 195.973 which is smaller than the reported statistic of 330.99 and 310. Hence as per this test, the NB2 regression model, in spite of demonstrating a much better fit than the Poisson regression model, is still sub-optimal. We might be able to do better.
Conclusion and next steps
The Poisson and the Negative Binomial regression models are used for modeling counts based data sets. Both models produce results that are:
Both models are backed by statistical theory that is strong and very well understood.
For doing regression on counts based data sets, a good strategy to follow is to start with the Poisson regression model, then see if you can get better results by using the Negative Binomial regression model.
If neither Poisson nor NB2 are appropriate for your data set, consider using more advanced techniques such as:
- Complex variants of the Poisson regression model such as the Zero-inflated model.
- The hurdle model
- A Random Forest based regression model
- A Long-Short Term Memory (LSTM) neural network based regression model
References, Citations and Copyrights
Book and Paper Links
Cameron A. C. and Trivedi P. K., Regression Analysis of Count Data, Second Edition, Econometric Society Monograph No. 53, Cambridge University Press, Cambridge, May 2013.
Cameron, A. Colin, and Pravin K. Trivedi. “Econometric Models Based on Count Data: Comparisons and Applications of Some Estimators and Tests.” Journal of Applied Econometrics, vol. 1, no. 1, 1986, pp. 29–53. JSTOR, www.jstor.org/stable/2096536.
All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.
PREVIOUS: The Poisson Regression Model
NEXT: The Generalized Poisson Regression Model