Fitting Linear Regression Models on Count Based Data Sets - Statistical Modeling and Forecasting

Build and train an OLSR model on a data set of counts and analyze its performance

We’ll address the following topics in this section:

Is it wise to fit a linear regression model, particularly an Ordinary Least Squares linear regression model on counts-based data?
How to configure and fit the OLS model on a real-world counts-based data set.
How does the performance of OLS compare with mainstream regression models for counts such as Poisson and Negative Binomial?

What is counts based data?

Counts based data sets contain the occurrences of some event such as all the times at which you spot a meteor during a Perseid event, or something more down to earth like the times at which cars pull up at a gas station.

Counts based data sets are encountered in a variety of fields such as:

Human behavior and psychology: For example, a time series of data points (captured hourly) containing the number of times I have checked my email during each hour.
Manufacturing and retail: An hourly time series containing the number of widgets manufactured or sold in that hour.
Food service: A time series of days in which each data point contains the number of hot dogs served by Papaya King on that day.
Cloud infrastructure: All the times when your e-commerce site went down.
Science: All the dates since 1988 on which an exoplanet was discovered.

There are three things we see right away about counts data sets:

The data is whole numbered i.e. no negative counts.
It is quite often skewed, and thus not normally distributed, and
It can be extraordinarily sparse.

Such data sets normally require special handling, and specialized regression techniques.

Meteors radiating from the constellation Perseus (US National Park Service)

A quick refresher on OLS

Ordinary Least Squares (OLS) linear regression models work on the principle of fitting an n-dimensional linear function to n-dimensional data, in such a way that the sum of squares of differences between the fitted values and the actual values is minimized.

Straight-up OLS based linear regression models can fail miserably on counts based data due to the skewness and sparsity of the data, and the heteroskedasticity of regression errors, i.e. variance in errors not being constant, and instead being a function of the dependent count variable.

Moreover, the propensity of OLS to generate negative and fractional predictions can lead to embarrassing looking predictions for data counts.

So why bother fiddling with linear regression models for such data? It turns out there are some strong reasons for doing so as we’ll soon see.

Building up the case for using OLS regression for counts data

Earlier, we saw how to train the following two commonly used regression models for counts based data sets:

In those sections, we trained and tested these models on the following real-world data of Bicyclist counts on the Brooklyn bridge:

Source: Bicycle Counts for East River Bridges (NYC OpenData) (Image by Author)

And while the Poisson and the NB models performed ‘as advertised’:

Negative Binomial regression model’s predictions (Image by Author)

…we would also want to know the following:

Is it possible to model counts based data successfully using Ordinary Least Squares regression techniques?
How to judge the appropriateness of using OLS regression for a given counts based data set? And how to measure the performanceof OLS regression on such data sets?
And in terms of goodness-of-fit and predictive power, how does the humble OLS model measure up to more sophisticated models such as Poisson and NB?

We’ll answer these questions in this section.

Overall, what we’ll see is two things:

An OLS model can act as a good and fast baseline model while building more sophisticated regression models for counts data sets.
We’ll also see that the most statistically significant regression parameters of an OLS model are also the most statistically significant regression parameters in the Poisson and NB models for such data sets. The math behind this finding has been beautifully explained by Messrs. Cameron and Trivedi in their highly-cited book Regression Analysis of Count Data (Chapter 3, Section 3.7.1).

Throughout this section, we’ll use the NYC bicyclist data set so that we can do an apples-to-apples comparison of OLS with Poisson and NB models.

Layout of the section

We’ll perform Exploratory Data Analysis (EDA) on the bicyclist counts data set so as to judge the suitability of OLS and see if any data transformations are needed.
Using Python, Pandas and statsmodels, we’ll build, train and test an OLS model on this data set.
We’ll compare the OLS regression model’s performance with that of the Poisson and NB models.

Exploratory Data Analysis of the bicyclists counts data set

The following table contains counts of bicyclists traveling over various NYC bridges. The counts were measured daily from 01 April 2017 to 31 October 2017. We’ll focus on the counts on the Brooklyn bridge.

I have cut out the Brooklyn Bridge counts into a separate data set. This data set is available for download over here.

Here are the first few rows of this data set:

Let’s load the data into a pandas data frame and plot the BB_COUNT variable:

import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv('nyc_bb_bicyclist_counts.csv', header=0, infer_datetime_format=True, parse_dates=[0], index_col=[0])
fig = plt.figure()
fig.suptitle('Bicyclist counts on the Brooklyn bridge')
plt.xlabel('Date')
plt.ylabel('Count')
actual, = plt.plot(df.index, df['BB_COUNT'], 'go-', label='Count of bicyclists')
plt.legend(handles=[actual])
plt.show()

The time series plot looks like this:

Time series plot of the number of bicyclists on the Brooklyn bridge (Image by Author)

Let’s also create a histogram plot to see how normally distributed this data set is:

plt.hist(df['BB_COUNT'], bins=20)
plt.show()

As expected, the bicyclist counts data is skewed. Let’s print out a few descriptive stats: the mean, median, skewness, and kurtosis.

For skewness and kurtosis, we’ll run the Jarque-Bera normality test on the BB_COUNT variable.

The JB test will not only give us the skewness and kurtosis, but it will also test whether the deviation from normality is statistically significant.

from statsmodels.stats.stattools import jarque_bera as jb
from statsmodels.stats.stattools import omni_normtest as omb
from statsmodels.compat import lzip
print('Mean='+str(round(df['BB_COUNT'].mean(), 2)))
print('Median='+str(round(df['BB_COUNT'].median(), 2)))
name = ['Jarque-Bera', 'Chi^2 two-tail probability', 'Skewness', 'Kurtosis']
test_results = jb(df['BB_COUNT'])
lzip(name, test_results)

This produces the following output:

Mean=2680.04

Median=2857.0

[('Jarque-Bera', 30.572383357990116), ('Chi^2 two-tail probability', 2.2976893142533207e-07), ('Skewness', -0.8746427375174667), ('Kurtosis', 3.6071892903122973)]

The mean and median are not very dissimilar which is a sign that the distribution is not very imbalanced.

As expected, the skewness reported by the test is negative, and it’s quite small. In comparison, a normal distribution’s skewness is zero. And the excess Kurtosis (=Kurtosis — 3.0) reported by the test is also quite small. In comparison, a normal distribution’s Kurtosis is 3.0 and excess Kurtosis is zero.

On the other hand, the Jarque-Bera test’s t-statistic of 30.57 places it in the normality rejection zone (see figure below) of the Chi-squared(2)’s PDF.

Acceptance and rejection zones for the Null hypothesis in the Chi-squared(2) PDF for two-tailed α=0.05 (Image by Author)

Hence we reject the null hypothesis H_0 that the data is normally distributed, and accept H_1 that the data is not normally distributed.

At this point, we can either accept the small amount of skewness and fit an OLS model on it, or we can try to fix the skewness. The later can be attempted by taking the log transform or the square-root transform of the dependent variable BB_COUNT.

We’ll try both transformation approaches and see if it produces the desired result.

We’ll add two new columns to the data frame: a LOG(BB_COUNT) column and a SQRT(BB_COUNT):

import numpy as np
#Add a column to the Data Frame that contains log(BB_COUNT):
df['LOG_BB_COUNT'] = np.log(df['BB_COUNT'])
#All another column containing sqrt(BB_COUNT)
df['SQRT_BB_COUNT'] = np.sqrt(df['BB_COUNT'])

Let’s run the Jarque-Bera again on the LOG and SQRT columns:

name = ['Jarque-Bera', 'Chi^2 two-tail probability', 'Skewness', 'Kurtosis']
test_results = omb(df['LOG_BB_COUNT'])
lzip(name, test_results)
test_results = omb(df['SQRT_BB_COUNT'])
lzip(name, test_results)

This prints out the following:

[('Jarque-Bera', 888.0352308852536), ('Chi^2 two-tail probability', 1.4641977846577634e-193), ('Skewness', -2.5832340081063463), ('Kurtosis', 11.538169851408346)]

[('Jarque-Bera', 135.86010800784334), ('Chi^2 two-tail probability', 3.15030346376236e-30), ('Skewness', -1.5270456576564704), ('Kurtosis', 5.430879236979477)]

Let’s compare these findings with normality measures of the un-transformed BB_COUNT:

[('Jarque-Bera', 30.572383357990116), ('Chi^2 two-tail probability', 2.2976893142533207e-07), ('Skewness', -0.8746427375174667), ('Kurtosis', 3.6071892903122973)]

That didn’t seem to go too well, did it?! Not only did transforming BB_COUNT not fix skewness, it actually made the data even more skewed! There is something to be learned from this transformation misadventure:

One should resist the temptation to blindly accept the result of any data transformation without carefully inspecting what it has done to one’s data.

So are we back to square one?

In fact, quite the opposite. We gained a lot from doing the EDA on the data. The biggest thing we gained was the realization that we should not spend any more time trying to fix the small amount of skewness and Kurtosis present in the data set.

Moreover, the normality of the dependent variable is not a pre-requisite for performing OLS regression. Excessive departure from normality of the dependent variable only makes it likely (but not certain) that the OLS estimator may produce a biased fit.

So let’s take the calculated risk of accepting the skewness, and fit the OLS regression model on the raw, un-transformed bicyclists counts data.

Our regression goal and regression strategy

Our regression goal is to predict the number of bicyclists crossing the Brooklyn bridge on any given day.

Our regression strategy will be to build and train an OLS regression model in which:

The regression variables will be the variables in the X matrix (see figure below), plus a few additional derived variables which we’ll add to X (we’ll see how to do that in a few seconds).
The dependent variable y of the regression will be the bicyclist counts (the BB_COUNT column in our data set).

The slice of the bicyclist counts data set showing the regression variables matrix X and the vector of counts y (Image by Author)

Once the model is trained, we’ll test its performance on a holdout test data-set which is data that the model is not shown during training.

A step-by-step guide to fitting an OLS regression model to the counts data set using Python, Pandas and Statsmodels

We’ll start with importing all the required packages.

import pandas as pd
from patsy import dmatrices
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

Let’s read the data set into a pandas DataFrame:

df = pd.read_csv('nyc_bb_bicyclist_counts.csv', header=0, infer_datetime_format=True, parse_dates=[0], index_col=[0])

We’ll add a few derived regression variables to the X matrix.

ds = df.index.to_series()
df['MONTH'] = ds.dt.month
df['DAY_OF_WEEK'] = ds.dt.dayofweek
df['DAY'] = ds.dt.day

Here are the first few rows of the pandas DataFrame showing the regression variables to the left and the BB_COUNT dependent variable:

Let’s create the training and testing data sets.

mask = np.random.rand(len(df)) < 0.8
df_train = df[mask]
df_test = df[~mask]
print('Training data set length='+str(len(df_train)))
print('Testing data set length='+str(len(df_test)))

Setup the regression expression in patsy notation. We are telling patsy that BB_COUNT is our dependent variable and it depends on the regression variables: DAY, DAY_OF_WEEK, MONTH, HIGH_T, LOW_T and PRECIP:

expr = 'BB_COUNT ~ DAY  + DAY_OF_WEEK + MONTH + HIGH_T + LOW_T + PRECIP'

Set up the X and y matrices for the training and testing data sets. patsy makes this really simple:

y_train, X_train = dmatrices(expr, df_train, return_type='dataframe')
y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')

Configure and fit the OLSR model. In statsmodels, this is one line of code:

olsr_results = smf.ols(expr, df_train).fit()

Print the regression output:

print(olsr_results.summary())

Which prints out the following:

Summary of the fitted OLSR model (Image by Author)

We’ll go over the individual elements of the regression output in a bit. Meanwhile, let’s generate the model’s predictions on the test data:

olsr_predictions = olsr_results.get_prediction(X_test)

And let’s print them out:

predictions_summary_frame = olsr_predictions.summary_frame()
print(predictions_summary_frame)

Following are the first few predicted counts:

Notice that forecasted counts are not integers, which is a classic drawback of applying a linear regression model to counts based data. In this case, we can work around this drawback by rounding up or rounding down the predictions.

Let’s also plot the predicted and the actual values.

predicted_counts=predictions_summary_frame['mean']
actual_counts = y_test['BB_COUNT']
fig = plt.figure()
fig.suptitle('Predicted versus actual bicyclist counts on the Brooklyn bridge')
predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts')
actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts')
plt.legend(handles=[predicted, actual])
plt.show()

The plot of actual versus predicted counts (Image by Author)

Notice that one of the predicted bicyclist counts is negative. This is yet another classic shortcoming of a linear regression model when it’s applied to counts based data. In some cases, we can work around this drawback by simply rounding up the negative values to zero.

Interpreting the results of the OLSR model

Keeping aside its propensity for generating fractional counts, and negative counts, let’s objectively evaluate how our OLS regression model performed on the bicyclist counts data set.

Testing the significance of regression parameters:

*Parameter significance using the two-tailed t-test* (Image by Author)

The t-test shows that all regression parameters are individually statistically significant. The absolute value of the t-statistic for each parameter is greater than the specified threshold t-value at the 95% significance level in the two-tailed t-test .

Next, let’s inspect the F-statistic:

The F-test for regression analysis (Image by Author)

The F-stat’s p-value of 3.26E-26 is much smaller than the 0.025 error threshold of a Chi-squared distribution with 6 degrees of freedom, indicating that all regression parameters are jointly significant.

The all-important R-squared value:

The Adjusted-R² is a small correction made to R² to account for the loss of 7 degrees of freedom while doing the estimation i.e. 6 regression variables + intercept).

The Adjusted-R² of 0.530 is telling us that the OLSR model is able to explain more than 50% of the variance in the bicyclist counts dependent variable.

By the way, the value of R² is meaningless if our choice of model itself is wrong! But so far we have not seen any signs of this being the case with OLS.

Let’s checkup two more things about the OLS regression model:

The normality of the residual errors: If we find that the residual errors of the model aren’t normally distributed, we won’t be able to trust the F-statistic or the confidence intervals of the model’s forecasts.
Auto-correlation among the residual errors: Auto-correlation in a time series means that values in the time series are correlated with previous values in the same time series. If we find significant auto-correlation among the residual errors of the model, it’s a sign that we have left out important regression variables, causing some amount of the ‘signal’ in the dependent count variable to ‘leak’ into the residual errors. This is a sign of a badly specified OLSR model.

To check if the residual errors of regression are normally distributed, and they are not auto-correlated, we need to look at the footer section of the OLS model’s summary. The footer section shows the results of normality tests and the auto-correlation test on the residual errors of regression.

Normality of residual errors:

The Jarque-Bera’s test statistic for normality of residuals is 11.917, placing it in the normality rejection region of the Chi-squared(2) PDF. The Omnibus K-squared test’s statistic of 7.347 just about straddles the boundary of the normality rejection zone. See figure below:

What the JB and the Omnibus tests of normality are telling us is that the residuals of OLSR are not normally distributed. At the same time, the very low skewness of residuals (0.147), and the Omnibus statistic being at the Normal/Not-Normal boundary of the Chi-squared(2) PDF implies that the errors are close to being normally distributed. A histogram plot of the residual errors visually confirms this judgement:

plt.hist(olsr_results.resid, bins=20)
plt.show()

Histogram of residual errors (Image by Author)

Notice how the errors are more or less normally distributed around a zero mean.

And here’s the Q-Q plot of the residual errors which also looks, for the most part, linear:

fig = sm.qqplot(olsr_results.resid)
plt.show()

Q-Q plot of the residual errors of OLSR (Image by Author)

All in all, we conclude that the residual errors of the OLS regression model are by and large normally distributed around a mean of zero.

Auto-correlation of residual errors:

The Durbin-Watson test gives us an idea about whether the residual errors are auto-correlated. A DW test value that is substantially less than 2 indicates significant auto-correlation. In this case, the DW test’s value is 1.772 implying no evidence of strong auto-correlation among the residual errors of regression.

We can also verify the lack of auto-correlation by plotting the ACF plot of residual errors:

from statsmodels.graphics.tsaplots import plot_acf
plot_acf(olsr_results.resid, title='ACF of residual errors')
plt.show()

Auto-correlation plot of residual errors of the OLSR model (Image by Author)

As expected, the ACF plot shows no significant auto-correlation between residuals

All in all, the OLS model appears to have fitted the data optimally with no systematic leakage of information into the model’s errors.

Finally, let’s compare the OLSR model’s performance with the Poisson and the NB regression models covered in my earlier two sections on regression models for counts based data.

Comparison of OLSR with Poisson and Negative Binomial models

If we do a side-by-side comparison of the regression parameters of all three models — OLS, Poisson, Negative Binomial — on the same bicyclist counts data set, we see that the most significant parameters found by OLSR (as seen from the absolute value of the parameter’s t-statistic) are also the most significant parameters identified by the other two models.

See figure below. I have annotated the top three significant parameters by their t-score: PRECIP, HIGH_T and LOW_T. They are the same in all three models:

A comparison of parameter significance among OLS, Poisson and NB models (Image by Author)

Comparison of goodness-of-fit of OLSR, Poisson and NB models:

Let’s also look at how well the OLSR model is fitting the bicyclists counts data in comparison with the Poisson and NB models.

Once again, let’s place the OLS results side by side with Poisson and NB regression results.

Using the Maximum Log-Likelihoods as the tape measure of goodness-of-fit, we see that:

The NB model ranks first with the highest Maximum LL of -1309.6,
The OLSR model makes a surprise entry at the 2nd place with a Maximum LL of -1339.4.
The Poisson regression model finishes a distant third at -12616.0.

Goodness-of-fit rankings as per Maximum Log-Likelihood (Image by Author)

This completes our analysis of the OLSR model fitted to the bicyclist counts data set, and the comparison of its performance with the Poisson and Negative Binomial regression models.

The following is the complete Python source code used in this section:

	import pandas as pd
	from matplotlib import pyplot as plt

	#load the data into a pandas data frame and plot the BB_COUNT variable
	df = pd.read_csv('nyc_bb_bicyclist_counts.csv', header=0, infer_datetime_format=True, parse_dates=[0], index_col=[0])
	fig = plt.figure()
	fig.suptitle('Bicyclist counts on the Brooklyn bridge')
	plt.xlabel('Date')
	plt.ylabel('Count')
	actual, = plt.plot(df.index, df['BB_COUNT'], 'go-', label='Count of bicyclists')
	plt.legend(handles=[actual])
	plt.show()

	#create a histogram plot to see how normally distributed is this data set
	plt.hist(df['BB_COUNT'], bins=20)
	plt.show()

	#print out a few descriptive stats: the mean, median, skewness, and kurtosis.
	#For skewness and kurtosis, we'll run the Jarque-Bera normality test on the BB_COUNT variable
	from statsmodels.stats.stattools import jarque_bera as jb
	from statsmodels.stats.stattools import omni_normtest as omb
	from statsmodels.compat import lzip

	print('Mean='+str(round(df['BB_COUNT'].mean(), 2)))
	print('Median='+str(round(df['BB_COUNT'].median(), 2)))
	name = ['Jarque-Bera', 'Chi^2 two-tail probability', 'Skewness', 'Kurtosis']
	test_results = jb(df['BB_COUNT'])
	lzip(name, test_results)

	#Try to fix the skewness by transforming the dependent variable
	#Add two new columns to the data frame: a LOG(BB_COUNT) column and a SQRT(BB_COUNT)
	import numpy as np

	#Add a column to the Data Frame that contains log(BB_COUNT):
	df['LOG_BB_COUNT'] = np.log(df['BB_COUNT'])
	#All another column containing sqrt(BB_COUNT)
	df['SQRT_BB_COUNT'] = np.sqrt(df['BB_COUNT'])

	#run the Jarque-Bera again on the LOG and SQRT columns to see if there is any improvement in normality
	name = ['Jarque-Bera', 'Chi^2 two-tail probability', 'Skewness', 'Kurtosis']
	test_results = omb(df['LOG_BB_COUNT'])
	lzip(name, test_results)
	test_results = omb(df['SQRT_BB_COUNT'])
	lzip(name, test_results)

	#revert to using the original BB_COUNT. Build, trainand test an OLSR model on the counts data

	#start with importing all the required packages
	import pandas as pd
	from patsy import dmatrices
	import numpy as np
	import statsmodels.api as sm
	import statsmodels.formula.api as smf
	import matplotlib.pyplot as plt

	#add a few derived regression variables to the X matrix
	ds = df.index.to_series()
	df['MONTH'] = ds.dt.month
	df['DAY_OF_WEEK'] = ds.dt.dayofweek
	df['DAY'] = ds.dt.day

	#create the training and testing data sets
	mask = np.random.rand(len(df)) < 0.8
	df_train = df[mask]
	df_test = df[~mask]
	print('Training data set length='+str(len(df_train)))
	print('Testing data set length='+str(len(df_test)))

	#Setup the regression expression in patsy notation. We are telling patsy that BB_COUNT is our dependent variable and it depends on the regression variables: DAY, DAY_OF_WEEK, MONTH, HIGH_T, LOW_T and PRECIP
	expr = 'BB_COUNT ~ DAY + DAY_OF_WEEK + MONTH + HIGH_T + LOW_T + PRECIP'

	#Set up the X and y matrices for the training and testing data sets
	y_train, X_train = dmatrices(expr, df_train, return_type='dataframe')
	y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')

	#Configure and fit the OLSR model
	olsr_results = smf.ols(expr, df_train).fit()

	#Print the regression output
	print(olsr_results.summary())

	#generate the model's predictions on the test data
	olsr_predictions = olsr_results.get_prediction(X_test)

	#let's print them out
	predictions_summary_frame = olsr_predictions.summary_frame()
	print(predictions_summary_frame)

	#plot the predicted and the actual values
	predicted_counts=predictions_summary_frame['mean']
	actual_counts = y_test['BB_COUNT']
	fig = plt.figure()
	fig.suptitle('Predicted versus actual bicyclist counts on the Brooklyn bridge')
	predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts')
	actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts')
	plt.legend(handles=[predicted, actual])
	plt.show()

	#plot histogram of the residual errors
	plt.hist(olsr_results.resid, bins=20)
	plt.show()

	#plot the Q-Q plot of the residual errors
	fig = sm.qqplot(olsr_results.resid)
	plt.show()

	#plot the ACF plot of residual errors
	from statsmodels.graphics.tsaplots import plot_acf

	plot_acf(olsr_results.resid, title='ACF of residual errors')
	plt.show()

view raw olsr_for_counts_based_data.py hosted with ❤ by GitHub

OLSR Models for Counts Based Data Sets

Four key takeaways

In spite of the skewness in the counts data and the OLSR model’s propensity for generating negative and fractional counts, the OLSR model can be a viable regression model for counts based data sets.
Just because the OLSR model’s performance has been good on the bicyclist counts data set, an OLSR model will not always work out quite as well on all counts data sets. The inherent weakness of OLS regression on counts data is bound to show through for more ‘challenging’ data sets.
Nevertheless, the OLSR model can serve as a sound baseline model for counts based data, and it can give us a means to quickly test which model parameters are the most statistically significant for modeling counts based data.
The OLSR model also gives us a way to objectively compare just how much better (or worse!) is a sophisticated counts based model such as the Poisson model, the Negative Binomial model, a Gradient Boosted Decision Tree model, or a recurrent neural network based model on a counts based data set, sometimes with rather surprising results.

Citations and Copyrights

Data set

Bicycle Counts for East River Bridges. Daily total of bike counts conducted monthly on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. From NYC Open Data under Terms of Use.

Books

Cameron A. C. and Trivedi P. K., Regression Analysis of Count Data, Second Edition, Econometric Society Monograph No. 53, Cambridge University Press, Cambridge, May 2013.

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

PREVIOUS: The Zero Inflated Poisson Regression Model

NEXT: An Overview of Generalized Linear Regression Models

UP: Table of Contents