The Zero Inflated Poisson Regression Model

Introduction to the ZIP model and a Python tutorial on training a ZIP model on a dataset having excess zeroes

In this section, we’ll learn how to build a regression model for counts based datasets in which the dependent variable contains an excess of zero-valued data.

Counts datasets are ones where the dependent variable is an event such as:

Number of vehicles crossing an intersection per hour.
Number of ER visits happening each month
Number of motor vehicle insurance claims filed per year
Number of defects found in a mass produced printed circuit board.

Data set containing many zero counts (Image by Author)

Many real world phenomena produce counts that are almost always zero. For example:

Number of times a machine fails each month
Number of exoplanets discovered each year
The number of billionaires living in every single city in the world.

Such data are hard to deal with using traditional models for counts data such as the Poisson, the Binomial or the Negative Binomial regression models.

This is because such data sets contain more number of zero valued counts than what one would expect to observe using the traditional model’s probability distribution.

For example, if you assume that a phenomenon obeys the following Poisson(5) process, you would expect to see zero counts no more than 0.67% of the time:

A Poisson(5) process will generate zeros in about 0.67% of observations (Image by Author)

If you observe zero counts far more often than that, the data set contains an excess of zeroes.

If you use a standard Poisson or Binomial or NB regression model on such data sets, it can fit badly and will generate poor quality predictions, no matter how much you tweak its parameters.

So what is a modeler to do when faced with such data with excess zeros?

Fortunately, there is a way to modify a standard counts model such as Poisson or Negative Binomial to account for the presence of the extra zeroes. In fact, there happen to be at least two ways to do this. One technique is known as the Hurdle model and the second technique is known the Zero-Inflated model.

In this section, we’ll look at the zero-inflated regression model in some detail. Specifically, we’ll focus on the Zero Inflated Poisson regression model, often referred to as the ZIP model.

The structure of a ZIP model

Let’s briefly look at the structure of a regular Poisson model before we see how its structure is modified to handle excess zero counts.

Imagine a data set containing n samples and p regression variables per sample. Therefore, the regression variables X can be represented by a matrix of size (n x p) and each row x_i in the X matrix is a vector of size (1 x p) corresponding the dependent variable value y_i:

A data set (y, X) in matrix notation — A data set (***y, X***) in matrix notation (Image by Author)

If we assume that y is a Poisson distributed random variable, we can build a Poisson regression model for this data set. The Poisson model is made up of two parts:

A Poisson Probability Mass Function (PMF) denoted as P(y_i=k) used to calculate the probability of observing k events in any unit interval given a mean event rate of λ events / unit time.
A link function that is used to express the mean rate λ as a function of the regression variables X.

This is illustrated in the figure below:

Probability Mass Function of the standard Poisson regression model (Image by Author)

Normally, we assume that there is some underlying process that is producing the observed counts as per the Poisson PMF: P(y_i=k).

The intuition behind the Zero Inflated Poisson model is that there is a second underlying process that is determining whether a count is zero or non-zero. Once a count is determined to be non-zero, the regular Poisson process takes over to determine its actual non-zero value based on the Poisson process’s PMF.

Thus, a ZIP regression model consists three parts:

A PMF P(y_i=0) which is used to calculate the probability of observing a zero count.
A second PMF P(y_i=k) which is used to calculate the probability of observing k events, given that k > 0.
A link function that is used to express the mean rate λ as a function of the regression variables X.

This is illustrated in the following figure:

Probability Mass Function of the ZIP model (Image by Author)

As before, y_i is the random variable that denotes the observed count corresponding to the regression variables row x_i=[x_i1, x_i2, x_i3,…,x_ip].

ϕ_i is a measure of the proportion of excess zeroes corresponding to the ith row (y_i, x_i) in the data set; .

Getting to know ϕ_i

A simple way to understand ϕ_i is as follows:

Imagine that you take 1000 observations of y_i, each one with the same combination of regression variable values x_i=[x_i1, x_i2, x_i3,…,x_ip]. Since y_i is a random variable that follows the Poisson distribution, you may see a different value of y_i in each one of the 1000 observations.

Suppose that out of the 1000 y_i values you observe, you observe 874 zero values. You determine that out of these 874 zero values, the regular Poisson distribution that you have assumed for y_i, will be able to explain only up to 7 zero values. So the remaining 867 zero values are excess zero observations. So for the ith row in your data set, ϕ_i =867/1000 = 0.867.

When the data set does not have any excess zeroes in the dependent variable, the value of ϕ works out to be zero and the PMF of the ZIP model reduces to the PMF of the standard Poisson model (you can easily verify this by setting ϕ to 0 in the ZIP model’s PMF).

How to estimate ϕ?

So how can we estimate the value of ϕ_i?

A simple and crude way of estimating ϕ_i is by setting each ϕ_i to the following ratio:

A simple but inaccurate way to estimate ϕ_i in the ZIP model (Image by Author)

Perhaps a more realistic way of calculating ϕ_i is by estimating it as a function of regression variables X. This is usually done by transforming the y variable to a binary 0/1 random variable y’ (y_prime) which takes the value 0 if the underlying y is 0, and 1 in all other cases. Then we fit a Logistic regression model on the transformed y’. We then train the Logistic regression model on the data set [X, y’] and it yields a vector of fitted probabilities µ_fitted=[µ_1, µ_2, µ_3,…,µ_n], (because that’s what a Logistic regression model does).

Once we get the µ_fitted vector, we simply set it to the vector ϕ.

Thus [ϕ_1=µ_1, ϕ_2=µ_2, ϕ_3=µ_3,…,ϕ_n=µ_n].

The above process of estimating ϕ is illustrated below:

The training sequence for estimating excess zeros parameter ϕ in a ZIP model — The training sequence for estimating *excess zeros parameter* ϕ in a ZIP model (Image by Author)

Once the ϕ vector is estimated, we plug it into the probability functions of the ZIP model and use what is known as the Maximum Likelihood Estimation (MLE) technique to train the ZIP model on the data set with excess counts.

The following figure illustrates the training sequence of the ZIP model:

Thankfully, there are many statistics packages that automate this entire procedure of estimating ϕ and using the estimated ϕ to train the ZIP model using the MLE technique on your data set.

In the rest of this section, we’ll use the Python statsmodels library to build and train a ZIP model in a single line of code.

How to train the ZIP model using Python

In our Python tutorial on the ZIP model, we’ll use a data set of camping trips taken by 250 groups of people:

The camping trips data set (Image by Author)

The data set is available here. Here are a couple of salient features of this data set:

The campers may or may not have done some fishing during their trip.
If a group did some fishing, they would have caught zero or more fish.
We want to estimate not only how many fish were caught (if there was fishing done by a camping group), but also the probability that the camping group caught any fish at all.

Thus, there are two distinct data generation processes involved:

A process that determines whether or not a camping group indulged in a successful fishing activity: The ZIP model will internally use a Logistic Regression model that was explained earlier to model this binary process.
A second process that determines how many fish were caught by a camping group, given that there was at least one fish caught by the group: The ZIP model will use a regular Poisson model for modeling this second process.

Variables in the data set

The camping trips data set contains the following variables:

FISH_COUNT: The number of fish that were caught. This will be our dependent variable y.

LIVE_BAIT: A binary variable indicating whether live bait was used.

CAMPER: Whether the fishing group used a camper van.

PERSONS: Total number of people in the fishing group. Note that in some groups, none of them may have fished.

CHILDREN: The number of children in the camping group.

Here is a frequency distribution of the dependent FISH_COUNT variable:

Frequency distribution of FISH_COUNT (Image by Author)

As we can see, there may be excess zeroes in this data set. We’ll train a ZIP model on this data set to test this theory and hopefully achieve a better fit than the regular Poisson model.

Regression Goal

Our regression goals on this data set are as follows:

Predict the number of fish caught (FISH_COUNT) by a camping group based on the values of LIVE_BAIT, CAMPER, PERSONS and CHILDREN variables.

Regression Strategy

Our regression strategy will be as follows:

FISH_COUNT will be the dependent variable y, and [LIVE_BAIT, CAMPER, PERSONS and CHILDREN] will be the explanatory variables X.
We’ll use the Python statsmodels library to train the ZIP regression model on the (y, X) data set.
We’ll make some predictions using the ZIP model on a test data set that the model has not seen during its training.

Let’s begin by import all the required packages:

import pandas as pd
from patsy import dmatrices
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

Next, we’ll load the fish data set into memory. Here is the link to the data set:

df = pd.read_csv('fish.csv', header=0)

Let’s print the top few rows of the data set:

print(df.head(10))

Top 10 rows of the data set (Image by Author)

Let’s also print out the frequency distribution of FISH_COUNT values:

df.groupby('FISH_COUNT').count()

Frequency distribution of fish counts (Image by Author)

Create the training and test data sets. Note that for now, we are not doing a stratified random split:

mask = np.random.rand(len(df)) < 0.8
df_train = df[mask]
df_test = df[~mask]
print('Training data set length='+str(len(df_train)))
print('Testing data set length='+str(len(df_test)))

>> Training data set length=196
>> Testing data set length=54

Setup the regression expression in Patsy notation. We are telling Patsy that FISH_COUNT is our dependent variable y and it depends on the regression variables LIVE_BAIT, CAMPER, PERSONS and CHILDREN:

expr = 'FISH_COUNT ~ LIVE_BAIT  + CAMPER + CHILDREN + PERSONS'

Let’s use Patsy to carve out the X and y matrices for the training and testing data sets.

y_train, X_train = dmatrices(expr, df_train, return_type='dataframe')
y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')

Using statsmodels’s ZeroInflatedPoisson class, let’s build and train a ZIP regression model on the training data set.

But before we do so, let me explain how to use two parameters that the class constructor takes:

inflation: The ZeroInflatedPoisson model class will internally use a LogisticRegression model to estimate the parameter ϕ. Hence we set the model parameter inflation to ’logit’. We can also experiment with setting it to other Binomial link functions such as ‘probit’.
exog_infl: We also want to ask the ZIP model to estimate ϕ as a function of the same set of regression variables as the parent model, namely: LIVE_BAIT, CAMPER, PERSONS and CHILDREN. Hence we set the parameter exog_infl to X_train. If you want to use only a subset of X_train, you can do so, or you can set exog_infl to an entirely different set of regression variables.

The below line builds and trains the ZIP model on our training data set in a single line of code.

zip_training_results = sm.ZeroInflatedPoisson(endog=y_train, exog=X_train, exog_infl=X_train, inflation='logit').fit()

Print the training summary:

print(zip_training_results.summary())

Here is the training summary (I have highlighted the important elements in the output):

Training summary of the ZIP model (Image by Author)

Interpreting the training output

The blue box contains information about variables that the nested Logistic Regression model has used to estimate the probability ϕ of whether or not any fish were caught by a camping group.

Regression coefficients, their standard errors and z-scores from the fitted ZIP model (Image by Author)

Notice that the Logistic regression model did not find Intercept, LIVE_BAIT and CAMPER variables useful for estimating ϕ. Their regression coefficients were found to be NOT statistically significant at the 95% confidence level, as indicated by the respective p values:
inflate_Intercept=0.425,
inflate_LIVE_BAIT=0.680 and
inflate_CAMPER=0.240,
which are all greater than 0.05 (i.e. 5% error threshold).

Observation 1

The only two variables that the Logistic Regression model determined as useful for estimating the probability of whether or not any fish were caught were CHILDREN and PERSONS.

Observation 2

The regression coefficient of PERSONS is negative (inflate_PERSONS -1.2193) which means that as the number of people in the camping group increases, probability of no fish being caught by that group decreases. This is in line with our intuition.

The red box contains information about variables that the parent Poisson model used to estimate FISH_COUNT on the condition that FISH_COUNT > 0.

Observation 3

We can see that the coefficients for all 5 regression variables are statistically significant at a 99% confidence level, as evidenced by their p value which is less than 0.01. In fact, the p value is less than 0.001 for all 5 variables, hence it is showing up as 0.000.

Observation 4

The coefficient for CHILDREN is negative (CHILDREN -1.0810), meaning that as the number of children in the camping group goes up, the number of fish caught by that group goes down!

Observation 5

The Maximized Log-Likelihood of this model is -566.43. This value is useful for comparing the goodness-of-fit of the model with that of other models.

Observation 6

Finally, note that the training algorithm of the ZIP model was not able to converge on the training data set as indicated by the following:

If it had converged, perhaps it would have resulted in a better fit. We could try to fix that by passing in a maxiter=100 parameter into the fit() method.

Prediction

We’ll get the ZIP model’s predictions on the test data set and calculate the root mean square error w.r.t. the actual values:

zip_predictions = zip_training_results.predict(X_test,exog_infl=X_test)
predicted_counts=np.round(zip_predictions)
actual_counts = y_test[dep_var]
print('ZIP RMSE='+str(np.sqrt(np.sum(np.power(np.subtract(predicted_counts,actual_counts),2)))))

>> ZIP RMSE=55.65069631190611

Let’s plot the predicted versus actual fish counts:

fig = plt.figure()
fig.suptitle('Predicted versus actual counts using the ZIP model')
predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted')
actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual')
plt.legend(handles=[predicted, actual])
plt.show()

We see the following plot:

Predicted versus actual fish caught (Image by Author)

This completes our look at the Zero-Inflated Poisson regression model.

Citations and Copyrights

Papers

Lambert, Diane. “Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.” Technometrics, vol. 34, no. 1, 1992, pp. 1–14. JSTOR, www.jstor.org/stable/1269547.

Books

Cameron A. C. and Trivedi P. K., Regression Analysis of Count Data, Second Edition, Econometric Society Monograph No. 53, Cambridge University Press, Cambridge, May 2013.

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

PREVIOUS: The Generalized Poisson Regression Model

NEXT: Fitting Linear Regression Models on Count Based Data Sets

UP: Table of Contents