The Binomial Regression Model - Statistical Modeling and Forecasting

A powerful regression model for a dataset in which the dependent variable y is binomially distributed

This section is divided into two sections:

SECTION 1: Introduction to the Binomial Regression model: We’ll get introduced to the Binomial Regression model, see how it fits into the family of Generalized Linear Models, and why it can be used to predict the odds of seeing a random event.
SECTION 2: Using the Binomial regression model: We’ll train a Binomial Regression model on the real world Titanic data set using Python and the statsmodels library. We’ll see why the Binomial regression model is the right model for predicting the odds of survival on the Titanic. We’ll also learn how to interpret the fitted model’s regression coefficients, a necessary skill to learn, which in case of the Titanic data set produces astonishing results.

Introduction to the Binomial Regression model

The Binomial Regression model can be used for predicting the odds of seeing an event, given a vector of regression variables. For e.g. one could use the Binomial Regression model to predict the odds of its starting to rain in the next 2 hours, given the current temperature, humidity, barometric pressure, time of year, geo-location, altitude etc.

In a Binomial Regression model, the dependent variable y is a discrete random variable that takes on values such as 0, 1, 5, 67 etc. Each value represents the number of ‘successes’ observed in m trials. Thus y follows the binomial distribution.

The following equation gives the probability of observing k successes in m independent Bernoulli trials. Each Bernoulli trial has a probability of success=π and probability of failure=(1-π). A coin toss is the simplest example of a Bernoulli trial in which π = (1-π) = 0.5.

Probability Mass Function of a binomially distributed random variable y (Image by Author)

The vertically bracketed term (m k) is the notation for a ‘Combination’ and is read as ‘m choose k’. It gives you the number of different ways to choose k outcomes from a set of m possible outcomes.

In a regression model, we will assume that the dependent variable y depends on an (n X p) size matrix of regression variables X. The ith row in X can be denoted as x_i which is a vector of size (1 X p ). It corresponds to the ith outcome y_i.

One usually expresses the probability of y_i taking a certain value k as conditional upon the regression variables X taking the value x_i. In notation form, it can be written as Pr(y_i=k|X=x_i) and can be read as probability of y_i being k given that X is x_i.

We can now state the probability distribution of the Binomially distributed y in the context of a regression of y over X as follows:

On the L.H.S. of the above mentioned PMF equation, we will replace the unconditional probability Pr(y=k) with the conditional probability Pr(y_i=k|X=x_i).
In the R.H.S, we will replace the unconditional probability π of observing a success with the conditional probability π_i, where π_i is the probability of observing a success for the ith row of the data set, i.e. when the regression vector X=x_i.

With these two substitutions, the PMF of the binomially distributed y becomes as follows:

Conditional probability distribution of a binomially distributed y that is regressed on X (Image by Author)

In the above equation, the probability of observing a success π_i for some X=x_i, is usually expressed as some function g(.) of x_i. In symbolic terms:

The probability of success expressed as some function of the ith row of the regression matrix X (Image by Author)

The above set of concepts can be neatly visualized using a single illustration as follows:

A binomially distributed y versus X (Image by Author)

In the above graph, y_1, y_2, y_3,…y_10 are ten binomially distributed random variables. They also happen to the the constituents of the dependent variable y which is a (10 X 1) matrix as follows:
y=
[[y_1],
[y_2],
[y_3],
…,
[y_10]].
The corresponding regression variables matrix X in this case also happens to be a 10 X 1 matrix since there is only one regression variable involved:
X=
[[1],
[2],
[3],
[4],
…,
[10]].

Because, y is a random variable with spread m, the plot shows how for each value of X=x_i, y can take any binomially distributed value around its expected value µ_i where µ_i = m*π_i and π_i as we saw earlier is some function g(.) of x_i. And thus, the expected value of y_i which is µ_i, can be expressed as some function of x_i.

The Binomial Regression model is part of the family of Generalized Linear Models. GLMs are used to model the relationship between the expected value of a response variable y and a linear combination of the explanatory variables vector X.

The relationship between E(y|X) and X is expressed by means of a suitable link function, as follows:

The general form of the Generalized Linear Model (Image by Author)

In the above equation, g(.) is the link function that connects the conditional expectation of y on X with a linear combination of the regression variables x_i. X being the matrix of regression variables of size (n X p) where n=rows and p=regression variables in each row, and X=x_i being the ith row in this matrix of size (1 X p) and β being a (p X 1) vector of regression coefficients.

When y is binomially distributed, we are interested in fixing the relation between the conditional expectation of the probability π of a single Bernoulli trial on a particular value of X=x_i, i.e. E(π=π_i|X=x_i), or concisely, π_i|x_i. So the GLM equation for the Binomial regression model can be written as follows:

The general form of the Generalized Linear Model in concise format (Image by Author)

In case of the Binomial Regression model, the link function g(.) takes one of the following four forms (we’ll stop mentioning the conditional notation |X=x_i in each for simplicity, but just assume that it is there):

The Logistic (logit) link function, also known as the log-odds function

The logistic link function (Image by Author)

The logistic is known as the log-odds function because it is expressed as the ratio of the probability of success to probability of failure, i.e. the log of the odds of success. We will be using this link function later on.

The probit link function

The probit (short for probability unit) link function is used to model the occurrence of an event that has a binary Yes/No outcome. This link function is expressed as the inverse of the Cumulative Distribution Function Φ(.) of the standard normal distribution N(0,1).

The book Regression analysis of count data by Colin Cameron and Pravin K. Trivedi provides an excellent introduction to the Probit link function in section 3.6: Ordered and Other Discrete-Choice Models. In there, you will also find a very lucid derivation of why the Probit model’s link function happens to be the Inverse of the CDF Φ(.) of the normal distribution.

The log-log function

The log-log link function (Image by Author)

The log-log function is useful for modeling ‘Poisson-like counting processes’ in which the parameter of the probability distribution (which often contains the mean) lies in the exponent of the probability distribution’s formula, and the parameter is also expressed as an exponent of a linear combination of the regression variables. Thus, it has the double exponent format: exp(exp( — β.x_i) and therefore two consecutive logarithm operations are needed to bring the β.x_i term down to ‘ground level’.

The complementary log-log link function

The complementary log-log is called so because it operates on (1-π_i) i.e. the probability of failure, instead of π_i.

We’ll use the logistic a.k.a. the logit a.k.a. the log-odds link function to build our Binomial Regression model. Here it is once again, this time expressed in a slightly different way. On the R.H.S, I have replaced summation with the bolded vector notation:

SECTION 2: Using the Binomial Regression model to predict the odds of survival on the Titanic

We’ll use the Titanic data set as an example to understand the kinds of use-cases that are appropriate for the Binomial regression model.

Here is a link to the original data set. The curated version of this data set used in the tutorial below can be downloaded from here.

The Titanic data set contains information about 887 of the 2229 souls aboard the ill-fated ocean liner Titanic. Each passenger’s record contains the following attributes:

Name
Passenger class (1/2/3)
Sex
Age
Whether the passenger was accompanied by siblings, parents or children
The Fare they paid, and most importantly,
Whether they survived (1=Survived, 0=Died)

Using Python and the Pandas data analysis library, let’s load the data set into a Pandas data frame, and print out the first few rows:

import pandas as pd
df = pd.read_csv('titanic_dataset.csv', header=0)
df.head(10)

This prints out the following output:

Top 10 rows of the Titanic data set (Image by Author)

We’ll focus attention on four key attributes:

Pclass (the passenger’s class)
Age
Sex
Survived (Whether they survived)

Let’s drop the rest of the columns from the Data Frame:

#Drop the columns that our model will not use
df = df.drop(['Name','Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare'], axis=1)
#print the top 10 rows
df.head(10)

This prints out the following:

Top 10 rows of the reduced Titanic data set (Image by Author)

Building the case for the Binomial Regression Model

We’ll postulate that while the Titanic was going down, the combination of [Pclass, Age, Sex] have greatly influenced the odds of a passenger’s survival.

Note that the ‘Survived’ column contains a [0, 1] Bernoulli random variable.

So can we say the following?

Regression variables X = [Pclass, Age, Sex], and,

Dependent variable is the BOOLEAN VARIABLE y = [Survived]

No, we cannot. Let’s see why…

Why not use the Probit Regression model?

Since y is a boolean variable, it may seem like a straight-forward case for using a Probit Regression model. But notice that if one is unfortunate enough to be on a ship such as the Titanic, what one wants to know is not the answer to the binary question: will I survive with 100% certainty or will I die with 100% certainty? Instead, what is more useful to know are the odds of survival.

For example, if you are a 22 years old woman in the second class cabin of the ship, you’ll want to know if your odds of survival are 1 out of 10, 1 out of 4, 1 out of 50 etc.

But the manner in which the Titanic data set is organized, the response variable survived has a yes/no i.e. 1/0 format. In other words, survived has a Bernoulli distribution, i.e. :

Pr(survived=0) = π,
Pr(survived=1) = (1-π)
Where π is some probability between 0 and 1.

What we want is for y to express the odds, i.e. the ratio of successes (survivals) to failures (deaths), in m independent, identical trials.

In other words, we want is for y to have a Log-Odds distribution.

And therefore, instead of using a True or False, 1 or 0 type Probit regression model, what we want to do here is build a Binomial regression model where the response variable is Binomially distributed, and the link function is the Logit i.e. log-odds function.

The link function will allow us to link the odds of survival to a linear combination of the regression variables X=[Pclass, Age & Sex] plus the intercept, as follows:

How to transform y from a Bernoulli variable to a Binomial variable:

To transform the response variable y from Bernoulli to Binomial, we must group together the data set rows that share the same combination of X values [Pclass, Age and Sex]. Before we go about doing that, there is one little thing we need to take care of, and that is bucketing of the Age attribute. You see, Age, the way it is expressed in the data set, is a continuous variable that ranges from 0.42 to 80.

It hardly seems plausible that babies that were 0.42 years and 0.67 years old respectively would have had different odds of survival. Ditto logic holds true for youths with ages 26, 27, 28, 29 etc. years old, and so on for other cases.

We need to make the age data more granular so as to limit the number of groups. Let’s do this by bucketing the overall age range into bins of size 5 years and label each bin like so:
(0, 5] → 5
(5, 10] → 10
(10, 15] → 15 and so on.
The pandas.cut() method does the bucketing very neatly:

#define the bins
age_range_bins=[0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80]
#define the label for each bin. Num labels = Num bins - 1
age_range_labels=[5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80]
#Cut up the age range into multiple bins and stuff them into a new Age_Range column
df['Age_Range']=pd.cut(df['Age'],age_range_bins,labels=age_range_labels)
#Print the output
df.head(10)

We get the following output. Note the new Age_Range column we have added:

Let’s drop the Age column as we’ll use Age_Range instead:

#Drop the age column
df = df.drop(['Age'],axis=1)

Now that we have our explanatory variables set up the way we want, let’s get to work on grouping together the samples by the combination [Pclass, Sex, Age_Range]. We’ll use the Pandas groupby() method.

#Group by ['Pclass', 'Sex', 'Age_Range'] 
groups = df.groupby(['Pclass', 'Sex', 'Age_Range'])
#Get the counts for each group. This is the number of passengers in each group who have survived
df_grouped_counts = groups.count()
#Get the size (number of passengers) in each group
df_grouped_survived = groups.sum()

Merge the number of survivors and number of passengers for each group into each grouped data frame. (We’ll see how this helps us in a minute):

df_grouped_counts.to_csv('df_grouped_counts.csv')
df_grouped_counts_1 = pd.read_csv('df_grouped_counts.csv', header=0)
df_grouped_survived.to_csv('df_grouped_survived.csv')
df_grouped_survived_1 = pd.read_csv('df_grouped_survived.csv', header=0)

Finally, let’s construct a new Data Frame which contains:

The grouped columns Pclass, Sex, Age_Range,
The corresponding number of survivors in each group,
The total number of passengers in each group i.e. the group size, and,
The number of passengers in each group who died.

#Create a new Data Frame
df_grouped = pd.DataFrame()
#Copy over the Pclass, Sex and Age Range columns
df_grouped['Pclass'] = df_grouped_counts_1['Pclass']
df_grouped['Sex'] = df_grouped_counts_1['Sex']
df_grouped['Age_Range'] = df_grouped_counts_1['Age_Range']
#Copy over the num passengers from the counts grouped Data Frame
df_grouped['Total'] = df_grouped_counts_1['Survived']
#Copy over the num survivors from the summation grouped Data Frame
df_grouped['Survived'] = df_grouped_survived_1['Survived']
#Add a column containing the number who died
df_grouped[’Died’] = df_grouped[’Total’] - df_grouped[’Survived’]

Let’s print out the first 20 rows of the grouped data set:

df_grouped.head(20)

The first 20 rows of the Titanic data set grouped by [Class, Sex, Age_Range] (Image by Author)

Let’s see what the grouped data set is telling us. I have highlighted row numbers 9, 14 and 19 for illustration:

In row #9, we find that there were 10 women in the age range (45, 50] with a first class ticket of which 9 survived. Thus the odds of survival for a woman in this group were pretty good (9 to 1), especially if she occupied a first class cabin.

In row #19, we see there were 4 male passengers aged (15- 20] of which only one survived.

In row #14, we see that there weren’t any women passengers aged (70–75] who occupied a first class cabin. This is why we are seeing NaNs in the aggregate columns for the group: [1, female, 75].

Let’s remove all such NaN rows from the Data Frame:

df_grouped = df_grouped.dropna()

Before we build the Binomial model, let’s take care of one final data preparation task, namely, let’s replace the ‘female’ and ‘male’ strings with integers 1 and 2:

df_grouped=df_grouped.replace(to_replace={'female': 1, 'male': 2})

Building the Binomial Regression Model using Python and statsmodels

We’ll use the excellent support offered by the statsmodels library for building and training the Binomial Regression model.

Let’s carve out the training and testing data sets:

import numpy as np
#Separate out the training and test sets
mask = np.random.rand(len(df_grouped)) < 0.85
df_train = df_grouped[mask]
df_test = df_grouped[~mask]

Let’s set up the regression model’s formula using the Patsy syntax. What we are saying in below mentioned formula is that the dependent variable is a matrix composed of the Survived and Died columns of the dataframe, while the regression variables are Pclass, Age_Range and Sex.

#Construct the Binomial model's regression formula in Patsy syntax.
formula = 'Survived + Died ~ Pclass + Age_Range + Sex'

Using this formula, let’s carve out the X and y design matrices from the training and testing data frames which we created a minute ago:

from patsy import dmatrices
#Carve out the training matrices from the training data frame using the regression formula
y_train, X_train = dmatrices(formula, df_train, return_type='dataframe')
#Carve out the testing matrices from the testing data frame using the regression formula
y_test, X_test = dmatrices(formula, df_test, return_type='dataframe')

Next, we feed X_train and y_train into an instance of the Binomial Regression model class and train the model:

import statsmodels.api as sm
binom_model = sm.GLM(y_train, X_train, family=sm.families.Binomial())
binom_model_results = binom_model.fit()

Let’s print out the fitted model summary:

print(binom_model_results.summary())

This prints out the following:

Training results of the Binomial Regression model (Image by Author)

How to Interpret the regression results:

In the above output, statsmodels is telling us that it has trained a Generalized Linear Model of type Binomial because, well, we asked it to, that it used the log-odds link function and it has used the Iterative Re-weighted Least Squares (IRLS) algorithm for training our model.

Link function = log-odds (Image by Author)

Statsmodels is reporting that our model has 3 degrees of freedom: Sex, Pclass and Age_Range, which seems about right:

For Binomial models, statsmodels calculates three goodness-of-fit measures for you: Maximum Log-likelihood, Deviance and Pearson Chi-squared. We won’t inspect them any further as all three measures are useful only when you are comparing the goodness-of-fit of two or more Binomial regression models which in this case, we aren’t:

Let’s look at the fitted model’s coefficients:

All regression coefficients are statistically significant at the 0.1% margin of error as indicated by the p-values which are all < 0.001:

Let’s see what each coefficient is telling us.

How to Interpret the model coefficients:

For the logit link function, the fitted coefficients can be interpreted as follows:

Age_Range: It’s coefficient is -0.0446. Note the negative value. The way to interpret this coefficient’s value is that, keeping all other variables constant, for each unit increase in the passenger’s age, the odds of their survival decreased by a factor = exp(-0.0446) = 0.9564. i.e. for each unit increase in the passenger’s age, one needs to multiple their survival odds by 0.9564, thereby reducing the odds of survival by a certain amount each time. For example, if a 12 year old male occupant of a 2nd class cabin had a known survival odds of 8:9 during the disaster, then a 22 years old male occupant of a 2nd class cabin had an odds of survival of (8/9) * 0.9564¹⁰ = approximately 6:10.

Pclass: The coefficient for Pclass is -1.2008. Again note the negative sign. A downgrade of the cabin class of a passenger had an even more dramatic effect on the passenger’s odds of survival aboard the Titanic. The Pclass variable is coded as First class cabin=1, Second class cabin=2 and Third class cabin=3. So for every unit increase in the cabin class i.e. as one goes down from 1st class to 2nd class to 3rd class, the odds of survival, keeping age and sex constant, reduce by a factor of exp(-1.2008) = 0.30! i.e. for each unit downgrade, your odds of survival get multiplied by 0.30. For e.g., if a 30 years old male occupant of a 1st class cabin had a 7 : 9 odds of survival on the Titanic, just dropping him down one class to class 2, reduced his odds of survival to (7/9)*0.3 = approximately 1:4. Bumping down the class further to the 3rd class reduced the odds to (7/9)*0.3*0.3 = 7 : 100.

Sex: Finally, notice the very heavy negative coefficient of -2.6526 for the Sex variable. Aboard the sinking Titanic, male passengers had quite miserable chances of survival as compared to female passengers. Keeping Pclass and Age constant, the odds of survival of a male passenger was only exp(- 2.6526) = 7% of those of a female passenger.

Prediction

Recollect that we had put aside the test data set in the Data Frame df_test. It’s time to test our model’s performance on this data set. To do that, we’ll first add a Percentage Survived column to the test data frame whose value we’ll ask our model to predict:

df_test['Pcnt_Survived'] = df_test['Survived']/df_test['Total']

We’ll use the .predict() method on the results object and pass the test data set get the predicted survival rate:

predicted_survival_rate = binom_model_results.predict(X_test)

Let’s plot the actual versus predicted survival rate:

import matplotlib.pyplot as plt
plt.xlabel('Actual Survival Rate')
plt.ylabel('Predicted Survival Rate')
plt.scatter(df_test['Pcnt_Survived'], predicted_survival_rate, color = 'blue')
plt.show()

Actual versus Predicted survival rates (Image by Author)

As you can see, the fit becomes unacceptable when the survival rates are toward the top of the range i.e. 1.0. To a large extent, the accuracy of the prediction is determined by the sample size i.e. the size of each group of passengers, grouped by the tuple [Pclass, Sex, Age Range]. For some groups in the training set, the group size is too small for the model to train in a meaningful way. For such combinations in the test data set, the accuracy will be understandably low.

Here is the link to the complete source code:

	import pandas as pd

	#load the data set into a Pandas data frame, and print out the first few rows
	df = pd.read_csv('titanic_dataset.csv', header=0)
	df.head(10)

	#Drop the columns that our model will not use
	df = df.drop(['Name','Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare'], axis=1)

	#print the top 10 rows
	df.head(10)

	#Bucket the Age column into bins

	#define the bins
	age_range_bins=[0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80]

	#define the label for each bin. Num labels = Num bins – 1
	age_range_labels=[5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80]

	#Cut up the age range into multiple bins and stuff them into a new Age_Range column
	df['Age_Range']=pd.cut(df['Age'],age_range_bins,labels=age_range_labels)

	#Print the output
	df.head(10)

	#Drop the age column
	df = df.drop(['Age'],axis=1)

	#Group by ['Pclass', 'Sex', 'Age_Range']
	groups = df.groupby(['Pclass', 'Sex', 'Age_Range'])

	#Get the counts for each group. This is the number of passengers in each group who have survived
	df_grouped_counts = groups.count()

	#Get the size (number of passengers) in each group
	df_grouped_survived = groups.sum()

	#Merge the number of survivors and number of passengers for each group into each grouped data frame
	df_grouped_counts.to_csv('df_grouped_counts.csv')
	df_grouped_counts_1 = pd.read_csv('df_grouped_counts.csv', header=0)

	df_grouped_survived.to_csv('df_grouped_survived.csv')
	df_grouped_survived_1 = pd.read_csv('df_grouped_survived.csv', header=0)

	#Create a new Data Frame
	df_grouped = pd.DataFrame()

	#Copy over the Pclass, Sex and Age Range columns
	df_grouped['Pclass'] = df_grouped_counts_1['Pclass']
	df_grouped['Sex'] = df_grouped_counts_1['Sex']
	df_grouped['Age_Range'] = df_grouped_counts_1['Age_Range']

	#Copy over the num passengers from the counts grouped Data Frame
	df_grouped['Total'] = df_grouped_counts_1['Survived']

	#Copy over the num survivors from the summation grouped Data Frame
	df_grouped['Survived'] = df_grouped_survived_1['Survived']

	#Add a column containing the number who died
	df_grouped['Died'] = df_grouped['Total'] – df_grouped['Survived']

	df_grouped.head(20)

	df_grouped = df_grouped.dropna()

	#replace the 'female' and 'male' strings with integers 1 and 2
	df_grouped=df_grouped.replace(to_replace={'female': 1, 'male': 2})

	#Separate out the training and test sets
	import numpy as np

	mask = np.random.rand(len(df_grouped)) < 0.85
	df_train = df_grouped[mask]
	df_test = df_grouped[~mask]

	#Construct the Binomial model's regression formula in Patsy syntax.
	formula = 'Survived + Died ~ Pclass + Age_Range + Sex'

	#carve out the X and y design matrices from the training and testing data frames
	from patsy import dmatrices

	#Carve out the training matrices from the training data frame using the regression formula
	y_train, X_train = dmatrices(formula, df_train, return_type='dataframe')

	#Carve out the testing matrices from the testing data frame using the regression formula
	y_test, X_test = dmatrices(formula, df_test, return_type='dataframe')

	#feed X_train and y_train into an instance of the Binomial Regression model class and train the model
	import statsmodels.api as sm

	binom_model = sm.GLM(y_train, X_train, family=sm.families.Binomial())
	binom_model_results = binom_model.fit()

	print(binom_model_results.summary())

	#add a Percentage Survived column to the test data frame whose value we'll ask our model to predict
	df_test['Pcnt_Survived'] = df_test['Survived']/df_test['Total']

	#use the .predict() method on the results object and pass the test data set get the predicted survival rate
	predicted_survival_rate = binom_model_results.predict(X_test)

	#plot the actual versus predicted survival rate
	import matplotlib.pyplot as plt

	plt.xlabel('Actual Survival Rate')
	plt.ylabel('Predicted Survival Rate')
	plt.scatter(df_test['Pcnt_Survived'], predicted_survival_rate, color = 'blue')
	plt.show()

view raw binomial_regression.py hosted with ❤ by GitHub

The Binomial Regression Model

Here is the link to the Titanic data set used in the tutorial.

Summary

A Binomial Regression model can be used to predict the odds of an event.
The Binomial Regression model is a member of the family of Generalized Linear Models which use a suitable link function to establish a relationship between the conditional expectation of the response variable y with a linear combination of explanatory variables X.
In the Binomial Regression model, we usually use the log-odds function as the link function.
The Logistic Regression model is a special case of the Binomial Regression model in the situation where the size of each group of explanatory variables in the data set is one.

Citations and Copyrights

Data Set

The Titanic data set has been downloaded from Stanford’s CS109 class website. Curated data set download link.

Books

M McCullagh P., Nelder J. A., Generalized Linear Models”, Chapman and Hall/CRC; 2nd edition (August 1, 1989), ISBN-13 : 978-0412317606

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

PREVIOUS: An Overview Of The Generalized Linear Regression Model

NEXT: The Holt Winters Exponential Smoothing Model For Time Series Data

UP: Table of Contents