###### A Generalized Linear Regression Model for data sets in which the dependent variable *y* follows the Binomial probability distribution

*y*

This section is divided into two sections:

**SECTION 1: Introduction to the Binomial Regression model:**We’ll get introduced to the Binomial Regression model, see how it fits into the family of Generalized Linear Models, and why it can be used to predict the odds of seeing a random event.**SECTION 2: Using the Binomial regression model:**We’ll train a Binomial Regression model on the real world Titanic data set using*Python*and the*statsmodels*library. We’ll see why the Binomial regression model is the right model for predicting the odds of survival on the Titanic. We’ll also learn how to interpret the fitted model’s regression coefficients, a necessary skill to learn, which in case of the Titanic data set produces astonishing results.

## Introduction to the Binomial Regression model

The Binomial Regression model can be used for predicting the odds of seeing an event, given a vector of regression variables. For e.g. one could use the Binomial Regression model to predict the odds of its starting to rain in the next 2 hours, given the current temperature, humidity, barometric pressure, time of year, geo-location, altitude etc.

In a Binomial Regression model, the dependent variable ** y** is a discrete random variable that takes on values such as 0, 1, 5, 67 etc. Each value represents the number of ‘successes’ observed in

*m*trials. Thus

**follows the binomial distribution.**

*y*The following equation gives the probability of observing *k *successes in *m* independent Bernoulli trials. Each Bernoulli trial has a probability of success=π and probability of failure=*(1-*π*)*. A coin toss is the simplest example of a Bernoulli trial in which π = (1-π) = 0.5.

The vertically bracketed term (m k) is the notation for a ‘Combination’ and is read as *‘m choose k’*. It gives you the number of different ways to choose *k* outcomes from a set of *m* possible outcomes.

In a regression model, we will assume that the dependent variable ** y** depends on an

*(n X p)*size matrix of regression variables

**. The**

*X**ith*row in

**can be denoted as**

*X***which is a vector of size**

*x_i**(1 X p ).*It corresponds to the ith outcome

*y_i*.

One usually expresses the probability of *y_i* taking a certain value *k* as *conditional upon the regression variables **X** taking the value *** x_i. **In notation form, it can be written as

*Pr(y_i=k|*

*X**=*

*x_i**)*and can be read as probability of

*y_i*being

*k*given that

**is**

*X***.**

*x_i*We can now state the probability distribution of the Binomially distributed ** y** in the context of a regression of

**over**

*y***as**

*X***follows:**

- On the L.H.S. of the above mentioned PMF equation, we will replace the unconditional probability
*Pr(y=k)*with the conditional probability*Pr(y_i=k|**X**=**x_i**).* - In the R.H.S, we will replace the unconditional probability
*π*of observing a success with the conditional probability*π_i*, where*π_i*is the probability of observing a success for the*ith*row of the data set, i.e. when the regression vector*X**=**x_i**.*

With these two substitutions, the PMF of the binomially distributed ** y **becomes as follows:

In the above equation, the probability of observing a success *π_i *for some X=x_i, is usually expressed as some function g(.) of ** x_i**. In symbolic terms:

The above set of concepts can be neatly visualized using a single illustration as follows:

In the above graph, *y_1, y_2, y_3,…y_10 *are ten binomially distributed random variables. They also happen to the the constituents of the dependent variable ** y** which is a (10 X 1) matrix as follows:

**=**

*y*[[y_1],

[y_2],

[y_3],

…,

[y_10]].

The corresponding regression variables matrix

**in this case also happens to be a 10 X 1 matrix since there is only one regression variable involved:**

*X*

*X**=*

[[1],

[2],

[3],

[4],

…,

[10]].

[[1],

[2],

[3],

[4],

…,

[10]].

Because, ** y** is a random variable with spread

*m*, the plot shows how for each value of

*X**=*

**,**

*x_i***can take any binomially distributed value around its**

*y**expected value*

*µ_i*where

*µ_i = m*π_i*and

*π_i*as we saw earlier is some function g(.) of

**. And thus, the expected value of**

*x_i**y_i*which is

*µ_i*, can be expressed as some function of

*x_i.*The Binomial Regression model is part of the family of **G**eneralized **L**inear **M**odels. GLMs are used to model the relationship between the *expected value *of a response variable ** y** and a

*linear combination*of the explanatory variables vector

**.**

*X*The relationship between *E(**y**|**X**) *and ** X** is expressed by means of a suitable

*link function, as*follows:

In the above equation, *g(.)* is the link function that connects the conditional expectation of ** y** on

*X**with a linear combination of the regression variables*

**.**

*x_i***being the matrix of regression variables of size**

*X**(n X p)*where

*n=*rows and

*p=*regression variables in each row, and

*X**=*

**being the**

*x_i**ith*row in this matrix of size (1 X p)

**and**

**being a**

*β**(p X 1)*vector of regression coefficients.

When ** y** is binomially distributed, we are interested in fixing the relation between the conditional expectation of

*the probability π of a single Bernoulli trial**on a particular value of*

*X**=*

*x_i**, i.e. E(*

*π**=π_i|*

*X**=*

*x_i**),*or concisely,

*π_i*

*|x_i**.*So the GLM equation for the Binomial regression model can be written as follows:

In case of the Binomial Regression model, the **link function g(.) **takes one of the following four forms (we’ll stop mentioning the conditional notation |X=x_i in each for simplicity, but just assume that it is there):

### The Logistic (logit) link function, also known as the log-odds function

The logistic is known as the log-odds function because it is expressed as the ratio of the probability of success to probability of failure, i.e. the log of the odds of success. We will be using this link function later on.

### The probit link function

The probit (short for probability unit) link function is used to model the occurrence of an event that has a binary Yes/No outcome. This link function is expressed as the inverse of the Cumulative Distribution Function Φ(.) of the standard normal distribution N(0,1).

The book **Regression analysis of count data** by Colin Cameron and Pravin K. Trivedi provides an excellent introduction to the Probit link function in *section 3.6: Ordered and Other Discrete-Choice Models*. In there, you will also find a very lucid derivation of why the Probit model’s link function happens to be the Inverse of the CDF Φ(.) of the normal distribution.

### The log-log function

The log-log function is useful for modeling ‘Poisson-like counting processes’ in which the parameter of the probability distribution (which often contains the mean) lies in the exponent of the probability distribution’s formula, *and* the parameter is also expressed as an exponent of a linear combination of the regression variables. Thus, it has the double exponent format: exp(exp( — *β.x_i**) *and therefore two consecutive logarithm operations are needed to bring the ** β.x_i** term down to ‘ground level’.

### The complementary log-log link function

The complementary log-log is called so because it operates on *(1-π_i) *i.e. the probability of failure, instead of *π_i.*

We’ll use the **logistic** a.k.a. the **logit** a.k.a. the **log-odds **link function to build our Binomial Regression model. Here it is once again, this time expressed in a slightly different way. On the R.H.S, I have replaced summation with the bolded vector notation:

## SECTION 2: Using the Binomial Regression model to predict the odds of survival on the Titanic

We’ll use the **Titanic data set** as an example to understand the kinds of use-cases that are appropriate for the Binomial regression model.

Here is a link to the original data set. The curated version of this data set used in the tutorial below can be downloaded from here.

The Titanic data set contains information about 887 of the 2229 souls aboard the ill-fated ocean liner Titanic. Each passenger’s record contains the following attributes:

- Name
- Passenger class (1/2/3)
- Sex
- Age
- Whether the passenger was accompanied by siblings, parents or children
- The Fare they paid, and most importantly,
**Whether they survived (1=Survived, 0=Died)**

Using Python and the *Pandas* data analysis library, let’s load the data set into a Pandas data frame, and print out the first few rows:

```
import pandas as pd
df = pd.read_csv('titanic_dataset.csv', header=0)
df.head(10)
```

This prints out the following output:

We’ll focus attention on four key attributes:

- Pclass (the passenger’s class)
- Age
- Sex
- Survived (Whether they survived)

Let’s drop the rest of the columns from the Data Frame:

```
#Drop the columns that our model will not use
df = df.drop(['Name','Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare'], axis=1)
#print the top 10 rows
df.head(10)
```

This prints out the following:

#### Building the case for the Binomial Regression Model

We’ll postulate that while the Titanic was going down, the combination of

[Pclass, Age, Sex]have greatly influenced the odds of a passenger’s survival.

Note that the *‘Survived’ *column contains a [0, 1] Bernoulli random variable.

So can we say the following?

*Regression variables X** = [Pclass, Age, Sex], *and,

*Dependent variable is the BOOLEAN VARIABLE y** = [Survived]*

No, we cannot. Let’s see why…

## Why not use the Probit Regression model?

Since ** y** is a boolean variable, it may seem like a straight-forward case for using a

**Probit Regression model**. But notice that if one is unfortunate enough to be on a ship such as the Titanic, what one wants to know is

*not*the answer to the binary question: will I survive with 100% certainty or will I die with 100% certainty? Instead, what is more useful to know are the

*odds of survival*.

For example, if you are a 22 years old woman in the second class cabin of the ship, you’ll want to know if your odds of survival are 1 out of 10, 1 out of 4, 1 out of 50 etc.

But the manner in which the Titanic data set is organized, the response variable ** survived** has a yes/no i.e.

**1/0 format.**In other words,

*survived*has a Bernoulli distribution, i.e. :

*Pr(**survived**=0) = *π*,Pr(*

*survived**=1) = (1-*π

*)*

Where π is some probability between 0 and 1.

What we want is for ** y** to express the odds, i.e. the ratio of successes (survivals) to failures (deaths), in

*m*independent, identical trials.

*In other words, we want is for **y** to have a **Log-Odds distribution**.*

And therefore, instead of using a True or False, 1 or 0 type Probit regression model, what we want to do here is build a Binomial regression model where the response variable is Binomially distributed, and the link function is the **Logit **i.e. log-odds function.

The link function will allow us to link the odds of survival to a linear combination of the regression variables *X**=[Pclass, Age & Sex]* plus the intercept, as follows:

## How to transform *y* from a Bernoulli variable to a Binomial variable:

To transform the response variable ** y** from Bernoulli to Binomial, we must group together the data set rows that share the same combination of

**values [Pclass, Age and Sex]. Before we go about doing that, there is one little thing we need to take care of, and that is bucketing of the Age attribute. You see, Age, the way it is expressed in the data set, is a continuous variable that ranges from 0.42 to 80.**

*X*It hardly seems plausible that babies that were 0.42 years and 0.67 years old respectively would have had different odds of survival. Ditto logic holds true for youths with ages 26, 27, 28, 29 etc. years old, and so on for other cases.

We need to make the age data more granular so as to limit the number of groups. Let’s do this by bucketing the overall age range into bins of size 5 years and label each bin like so:

(0, 5] → 5

(5, 10] → 10

(10, 15] → 15 and so on.

The *pandas.cut() *method does the bucketing very neatly:

```
#define the bins
age_range_bins=[0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80]
#define the label for each bin. Num labels = Num bins - 1
age_range_labels=[5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80]
#Cut up the age range into multiple bins and stuff them into a new Age_Range column
df['Age_Range']=pd.cut(df['Age'],age_range_bins,labels=age_range_labels)
#Print the output
df.head(10)
```

We get the following output. Note the new **Age_Range **column we have added:

Let’s drop the Age column as we’ll use Age_Range instead:

```
#Drop the age column
df = df.drop(['Age'],axis=1)
```

Now that we have our explanatory variables set up the way we want, let’s get to work on grouping together the samples by the combination *[Pclass, Sex, Age_Range]. *We’ll use the Pandas *groupby()** *method.

```
#Group by ['Pclass', 'Sex', 'Age_Range']
groups = df.groupby(['Pclass', 'Sex', 'Age_Range'])
#Get the counts for each group. This is the number of passengers in each group who have survived
df_grouped_counts = groups.count()
#Get the size (number of passengers) in each group
df_grouped_survived = groups.sum()
```

Merge the number of survivors and number of passengers for each group into each grouped data frame. (We’ll see how this helps us in a minute):

```
df_grouped_counts.to_csv('df_grouped_counts.csv')
df_grouped_counts_1 = pd.read_csv('df_grouped_counts.csv', header=0)
df_grouped_survived.to_csv('df_grouped_survived.csv')
df_grouped_survived_1 = pd.read_csv('df_grouped_survived.csv', header=0)
```

Finally, let’s construct a new Data Frame which contains:

- The grouped columns Pclass, Sex, Age_Range,
- The corresponding number of survivors in each group,
- The total number of passengers in each group i.e. the group size, and,
- The number of passengers in each group who died.

```
#Create a new Data Frame
df_grouped = pd.DataFrame()
#Copy over the Pclass, Sex and Age Range columns
df_grouped['Pclass'] = df_grouped_counts_1['Pclass']
df_grouped['Sex'] = df_grouped_counts_1['Sex']
df_grouped['Age_Range'] = df_grouped_counts_1['Age_Range']
#Copy over the num passengers from the counts grouped Data Frame
df_grouped['Total'] = df_grouped_counts_1['Survived']
#Copy over the num survivors from the summation grouped Data Frame
df_grouped['Survived'] = df_grouped_survived_1['Survived']
#Add a column containing the number who died
df_grouped[’Died’] = df_grouped[’Total’] - df_grouped[’Survived’]
```

Let’s print out the first 20 rows of the grouped data set:

```
df_grouped.head(20)
```

Let’s see what the grouped data set is telling us. I have highlighted row numbers 9, 14 and 19 for illustration:

In row #9, we find that there were 10 women in the age range (45, 50] with a first class ticket of which 9 survived. Thus the odds of survival for a woman in this group were pretty good (9 to 1), especially if she occupied a first class cabin.

In row #19, we see there were 4 male passengers aged (15- 20] of which only one survived.

In row #14, we see that there weren’t any women passengers aged (70–75] who occupied a first class cabin. This is why we are seeing NaNs in the aggregate columns for the group: [1, female, 75].

Let’s remove all such NaN rows from the Data Frame:

```
df_grouped = df_grouped.dropna()
```

Before we build the Binomial model, let’s take care of one final data preparation task, namely, let’s replace the ‘female’ and ‘male’ strings with integers 1 and 2:

```
df_grouped=df_grouped.replace(to_replace={'female': 1, 'male': 2})
```

## Building the Binomial Regression Model using Python and statsmodels

We’ll use the excellent support offered by the statsmodels library for building and training the Binomial Regression model.

Let’s carve out the training and testing data sets:

```
import numpy as np
#Separate out the training and test sets
mask = np.random.rand(len(df_grouped)) < 0.85
df_train = df_grouped[mask]
df_test = df_grouped[~mask]
```

Let’s set up the regression model’s formula using the **Patsy**** **syntax. What we are saying in below mentioned formula is that the dependent variable is a matrix composed of the ** Survived **and

**columns of the dataframe, while the regression variables are**

*Died***,**

*Pclass***and**

*Age_Range***.**

*Sex*```
#Construct the Binomial model's regression formula in Patsy syntax.
formula = 'Survived + Died ~ Pclass + Age_Range + Sex'
```

Using this formula, let’s carve out the ** X** and

**design matrices from the training and testing data frames which we created a minute ago:**

*y*```
from patsy import dmatrices
#Carve out the training matrices from the training data frame using the regression formula
y_train, X_train = dmatrices(formula, df_train, return_type='dataframe')
#Carve out the testing matrices from the testing data frame using the regression formula
y_test, X_test = dmatrices(formula, df_test, return_type='dataframe')
```

Next, we feed *X_train* and *y_train* into an instance of the Binomial Regression model class and train the model:

```
import statsmodels.api as sm
binom_model = sm.GLM(y_train, X_train, family=sm.families.Binomial())
binom_model_results = binom_model.fit()
```

Let’s print out the fitted model summary:

```
print(binom_model_results.summary())
```

This prints out the following:

#### How to Interpret the regression results:

In the above output, statsmodels is telling us that it has trained a **Generalized Linear Model** of type **Binomial **because, well, we asked it to, that it used the **log-odds link function** and it has used the **Iterative Re-weighted Least Squares (IRLS)** algorithm for training our model**.**

Statsmodels is reporting that our model has 3 degrees of freedom: ** Sex, Pclass and Age_Range, **which seems about right:

For Binomial models, statsmodels calculates three goodness-of-fit measures for you: **Maximum Log-likelihood, Deviance **and **Pearson Chi-squared**. We won’t inspect them any further as all three measures are useful only when you are comparing the goodness-of-fit of two or more Binomial regression models which in this case, we aren’t:

Let’s look at the ** fitted **model’s coefficients:

All regression coefficients are statistically significant at the 0.1% margin of error as indicated by the p-values which are all < 0.001:

Let’s see what each coefficient is telling us.

### How to Interpret the model coefficients:

For the logit link function, the fitted coefficients can be interpreted as follows:

**Age_Range: **It’s coefficient is -0.0446. Note the negative value. The way to interpret this coefficient’s value is that, keeping all other variables constant, **for each unit increase in the passenger’s age, the odds of their survival decreased by a factor = exp(-0.0446) = 0.9564. **i.e. for each unit increase in the passenger’s age, one needs to multiple their survival odds by 0.9564, thereby reducing the odds of survival by a certain amount each time. For example, if a 12 year old male occupant of a 2nd class cabin had a known survival odds of 8:9 during the disaster, then a 22 years old male occupant of a 2nd class cabin had an odds of survival of (8/9) * 0.9564¹⁰ = approximately 6:10.

**Pclass:** The coefficient for Pclass is -1.2008. Again note the negative sign. A downgrade of the cabin class of a passenger had an even more dramatic effect on the passenger’s odds of survival aboard the Titanic. The Pclass variable is coded as First class cabin=1, Second class cabin=2 and Third class cabin=3. **So for every unit increase in the cabin class i.e. as one goes down from 1st class to 2nd class to 3rd class, the odds of survival, keeping age and sex constant, reduce by a factor of exp(-1.2008) = 0.30! **i.e. for each unit downgrade, your odds of survival get multiplied by 0.30. For e.g., if a 30 years old male occupant of a 1st class cabin had a 7 : 9 odds of survival on the Titanic, just dropping him down one class to class 2, reduced his odds of survival to (7/9)*0.3 = approximately 1:4. Bumping down the class further to the 3rd class reduced the odds to (7/9)*0.3*0.3 = 7 : 100.

**Sex: **Finally, notice the very heavy negative coefficient of -2.6526 for the Sex variable. Aboard the sinking Titanic, male passengers had quite miserable chances of survival as compared to female passengers. Keeping Pclass and Age constant, **the odds of survival of a male passenger was only exp(- 2.6526) = 7% of those of a female passenger**.

## Prediction

Recollect that we had put aside the test data set in the Data Frame *df_test*. It’s time to test our model’s performance on this data set. To do that, we’ll first add a *Percentage Survived* column to the test data frame whose value we’ll ask our model to predict:

```
df_test['Pcnt_Survived'] = df_test['Survived']/df_test['Total']
```

We’ll use the ** .predict()** method on the results object and pass the test data set get the predicted survival rate:

```
predicted_survival_rate = binom_model_results.predict(X_test)
```

Let’s plot the actual versus predicted survival rate:

```
import matplotlib.pyplot as plt
plt.xlabel('Actual Survival Rate')
plt.ylabel('Predicted Survival Rate')
plt.scatter(df_test['Pcnt_Survived'], predicted_survival_rate, color = 'blue')
plt.show()
```

As you can see, the fit becomes unacceptable when the survival rates are toward the top of the range i.e. 1.0. To a large extent, the accuracy of the prediction is determined by the sample size i.e. the size of each group of passengers, grouped by the tuple [Pclass, Sex, Age Range]. For some groups in the training set, the group size is too small for the model to train in a meaningful way. For such combinations in the test data set, the accuracy will be understandably low.

Here is the link to the complete source code:

Here is the link to the Titanic data set used in the tutorial.

## Summary

- A Binomial Regression model can be used to predict the odds of an event.
- The Binomial Regression model is a member of the family of Generalized Linear Models which use a suitable link function to establish a relationship between the conditional expectation of the response variable
with a linear combination of explanatory variables*y*.*X* - In the Binomial Regression model, we usually use the log-odds function as the link function.
- The Logistic Regression model is a special case of the Binomial Regression model in the situation where the size of each group of explanatory variables in the data set is one.

## Citations and Copyrights

### Data Set

The Titanic data set has been downloaded from Stanford’s CS109 class website. **Curated data set download link**.

### Books

M McCullagh P., Nelder J. A., Generalized Linear Models”, *Chapman and Hall/CRC; 2nd edition (August 1, 1989)*, ISBN-13 : 978-0412317606

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **An Overview Of The Generalized Linear Regression Model

**NEXT: **The Holt Winters Exponential Smoothing Model For Time Series Data