###### We will learn what are Survival and Hazard Functions, the Kaplan-Meier Estimator, and how to build a proportional hazards regression model using Python and the Lifelines library

## A two-sentence description of Survival Analysis

Survival Analysis lets you calculate the probability of failure by death, disease, breakdown or some other event of interest *at*, *by*, or *after *a certain time. While analyzing survival (or failure), one uses specialized regression models to calculate the contributions of various factors that influence the length of time before a failure occurs.

## What is it used for?

In **medicine**, survival analysis is used to measure the efficacy of drug and vaccine candidates in randomized controlled trials. It is also used to measure the effects of lifestyle factors such as smoking and obesity on lifespan (more accurately, the disease-free lifespan) and to calculate the effect of interventions such as chemo, radiation and surgery on the post-intervention lifespan.

In **insurance**, it is used to compute mortality rates and lifespans.

In **reliability engineering**, survival analysis is used to estimate the probability of failure and the failure rate of hardware and machines and to calculate the mean time between failures and the mean time to first failure.

In **finance**, survival analysis can be used to predict when a stock will split and when (or if) a company will default on its debt and many other things.

In **sociology and law**, it has been used to measure the probability of a repeat offense being committed after an inmate is released from prison.

The mathematical structure of survival analysis is general enough that it has found uses in areas that are seemingly unrelated to survival, failure, disease and death.

## Pairs of events

The intuition for the field is obtained by considering a pair of events, any sorts of events, separated by some duration. Here are a couple of examples:

In case of machines, the two events will represent the times of consecutive failures.

In case of stocks, the two events could be the times of consecutive stock splits.

In each case, one wants to know primarily two things:

- What is the
**relative risk of occurrence**of event 2 under different combinations of event 1. For e.g. what is the relative risk of occurrence of disease in a 6 month period after a vaccine short versus a placebo shot. - What are the
**factors influencing the duration**between events 1 and 2? Another way of asking the same question is what is the probability that event 2 will occur before, or at, or after a certain time*conditioned upon the influencing factors such as age, operation environment, genetic makeup etc.*?

Survival Analysis provides you the tools, the techniques, and the regression models to answer all these questions.

## What is a good way to *quickly* learn Survival Analysis?

Learning about the entire field of Survival Analysis is no picnic. But you don’t need to lose sleep over learning about the whole field.

We’ll quickly get you to the point where you can start using Survival Analysis in practical settings. We’ll do that by first learning about the **three protagonists **of the field namely:

**The random variable T,****The Survival Function and,****The Hazard Function,**

We’ll follow that up by learning about the **Kaplan-Meier Estimator**. The Kaplan-Meier Estimator is used to estimate the Survival Function, and to compare the chances of survival within two populations that differ in one or more variables.

Finally, we’ll gain an understanding of the **proportional hazards regression models**, and particularly the **Cox Proportional Hazards model**. The proportional hazards models are the work horses of the survival field. They are used to estimate the contribution of different factors on the duration of survival.

All along, we’ll use the Stanford Heart Transplant data set to illustrate the concepts and we’ll use Python and the Lifelines library to build and train the models and to make the predictions.

Let’s begin!

## Getting to know the protagonists

Let’s get acquainted with the three protagonists of the field:

**The random variable T****The Survival Function****The Hazard Function**

### The random variable T

The random variable at the center of Survival Analysis is *the time of occurrence of the event of interest*. It’s called **T** or tau (**τ**).

**T** represents an event of interest such as:

- The time of death
- The time of failure or breakdown of a machine
- The time at which a volunteer catches the disease in a vaccine trial
- The time at which a company declares bankruptcy or defaults on a loan.

The **domain **of **T **is **time** itself. Thus the domain is the set of non-negative real numbers in the set [0, ∞) and therefore **T** is a **continuous random variable**.

The **range** of **T** is the **probability of occurrence** of the event such as death, failure, loan default etc at time *t*, before time *t* or after time *t*.

Thus, **T** maps time of event *t* to the probability that the event of interest will occur at or before or after *t*.

Notation-wise, it is one of the following probabilities:

*P(**T** < t) P(*

*T**= t)*

P(

P(

*T**> t)*

…where *t ∈ [0, ∞)*.

Often, the three kinds of probabilities mentioned above are conditional probabilities. Here are two examples of how they can be conditional:

- One would want to know the probability that the time of death of a patient is greater than some time
*t**given that the patient is alive at time t.* - One would want to know the probability that your car will break down in the tiny interval
*(t, t+δt]*,*given that it has not broken down up through time t*.

There is one other very important way in which this probability is conditional, namely, it’s being conditional on regression variables such as the individual’s age, the vehicle’s operation conditions, the company’s cash-flow history etc. But we’ll get to this form of conditionality a little later when we look at regression models for survival.

Meanwhile, the above two examples bring us straight to the other two protagonists, the survival function and the hazard function.

### The Survival Function

The Survival Function *S(t)* takes time *t* as a parameter and outputs the probability that the survival time is *greater than t*.

*S(t) = P(**T** > t)*

One way to understand the survival function is to chop up the time axis into discrete time steps as show in the figure below. *S(t_8) *is the probability of making it past *t=t_8*:

Let’s illustrate how to construct the Survival Function using the Stanford heart transplant data set.

I have uploaded the CSV version of this data set at this location.

#### The Stanford heart transplant data set

The data set contains a set of 103 heart patients whose survival time in years is tracked from the point of induction into the study.

Using Python and Pandas, we’ll load the data into memory:

```
import pandas as pd
df = pd.read_csv('stanford_heart_transplant_dataset_full.csv')
```

Print out the columns in the data set:

```
df.columns
```

We see the following list:

Index(['PATIENT_ID', 'YR_OF_ACCEPTANCE', 'AGE', 'SURVIVAL_STATUS',

'SURVIVAL_TIME', 'PRIOR_SURGERY', 'TRANSPLANT_STATUS',

'WAITING_TIME_FOR_TRANSPLANT', 'MISMATCH_ON_ALLELES',

'MISMATCH_ON_ANTIGEN', 'MISMATCH_SCORE'],

dtype='object')

The two columns of immediate interest to us are the following ones:

**SURVIVAL_TIME: **The number of days the patient survived after induction into the study.**SURVIVAL_STATUS: **1=dead, 0=alive at **SURVIVAL_TIME** days after induction.

Let’s sort the data by **SURVIVAL_TIME **and print out the first15 rows for these two columns:

```
df = df.sort_values(by='SURVIVAL_TIME')
df[['SURVIVAL_TIME','SURVIVAL_STATUS']].head(15)
```

Here’s the output we get:

SURVIVAL_TIME SURVIVAL_STATUS

0 1 1

1 2 1

2 2 1

3 2 1

4 3 1

5 3 1

6 3 1

7 5 1

8 5 1

9 6 1

10 6 1

11 8 1

12 9 1

13 11 0

14 12 1

Let’s color code the output by the number of days of survival:

One sees from this picture that one patient has died at one days after induction into the study, three patients have died at the two day mark, three more died at the three day mark and so one.

Notice row # 13. Eleven days after induction, this patient exited the study while still alive. This is an example of a **right-censored data point**. We’ll soon see how to deal with such data as it’s quite common in Survival settings.

Let’s try to visualize the above data set in the following way: we’ll assume that only 15 patients are inducted into the study and all of them are inducted at the same time *t_0*. With that assumption, the following graphic illustrates the situation shown in the color-coded table we saw earlier:

#### What is the probability of survival beyond t=t_6?

*S(t_6)* which denotes the probability of survival for a patient beyond *t_6 *is simply the probability that they did not die in each one of the preceding six intervals *(t_0 ,t_1], (t_1, t_2], (t_2, t_3]…(t_5, t_6]*.

The probability that the patient did not die in *(t_i, t_(i+1)] *is simply the complement of the **conditional probability **that they did die in the interval *(t_i, t_(i+1)]* *given that they were still alive at t_i.* Notation-wise, it’s expressed as follows:

The probability on the R.H.S. can be calculated as follows:

Combining the above two results yields the following:

Recollect that *S(t_6) *denotes the probability of survival for a patient beyond *t_6.* It is the probability that the patient did not die in each one of the preceding six intervals *(t_0 ,t_1], (t_1, t_2], (t_2, t_3]…(t_5, t_6]*. We can use the above equation to calculate S(6) as follows:

Let’s use the above formula to calculate *S(t_6) *for the Stanford heart patient data set. We’ll reproduce the first 15 rows for reference:

Here’s the table of *d_i* and *n_i* calculated from the above table using MS Excel. We make the simplifying assumption that all 103 patients are inducted at *t=t_0*:

With the table of *d_i *and *n_i *in place, we calculate *S(t_6)* as follows:

Here is how we do it in Excel:

#### How to deal with right-censored data

Recollect that row 13 in the data set contains a patient who left the study after 11 days while still alive. This patient will contribute to the count *n_11 *which is the number of patients who made it to 11 days. But this patient cannot be counted while counting *n_12, n_13, *etc. since he/she was simply not present beyond 11 days. Note that such patients also don’t count toward any of the death counts either, as they did not die while being in the study. Hence, the impact of such a right-censored data point at *t=t_11* is on the denominators *n_12, n_13 etc. *while calculating the Survival Function’s value at *t=t_12, t_13*, etc. i.e. *the value S(t) *for all *t > t_11*.

Right censored observations are very common in survival data.

### The Kaplan-Meier Estimator

The above mentioned technique for constructing a Survival Function can be easily generalized as follows:

The above formula is known as the **Kaplan-Meier estimator** after its co-creators.

The

Kaplan-Meier Estimatoris used to estimate the Survival Function, and to compare the chances of survival within two populations that differ in one or more variables

It’s important to note that Kaplan-Meier estimator produces *an estimate of the true (unknown) survival function*.

The

Kaplan-Meier estimatoris a random variable that has a mean, a variance and a probability distribution, and each estimate of the true S(t) that it generates comes bracketed with confidence intervals.

If you plot the value of *S(k)* for different values of *k* using the above formula, you will get what is known as the **Kaplan-Meier curve **or **Kaplan-Meier plot**.

Let’s construct this plot for the heart transplant data set using Python and the Lifelines library:

```
from lifelines import KaplanMeierFitter
from matplotlib import pyplot as plt
kmf = KaplanMeierFitter(label='Stanford heart transplant dataset')
kmf = kmf.fit(durations=df['SURVIVAL_TIME'], event_observed=df['SURVIVAL_STATUS'])
kmf.plot()
plt.show()
```

We see the following plot of *S(t): probability of survival for T > t *against t:

Each point (x, y) on the above Kaplan-Meier plot represents the probability *y* that patient survived beyond time t=x. i.e. S(t=x) = y.

Let’s mark off the point *(6, 0.9126)* corresponding to *S(t=6)* which we hand-cranked out earlier. The corresponding probability is 91.26% that the patient survived beyond 6 days after induction into the study.

```
plt.plot(6, 0.9126, color='red', marker='x', markeredgewidth=1, markersize=10)
kmf.plot()
plt.show()
```

#### When T is a continuous random variable

When T needs to be considered as a continuous variable, one has to work with probability densities, specifically the Probability Density Function (PDF) and the Commutative Density Function (CDF) which is the integration of the PDF over some interval of time.

Let *f(T=u) *be the probability *density *of failure, disease, death, or in general some event of interest, *by* *T=u*. Then the probability of that event happening by time *T ≤ t* is given by the **C**umulative **P**robability **F**unction *F(t) *as follows:

This CDF is essentially the area under the PDF curve in the interval *(0, t], *assuming time is always zero or positive:

Since the probability of failure or death at or before time *T ≤t* is *F(t)*, the probability of death or failure *not* happening by *T ≤t* is *1 — F(t). That is same as the probability of time of death or time of failure being at some time after T = t. *Which by definition, is the survival function *S(T > t).*

Thus, we arrive at this important result that is applicable to both discrete and continuous time scales:

### The Hazard Function

The Hazard Function *h(t)* gives you the **density of instantaneous risk**, or in other words, the probability of occurrence of the event at *T=t*, assuming that the event has not occurred up through time* t***.**

The Hazard Function can be used to answer questions like what is the chance that your 15 year old Toyota which has not broken down even once, will break down on the very morning of your big trip to the mountains!

Since *h(t)*, the hazard function, is the probability **density at **time *t*, ** h(t)δt **is the probability of your Toyota’s breaking down during a vanishingly small time interval

**i.e. just when you were about to leave on your trip at time**

*(t, t+δt*]*t*assuming that it had not broken down up through time

*t*.

Therefore, *1 — h(t)δt* is the probability of your vehicle’s **not **breaking down during the interval *(t, t+δt*] given that it has also not broken down up through time *t*. In other words, *1 — h(t)δt *is the probability that the car did not break down all the way up through *t+δt.*

Recall that *F(t) *is the probability of failure by time* t*. So *1-F(t+δt) *is the probability that your vehicle has not failed up through time *t+δt.*

Therefore, we have the following relationship between* h(t)* and *F(t)*:

Plugging in the two probabilities on the R.H.S.:

After rearranging the terms, we get:

But what we are interested in is the instantaneous risk, more accurately the density of the instantaneous risk, at time* t*. To get the instantaneous risk density at t, we make *δt *arbitrarily small. We do this by finding the limit of the above equation as *δt → 0*:

The limit converts the CDF *F(t) *to its derivative *F’(t)*:

We saw earlier that *1-F(t)* is the Survival Function *S(t)*, which is the probability of your car breaking after time *t. *Furthermore, *F’(t)* = *f(t) *which is simply the probability density of break down by time *t*.

Substituting, *f(t) *and *S(t)*, we get one of the fundamental equations of the field of Survival Analysis — an equation that expresses the instantaneous probability density of failure *at *time *t* in terms of the probability density of failure *by *time *t* and the probability of failure *after *time *t*:

#### Relationship between the Hazard Function and the Poisson Process

The Hazard Function *h(t) *is the instantaneous density of risk at some time *t*. The density aspect of it makes it a natural candidate for *h(t) *to be considered as a rate. In fact, another name for the Hazard Function is the Failure Rate with units of number of failures per unit time *at a certain time t*.

A Poisson process is characterized by the Poisson event rate* λ *which has the units of events per unit time.

Therefore, in situations where the hazard function *h(t)* has a constant value at all times *t*, this constant value is often represented by the symbol* λ,* and the underlying random process that displays the characteristic of a constant hazard rate is in fact a Poisson process with a constant failure rate* λ.*

Thus we have:

*h(t) = λ*

The hazard rate of systems such as factory machinery, commercial HVAC systems and your 15 year old Toyota, all of which (presumably) undergo periodic maintenance and repair can be modeled using a constant hazard rate. For such systems, the **M**ean **T**ime **B**etween **F**ailures (**MTBF**) is simply 1/*λ.*

When *h(t)* = a constant rate *λ*, we also have the following interesting results:

The Survival Function *S(t) *which gives you the probability of failure occurring after time *t*, is *1 — F(t). *Hence in the constant hazard rate situation:

#### How to compare the survival functions of two samples based on a boolean explanatory variable?

Consider two samples *S_1 and S_2* from the Stanford heart transplant data set which are differentiated by the value of the boolean explanatory variable PRIOR_SURGERY as follows:

S_1 = [Set of all patients who have had at least one open-heart surgery prior to entry into the study. PRIOR_SURGERY=1]

S_2 = [Set of all patients who have **not **had any open-heart surgery prior to entry into the study. PRIOR_SURGERY = 0]

Let’s construct and plot the survival function for each sample:

```
#load the dataset into memory if you haven't done so already
df = pd.read_csv('stanford_heart_transplant_dataset_full.csv')
#create a differentiated index based on PRIOR_SURGERY=1
idx = df['PRIOR_SURGERY'] == 1
#Create the respective time series
survival_time_with_prior_surgery, survival_status_with_prior_surgery = df.loc[idx, 'SURVIVAL_TIME'], df.loc[idx, 'SURVIVAL_STATUS']
survival_time_without_prior_surgery, survival_status_without_prior_surgery = df.loc[~idx, 'SURVIVAL_TIME'], df.loc[~idx, 'SURVIVAL_STATUS']
#Fit a Kaplan-Meier plot to the respective sample, with and without prior surgery
kmf_with_prior_surgery = KaplanMeierFitter(label='Group with prior heart surgery').fit(durations=survival_time_with_prior_surgery, event_observed=survival_status_with_prior_surgery)
kmf_without_prior_surgery = KaplanMeierFitter(label='Group without prior heart surgery').fit(durations=survival_time_without_prior_surgery, event_observed=survival_status_without_prior_surgery)
#plot the two curves
kmf_with_prior_surgery.plot()
kmf_without_prior_surgery.plot()
plt.show()
```

We see the following interesting result in the plot below, namely for any time *t*:

**S(t|with prior open heart surgery) > S(t|without prior open heart surgery):**

#### How to test the differences in survival probability beyond a point in time

Sometimes, one wants to know if two groups have a statistically meaningful difference in survival probability beyond a certain time. For example, one may want to know if the probability of survival beyond 180 days from induction into the study is different (in a statistically significant way) between patients who have and have not undergone prior open-heart surgery.

To do so, one can use a Chi-square(1) distributed test statistic described by Klein, J. P., Logan, B. , Harhoff, M. and Andersen, P. K. in their 2007 paper: *Analyzing survival curves at a fixed point in time.*

This statistic is implemented in the lifelines library and can be used as follows:

```
#import the test
from lifelines.statistics import survival_difference_at_fixed_point_in_time_test
#Perform the test at t=365 days
results = survival_difference_at_fixed_point_in_time_test(point_in_time=365, fitterA=kmf_with_prior_surgery, fitterB=kmf_without_prior_surgery)
#Print the test statistic and it's p-value for 1 degree of freedom on the Chi-square table
print('Chi-squared(1) Test statistic='+str(results.test_statistic) + ' p-value='+str(results.p_value))
```

We get the following result:

Chi-squared(1) Test statistic=5.652097138214293p-value=0.01743450982288937

The p-value of 0.017 is < 0.05. Therefore, we can say with greater than 98% confidence that the probability of survival beyond one year from induction into the study **is different **between patients who have and have not undergone prior open-heart surgery.

We have seen how the value of the variate PRIOR_SURGERY influences *S(t)* and therefore the hazard *h(t)* experienced by the corresponding patient.

We would like to know is *by how much* does the hazard* h(t) *at time *t *reduce (or increase) when a patient has had prior open heart surgery. We would also want to measure the effect of other non-boolean variables such as age on the increase or decrease in the hazard *h(t) *experienced by the patient. For example, one would may reach a conclusion that given two study patients who are otherwise similar except for their ages, the patient who is older will experience a hazard that is 10% higher than the younger patient for each year that they are older than the younger patient. In other words, *h_older(t) = h_younger(t)*(1.1)^(k)* where *k* is the difference in ages in number of years.

To reach such measurable conclusions, we need to use a class of regression models known as the **proportional hazards model.**

## The Proportional Hazards Model

Earlier, we looked at a special case of the hazard function *h(t)* where the hazard rate was a constant *λ *events per unit time *at all times. *In general, it is useful to express *λ* as a function of time i.e. as *λ(t) *because the hazard often varies with time.

It is also useful to think of *λ(t) *as the ‘**baseline**’ hazard rate experienced by all members in the study. We call *λ(t) *the baseline hazard rate because at any time *t* during the study, the actual hazard rate *h_i(t)* experienced by the *ith* member of the sample is likely to be a function of 1) the baseline hazard *λ(t) *and 2) certain characteristics of the *ith* study participant. In case of human members, such characteristics could be their age, gender, genetic makeup etc. In case of machines such as your Toyota, they could be the age of the vehicle, the operating conditions, model of the vehicle, how regularly it has been serviced etc. In the regression model, these characteristics form the regression variables or the ‘covariates’ vector ** x_i** for the study individual

*i*. Thus, the actual hazard rate

*h_i(t)*experienced by the

*ith*individual can be expressed as a function of:

- The baseline hazard rate
*λ(t)*experienced by everyone in the study, - The regression variables vector
for the ith individual, and*x_i* - The regression coefficients vector
*β*

Since the hazard rate is always non-negative, we can use the following technique drawn from the discipline of **Generalized Linear Models** to bring together *λ(t), *** x_i** and

**:**

*β*The exponentiation *‘e^(**x_i*β**)’* ensures that the *h_i(t)* stays non-negative no matter what values are observed for the regression variables vector ** x_i**.

The above equation gives you the hazard rate experienced by the *ith *individual. If the study contains *n *individuals, we replace *h_i(t)* with a vector *h**(t)* of size *(n x 1)* and we replace the regression vector ** x_i** with the matrix of regression variables

**of size**

*X**(n x m)*where

*m*is the number of regression variables per individual. The dimensions of the vector of regression coefficients

**stay the same namely**

*β**(m x 1)*. The following figure illustrates this situation for a data set of size n individuals:

If we take the ratio *h_i(t)/h_j(t)* of the hazard experienced by any two individuals *i* and *j* at a given time *t*, we get the following interesting result:

The ratio of hazards depends only on the difference in the observed values of regression variables.

## The Cox Proportional Hazards Model

The concept of proportional hazards described above was used by Cox, D. R. in his influential 1972 paper “*Regression Models and Life-Tables*”.

To understand how the Cox model works, let’s refer back to our Heart Transplant data set. This time, we’ll look at a subset of rows from the Heart Transplant data set (I have highlighted row 6 below which we’ll get to in a bit):

We’ll consider three regression variables which will form the ** X** matrix:

**AGE**: The patient’s age when they were inducted into the study.

**PRIOR_SURGERY**: Whether the patient had at least one open-heart surgery prior to entry into the study.1=Yes, 0=No

**TRANSPLANT_STATUS**: Whether the patient received a heart transplant while in the study. 1=Yes, 0=No

The response variable ** y** is the variable

**SURVIVAL_TIME**: the number of days a patient survived after induction into the study.

### Regression Goal

The goal of our regression model is to estimate the ** expected survival duration y** given the

**, and in doing so, calculate the influence of each regression variable**

*regression variables X***AGE**,

**PRIOR_SURGERY**and

**TRANSPLANT_STATUS**on

**SURVIVAL_TIME**.

### Regression Strategy

To achieve the regression goal, we will do the following:

- We will construct the equation for the
**P**robability**D**istribution**F**unction ofgiven*y*denoted as*X**P(**y**=y_i|**X=x_i**)*. This PDF needs to express probability of survival*y**=y_i*for the*ith*individual as a function ofThe baseline hazard*three factors:**λ(t)*at time*t*The regression variables*,**X**=*for individual*x_i**ith*individual (we’ll assume time-invariant regression variables), and,

The regression coefficients*β* - Once we have this conditional PDF in place, we will use
**M**aximum**L**ikelihood**E**stimation to solve for the regression coefficientsThis MLE step gives us the trained regression model.*β.* - We can use this trained model to estimate the influence of the regression variables on the response variable
, and to get the expected survival duration*y**y_k*of the*kth*individual given=*X*and coefficients*x_k**β.*

To construct the PDF, we’ll use the concept of Partial Likelihood that Cox made use of in his model. Don’t worry if you don’t know what Partial Likelihood is. We’ll illustrate the concept below:

Let’s zoom into row 6 in the subset of rows from the heart data set shown above. Row 6 tagged by the red arrow shows that one patient died at t=8 days from the start of the study. This patient was 53 years old, had no evidence of prior open heart surgery and the patient died without receiving a transplant during the study.

The above subset of data also shows that 13 patients tagged by the orange arrow survived past t=8 days. One can say that just before t=8 days, there were 14 patients alive. Out of these 14, any one of them may have been the patient who passed away at t=8. The unconditional probability of any given patient dying at t=8 is simply 1/14. But we can do better than the unconditional probability.

We will postulate that the probability that a particular patient out of this set of 14 will die at t=14 is conditioned upon certain characteristics of that patient such as their age, prior open heart surgery and whether they received a transplanted heart.

Cox has provided the following formula for estimating this conditional probability:

The probability is expressed as a ratio of two hazards. The numerator is the hazard rate experienced by the *ith *patient as a function of the baseline hazard *λ(t) *at time* t*, the regression variables ** x_i** for the

*ith*patient and the regression coefficients

**. The denominator is the sum of hazard rates of all individuals (including the**

*β**ith*individual) who have made it through to

*t=8*days.

We see that the baseline hazard rate *λ(t) *gets cancelled out so that the probability does not depend on the time t but only on ** X** and

*β.**This is mighty helpful as the baseline hazard function h(t) is often not known and its exact shape is hard to guess.*

The above equation is called Partial Likelihood because it discounts the presence of the baseline hazard *λ(t).*

The general form of the above equation is as follows:

Where the denominator is the set of all individuals (including the *ith* individual) who have made it to the *ith *time point.

The **M**aximum **L**ikelihood **E**stimation procedure will optimize the coefficients ** β **in such a way that it will maximize the probability of observing the training data set (

**,**

*X***) where the probability of observing each**

*y**y_i*given

**is given by the above probability function. Running the MLE procedure on the training data set results in a trained Cox model which can be used for estimating the expectation of**

*x_i**y_i*for any given

*x_i.*All of this is taken care of by libraries such as Lifelines which implement the Cox Proportional Hazards model. We’ll use it below on the train a Cox model on the heart transplant data set.

We have already loaded the data set into memory. Let’s create a subset that contains the columns of our interest.

```
df2 = df[['AGE', 'PRIOR_SURGERY', 'TRANSPLANT_STATUS', 'SURVIVAL_TIME', 'SURVIVAL_STATUS']]
```

Let’s carve out the training and test data sets.

```
import numpy as np
mask = np.random.rand(len(df2)) < 0.8
df2_train = df2[mask]
df2_test = df2[~mask]
print('Training data set length='+str(len(df2_train)))
print('Testing data set length='+str(len(df2_test)))
```

Import the Cox model:

```
from lifelines import CoxPHFitter
```

Create and train the Cox model on the training set:

```
cph_model = CoxPHFitter()
cph_model.fit(df=df2_train, duration_col='SURVIVAL_TIME', event_col='SURVIVAL_STATUS')
```

Display the model training summary:

```
cph_model.print_summary()
```

We get the following output:

Let’s look at the fitted regression coefficients for each regression variable, their exponents (e^*β)* and the corresponding confidence intervals:

AGE has an exponentiated coefficient of e^(0.07) = 1.07. This tells us that each unit increase in age of the patient at the time of induction into the study increases the hazard by 0.07 = 7%. The 95% confidence intervals range from 3% to 11%.

PRIOR_SURGERY has an exponentiated coefficient of e^(-0.75) = 0.47. Each unit increase in value of this variable *reduces *the hazard rate by (1–0.47)*100=53%. But PRIOR_SURGERY is a boolean variable. So it tells us the following:

Having an open heart surgery done prior to induction into the study seems to reduce the hazard rate by (1–0.47) *100 = 53%. Notice that the confidence intervals for this variable are very wide ranging from a 82% reduction to a 23% *increase *in the hazard rate. This is so because the standard error of this coefficient is very large (0.49).

Similarly, getting a transplant (TRANSPLANT_STATUS=1) during the study reduces the hazard rate by (1–0.19)*100 = 81% with the confidence intervals ranging from a 90% reduction in hazard rate to a 66% reduction in hazard rate.

There is other useful information in the training summary such as significance data for each regression variable. We see that AGE and TRANSPLANT_STATUS are significant at a 99.5% confidence level (p < 0.005) while PRIOR_SURGERY is significant only at a 88% confidence (p = 0.12).

### Prediction

For such a small data set, the quality (as measured by the mean squared error) of the estimated expected survival duration is not going to be good. A much greater benefit is obtained from measuring the contribution of regression variables to the survival duration like what we did in the earlier section. Just for reference, here is the way to get the estimated expectation of the survival times on the test data set:

```
expected_survival_times = cph_model.predict_expectation(df2_test)
```

Here is the complete source code:

## Citations and Copyrights

### Data set

The Stanford heart transplant data set is taken from *https://statistics.stanford.edu/research/covariance-analysis-heart-transplant-survival-data** *and available for personal/research purposes only. **Curated data set download link**.

### Software

Cameron Davidson-Pilon, Jonas Kalderstam, Noah Jacobson, Sean Reed, Ben Kuhn, Paul Zivich, … Skipper Seabold. (2021, May 27). CamDavidsonPilon/lifelines: 0.26.0 (Version 0.26.0). Zenodo. http://doi.org/10.5281/zenodo.4816284

### Papers

Cox, D. R. “Regression Models and Life-Tables.” Journal of the Royal Statistical Society. Series B (Methodological) 34, no. 2 (1972): 187–220. http://www.jstor.org/stable/2985181.

Klein, J. P., Logan, B., Harhoff, M., & Andersen, P. K. (2007). Analyzing survival curves at a fixed point in time. *Statistics in medicine*, *26*(24), 4505–4519. https://doi.org/10.1002/sim.2864

### Book

McCullagh P., Nelder John A., Generalized Linear Models, 2nd Ed., CRC Press, 1989, ISBN 0412317605, 9780412317606

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **The Poisson Hidden Markov Model – Part 2 (Implementation in Python)

**NEXT: **The Stratified Cox Proportional Hazards Regression Model