Introduction to Survival Analysis - Statistical Modeling and Forecasting

We will learn what are Survival and Hazard Functions, the Kaplan-Meier Estimator, and how to build a proportional hazards regression model using Python and the Lifelines library

A two-sentence description of Survival Analysis

Survival Analysis lets you calculate the probability of failure by death, disease, breakdown or some other event of interest at, by, or after a certain time. While analyzing survival (or failure), one uses specialized regression models to calculate the contributions of various factors that influence the length of time before a failure occurs.

What is it used for?

In medicine, survival analysis is used to measure the efficacy of drug and vaccine candidates in randomized controlled trials. It is also used to measure the effects of lifestyle factors such as smoking and obesity on lifespan (more accurately, the disease-free lifespan) and to calculate the effect of interventions such as chemo, radiation and surgery on the post-intervention lifespan.

In insurance, it is used to compute mortality rates and lifespans.

In reliability engineering, survival analysis is used to estimate the probability of failure and the failure rate of hardware and machines and to calculate the mean time between failures and the mean time to first failure.

In finance, survival analysis can be used to predict when a stock will split and when (or if) a company will default on its debt and many other things.

In sociology and law, it has been used to measure the probability of a repeat offense being committed after an inmate is released from prison.

The mathematical structure of survival analysis is general enough that it has found uses in areas that are seemingly unrelated to survival, failure, disease and death.

Pairs of events

The intuition for the field is obtained by considering a pair of events, any sorts of events, separated by some duration. Here are a couple of examples:

In case of machines, the two events will represent the times of consecutive failures.

In case of stocks, the two events could be the times of consecutive stock splits.

In each case, one wants to know primarily two things:

What is the relative risk of occurrence of event 2 under different combinations of event 1. For e.g. what is the relative risk of occurrence of disease in a 6 month period after a vaccine short versus a placebo shot.
What are the factors influencing the duration between events 1 and 2? Another way of asking the same question is what is the probability that event 2 will occur before, or at, or after a certain time conditioned upon the influencing factors such as age, operation environment, genetic makeup etc.?

Survival Analysis provides you the tools, the techniques, and the regression models to answer all these questions.

What is a good way to quickly learn Survival Analysis?

Learning about the entire field of Survival Analysis is no picnic. But you don’t need to lose sleep over learning about the whole field.

We’ll quickly get you to the point where you can start using Survival Analysis in practical settings. We’ll do that by first learning about the three protagonists of the field namely:

The random variable T,
The Survival Function and,
The Hazard Function,

We’ll follow that up by learning about the Kaplan-Meier Estimator. The Kaplan-Meier Estimator is used to estimate the Survival Function, and to compare the chances of survival within two populations that differ in one or more variables.

Finally, we’ll gain an understanding of the proportional hazards regression models, and particularly the Cox Proportional Hazards model. The proportional hazards models are the work horses of the survival field. They are used to estimate the contribution of different factors on the duration of survival.

All along, we’ll use the Stanford Heart Transplant data set to illustrate the concepts and we’ll use Python and the Lifelines library to build and train the models and to make the predictions.

Let’s begin!

Getting to know the protagonists

Let’s get acquainted with the three protagonists of the field:

The random variable T
The Survival Function
The Hazard Function

The random variable T

The random variable at the center of Survival Analysis is the time of occurrence of the event of interest. It’s called T or tau (τ).

T represents an event of interest such as:

The time of death
The time of failure or breakdown of a machine
The time at which a volunteer catches the disease in a vaccine trial
The time at which a company declares bankruptcy or defaults on a loan.

The domain of T is time itself. Thus the domain is the set of non-negative real numbers in the set [0, ∞) and therefore T is a continuous random variable.

The range of T is the probability of occurrence of the event such as death, failure, loan default etc at time t, before time t or after time t.

Thus, T maps time of event t to the probability that the event of interest will occur at or before or after t.

T maps Time of Event to the Probability of the Event (Image by Author)

Notation-wise, it is one of the following probabilities:

P(T < t)
P(T= t)
P(T > t)

…where t ∈ [0, ∞).

Often, the three kinds of probabilities mentioned above are conditional probabilities. Here are two examples of how they can be conditional:

One would want to know the probability that the time of death of a patient is greater than some time t given that the patient is alive at time t.
One would want to know the probability that your car will break down in the tiny interval (t, t+δt], given that it has not broken down up through time t.

There is one other very important way in which this probability is conditional, namely, it’s being conditional on regression variables such as the individual’s age, the vehicle’s operation conditions, the company’s cash-flow history etc. But we’ll get to this form of conditionality a little later when we look at regression models for survival.

Meanwhile, the above two examples bring us straight to the other two protagonists, the survival function and the hazard function.

The Survival Function

The Survival Function S(t) takes time t as a parameter and outputs the probability that the survival time is greater than t.

S(t) = P(T > t)

One way to understand the survival function is to chop up the time axis into discrete time steps as show in the figure below. S(t_8) is the probability of making it past t=t_8:

Discrete version of the time axis (Image by Author)

Let’s illustrate how to construct the Survival Function using the Stanford heart transplant data set.

I have uploaded the CSV version of this data set at this location.

The Stanford heart transplant data set

The data set contains a set of 103 heart patients whose survival time in years is tracked from the point of induction into the study.

Using Python and Pandas, we’ll load the data into memory:

import pandas as pd
df = pd.read_csv('stanford_heart_transplant_dataset_full.csv')

Print out the columns in the data set:

df.columns

We see the following list:

Index(['PATIENT_ID', 'YR_OF_ACCEPTANCE', 'AGE', 'SURVIVAL_STATUS',
       'SURVIVAL_TIME', 'PRIOR_SURGERY', 'TRANSPLANT_STATUS',
       'WAITING_TIME_FOR_TRANSPLANT', 'MISMATCH_ON_ALLELES',
       'MISMATCH_ON_ANTIGEN', 'MISMATCH_SCORE'],
      dtype='object')

The two columns of immediate interest to us are the following ones:

SURVIVAL_TIME: The number of days the patient survived after induction into the study.
SURVIVAL_STATUS: 1=dead, 0=alive at SURVIVAL_TIME days after induction.

Let’s sort the data by SURVIVAL_TIME and print out the first15 rows for these two columns:

df = df.sort_values(by='SURVIVAL_TIME')
df[['SURVIVAL_TIME','SURVIVAL_STATUS']].head(15)

Here’s the output we get:

    SURVIVAL_TIME  SURVIVAL_STATUS
0               1                1
1               2                1
2               2                1
3               2                1
4               3                1
5               3                1
6               3                1
7               5                1
8               5                1
9               6                1
10              6                1
11              8                1
12              9                1
13             11                0
14             12                1

Let’s color code the output by the number of days of survival:

Output color-coded by the number of years of survival (Image by Author)

One sees from this picture that one patient has died at one days after induction into the study, three patients have died at the two day mark, three more died at the three day mark and so one.

Notice row # 13. Eleven days after induction, this patient exited the study while still alive. This is an example of a right-censored data point. We’ll soon see how to deal with such data as it’s quite common in Survival settings.

Let’s try to visualize the above data set in the following way: we’ll assume that only 15 patients are inducted into the study and all of them are inducted at the same time t_0. With that assumption, the following graphic illustrates the situation shown in the color-coded table we saw earlier:

What is the probability of survival at T > t_6? (Image by Author)

What is the probability of survival beyond t=t_6?

S(t_6) which denotes the probability of survival for a patient beyond t_6 is simply the probability that they did not die in each one of the preceding six intervals (t_0 ,t_1], (t_1, t_2], (t_2, t_3]…(t_5, t_6].

The probability that the patient did not die in (t_i, t_(i+1)] is simply the complement of the conditional probability that they did die in the interval (t_i, t_(i+1)] given that they were still alive at t_i. Notation-wise, it’s expressed as follows:

Probability of survival in any given discrete interval ((Image by Author))

The probability on the R.H.S. can be calculated as follows:

The conditional probability of dying in any given discrete interval ((Image by Author)

Combining the above two results yields the following:

How to calculate the conditional probability of survival in any given discrete interval (Image by Author)

Recollect that S(t_6) denotes the probability of survival for a patient beyond t_6. It is the probability that the patient did not die in each one of the preceding six intervals (t_0 ,t_1], (t_1, t_2], (t_2, t_3]…(t_5, t_6]. We can use the above equation to calculate S(6) as follows:

How to calculate S(t_6)? — How to calculate *S(t_6)?* (Image by Author)

Let’s use the above formula to calculate S(t_6) for the Stanford heart patient data set. We’ll reproduce the first 15 rows for reference:

Output color-coded by the number of days of survival (Image by Author)

Here’s the table of d_i and n_i calculated from the above table using MS Excel. We make the simplifying assumption that all 103 patients are inducted at t=t_0:

Calculating d_i and n_i using Excel (Image by Author)

With the table of d_i and n_i in place, we calculate S(t_6) as follows:

Here is how we do it in Excel:

Calculation of the Survival Function using Excel (Image by Author)

How to deal with right-censored data

Recollect that row 13 in the data set contains a patient who left the study after 11 days while still alive. This patient will contribute to the count n_11 which is the number of patients who made it to 11 days. But this patient cannot be counted while counting n_12, n_13, etc. since he/she was simply not present beyond 11 days. Note that such patients also don’t count toward any of the death counts either, as they did not die while being in the study. Hence, the impact of such a right-censored data point at t=t_11 is on the denominators n_12, n_13 etc. while calculating the Survival Function’s value at t=t_12, t_13, etc. i.e. the value S(t) for all t > t_11.

Right censored observations are very common in survival data.

The Kaplan-Meier Estimator

The above mentioned technique for constructing a Survival Function can be easily generalized as follows:

The above formula is known as the Kaplan-Meier estimator after its co-creators.

The Kaplan-Meier Estimator is used to estimate the Survival Function, and to compare the chances of survival within two populations that differ in one or more variables

It’s important to note that Kaplan-Meier estimator produces an estimate of the true (unknown) survival function.

The Kaplan-Meier estimator is a random variable that has a mean, a variance and a probability distribution, and each estimate of the true S(t) that it generates comes bracketed with confidence intervals.

If you plot the value of S(k) for different values of k using the above formula, you will get what is known as the Kaplan-Meier curve or Kaplan-Meier plot.

Let’s construct this plot for the heart transplant data set using Python and the Lifelines library:

from lifelines import KaplanMeierFitter
from matplotlib import pyplot as plt
kmf = KaplanMeierFitter(label='Stanford heart transplant dataset')
kmf = kmf.fit(durations=df['SURVIVAL_TIME'], event_observed=df['SURVIVAL_STATUS'])
kmf.plot()
plt.show()

We see the following plot of S(t): probability of survival for T > t against t:

The Kaplan-Meier plot for the heart transplant data set along with 95% confidence intervals — The **Kaplan-Meier plot** for the heart transplant data set along with 95% confidence intervals (Image by Author)

Each point (x, y) on the above Kaplan-Meier plot represents the probability y that patient survived beyond time t=x. i.e. S(t=x) = y.

Let’s mark off the point (6, 0.9126) corresponding to S(t=6) which we hand-cranked out earlier. The corresponding probability is 91.26% that the patient survived beyond 6 days after induction into the study.

plt.plot(6, 0.9126, color='red', marker='x', markeredgewidth=1, markersize=10)
kmf.plot()
plt.show()

Point (6, 0.9126) corresponding to S(T > 6) marked off on the S(t) plot — Point *(6, 0.9126)* corresponding to *S(T > 6) marked off on the S(t) plot* (Image by Author)

When T is a continuous random variable

When T needs to be considered as a continuous variable, one has to work with probability densities, specifically the Probability Density Function (PDF) and the Commutative Density Function (CDF) which is the integration of the PDF over some interval of time.

Let f(T=u) be the probability density of failure, disease, death, or in general some event of interest, by T=u. Then the probability of that event happening by time T ≤ t is given by the Cumulative Probability Function F(t) as follows:

Cumulative Density Function (CDF) — Cumulative Density Function (**CDF**) (Image by Author)

This CDF is essentially the area under the PDF curve in the interval (0, t], assuming time is always zero or positive:

F(t) = Area under f(t) (source: desmos.com. Non-commercial use) — F(t) = Area under f(t) *(source: desmos.com.* *Non-commercial use*)

Since the probability of failure or death at or before time T ≤t is F(t), the probability of death or failure not happening by T ≤t is 1 — F(t). That is same as the probability of time of death or time of failure being at some time after T = t. Which by definition, is the survival function S(T > t).

Thus, we arrive at this important result that is applicable to both discrete and continuous time scales:

The Hazard Function

The Hazard Function h(t) gives you the density of instantaneous risk, or in other words, the probability of occurrence of the event at T=t, assuming that the event has not occurred up through time t.

The Hazard Function can be used to answer questions like what is the chance that your 15 year old Toyota which has not broken down even once, will break down on the very morning of your big trip to the mountains!

Since h(t), the hazard function, is the probability density at time t, h(t)δt is the probability of your Toyota’s breaking down during a vanishingly small time interval (t, t+δt] i.e. just when you were about to leave on your trip at time t assuming that it had not broken down up through time t.

Therefore, 1 — h(t)δt is the probability of your vehicle’s not breaking down during the interval (t, t+δt] given that it has also not broken down up through time t. In other words, 1 — h(t)δt is the probability that the car did not break down all the way up through t+δt.

Recall that F(t) is the probability of failure by time t. So 1-F(t+δt) is the probability that your vehicle has not failed up through time t+δt.

Therefore, we have the following relationship between h(t) and F(t):

Probability of no breakdown up through t+δt — Probability of no breakdown up through *t+δt* (Image by Author)

Plugging in the two probabilities on the R.H.S.:

After rearranging the terms, we get:

Hazard expressed as a function of the CDF (Image by Author)

But what we are interested in is the instantaneous risk, more accurately the density of the instantaneous risk, at time t. To get the instantaneous risk density at t, we make δt arbitrarily small. We do this by finding the limit of the above equation as δt → 0:

Instantaneous risk h(t) at time t (Image by Author)

The limit converts the CDF F(t) to its derivative F’(t):

h(t) as a function of CDF and its derivative (Image by Author)

We saw earlier that 1-F(t) is the Survival Function S(t), which is the probability of your car breaking after time t. Furthermore, F’(t) = f(t) which is simply the probability density of break down by time t.

Substituting, f(t) and S(t), we get one of the fundamental equations of the field of Survival Analysis — an equation that expresses the instantaneous probability density of failure at time t in terms of the probability density of failure by time t and the probability of failure after time t:

Hazard as a function of f(t) and S(t) (Image by Author)

Relationship between the Hazard Function and the Poisson Process

The Hazard Function h(t) is the instantaneous density of risk at some time t. The density aspect of it makes it a natural candidate for h(t) to be considered as a rate. In fact, another name for the Hazard Function is the Failure Rate with units of number of failures per unit time at a certain time t.

A Poisson process is characterized by the Poisson event rate λ which has the units of events per unit time.

Therefore, in situations where the hazard function h(t) has a constant value at all times t, this constant value is often represented by the symbol λ, and the underlying random process that displays the characteristic of a constant hazard rate is in fact a Poisson process with a constant failure rate λ.

Thus we have:

h(t) = λ

The hazard rate of systems such as factory machinery, commercial HVAC systems and your 15 year old Toyota, all of which (presumably) undergo periodic maintenance and repair can be modeled using a constant hazard rate. For such systems, the Mean Time Between Failures (MTBF) is simply 1/λ.

Constant MTBF under a constant hazard rate (Image by Author)

When h(t) = a constant rate λ, we also have the following interesting results:

Probability of failure happening by a certain time t, assuming a constant hazard rate λ (Image by Author)

The Survival Function S(t) which gives you the probability of failure occurring after time t, is 1 — F(t). Hence in the constant hazard rate situation:

Survival function under constant hazard rate (Image by Author)

Survival Function under a constant hazard rate λ (Image by Author)

How to compare the survival functions of two samples based on a boolean explanatory variable?

Consider two samples S_1 and S_2 from the Stanford heart transplant data set which are differentiated by the value of the boolean explanatory variable PRIOR_SURGERY as follows:

S_1 = [Set of all patients who have had at least one open-heart surgery prior to entry into the study. PRIOR_SURGERY=1]

S_2 = [Set of all patients who have not had any open-heart surgery prior to entry into the study. PRIOR_SURGERY = 0]

Let’s construct and plot the survival function for each sample:

#load the dataset into memory if you haven't done so already
df = pd.read_csv('stanford_heart_transplant_dataset_full.csv')
#create a differentiated index based on PRIOR_SURGERY=1
idx = df['PRIOR_SURGERY'] == 1
#Create the respective time series
survival_time_with_prior_surgery, survival_status_with_prior_surgery = df.loc[idx, 'SURVIVAL_TIME'], df.loc[idx, 'SURVIVAL_STATUS']
survival_time_without_prior_surgery, survival_status_without_prior_surgery = df.loc[~idx, 'SURVIVAL_TIME'], df.loc[~idx, 'SURVIVAL_STATUS']
#Fit a Kaplan-Meier plot to the respective sample, with and without prior surgery
kmf_with_prior_surgery = KaplanMeierFitter(label='Group with prior heart surgery').fit(durations=survival_time_with_prior_surgery, event_observed=survival_status_with_prior_surgery)
kmf_without_prior_surgery = KaplanMeierFitter(label='Group without prior heart surgery').fit(durations=survival_time_without_prior_surgery, event_observed=survival_status_without_prior_surgery)
#plot the two curves
kmf_with_prior_surgery.plot()
kmf_without_prior_surgery.plot()
plt.show()

We see the following interesting result in the plot below, namely for any time t:

S(t|with prior open heart surgery) > S(t|without prior open heart surgery):

Comparing survival of patients who have and have not undergone open heart surgery prior to induction into the study, along with corresponding 95% confidence intervals (Image by Author)

How to test the differences in survival probability beyond a point in time

Sometimes, one wants to know if two groups have a statistically meaningful difference in survival probability beyond a certain time. For example, one may want to know if the probability of survival beyond 180 days from induction into the study is different (in a statistically significant way) between patients who have and have not undergone prior open-heart surgery.

To do so, one can use a Chi-square(1) distributed test statistic described by Klein, J. P., Logan, B. , Harhoff, M. and Andersen, P. K. in their 2007 paper: Analyzing survival curves at a fixed point in time.

This statistic is implemented in the lifelines library and can be used as follows:

#import the test
from lifelines.statistics import survival_difference_at_fixed_point_in_time_test
#Perform the test at t=365 days
results = survival_difference_at_fixed_point_in_time_test(point_in_time=365, fitterA=kmf_with_prior_surgery, fitterB=kmf_without_prior_surgery)
#Print the test statistic and it's p-value for 1 degree of freedom on the Chi-square table
print('Chi-squared(1) Test statistic='+str(results.test_statistic) + ' p-value='+str(results.p_value))

We get the following result:

Chi-squared(1) Test statistic=5.652097138214293 p-value=0.01743450982288937

The p-value of 0.017 is < 0.05. Therefore, we can say with greater than 98% confidence that the probability of survival beyond one year from induction into the study is different between patients who have and have not undergone prior open-heart surgery.

We have seen how the value of the variate PRIOR_SURGERY influences S(t) and therefore the hazard h(t) experienced by the corresponding patient.

We would like to know is by how much does the hazard h(t) at time t reduce (or increase) when a patient has had prior open heart surgery. We would also want to measure the effect of other non-boolean variables such as age on the increase or decrease in the hazard h(t) experienced by the patient. For example, one would may reach a conclusion that given two study patients who are otherwise similar except for their ages, the patient who is older will experience a hazard that is 10% higher than the younger patient for each year that they are older than the younger patient. In other words, h_older(t) = h_younger(t)*(1.1)^(k) where k is the difference in ages in number of years.

To reach such measurable conclusions, we need to use a class of regression models known as the proportional hazards model.

The Proportional Hazards Model

Earlier, we looked at a special case of the hazard function h(t) where the hazard rate was a constant λ events per unit time at all times. In general, it is useful to express λ as a function of time i.e. as λ(t) because the hazard often varies with time.

It is also useful to think of λ(t) as the ‘baseline’ hazard rate experienced by all members in the study. We call λ(t) the baseline hazard rate because at any time t during the study, the actual hazard rate h_i(t) experienced by the ith member of the sample is likely to be a function of 1) the baseline hazard λ(t) and 2) certain characteristics of the ith study participant. In case of human members, such characteristics could be their age, gender, genetic makeup etc. In case of machines such as your Toyota, they could be the age of the vehicle, the operating conditions, model of the vehicle, how regularly it has been serviced etc. In the regression model, these characteristics form the regression variables or the ‘covariates’ vector x_i for the study individual i. Thus, the actual hazard rate h_i(t) experienced by the ith individual can be expressed as a function of:

The baseline hazard rate λ(t) experienced by everyone in the study,
The regression variables vector x_i for the ith individual, and
The regression coefficients vector β

Since the hazard rate is always non-negative, we can use the following technique drawn from the discipline of Generalized Linear Models to bring together λ(t), x_i and β:

Hazard rate for ith individual as a function of baseline hazard and covariates (Image by Author)

The exponentiation ‘e^(x_i*β)’ ensures that the h_i(t) stays non-negative no matter what values are observed for the regression variables vector x_i.

The above equation gives you the hazard rate experienced by the ith individual. If the study contains n individuals, we replace h_i(t) with a vector h(t) of size (n x 1) and we replace the regression vector x_i with the matrix of regression variables X of size (n x m) where m is the number of regression variables per individual. The dimensions of the vector of regression coefficients β stay the same namely (m x 1). The following figure illustrates this situation for a data set of size n individuals:

The hazard rate vector for the n study individuals (Image by Author)

If we take the ratio h_i(t)/h_j(t) of the hazard experienced by any two individuals i and j at a given time t, we get the following interesting result:

The ratio of hazards depends only on the difference in the observed values of regression variables.

The Cox Proportional Hazards Model

The concept of proportional hazards described above was used by Cox, D. R. in his influential 1972 paper “Regression Models and Life-Tables”.

To understand how the Cox model works, let’s refer back to our Heart Transplant data set. This time, we’ll look at a subset of rows from the Heart Transplant data set (I have highlighted row 6 below which we’ll get to in a bit):

Carving out the X and y matrices from the heart transplant data set (Image by Author)

We’ll consider three regression variables which will form the X matrix:
AGE: The patient’s age when they were inducted into the study.
PRIOR_SURGERY: Whether the patient had at least one open-heart surgery prior to entry into the study.1=Yes, 0=No
TRANSPLANT_STATUS: Whether the patient received a heart transplant while in the study. 1=Yes, 0=No

The response variable y is the variable SURVIVAL_TIME: the number of days a patient survived after induction into the study.

Regression Goal

The goal of our regression model is to estimate the expected survival duration y given the regression variables X, and in doing so, calculate the influence of each regression variable AGE, PRIOR_SURGERY and TRANSPLANT_STATUS on SURVIVAL_TIME.

Regression Strategy

To achieve the regression goal, we will do the following:

We will construct the equation for the Probability Distribution Function of y given X denoted as P(y=y_i|X=x_i). This PDF needs to express probability of survival y=y_i for the ith individual as a function of three factors:
The baseline hazard λ(t) at time t,
The regression variables X=x_i for individual ith individual (we’ll assume time-invariant regression variables), and,
The regression coefficients β
Once we have this conditional PDF in place, we will use Maximum Likelihood Estimation to solve for the regression coefficients β. This MLE step gives us the trained regression model.
We can use this trained model to estimate the influence of the regression variables on the response variable y, and to get the expected survival duration y_k of the kth individual given X=x_k and coefficients β.

To construct the PDF, we’ll use the concept of Partial Likelihood that Cox made use of in his model. Don’t worry if you don’t know what Partial Likelihood is. We’ll illustrate the concept below:

Let’s zoom into row 6 in the subset of rows from the heart data set shown above. Row 6 tagged by the red arrow shows that one patient died at t=8 days from the start of the study. This patient was 53 years old, had no evidence of prior open heart surgery and the patient died without receiving a transplant during the study.

The above subset of data also shows that 13 patients tagged by the orange arrow survived past t=8 days. One can say that just before t=8 days, there were 14 patients alive. Out of these 14, any one of them may have been the patient who passed away at t=8. The unconditional probability of any given patient dying at t=8 is simply 1/14. But we can do better than the unconditional probability.

We will postulate that the probability that a particular patient out of this set of 14 will die at t=14 is conditioned upon certain characteristics of that patient such as their age, prior open heart surgery and whether they received a transplanted heart.

Cox has provided the following formula for estimating this conditional probability:

The probability is expressed as a ratio of two hazards. The numerator is the hazard rate experienced by the ith patient as a function of the baseline hazard λ(t) at time t, the regression variables x_i for the ith patient and the regression coefficients β. The denominator is the sum of hazard rates of all individuals (including the ith individual) who have made it through to t=8 days.

We see that the baseline hazard rate λ(t) gets cancelled out so that the probability does not depend on the time t but only on X and β. This is mighty helpful as the baseline hazard function h(t) is often not known and its exact shape is hard to guess.

The above equation is called Partial Likelihood because it discounts the presence of the baseline hazard λ(t).

The general form of the above equation is as follows:

Partial Likelihood used in Cox model (Image by Author)

Where the denominator is the set of all individuals (including the ith individual) who have made it to the ith time point.

The Maximum Likelihood Estimation procedure will optimize the coefficients β in such a way that it will maximize the probability of observing the training data set (X, y) where the probability of observing each y_i given x_i is given by the above probability function. Running the MLE procedure on the training data set results in a trained Cox model which can be used for estimating the expectation of y_i for any given x_i.

All of this is taken care of by libraries such as Lifelines which implement the Cox Proportional Hazards model. We’ll use it below on the train a Cox model on the heart transplant data set.

We have already loaded the data set into memory. Let’s create a subset that contains the columns of our interest.

df2 = df[['AGE', 'PRIOR_SURGERY', 'TRANSPLANT_STATUS', 'SURVIVAL_TIME', 'SURVIVAL_STATUS']]

Let’s carve out the training and test data sets.

import numpy as np
mask = np.random.rand(len(df2)) < 0.8
df2_train = df2[mask]
df2_test = df2[~mask]
print('Training data set length='+str(len(df2_train)))
print('Testing data set length='+str(len(df2_test)))

Import the Cox model:

from lifelines import CoxPHFitter

Create and train the Cox model on the training set:

cph_model = CoxPHFitter()
cph_model.fit(df=df2_train, duration_col='SURVIVAL_TIME', event_col='SURVIVAL_STATUS')

Display the model training summary:

cph_model.print_summary()

We get the following output:

Let’s look at the fitted regression coefficients for each regression variable, their exponents (e^β) and the corresponding confidence intervals:

Coefficients, their exponents and 95% confidence intervals

AGE has an exponentiated coefficient of e^(0.07) = 1.07. This tells us that each unit increase in age of the patient at the time of induction into the study increases the hazard by 0.07 = 7%. The 95% confidence intervals range from 3% to 11%.

PRIOR_SURGERY has an exponentiated coefficient of e^(-0.75) = 0.47. Each unit increase in value of this variable reduces the hazard rate by (1–0.47)*100=53%. But PRIOR_SURGERY is a boolean variable. So it tells us the following:

Ratio of hazard rates of patients with and without prior open heart surgery

Having an open heart surgery done prior to induction into the study seems to reduce the hazard rate by (1–0.47) *100 = 53%. Notice that the confidence intervals for this variable are very wide ranging from a 82% reduction to a 23% increase in the hazard rate. This is so because the standard error of this coefficient is very large (0.49).

Similarly, getting a transplant (TRANSPLANT_STATUS=1) during the study reduces the hazard rate by (1–0.19)*100 = 81% with the confidence intervals ranging from a 90% reduction in hazard rate to a 66% reduction in hazard rate.

There is other useful information in the training summary such as significance data for each regression variable. We see that AGE and TRANSPLANT_STATUS are significant at a 99.5% confidence level (p < 0.005) while PRIOR_SURGERY is significant only at a 88% confidence (p = 0.12).

Prediction

For such a small data set, the quality (as measured by the mean squared error) of the estimated expected survival duration is not going to be good. A much greater benefit is obtained from measuring the contribution of regression variables to the survival duration like what we did in the earlier section. Just for reference, here is the way to get the estimated expectation of the survival times on the test data set:

expected_survival_times = cph_model.predict_expectation(df2_test)

Here is the complete source code:

	import numpy as np
	import pandas as pd
	from lifelines import KaplanMeierFitter
	from lifelines.statistics import survival_difference_at_fixed_point_in_time_test
	from lifelines import CoxPHFitter
	from matplotlib import pyplot as plt

	#dataset link
	#https://statistics.stanford.edu/research/covariance-analysis-heart-transplant-survival-data
	#http://www.stat.rice.edu/~sneeley/STAT553/Datasets/survivaldata.txt


	#Load the data set into memory
	df = pd.read_csv('stanford_heart_transplant_dataset_full.csv')

	#sort the data by survival time
	df = df.sort_values(by='SURVIVAL_TIME')

	#print out the columns
	df.columns

	#print the top 15 rows of the two columns we are interested in
	df[['SURVIVAL_TIME', 'SURVIVAL_STATUS']].head(15)

	#Construct the Kaplan-Meier Estimator
	kmf = KaplanMeierFitter(label='Stanford heart transplant dataset')

	#fit it to the data set
	kmf = kmf.fit(durations=df['SURVIVAL_TIME'], event_observed=df['SURVIVAL_STATUS'])

	#plot the Kaplan-Meier plot showing S(T>t) against t
	kmf.plot()

	#Mark the point (6, 0.9126) corresponding to S(T > 6)
	plt.plot(6, 0.9126, color='red', marker='+', markeredgewidth=3, markersize=16)

	plt.show()

	#create a differentiated index based on PRIOR_SURGERY=1
	idx = df['PRIOR_SURGERY'] == 1

	#Create the respective time series
	survival_time_with_prior_surgery, survival_status_with_prior_surgery = df.loc[idx, 'SURVIVAL_TIME'], df.loc[idx, 'SURVIVAL_STATUS']

	survival_time_without_prior_surgery, survival_status_without_prior_surgery = df.loc[~idx, 'SURVIVAL_TIME'], df.loc[~idx, 'SURVIVAL_STATUS']

	#Fit a Kaplan-Meier plot to the respective sample, with and without prior surgery
	kmf_with_prior_surgery = KaplanMeierFitter(label='Group with prior heart surgery').fit(durations=survival_time_with_prior_surgery, event_observed=survival_status_with_prior_surgery)

	kmf_without_prior_surgery = KaplanMeierFitter(label='Group without prior heart surgery').fit(durations=survival_time_without_prior_surgery, event_observed=survival_status_without_prior_surgery)

	#plot the two curves
	kmf_with_prior_surgery.plot()
	kmf_without_prior_surgery.plot()
	plt.show()

	#Perform the point in time at t=365 days
	results = survival_difference_at_fixed_point_in_time_test(point_in_time=365, fitterA=kmf_with_prior_surgery, fitterB=kmf_without_prior_surgery)

	#Print the test statistic and it's p-value for 1 degree of freedom on the Chi-square table
	print('Chi-squared(1) Test statistic='+str(results.test_statistic) + ' p-value='+str(results.p_value))

	#Let's create a subset that contains the columns of our interest
	df2 = df[['AGE', 'PRIOR_SURGERY', 'TRANSPLANT_STATUS', 'SURVIVAL_TIME', 'SURVIVAL_STATUS']]

	#Carve out the training and test data sets
	mask = np.random.rand(len(df2)) < 0.8
	df2_train = df2[mask]
	df2_test = df2[~mask]
	print('Training data set length='+str(len(df2_train)))
	print('Testing data set length='+str(len(df2_test)))

	#Create and train the Cox model on the training set
	cph_model = CoxPHFitter()
	cph_model.fit(df=df2_train, duration_col='SURVIVAL_TIME', event_col='SURVIVAL_STATUS')

	#Display the model training summary
	cph_model.print_summary()

	#here is the way to get the estimated expectation of the survival times on the test data set
	expected_survival_times = cph_model.predict_expectation(df2_test)

view raw survival_analysis.py hosted with ❤ by GitHub

Survival Analysis using Python and Lifelines

Citations and Copyrights

Data set

The Stanford heart transplant data set is taken from https://statistics.stanford.edu/research/covariance-analysis-heart-transplant-survival-data and available for personal/research purposes only. Curated data set download link.

Software

Cameron Davidson-Pilon, Jonas Kalderstam, Noah Jacobson, Sean Reed, Ben Kuhn, Paul Zivich, … Skipper Seabold. (2021, May 27). CamDavidsonPilon/lifelines: 0.26.0 (Version 0.26.0). Zenodo. http://doi.org/10.5281/zenodo.4816284

Papers

Cox, D. R. “Regression Models and Life-Tables.” Journal of the Royal Statistical Society. Series B (Methodological) 34, no. 2 (1972): 187–220. http://www.jstor.org/stable/2985181.

Klein, J. P., Logan, B., Harhoff, M., & Andersen, P. K. (2007). Analyzing survival curves at a fixed point in time. Statistics in medicine, 26(24), 4505–4519. https://doi.org/10.1002/sim.2864

Book

McCullagh P., Nelder John A., Generalized Linear Models, 2nd Ed., CRC Press, 1989, ISBN 0412317605, 9780412317606

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

PREVIOUS: The Poisson Hidden Markov Model – Part 2 (Implementation in Python)

NEXT: The Stratified Cox Proportional Hazards Regression Model

UP: Table of Contents