###### And a tutorial on how to build a stratified Cox model using Python and Lifelines

The Cox proportional hazards model is used to study the effect of various parameters on the instantaneous hazard experienced by individuals or ‘things’.

See Introduction to Survival Analysis for an overview of the Cox Proportional Hazards Model

## The Assumptions of the Cox Proportional Hazards Model

The Cox model makes the following assumptions about your data set:

- All individuals or things in the data set experience the same baseline hazard rate.
- The regression variables
do not change with time.*X* - The regression coefficients
do not change with time.*β*

After training the model on the data set, you must test and verify these assumptions *using the trained model* before accepting the model’s result.** If these assumptions are violated, you can still use the Cox model after modifying it in one or more of the following ways:**

### Stratification

The baseline hazard rate may be constant only within certain ranges or for certain values of regression variables. For e.g. if it is hypothesized that the baseline hazard rate for getting a disease is the same for 15–25 year olds, for 26–55 year olds and for those older than 55 years, then we breakup the age variable into different strata as follows: ‘15–25’, ‘26–55’ and ‘>55’. Similarly, categorical variables such as country form natural candidates for stratification. We may assume that the baseline hazard of someone dying in a traffic accident in Germany is different than for people in the United States. Once we stratify the data, we fit the Cox proportional hazards model within each strata. One can also dice up the data set into combinations of strata such as [Age-Range, Country]. The drawback of this approach is that unless your original data set is very large and well-balanced across the chosen strata, the number of data points available to the model within each strata greatly reduces with the inclusion of each variable into the stratification leading. If there aren’t enough number of data points available for the model to train on within each combination of strata, the statistical power of the stratified model will be less.

### Changing the functional form of the regression variables

The concept here is simple. Take for example Age as the regression variable. Your Cox model assumes that the log of the hazard ratio between two individuals is proportional to Age. But in reality the log(hazard ratio) might be proportional to Age², Age³ etc. in addition to Age. In which case, adding an Age² term might “fix” your model. Note that your model is still linear in the coefficient for Age. As long as the Cox model is linear in regression coefficients, we are not breaking the linearity assumption of the Cox model by changing the functional form of variables.

### Adding time interaction terms

Here, the concept is not so simple! We won’t go into this remedy any further.

We’ll see how to fix non-proportionality using **stratification**.

## The data set

The data set we’ll use to illustrate the procedure of building a stratified Cox proportional hazards model is the **US Veterans Administration Lung Cancer Trial data**. It contains data about 137 patients with advanced, inoperable lung cancer who were treated with a standard and an experimental chemotherapy regimen. Their progress was tracked during the study until the patient died or exited the trial while still alive, or until the trial ended. In the later two situations, the data is considered to be right censored.

This data set appears in the book: The Statistical Analysis of Failure Time Data, Second Edition, by John D. Kalbfleisch and Ross L. Prentice.

Using Python and Pandas, let’s load the data set into a DataFrame:

```
import pandas as pd
data_types = {'TREATMENT_TYPE':'int', 'CELL_TYPE':'category', 'SURVIVAL_IN_DAYS':'int', 'STATUS':'int', 'KARNOFSKY_SCORE':'int', 'MONTHS_FROM_DIAGNOSIS':'int', 'AGE':'int', 'PRIOR_THERAPY':'int'}
df = pd.read_csv(filepath_or_buffer='va_lung_cancer_dataset.csv', dtype=data_types)
df.head()
```

Here is the output we see:

Our regression variables, namely the ** X** matrix, are going to be the following:

**TREATMENT_TYPE**: 1=Standard. 2=Experimental**CELL_TYPE**: 1=Squamous, 2=Small cell, 3=Adeno, 4=large**KARNOFSKY_SCORE**: A measure of general performance of the patient. 100=Best**MONTHS_FROM_DIAGNOSIS**: The number of months after diagnosis of lung cancer that the patient entered the trial.**AGE**: The age in years of the patient when they were inducted into the trial.**PRIOR_THERAPY**: Whether the patient had received any kind of prior therapy for lung cancer before induction into the trial.

Our dependent variable ** y** is going to be:

**SURVIVAL_IN_DAYS:**Indicating how many days the patient lived after being inducted into the trail.

The event variable is:**STATUS: **1=Dead. 0=Alive

Using Patsy, let’s break out the categorical variable **CELL_TYPE **into different category wise column variables. Don’t worry about the fact that **SURVIVAL_IN_DAYS **is on both sides of the model expression even though it’s the dependent variable. It’s just to make Patsy happy.

```
from patsy import dmatrices
#Build the model expression in Patsy syntax.
model_expr = 'SURVIVAL_IN_DAYS ~ TREATMENT_TYPE + CELL_TYPE + KARNOFSKY_SCORE + MONTHS_FROM_DIAGNOSIS + AGE + PRIOR_THERAPY + SURVIVAL_IN_DAYS + STATUS'
#Use the model expression to break out the CELL_TYPE categorical variable into 1-0 type columns
y, X = dmatrices(model_expr, df, return_type='dataframe')
#Print out the first few rows
X.head()
```

## Training the Cox Proportional Hazard Model

Next, let’s build and train the regular (non-stratified) Cox Proportional Hazards model on this data using the **Lifelines**** **Survival Analysis library:

```
from lifelines import CoxPHFitter
#Create the Cox model
cph_model = CoxPHFitter()
#Train the model on the data set
cph_model.fit(df=X, duration_col='SURVIVAL_IN_DAYS', event_col='STATUS')
#Print the model summary
cph_model.print_summary()
```

We see the following model summary:

## Performing the Proportional Hazard Test

To test the proportional hazards assumptions on the trained model, we will use the

method supplied by Lifelines on the **proportional_hazard_test**

class:**CPHFitter**

```
CPHFitter.proportional_hazard_test(fitted_cox_model, training_df, time_transform, precomputed_residuals)
```

Let’s look at each parameter of this method:

: This parameter references the fitted Cox model. In our example, **fitted_cox_model****fitted_cox_model**=cph_model

: This is a reference to the training data set. In our example, **training_df****training_df**=X

: This variable takes a list of strings: {‘all’, ‘km’, ‘rank’, ‘identity’, ‘log’}. Each string indicates the function to apply to the **time_transform**** y** (duration) variable of the Cox model so as to lessen the sensitivity of the test to outliers in the data i.e. extreme duration values. Recollect that in the VA data set the

**variable is**

*y***SURVIVAL_IN_DAYS**. ‘km’ applies the transformation:

*(1-KaplanMeirFitter.fit(durations, event_observed).*The ‘rank’ transform will map the sorted list of durations to the set of ordered natural numbers [1, 2, 3, …]. ‘Identity’ will keep the durations intact and ‘log’ will log-transform the duration values.

: You get to supply the type of residual errors of your choice from the following types: Schoenfeld, score, delta_beta, deviance, martingale, and variance scaled Schoenfeld.**precomputed_residuals**

Let’s compute the variance scaled Schoenfeld residuals of the Cox model which we trained earlier. We’ll learn about Shoenfeld residuals in detail in the later section on Model Evaluation and Good of Fit but if you want you jump to that section now and learn all about them. For now, let’s compute the Schoenfeld residual errors of the regression model:

```
scaled_schoenfeld = cph_model.compute_residuals(training_dataframe=X, kind='scaled_schoenfeld')
```

Now let’s perform the proportional hazards test:

```
from lifelines.statistics import proportional_hazard_test
proportional_hazard_test(fitted_cox_model=cph_model, training_df=X, time_transform='log', precomputed_residuals=scaled_schoenfeld)
```

We get the following output:

The test statistic obeys a Chi-square(1) distribution under the Null hypothesis that the variable follows the proportional hazards test. Under the Null hypothesis, the expected value of the test statistic is zero. Any deviations from zero can be judged to be statistically significant at some significance level of interest such as 0.01, 0.05 etc.

How this test statistic is created is itself a fascinating topic to study. For the interested reader, the following paper provides a good starting point:*Park, Sunhee and Hendry, David J. (2015) Reassessing Schoenfeld residual tests of proportional hazards in political…*eprints.lse.ac.uk

Getting back to our little problem, I have highlighted in red the variables which have failed the Chi-square(1) test at a significance level of 0.05 (95% confidence level).

We will try to solve these issues by stratifying AGE, CELL_TYPE[T.4] and KARNOFSKY_SCORE.

## Stratifying AGE, CELL_TYPE[T.4] and KARNOFSKY_SCORE

We’ll stratify AGE and KARNOFSKY_SCORE by dividing them into 4 strata based on 25%, 50%, 75% and 99% quartiles. CELL_TYPE[T.4] is a categorical indicator (1/0) variable, so it’s already stratified into two strata: 1 and 0.

To stratify AGE and KARNOFSKY_SCORE, we will use the Pandas method

. We’ll set x to the Pandas Series object df[‘AGE’] and df[‘KARNOFSKY_SCORE’] respectively. q is a list of quantile points as follows:**qcut**(x, q)

```
age_strata = pd.qcut(x=df['AGE'], q=[0, .25, .5, .75, 1.])
karnofsky_strata = pd.qcut(x=df['KARNOFSKY_SCORE'], q=[0, .25, .5, .75, 1.])
```

The output of

is also a Pandas Series object. We’ll add age_strata and karnofsky_strata columns back into our **qcut**(x, q)** X** matrix. Recollect that we had carved out

**using Patsy:**

*X*```
X['AGE_STRATA'] = age_strata
X['KARNOFSKY_SCORE_STRATA'] = karnofsky_strata
```

Let’s look at how the stratified AGE and KARNOFSKY_SCORE look like when displayed alongside AGE and KARNOFSKY_SCORE respectively:

This is AGE and AGE_STRATA:

```
print(X[['AGE', 'AGE_STRATA']])
```

And here is stratified KARNOFSKY_SCORE:

```
print(X[['KARNOFSKY_SCORE', 'KARNOFSKY_SCORE_STRATA']])
```

Next, let’s add the AGE_STRATA series and the KARNOFSKY_SCORE_STRATA series to our ** X** matrix:

```
X['AGE_STRATA'] = age_strata
X['KARNOFSKY_SCORE_STRATA'] = karnofsky_strata
```

We’ll drop AGE and KARNOFSKY_SCORE since our stratified Cox model will not be using the unstratified AGE and KARNOFSKY_SCORE variables:

```
X = X.drop(['AGE', 'KARNOFSKY_SCORE'], axis=1)
```

Let’s review the columns in the updated ** X** matrix:

```
print(X.columns)
```

Index(['Intercept', 'CELL_TYPE[T.2]', 'CELL_TYPE[T.3]', 'CELL_TYPE[T.4]', 'TREATMENT_TYPE', 'MONTHS_FROM_DIAGNOSIS', 'PRIOR_THERAPY', 'SURVIVAL_IN_DAYS', 'STATUS', 'AGE_STRATA', 'KARNOFSKY_SCORE_STRATA'],dtype='object')

Now let’s create an instance of the **stratified** Cox proportional hazard model by passing it AGE_STRATA, KARNOFSKY_SCORE_STRATA and CELL_TYPE[T.4]:

```
cph_model = CoxPHFitter(strata=['CELL_TYPE[T.4]', 'KARNOFSKY_SCORE_STRATA', 'AGE_STRATA'])
```

Let’s fit the model on ** X**. This time, the model will be fitted within each strata in the list: [‘CELL_TYPE[T.4]’, ‘KARNOFSKY_SCORE_STRATA’, ‘AGE_STRATA’].

```
cph_model.fit(df=X, duration_col='SURVIVAL_IN_DAYS', event_col='STATUS')
```

Let’s test the proportional hazards assumption once again on the stratified Cox proportional hazards model:

```
scaled_schoenfeld = cph_model.compute_residuals(training_dataframe=X, kind='scaled_schoenfeld')
proportional_hazard_test(fitted_cox_model=cph_model, training_df=X, time_transform='log', precomputed_residuals=scaled_schoenfeld)
```

We get the following output:

Let’s note two things about this output:

**The test-statistic and p-values:**As noted earlier, the test-statistic is Chi-square(1) distributed under the Null hypothesis*H0*that the variable respects the proportional hazards assumption. Under*H0*, the expected value of the test statistic is zero. Any deviations from zero can be judged to be statistically significant at some acceptable p-value. We can see that all p-values are comfortably above 0.2. So none of the deviations from zero are statistically significant at a confidence level of ≥ 80%. So we strongly reject the alternate hypothesis and accept*H0*that all variables obey the proportional hazards assumption.**Absence of CELL_TYPE[T.4], AGE, KARNOFSKY_SCORE:**Since we are stratifying on these three variables, they are no longer part of the regression variables of the model.

We have succeeded in building a Cox proportional hazards model on the VA lung cancer data in a way that the regression variables of the model (and therefore the model as a whole) satisfy the proportional hazards assumptions.

## How to interpret the output of the Cox proportional hazards model

Let’s print out the model training summary:

```
cph_model.print_summary()
```

We see that the model has considered the following variables for stratification:

The partial log-likelihood of the model is -137.76. This number will be useful if we want to compare the model’s goodness-of-fit with another version of the same model, stratified in the same manner, but with fewer or greater number of variables. The model with the larger Partial Log-LL will have a better goodness-of-fit.

Next, let’s look at the coefficients:

Both the coefficient and it’s exponent are shown in the output.

CELL_TYPE[T.2] is an indicator variable (1 or 0 ) and it represents whether the patient’s tumor cells were of type “small cell”. The coefficient 0.92 is interpreted as follows:

If the tumor is of type “small cell”, the instantaneous hazard of death at any time t, increases by (2.51–1)*100=151%.

TREATMENT_TYPE is another indicator variable with values 1=STANDARD TREATMENT and 2=EXPERIMENTAL TREATMENT. We interpret the coefficient for TREATMENT_TYPE as follows:

Patients who received the experimental treatment experienced a (1.34–1)*100=34% increase in the instantaneous hazard of dying as compared to ones on the standard treatment.

We can interpret the effect of the other coefficients in a similar manner.

Now let’s take a look at the p-values and the confidence intervals for the various regression variables.

Here are the p-values:

The p-values tell us that CELL_TYPE[T.2] and CELL_TYPE[T.3] are highly significant. Their p-value is less than 0.005, implying a statistical significance at a (100–0.005) = 99.995% or higher confidence level. Similarly, PRIOR_THERAPY is statistically significant at a > 95% confidence level. The p-values of TREATMENT_TYPE and MONTH_FROM_DIAGNOSIS are > 0.25. So we cannot say that the coefficients are statistically different than zero even at a (1–0.25)*100 = 75% confidence level. Therefore, we should not read too much into the effect of TREATMENT_TYPE and MONTHS_FROM_DIAGNOSIS on the proportional hazard rate. This conclusion is also borne out when you look at how large their standard errors are as a proportion of the value of the coefficient, and the correspondingly wide confidence intervals of TREATMENT_TYPE and MONTH_FROM_DIAGNOSIS.

## Summary

- The Cox proportional hazards model is used to study the effect of various parameters on the instantaneous hazard experienced by individuals or ‘things’.
- The Cox model assumes that all study participants experience the same baseline hazard rate, and the regression variables and their coefficients are time invariant.
- If your model fails these assumptions, you can “fix” the situation by using one or more of the following techniques on the regression variables that have failed the proportional hazards test: 1) Stratification of regression variables, 2) Changing the functional form of the regression variables and 3) Adding time interaction terms to the regression variables.

## Citations and Copyrights

### Data set

The VA lung cancer data set is taken from the following source:

http://www.stat.rice.edu/~sneeley/STAT553/Datasets/survivaldata.txt

### Paper and book links

Cox, D. R. “Regression Models and Life-Tables.” *Journal of the Royal Statistical Society. Series B (Methodological)* 34, no. 2 (1972): 187–220. Accessed November 20, 2020. http://www.jstor.org/stable/2985181.

Park, Sunhee and Hendry, David J. (2015) “Reassessing Schoenfeld residual tests of proportional hazards in political science event history analyses”*.* *American Journal of Political Science*, 59 (4). 1072–1087. ISSN 0092–5853. http://eprints.lse.ac.uk/84988/

Grambsch, Patricia M., and Terry M. Therneau. “Proportional Hazards Tests and Diagnostics Based on Weighted Residuals.” *Biometrika*, vol. 81, no. 3, 1994, pp. 515–526. *JSTOR*, www.jstor.org/stable/2337123. Accessed 5 Dec. 2020.

Therneau, Terry M., and Patricia M. Grambsch. “Modeling Survival Data: Extending the Cox Model”. 2000. New York: *Springer*

The Statistical Analysis of Failure Time Data, Second Edition, by John D. Kalbfleisch and Ross L. Prentice.

McCullagh P., Nelder John A., Generalized Linear Models, 2nd Ed., *CRC Press*, 1989, ISBN 0412317605, 9780412317606

### Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

**PREVIOUS: **Introduction to Survival Analysis

**NEXT: **The Nonlinear Least Squares (NLS) Regression Model