What are Schoenfeld residuals and how to use them to test the proportional hazards assumption of the Cox model.
One thinks of regression modeling as a process by which you estimate the effect of regression variables X on the dependent variable y. Your model is also capable of giving you an estimate for y given X. You subtract that estimate from the observed y to get the residual error of regression.
But what if you turn that concept on its head by estimating X for a given y and subtracting that estimate from the observed X?
That’s right —you estimate the regression matrix X for a given response vector y!
When you do such a thing, what you get are the Schoenfeld Residuals named after their inventor David Schoenfeld who in 1982 showed (to great success) how to use them to test the assumptions of the Cox Proportional Hazards model.
The Schoenfeld residuals have since become an indispensable tool in the field of Survival Analysis and they have found in a place in all major statistical analysis software such as STATA, SAS, SPSS, Statsmodels, Lifelines and many others.
Schoenfeld residuals are so wacky and so brilliant at the same time that their inner workings deserve to be explained in detail with an example to really understand what’s going on. That is what we’ll do in this section.
Before we dive in, let’s get our head around a few essential concepts from Survival Analysis.
A Cheat-sheet of Relevant Concepts
Before we dive into what are Schoenfeld residuals and how to use them, let’s build a quick cheat-sheet of the main concepts from Survival Analysis.
The Random Variable T
The random variable T denotes the time of occurrence of some event of interest such as onset of disease, death or failure. T maps time t to a probability of occurrence of the event before/by/at or after t.
The Hazard Function
The Hazard Function h(t) gives you the density of instantaneous risk experienced by an individual or a thing at T=t assuming that the event has not occurred up through time t. h(t) can also be thought of as the instantaneous failure rate at t i.e. the number of failures per unit time at time t.
The Concept of Proportional Hazards
The hazard h_i(t) experienced by the ith individual or thing at time t can be expressed as a function of 1) a baseline hazard λ_i(t) and 2) a linear combination of variables such as age, sex, income level, operating conditions etc. that are unique to that individual or thing. We express hazard h_i(t) as follows:
At any time T=t, if the baseline hazard (also known as the background hazard) experienced by all individuals is the same i.e. if λ_i(t) = λ(t) for all i, then the ratio of hazards experienced by two individuals i and j can be expressed as follows:
Notice that under the common baseline hazard assumption, the ratio of hazard for i and j is a function of only the difference in the respective regression variables. It is independent of the baseline hazard.
The Cox Proportional Hazards Model
The Cox model extends the concept of proportional hazards in a way that is best illustrated with the following example:
Imagine a vaccine trial in which volunteers catch the disease on days t_0, t_1, t_2, t_3,…,t_i…,t_n after induction into the study. Just before T=t_i, let R_i be the set of indexes of all volunteers who have not yet caught the disease. Thus, R_i is the ‘at-risk’ set just before T=t_i. Assume that at T=t_i exactly one individual from R_i will catch the disease. Suppose this individual has index j in R_i. The Cox model gives us the probability that the individual who falls sick at T=t_i is the observed individual j as follows:
In the above equation, the numerator is the hazard experienced by the individual j who fell sick at t_i. The denominator is the sum of the hazards experienced by all individuals who were at risk of falling sick at time T=t_i.
The Cox model is used for calculating the effect of various regression variables on the instantaneous hazard experienced by an individual or thing at time t. It is also used for estimating the probability of survival beyond any given time T=t.
The Assumptions of the Cox Proportional Hazards Model
The Cox model makes three assumptions:
- Common baseline hazard rate λ(t): At any time t, all individuals are assumed to experience the same baseline hazard λ(t). For example, if a study consists of males and females belonging to different races and age groups, then at any time t during the study, white males who entered the study when they were in the 18–34 years age range are assumed to experience the same baseline hazard λ(t) as Asian females who entered the study when they 40–64 year old. This is clearly a strong assumption and one must validate it before accepting results of the Cox model for your data set.
- Time invariant co-variates X: The effect of the regression variables X (a.k.a. co-variates) on the instantaneous hazard experienced by an individual is assumed to remain constant over time. For example, if a volunteer enters a study of cancer risk to smokers, and if the participant’s genetic makeup is a co-variate, then the effect of their genes on the hazard of contracting cancer is assumed to remain constant throughout the study. Notation-wise, X(t_i)=X for all t_i. This is again a rather strong assumption that should be validated. For example, gene expression can vary with time in response to a number of factors.
- Time invariant regression coefficients β: The regression coefficients β do not vary with time. In other words, β(t_i)=β for all t_i. One can see that assumptions (2) and (3) are closely related. For example, the effect of gene expression increasing with time can be expressed as the coefficient of the gene regression variable increasing with time.
Schoenfeld Residuals are used to validate the above assumptions made by the Cox model.
How to Calculate the Schoenfeld Residuals?
The calculation of Schoenfeld residuals is best described by fitting the Cox Proportional Hazards model on a sample data set.
We’ll use the Stanford heart transplant data set which is a data set of 103 heart patients who have been voluntarily admitted into a study after it was determined that a transplant was the only option left for them. The study collected various variables related to each individual such as their age, evidence of prior open heart surgery, their genetic makeup etc. If they received a transplant during the study, this event was noted down. It was also noted down how many days elapsed before an individual died irrespective of whether they received a transplant. Some individuals left the study for various reasons or they were still alive when the study ended. These ‘lost-to-observation’ cases constituted what are known as ‘right-censored’ observations.
I have uploaded the CSV version of this data set at this location.
Using Python and Pandas, let’s start by loading the data into memory:
import pandas as pd df = pd.read_csv('stanford_heart_transplant_dataset_full.csv')
Let’s print out the columns in the data set:
We see the following list:
Index(['PATIENT_ID', 'YR_OF_ACCEPTANCE', 'AGE', 'SURVIVAL_STATUS',
'SURVIVAL_TIME', 'PRIOR_SURGERY', 'TRANSPLANT_STATUS',
The columns of immediate interest to us are the following ones:
SURVIVAL_TIME: The number of days the patient survived after induction into the study. This is our response variable y.
SURVIVAL_STATUS: 1=dead, 0=alive at SURVIVAL_TIME days after induction.
We’ll consider the following three regression variables which will form our regression variables matrix X:
AGE: The patient’s age when they were inducted into the study.
PRIOR_SURGERY: Whether the patient had at least one open-heart surgery prior to entry into the study.1=Yes, 0=No
TRANSPLANT_STATUS: Whether the patient received a heart transplant while in the study. 1=Yes, 0=No
Let’s carve out a vertical slice of the data set containing only columns of our interest:
df2 = df[['AGE', 'PRIOR_SURGERY', 'TRANSPLANT_STATUS', 'SURVIVAL_TIME', 'SURVIVAL_STATUS']].copy() df2.head()
We see the following output:
Let’s fit the Cox PH model from the Lifelines library on this data set.
Let’s import the Cox model:
from lifelines import CoxPHFitter
Create and train the Cox model on the training set:
cph_model = CoxPHFitter() cph_model.fit(df=df2, duration_col='SURVIVAL_TIME', event_col='SURVIVAL_STATUS')
Display the model training summary:
We get the following output:
Here are the fitted coefficients and their exponents of the three regression variables:
These three coefficients form our β vector:
β=[0.06, -0.74, -1.66]
The Schoenfeld residuals are calculated for each regression variable to see if each variable independently satisfies the assumptions of the Cox model. In our case those would be AGE, PRIOR_SURGERY and TRANSPLANT_STATUS.
We’ll show how the Schoenfeld residuals can be calculated for the AGE variable.
To illustrate the calculation for AGE, let’s focus our attention on what happens at row number # 23 in the data set. We see that one death has occurred at T=30 days. The set of patients who were at ‘at-risk’ of dying just before T=30 are shown in the red box below:
The set of indices [23, 24, 25,…,102] form our ‘at-risk’ set R_30 corresponding to the event occurring at T=30 days. Out of this at-risk set, the patient with ID=23 is the one who died at T=30 days.
Let’s carve out the X matrix consisting of only the patients in R_30:
import numpy as np X30 = np.array(df2[['AGE', 'PRIOR_SURGERY', 'TRANSPLANT_STATUS']])[23:,]
We get the following X matrix that was shown inside the red box in the earlier figure:
Let’s focus on the first column (column index 0) of X30. This is the AGE column and it contains the ages of the volunteers at risk at T=30. We’ll denote it as X30[…] where the three dots denote all rows in X30.
What we want to do next is estimate the expected value of the AGE column.
In other words, we want to estimate the expected age of the study volunteers who are at risk of dying at T=30 days.
Notice that this strategy effectively fixes the value of response variable y to a known value (30 days) and it makes X30[…] i.e. the age of the volunteer as the random variable having an expected value and a variance!
The expected age of at-risk volunteers in R_30 can be calculated by the usual formula for expectation namely the value times the probability summed over all values:
In the above equation, the summation is over all indices in the ‘at-risk’ set R30.
From the earlier discussion about the Cox model, we know that the probability of the jth individual in R30 dying at T=30 is given by:
We plug this probability into the earlier equation for E(X30[…]) to get the following formula for the expected age of individuals who were at risk of dying at T=30 days:
Similarly, we can get the expected values for PRIOR_SURGERY and TRANSPLANT_STATUS regression variables by replacing the index 0 in the above equation with 1 and 2 respectively.
The above equation for E(X30[…]) can be generalized for the ith time instant at which a significant event (such as death) occurs. For T=t_i, the ‘at-risk’ set is R_i and expected value of the mth regression variable i.e. E(Xi[…][m]) can be estimated as follows:
Let’s put these equations to work by calculating the expected age of patients in R30 for our sample data set.
We’ll use a little bit of very simple matrix algebra to make the computation more efficient.
#The regression coefficients vector β of shape (3 x 1) Beta = np.array(cph_model.params_).reshape(-1,1) #exp(X30.Beta). A vector of size (80 x 1). Note that X30 has a shape (80 x 1) exp_X30_Beta = np.exp(np.matmul(X30, Beta)) #The summation in the denominator (a scaler quantity) sum_exp_X30_Beta = np.sum(exp_X30_Beta) #The Cox probability of the kth individual in R30 dying0at T=30. A vector of shape (80 x 1) p_k = exp_X30_Beta/sum_exp_X30_Beta #Column 0 (Age) in X30, transposed to shape (1 x 80) X30_0 = X30[:,0].reshape(1,-1) #Expected Age of volunteers in R30 expected_age_at_R30 = np.matmul(X30_0, p_k) print('Expected age of volunteers in R30='+str(expected_age_at_R30))
We get the following output:
Expected age of volunteers in R30=[[47.83197965]]
Next, we subtract the observed age from the expected value of age to get the vector of Schoenfeld residuals r_i_0 corresponding to T=t_i and risk set R_i.
r_i_0 is a vector of shape (1 x 80).
r_i_0 = X30_0-expected_age_at_R30
The value of the Schoenfeld residual for Age at T=30 days is the mean value (actually a weighted mean) of r_i_0:
mean_r_i_0 = np.mean(r_i_0) print(mean_r_i_0)
Which prints the following:
In practice, one would repeat the above procedure for each regression variable and at each time instant T=t_i at which the event of interest such as death occurs. That results in a time series of Schoenfeld residuals for each regression variable.
It is also common practice to scale the Schoenfeld residuals using their variance.
Thankfully, you don’t have to hand crank out the residuals like we did! All major statistical regression libraries will do all the hard work for you. We’ll soon see how to generate the residuals using the Lifelines Python library.
Properties of Schoenfeld Residuals
- Let’s look at the formula for the expectation again:
You can see that the Cox hazard probability shaded in blue assumes that the baseline hazard λ(t) is the same for all study participants. Thus, the Schoenfeld residuals in turn assume a common baseline hazard.
- David Schoenfeld, the inventor of the residuals has shown that if the survival data obeys the assumptions of the Cox model, then asymptotically, that is for very large data sets, all the Schoenfeld residuals are uncorrelated with each other. Thus, r_i and r_j corresponding to the set of at-risk individuals R_i and R_j at any two time points of interest T=t_i and T=T_j are uncorrelated. This fact can be tested very easily using an auto-correlation plot of the Schoenfeld residuals or using a white noise test such as the Ljung-Box test.
- Notice that the formula for the expectation is completely independent of time. If the covariates X happen to be functions of time i.e. X=X(t), then plotting the Schoenfeld residuals r_i against time will yield a discernible pattern thereby violating the time independence assumption of the Cox model.
- Grambsch, P. M., and Therneau, T. M. (paper links at the bottom of the page) have shown that if the regression coefficients β of the Cox proportional hazards model do not change with time, then the mean of the scaled Schoenfeld Residuals is zero. This happens to be is a very important result. It is used in many statistical packages to test the Cox Proportional Hazards assumptions. This result also yields the conclusion that a plot of the scaled Schoenfeld residuals w.r.t. time (or w.r.t. a scaled time axis) will be a Random Walk around a zero value mean line. Grambsch and Therneau also supply a Chi-square(1) distributed statistic to allow us to easily test this Random Walk hypothesis and thereby the time-invariance assumption of the Cox model.
How to use the Schoenfeld Residuals to Test the Assumptions of the Cox Model
The Lifelines library provides an implementation of Schoenfeld residuals via the
compute_residuals method on the
CoxPHFitter class which you can use as follows:
scaled_schoenfeld_residuals = cph_model.compute_residuals(training_dataframe=df2, kind='scaled_schoenfeld') print(scaled_schoenfeld_residuals)
CPHFitter.compute_residuals will compute the residuals for all regression variables in the X matrix that you had supplied to your Cox model for training and it will output the residuals as a Pandas DataFrame as follows:
Let’s plot the residuals for AGE against time:
from matplotlib import pyplot as plt plt.plot(scaled_schoenfeld_residuals.index, scaled_schoenfeld_residuals['AGE']) plt.show()
We see the following plot:
It’s hard to tell objectively if there are no time based patterns caused by auto-correlations in the above plot. So we’ll run the Ljung-Box test and also the Box-Pierce tests from the statsmodels library on this time series to see if it’s anything more than white noise. The Null hypothesis of the two tests is that the time series is white noise. A p-value of less than 0.05 (95% confidence level) should convince us that it is not white noise and there is in fact a valid trend in the residuals.
import statsmodels.stats.diagnostic as diag diag.acorr_ljungbox(x=scaled_schoenfeld_residuals['AGE'], lags=, boxpierce=True, model_df=0, period=None, return_df=None)
We get the following output:
(array([39.18080837]), array([0.50696947]), array([26.42444176]), array([0.95127985]))
The p-value of the Ljung-Box test is 0.50696947 while that of the Box-Pierce test is 0.95127985. Both values are much greater than 0.05 thereby strongly supporting the Null hypothesis that the Schoenfeld residuals for AGE are not auto-correlated.
Let’s run the same two tests on the residuals for PRIOR_SURGERY:
diag.acorr_ljungbox(x=scaled_schoenfeld_residuals['PRIOR_SURGERY'], lags=, boxpierce=True, model_df=0, period=None, return_df=None)
(array([54.23449071]), array([0.06592408]), array([45.71670762]), array([0.24673926]))
And for TRANSPLANT_STATUS:
diag.acorr_ljungbox(x=scaled_schoenfeld_residuals['TRANSPLANT_STATUS'], lags=, boxpierce=True, model_df=0, period=None, return_df=None)
(array([24.68047727]), array([0.97266311]), array([15.81377991]), array([0.99978178]))
We see that in each case all p-values are greater than 0.05 indicating no auto-correlation among the residuals at a 95% confidence level.
We have shown that the Schoenfeld residuals of all three regression variables of our Cox model are not auto-correlated.
There is one more test on residuals that we will look at. In Lifelines, it is called
proportional_hazards_test. It runs the Chi-square(1) test on the statistic described by Grambsch and Therneau to detect whether the regression coefficients vary with time. The Null hypothesis of the test is that the residuals are a pattern-less random-walk in time around a zero mean line. We will test the null hypothesis at a > 95% confidence level (p-value< 0.05).
from lifelines.statistics import proportional_hazard_test proportional_hazard_test(fitted_cox_model=cph_model, training_df=df2, time_transform='log', precomputed_residuals=scaled_schoenfeld_residuals)
Notice that we have log-transformed the time axis to reduce the influence of outliers. And we have passed the scaled Schoenfeld residuals which had computed earlier using the
We get the following output from the
We see that the p-value of the Chi-square(1) test is <0.05 for all three regression variables indicating that the test is passed at a ≥ 95% confidence level.
Here is the complete source code:
And here is the link to the data set.
Citations and Copyrights
The Stanford heart transplant data set is taken from https://statistics.stanford.edu/research/covariance-analysis-heart-transplant-survival-data and available for personal/research purposes only. Download curated data set.
Paper and book links
Cox, D. R. “Regression Models and Life-Tables.” Journal of the Royal Statistical Society. Series B (Methodological) 34, no. 2 (1972): 187–220. Accessed November 20, 2020. http://www.jstor.org/stable/2985181.
Schoenfeld, David. “Partial Residuals for The Proportional Hazards Regression Model.” Biometrika, vol. 69, no. 1, 1982, pp. 239–241. JSTOR, www.jstor.org/stable/2335876. Accessed 29 Nov. 2020.
Here is another link to Schoenfeld’s paper.
Park, Sunhee and Hendry, David J. (2015) Reassessing Schoenfeld residual tests of proportional hazards in political science event history analyses. American Journal of Political Science, 59 (4). 1072–1087. ISSN 0092–5853. Download link
Grambsch, Patricia M., and Terry M. Therneau. “Proportional Hazards Tests and Diagnostics Based on Weighted Residuals.” Biometrika, vol. 81, no. 3, 1994, pp. 515–526. JSTOR, www.jstor.org/stable/2337123. Accessed 5 Dec. 2020.
Therneau, Terry M., and Patricia M. Grambsch. “Modeling Survival Data: Extending the Cox Model”. 2000. New York: Springer
PREVIOUS: The Chi Squared Test