In this notebook we provide a python implementation of the lab from the Survival Analysis - Chapter 11 of the second edition of the book An Introduction to Statistical
Learning (Second Edition). You can find a free pdf version of the book here. We will use the lifelines
python package, which you can find in this repository. There is a nice introduction into survival analysis on the documentation. There are also many concrete examples and guidelines to use the package.
Please refer to the book for more details and explanations on the concepts. We focus on the lab implementation.
Prepare Notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style(
style='darkgrid',
rc={'axes.facecolor': '.9', 'grid.color': '.8'}
)
sns.set_palette(palette='deep')
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 100
Read Data
In this lab we want to analyze the BrainCancer data set. The data for the lab is provided as .rda
here. To read it as a pandas data frame, I read it in R and exported it in feather format.
bc_df = pd.read_feather('../Data/brain_cancer.feather')
bc_df.head()
sex | diagnosis | loc | ki | gtv | stereo | status | time | |
---|---|---|---|---|---|---|---|---|
0 | Female | Meningioma | Infratentorial | 90 | 6.11 | SRS | 0 | 57.64 |
1 | Male | HG glioma | Supratentorial | 90 | 19.35 | SRT | 1 | 8.98 |
2 | Female | Meningioma | Infratentorial | 70 | 7.95 | SRS | 0 | 26.46 |
3 | Female | LG glioma | Supratentorial | 80 | 7.61 | SRT | 1 | 47.80 |
4 | Male | HG glioma | Supratentorial | 90 | 5.06 | SRT | 1 | 6.30 |
The column of interest is time
(survival time). The variable status
indicates whether the observation is censored. The other variables are additional covariates.
bc_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88 entries, 0 to 87
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sex 88 non-null category
1 diagnosis 87 non-null category
2 loc 88 non-null category
3 ki 88 non-null int32
4 gtv 88 non-null float64
5 stereo 88 non-null category
6 status 88 non-null int32
7 time 88 non-null float64
dtypes: category(4), float64(2), int32(2)
memory usage: 3.1 KB
Note that we have column diagnosis
is missing one value.
Let us plot the time
distribution split by status
.
fig, ax = plt.subplots()
sns.histplot(x='time', data=bc_df, hue='status', stat='density', ax=ax)
sns.kdeplot(x='time', data=bc_df, hue='status', fill=False, ax=ax)
ax.set(title='Time Distribution');
The Kaplan-Meier Survival Curve
Recall from Chapter 11.5 that the Kaplan-Meier estimator of the survival curve \(S(T)\) is
\[ \hat{S}(t) = \prod_{i : \: t_{i} \leq t} \left(1 - \frac{d_{i}}{n_{i}} \right) \]
where with \(t_{i}\) is a time when at least one event happened, \(d_{i}\) is the number of events that happened at time \(t_{i}\), and \(n_{i}\) is the number of individuals known to have survived up to time \(t_{i}\). We can use the KaplanMeierFitter
to compute this estimator.
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(durations=bc_df['time'], event_observed=bc_df['status'])
fig, ax = plt.subplots()
kmf.plot_survival_function(color='C0',ax=ax)
ax.set(
title='Kaplan-Meier survival curve',
xlabel='Months',
ylabel='Estimated Probability of Survival'
);
Compare with FIGURE 11.2 in the book.
Let us now compute this estimator for each of the classes of the sex
column.
fig, ax = plt.subplots()
for i, sex in enumerate(bc_df['sex'].unique()):
# Mask category
mask = f'sex == "{sex}"'
# Fit model
kmf = KaplanMeierFitter()
kmf.fit(
durations=bc_df.query(mask)['time'],
event_observed=bc_df.query(mask)['status']
)
# Plot survival curves
kmf.plot_survival_function(
color=f'C{i + 1}',
label=sex,
ax=ax
)
ax.set(
title='Kaplan-Meier survival curves formales and females',
xlabel='Months',
ylabel='Estimated Probability of Survival'
);
The Log-Rank Test
Let us now compute the log-rank test to compare the survival of males to females.
from lifelines.statistics import logrank_test
mask_female = f'sex == "Female"'
mask_male = f'sex == "Male"'
logrank_test_result = logrank_test(
durations_A=bc_df.query(mask_female)['time'],
durations_B=bc_df.query(mask_male)['time'],
event_observed_A=bc_df.query(mask_female)['status'],
event_observed_B=bc_df.query(mask_male)['status']
)
logrank_test_result.print_summary()
t_0 | -1 |
---|---|
null_distribution | chi squared |
degrees_of_freedom | 1 |
test_name | logrank_test |
test_statistic | p | -log2(p) | |
---|---|---|---|
0 | 1.44 | 0.23 | 2.12 |
As mentioned in the book: The resulting p-value is 0.23, indicating no evidence of a difference in survival between the two sexes.
Regression Models With a Survival Response
Next, we move into regression analysis. We will focus in the The Proportional Hazards model, specifically the Cox’s Proportional Hazards Model. The main idea of this model is to assume the following form for the Hazard function
\[ h(t|x_i) = h_0(t) \exp \left( \sum_{j=1}^{p} \beta_jx_{ij}\right) \]
where \(h_0(t) \geq 0\) is the baseline hazard and \(x_{i}\) are the covariates. Recall that the relation between the Hazard function and the survival curve is
\[ S(t) = \exp \left( - \int_{0}^{t} h(\tau) d\tau\right) \]
We can use the CoxPHFitter
to estimate these type of models. The model features can be specified by R-like formulas, see patsy.
Model 1
Let us start by using just the sex
feature.
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(df=bc_df, duration_col='time', event_col='status', formula='sex')
cph.print_summary()
model | lifelines.CoxPHFitter |
---|---|
duration col | ‘time’ |
event col | ‘status’ |
baseline estimation | breslow |
number of observations | 88 |
number of events observed | 35 |
partial log-likelihood | -137.43 |
time fit was run | 2021-08-31 09:52:08 UTC |
coef | exp(coef) | se(coef) | coef lower 95% | coef upper 95% | exp(coef) lower 95% | exp(coef) upper 95% | z | p | -log2(p) | |
---|---|---|---|---|---|---|---|---|---|---|
sex[T.Male] | 0.41 | 1.50 | 0.34 | -0.26 | 1.08 | 0.77 | 2.94 | 1.19 | 0.23 | 2.10 |
Concordance | 0.56 |
---|---|
Partial AIC | 276.86 |
log-likelihood ratio test | 1.44 on 1 df |
-log2(p) of ll-ratio test | 2.12 |
Recall that the score test from the Cox model is exactly equal to the log rank test statistic. We indeed see that there is no clear evidence for a difference in survival between males and females.
Model 2
For the final model we add all the features available (note that we need to remove the row with missing values).
cph.fit(
df=bc_df.dropna(axis=0),
duration_col='time',
event_col='status',
formula='sex + diagnosis + loc + ki + gtv + stereo'
)
cph.print_summary()
model | lifelines.CoxPHFitter |
---|---|
duration col | ‘time’ |
event col | ‘status’ |
baseline estimation | breslow |
number of observations | 87 |
number of events observed | 35 |
partial log-likelihood | -116.75 |
time fit was run | 2021-08-31 09:52:09 UTC |
coef | exp(coef) | se(coef) | coef lower 95% | coef upper 95% | exp(coef) lower 95% | exp(coef) upper 95% | z | p | -log2(p) | |
---|---|---|---|---|---|---|---|---|---|---|
diagnosis[T.LG glioma] | 0.92 | 2.50 | 0.64 | -0.34 | 2.17 | 0.71 | 8.72 | 1.43 | 0.15 | 2.72 |
diagnosis[T.HG glioma] | 2.15 | 8.62 | 0.45 | 1.27 | 3.04 | 3.57 | 20.85 | 4.78 | <0.005 | 19.14 |
diagnosis[T.Other] | 0.89 | 2.42 | 0.66 | -0.40 | 2.18 | 0.67 | 8.80 | 1.35 | 0.18 | 2.49 |
gtv | 0.03 | 1.03 | 0.02 | -0.01 | 0.08 | 0.99 | 1.08 | 1.54 | 0.12 | 3.00 |
ki | -0.05 | 0.95 | 0.02 | -0.09 | -0.02 | 0.91 | 0.98 | -3.00 | <0.005 | 8.54 |
loc[T.Supratentorial] | 0.44 | 1.55 | 0.70 | -0.94 | 1.82 | 0.39 | 6.17 | 0.63 | 0.53 | 0.91 |
sex[T.Male] | 0.18 | 1.20 | 0.36 | -0.52 | 0.89 | 0.59 | 2.44 | 0.51 | 0.61 | 0.71 |
stereo[T.SRT] | 0.18 | 1.19 | 0.60 | -1.00 | 1.36 | 0.37 | 3.88 | 0.30 | 0.77 | 0.38 |
Concordance | 0.79 |
---|---|
Partial AIC | 249.50 |
log-likelihood ratio test | 41.37 on 8 df |
-log2(p) of ll-ratio test | 19.10 |
Finally we plot the survival curves for the diagnosis
variable. Here the baseline corresponds to meningioma.
cph.plot_partial_effects_on_outcome(
covariates='diagnosis',
values=bc_df.dropna(axis=0)['diagnosis'].unique()[1:]
);
All of the results and plots coincide with the R implementation. The remaining labs can be done in a very similar way.