ISLR2 - Survival Analysis Lab (lifelines)

In this notebook we provide a python implementation of the lab from the Survival Analysis - Chapter 11 of the second edition of the book An Introduction to Statistical Learning (Second Edition). You can find a free pdf version of the book here. We will use the lifelines python package, which you can find in this repository. There is a nice introduction into survival analysis on the documentation. There are also many concrete examples and guidelines to use the package.

Please refer to the book for more details and explanations on the concepts. We focus on the lab implementation.

Prepare Notebook

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style(
    style='darkgrid', 
    rc={'axes.facecolor': '.9', 'grid.color': '.8'}
)
sns.set_palette(palette='deep')
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 100

Read Data

In this lab we want to analyze the BrainCancer data set. The data for the lab is provided as .rda here. To read it as a pandas data frame, I read it in R and exported it in feather format.

bc_df = pd.read_feather('../Data/brain_cancer.feather')

bc_df.head()

	sex	diagnosis	loc	ki	gtv	stereo	status	time
0	Female	Meningioma	Infratentorial	90	6.11	SRS	0	57.64
1	Male	HG glioma	Supratentorial	90	19.35	SRT	1	8.98
2	Female	Meningioma	Infratentorial	70	7.95	SRS	0	26.46
3	Female	LG glioma	Supratentorial	80	7.61	SRT	1	47.80
4	Male	HG glioma	Supratentorial	90	5.06	SRT	1	6.30

The column of interest is time (survival time). The variable status indicates whether the observation is censored. The other variables are additional covariates.

bc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88 entries, 0 to 87
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   sex        88 non-null     category
 1   diagnosis  87 non-null     category
 2   loc        88 non-null     category
 3   ki         88 non-null     int32   
 4   gtv        88 non-null     float64 
 5   stereo     88 non-null     category
 6   status     88 non-null     int32   
 7   time       88 non-null     float64 
dtypes: category(4), float64(2), int32(2)
memory usage: 3.1 KB

Note that we have column diagnosis is missing one value.

Let us plot the time distribution split by status.

fig, ax = plt.subplots()
sns.histplot(x='time', data=bc_df, hue='status', stat='density', ax=ax)
sns.kdeplot(x='time', data=bc_df, hue='status', fill=False, ax=ax)
ax.set(title='Time Distribution');

The Kaplan-Meier Survival Curve

Recall from Chapter 11.5 that the Kaplan-Meier estimator of the survival curve \(S(T)\) is

\[ \hat{S}(t) = \prod_{i : \: t_{i} \leq t} \left(1 - \frac{d_{i}}{n_{i}} \right) \]

where with \(t_{i}\) is a time when at least one event happened, \(d_{i}\) is the number of events that happened at time \(t_{i}\), and \(n_{i}\) is the number of individuals known to have survived up to time \(t_{i}\). We can use the KaplanMeierFitter to compute this estimator.

from lifelines import KaplanMeierFitter

kmf = KaplanMeierFitter()

kmf.fit(durations=bc_df['time'], event_observed=bc_df['status'])

fig, ax =  plt.subplots()
kmf.plot_survival_function(color='C0',ax=ax)
ax.set(
    title='Kaplan-Meier survival curve',
    xlabel='Months',
    ylabel='Estimated Probability of Survival'
);

Compare with FIGURE 11.2 in the book.

Let us now compute this estimator for each of the classes of the sex column.

fig, ax =  plt.subplots()

for i, sex in enumerate(bc_df['sex'].unique()):
    # Mask category
    mask = f'sex == "{sex}"'
    # Fit model
    kmf = KaplanMeierFitter()
    kmf.fit(
        durations=bc_df.query(mask)['time'],
        event_observed=bc_df.query(mask)['status']
    )
    # Plot survival curves
    kmf.plot_survival_function(
        color=f'C{i + 1}',
        label=sex,
        ax=ax
    )

ax.set(
    title='Kaplan-Meier survival curves formales and females',
    xlabel='Months',
    ylabel='Estimated Probability of Survival'
);

The Log-Rank Test

Let us now compute the log-rank test to compare the survival of males to females.

from lifelines.statistics import logrank_test

mask_female = f'sex == "Female"'
mask_male = f'sex == "Male"'

logrank_test_result = logrank_test(
    durations_A=bc_df.query(mask_female)['time'],
    durations_B=bc_df.query(mask_male)['time'],
    event_observed_A=bc_df.query(mask_female)['status'],
    event_observed_B=bc_df.query(mask_male)['status']
)

logrank_test_result.print_summary()

t_0	-1
null_distribution	chi squared
degrees_of_freedom	1
test_name	logrank_test

	test_statistic	p	-log2(p)
0	1.44	0.23	2.12

As mentioned in the book: The resulting p-value is 0.23, indicating no evidence of a difference in survival between the two sexes.

Regression Models With a Survival Response

Next, we move into regression analysis. We will focus in the The Proportional Hazards model, specifically the Cox’s Proportional Hazards Model. The main idea of this model is to assume the following form for the Hazard function

\[ h(t|x_i) = h_0(t) \exp \left( \sum_{j=1}^{p} \beta_jx_{ij}\right) \]

where \(h_0(t) \geq 0\) is the baseline hazard and \(x_{i}\) are the covariates. Recall that the relation between the Hazard function and the survival curve is

\[ S(t) = \exp \left( - \int_{0}^{t} h(\tau) d\tau\right) \]

We can use the CoxPHFitter to estimate these type of models. The model features can be specified by R-like formulas, see patsy.

Model 1

Let us start by using just the sex feature.

from lifelines import CoxPHFitter

cph = CoxPHFitter()

cph.fit(df=bc_df, duration_col='time', event_col='status', formula='sex')

cph.print_summary()

model	lifelines.CoxPHFitter
duration col	‘time’
event col	‘status’
baseline estimation	breslow
number of observations	88
number of events observed	35
partial log-likelihood	-137.43
time fit was run	2021-08-31 09:52:08 UTC

	coef	exp(coef)	se(coef)	coef lower 95%	coef upper 95%	exp(coef) lower 95%	exp(coef) upper 95%	z	p	-log2(p)
sex[T.Male]	0.41	1.50	0.34	-0.26	1.08	0.77	2.94	1.19	0.23	2.10

Concordance	0.56
Partial AIC	276.86
log-likelihood ratio test	1.44 on 1 df
-log2(p) of ll-ratio test	2.12

Recall that the score test from the Cox model is exactly equal to the log rank test statistic. We indeed see that there is no clear evidence for a difference in survival between males and females.

Model 2

For the final model we add all the features available (note that we need to remove the row with missing values).

cph.fit(
    df=bc_df.dropna(axis=0),
    duration_col='time',
    event_col='status',
    formula='sex + diagnosis + loc + ki + gtv + stereo'
)

cph.print_summary()

model	lifelines.CoxPHFitter
duration col	‘time’
event col	‘status’
baseline estimation	breslow
number of observations	87
number of events observed	35
partial log-likelihood	-116.75
time fit was run	2021-08-31 09:52:09 UTC

	coef	exp(coef)	se(coef)	coef lower 95%	coef upper 95%	exp(coef) lower 95%	exp(coef) upper 95%	z	p	-log2(p)
diagnosis[T.LG glioma]	0.92	2.50	0.64	-0.34	2.17	0.71	8.72	1.43	0.15	2.72
diagnosis[T.HG glioma]	2.15	8.62	0.45	1.27	3.04	3.57	20.85	4.78	<0.005	19.14
diagnosis[T.Other]	0.89	2.42	0.66	-0.40	2.18	0.67	8.80	1.35	0.18	2.49
gtv	0.03	1.03	0.02	-0.01	0.08	0.99	1.08	1.54	0.12	3.00
ki	-0.05	0.95	0.02	-0.09	-0.02	0.91	0.98	-3.00	<0.005	8.54
loc[T.Supratentorial]	0.44	1.55	0.70	-0.94	1.82	0.39	6.17	0.63	0.53	0.91
sex[T.Male]	0.18	1.20	0.36	-0.52	0.89	0.59	2.44	0.51	0.61	0.71
stereo[T.SRT]	0.18	1.19	0.60	-1.00	1.36	0.37	3.88	0.30	0.77	0.38

Concordance	0.79
Partial AIC	249.50
log-likelihood ratio test	41.37 on 8 df
-log2(p) of ll-ratio test	19.10

Finally we plot the survival curves for the diagnosis variable. Here the baseline corresponds to meningioma.

cph.plot_partial_effects_on_outcome(
    covariates='diagnosis',
    values=bc_df.dropna(axis=0)['diagnosis'].unique()[1:]
);

All of the results and plots coincide with the R implementation. The remaining labs can be done in a very similar way.