In this notebook we present a simple example of uplift modeling estimation via meta-models using causalml
and scikit-uplift
. For a more detailed introduction to uplift modeling, see:
This material was presented at PyCon DE & PyData Berlin2022: Introduction to Uplift Modeling.
Here you can find the recording of the talk:
Here you can find the slides.
Prepare Notebook
from causalml.inference.meta.base import BaseLearner
from causalml.inference.meta import(
BaseSClassifier,
BaseTClassifier,
BaseXClassifier,
)
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.ensemble import (
HistGradientBoostingClassifier,
HistGradientBoostingRegressor,
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklift.metrics import uplift_by_percentile, uplift_curve
from sklift.viz import (
plot_qini_curve,
plot_uplift_by_percentile,
plot_uplift_curve,
)
plt.style.use("bmh")
plt.rcParams["figure.figsize"] = [10, 6]
plt.rcParams["figure.dpi"] = 100
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = "svg"
Read Data
We are going to follow the example The overview of the basic approaches to solving the Uplift Modeling problem presented in scikit-uplift
’s Tutorials Section. We extend such example by adding and EDA section on the data and using causalml
for the uplift estimation. We do use the sklift.viz
to get diagnostics plots.
Data set: RetailHero.ai contest data
from pathlib import Path
data_path = Path("~/data")
# clients data
clients_df = pd.read_csv(
data_path / "clients.csv", parse_dates=["first_issue_date", "first_redeem_date"]
)
# treatment and target data
uplift_train_df = pd.read_csv(data_path / "uplift_train.csv")
clients_df
data:
clients_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400162 entries, 0 to 400161
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 client_id 400162 non-null object
1 first_issue_date 400162 non-null datetime64[ns]
2 first_redeem_date 364693 non-null datetime64[ns]
3 age 400162 non-null int64
4 gender 400162 non-null object
dtypes: datetime64[ns](2), int64(1), object(2)
memory usage: 15.3+ MB
uplift_train_df
data:
uplift_train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200039 entries, 0 to 200038
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 client_id 200039 non-null object
1 treatment_flg 200039 non-null int64
2 target 200039 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.6+ MB
purchases_df
data:
purchases_df = pd.read_csv(data_path / "purchases.csv")
purchases_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45786568 entries, 0 to 45786567
Data columns (total 13 columns):
# Column Dtype
--- ------ -----
0 client_id object
1 transaction_id object
2 transaction_datetime object
3 regular_points_received float64
4 express_points_received float64
5 regular_points_spent float64
6 express_points_spent float64
7 purchase_sum float64
8 store_id object
9 product_id object
10 product_quantity float64
11 trn_sum_from_iss float64
12 trn_sum_from_red float64
dtypes: float64(8), object(5)
memory usage: 4.4+ GB
EDA
Let us start by exploring the data. First, let us take a look into the target and treatment variables:
fig, ax = plt.subplots()
uplift_train_df \
.groupby(["treatment_flg", "target"], as_index=False) \
.size() \
.assign(
share=lambda x: x["size"] / x["size"].sum()
) \
.pipe((sns.barplot, "data"), x="target", y="share", hue="treatment_flg", ax=ax)
ax.yaxis.set_major_formatter(mtick.FuncFormatter(lambda y, _: f"{y :.0%}"))
ax.set(title="Targer by treatment distribution");
We have approximately 50/50 split of the treatment_flg
but the target
is not balanced (approx. 40/60).
Now we examine the clients data. First we count the unique number of client_id
.
assert clients_df.shape[0] == clients_df["client_id"].nunique()
assert uplift_train_df.shape[0] == uplift_train_df["client_id"].nunique()
print(f"""
clients_id
----------
clients_df: {clients_df["client_id"].nunique()}
uplift_train_df: {uplift_train_df["client_id"].nunique()}
""")
clients_id
----------
clients_df: 400162
uplift_train_df: 200039
We have more client_id
in the clients_df
. Next we merge the data by client_id
.
raw_data_df = pd.merge(
left=clients_df, right=uplift_train_df, on="client_id", how="outer"
)
assert raw_data_df.shape[0] == clients_df.shape[0]
assert raw_data_df.shape[0] == raw_data_df["client_id"].nunique()
We continue by taking a look into the gender
feature.
Warning: Including gender
-like variables in ML models can induce undesirable biases. We do keep this feature just because we want to compare the techniques with the original example.
g = (
raw_data_df.query("target.notnull()")
.groupby(["treatment_flg", "target", "gender"], as_index=False)
.agg(count=("client_id", "count"))
.pipe(
(sns.catplot, "data"),
x="target",
y="count",
hue="gender",
col="treatment_flg",
kind="bar",
)
)
Now we plot the age distribution
. Note however, we need to remove some outliers:
# reasonable age range
good_age_mask = "10 < age < 100"
print( f"""
Rows with age outliers:
{1 - clients_df.query(good_age_mask).shape[0] / clients_df.shape[0]: 0.2%}
""")
Rows with age outliers:
0.37%
raw_data_df.query("(target.notnull()) and (10 < age < 100)").pipe(
(sns.displot, "data"), x="age", hue="gender", col="treatment_flg", kind="kde"
);
We continue by studying the time variables. Note that the variable first_redeem_date
has missing values. Let us see target and treatment distribution over these missing values.
g = (
raw_data_df.assign(
first_redeem_date_is_null=lambda x: x["first_redeem_date"].isna()
)
.groupby(
["treatment_flg", "target", "first_redeem_date_is_null"], as_index=False
)
.agg(count=("client_id", "count"))
.pipe(
(sns.catplot, "data"),
x="target",
y="count",
hue="first_redeem_date_is_null",
col="treatment_flg",
kind="bar",
)
)
We do not see any pattern at first glance. Let us see the development the client counts over first_issue_date
.
fig, ax = plt.subplots()
raw_data_df \
.assign(first_issue_date=lambda x: x["first_issue_date"].dt.date) \
.groupby(
["first_issue_date"], as_index=False
) \
.agg(count=("client_id", "count")) \
.pipe(
(sns.lineplot, "data"),
x="first_issue_date",
y="count",
label="first_issue_date",
ax=ax,
)
raw_data_df \
.query("first_redeem_date.isnull()") \
.assign(
first_issue_date=lambda x: x["first_issue_date"].dt.date
) \
.groupby(["first_issue_date"], as_index=False) \
.agg(count=("client_id", "count")) \
.pipe(
(sns.lineplot, "data"),
x="first_issue_date",
y="count",
label="first_issue_date (first_redeem_date null)",
ax=ax,
);
There seems to be missing values along the whole time period.
print(f"""
rows share with missing values:
{raw_data_df.query("first_redeem_date.isnull()").shape[0] / raw_data_df.shape[0]: 0.2%}
""")
rows share with missing values:
8.86%
From this initial EDA there is no hint of the source of these missing values, i.e. they are at random (or maybe we are missing some information or context of the data?).
We now plot the client counts over first_issue_date
and first_redeem_date
:
fig, ax = plt.subplots()
raw_data_df \
.assign(first_issue_date=lambda x: x["first_issue_date"].dt.date) \
.groupby(["first_issue_date"], as_index=False) \
.agg(count=("client_id", "count")) \
.pipe((sns.lineplot, "data"),
x="first_issue_date",
y="count",
label="first_issue_date",
ax=ax,
)
raw_data_df \
.assign(first_redeem_date=lambda x: x["first_redeem_date"].dt.date) \
.groupby(["first_redeem_date"], as_index=False) \
.agg(count=("client_id", "count")) \
.pipe((sns.lineplot, "data"),
x="first_redeem_date",
y="count",
label="first_redeem_date",
ax=ax,
)
ax.set(xlabel="date", ylabel="count")
Finally, in order to enrich our models, we calculate some simple summary metrics from the purchase data:
client_purchases_summary_df = (
purchases_df.groupby(["client_id"], as_index=False)
.agg(
n_transactions=("transaction_id", "count"),
n_products=("product_id", "nunique"),
n_stores=("store_id", "nunique"),
last_transaction_date=("transaction_datetime", "max"),
express_points_received=("express_points_received", np.sum),
express_points_spent=("express_points_spent", np.sum),
regular_points_spent=("regular_points_spent", np.sum),
mean_product_quantity=("product_quantity", np.mean),
)
.assign(
last_transaction_date=lambda x: pd.to_datetime(x["last_transaction_date"])
)
)
Warning: We are using time-dependent features like last_transaction_date = ("transaction_datetime", "max")
, which have to be treated carefully when doing out-of-sample validation. Below we will do a train-test split by randomly selecting a fraction of the data bases on the client_id
. Nevertheless, to have a faithful out-of-sample evaluation metrics we might want to compute these features on each split otherwise we would be leaking information. For the sake of this toy-example we will not do this.
Prepare Data
In this section we prepare the data for uplift modeling. We will use the features from the original example (with minor modifications) plus some purchase features.
# add purchase features
raw_data_ext_df = raw_data_df \
.copy() \
.merge(
right=client_purchases_summary_df,
on="client_id",
how="left"
)
transformation_map = {
"first_issue_time": lambda x: (x["first_issue_date"] - pd.Timestamp("2017-01-01")).dt.days,
"first_issue_time_weekday": lambda x: x["first_issue_date"].dt.weekday,
"first_issue_time_month": lambda x: x["first_issue_date"].dt.month,
"first_redeem_time": lambda x: (x["first_redeem_date"] - pd.Timestamp("2017-01-01")).dt.days,
"issue_redeem_delay": lambda x: (x["first_redeem_time"] - x["first_issue_time"]),
"last_transaction_time": lambda x: (x["last_transaction_date"] - pd.Timestamp("2017-01-01")).dt.days,
}
data_df = (
raw_data_ext_df.copy()
.query("target.notnull()")
.query(good_age_mask)
.set_index("client_id")
.assign(**transformation_map)
.sort_values("first_issue_time")
.drop(
columns=[
"first_issue_date",
"first_redeem_date",
"last_transaction_date",
]
)
)
data_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 199305 entries, b5e94fd9dd to 903a531bb7
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 199305 non-null int64
1 gender 199305 non-null object
2 treatment_flg 199305 non-null float64
3 target 199305 non-null float64
4 n_transactions 199305 non-null int64
5 n_products 199305 non-null int64
6 n_stores 199305 non-null int64
7 express_points_received 199305 non-null float64
8 express_points_spent 199305 non-null float64
9 regular_points_spent 199305 non-null float64
10 mean_product_quantity 199305 non-null float64
11 first_issue_time 199305 non-null int64
12 first_issue_time_weekday 199305 non-null int64
13 first_issue_time_month 199305 non-null int64
14 first_redeem_time 181873 non-null float64
15 issue_redeem_delay 181873 non-null float64
16 last_transaction_time 199305 non-null int64
dtypes: float64(8), int64(8), object(1)
memory usage: 27.4+ MB
We can now show a pair-plot of the main features of the original example:
sns.pairplot(
data=data_df[
[
"first_issue_time",
"first_redeem_time",
"issue_redeem_delay",
]
],
kind="hist",
height=2.5,
aspect=1.5,
)
Now we do a simple train-validation split of the data.
target_col = "target"
treatment_col = "treatment_flg"
y = data_df[target_col]
w = data_df[treatment_col]
x = data_df.drop(columns=[treatment_col, target_col])
idx_train, idx_val = train_test_split(
data_df.index,
test_size=0.3,
random_state=42,
stratify=(y.astype(str) + "_" + w.astype(str)),
)
x_train = x.loc[idx_train]
x_val = x.loc[idx_val]
w_train = w.loc[idx_train]
w_val = w.loc[idx_val]
y_train = y.loc[idx_train]
y_val = y.loc[idx_val]
Let us encode the gender
as an ordinal categorical variable.
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(x_train[["gender"]])
x_train_transformed = x_train.assign(
gender=lambda x: ordinal_encoder.transform(x[["gender"]])
)
x_val_transformed = x_val.assign(
gender=lambda x: ordinal_encoder.transform(x[["gender"]])
)
Propensity Score Model
The propensity score are defined as \(p(X_{i}) = P(W_{i}=1 | X_{i})\), that is, the probability of having a treatment given the covariates. If the treatment assignment is at random these scores should be concentrated around 0.5. For a nice introduction to the subject you can see “Propensity Score Matching: A Non-experimental Approach to Causal Inference” by Michael Johns, PyData New York 2019. We ser scikit-learn
’s HistGradientBoostingClassifier
. For this model we need to explicitly indicate the categorical variables:
categorical_features = ["gender"]
hgc_params = {
"categorical_features": np.argwhere(
[col in categorical_features for col in x_train_transformed.columns]
).flatten()
}
We now fit the model:
propensity_model = HistGradientBoostingClassifier(**hgc_params)
propensity_model.fit(X=x_train_transformed, y=w_train)
p_train = propensity_model.predict_proba(X=x_train_transformed)
p_val = propensity_model.predict_proba(X=x_val_transformed)
p_train = pd.Series(p_train[:, 0], index=idx_train)
p_val = pd.Series(p_val[:, 0], index=idx_val)
Lets see the results:
fig, ax = plt.subplots()
sns.kdeplot(x=p_train, label="train", ax=ax)
sns.kdeplot(x=p_val, label="val", ax=ax)
ax.legend()
ax.set(
title="Propensity Score Predictions Distribution",
xlabel="propensity score",
ylabel="density",
);
print(f"""
Share of predictions with |p - 0.5| > 0.2 (train) {p_train[abs(p_train - 0.5) > 0.2].size / p_train.size : 0.2%}
Share of predictions with |p - 0.5| > 0.2 (val) {p_val[abs(p_val - 0.5) > 0.2].size / p_val.size : 0.2%}
""")
Share of predictions with |p - 0.5| > 0.2 (train) 1.46%
Share of predictions with |p - 0.5| > 0.2 (val) 1.47%
from sklearn.inspection import permutation_importance
pi = permutation_importance(
estimator=propensity_model, X=x_train_transformed, y=w_train
)
fig, ax = plt.subplots(figsize=(8, 8))
idx = pi["importances_mean"].argsort()[::-1]
sns.barplot(
x=pi["importances_mean"][idx],
y=x_train_transformed.columns[idx],
color="C4",
ax=ax
)
ax.set(title="Permutation importance propensity score model");
Data Container
We now define a convenient data structure for the uplift models input data.
from dataclasses import dataclass
@dataclass
class DataIn:
x: pd.DataFrame
x_transformed: np.array
y: pd.Series
treatment: pd.Series
p: pd.Series
data_train = DataIn(
x=x_train,
x_transformed=x_train_transformed,
y=y_train,
treatment=w_train,
p=p_train,
)
data_val = DataIn(
x=x_val,
x_transformed=x_val_transformed,
y=y_val,
treatment=w_val,
p=p_val
)
Models
Now that we have a better understanding of the data we can start modeling. We use some of the meta-learners from causalml
. For more details please see the causalml
documentation.
S-Learner
In this meta-model we simply train an ML model to predict the target
variable using the covariates \(X\) plus the target
as regressors.
- Step 1: Training
\[ \underbrace{ \left( \begin{array}{cccc} x_{11} & \cdots & x_{1k} & w_{1} \\ \vdots & \ddots & \vdots & \vdots \\ x_{11} & \cdots & x_{nk} & w_{n} \\ \end{array} \right)}_{X\bigoplus W} \xrightarrow{\mu} \left( \begin{array}{c} y_{1} \\ \vdots \\ y_{n} \end{array} \right) \]
- Step 2: Uplift Prediction
\[ \widehat{\text{uplift}} = \hat{\mu}\left( \begin{array}{cccc} x_{11} & \cdots & x_{1k} & 1 \\ \vdots & \ddots & \vdots & \vdots \\ x_{11} & \cdots & x_{mk} & 1 \\ \end{array} \right) - \hat{\mu} \left( \begin{array}{cccc} x_{11} & \cdots & x_{1k} & 0 \\ \vdots & \ddots & \vdots & \vdots \\ x_{11} & \cdots & x_{mk} & 0 \\ \end{array} \right) \]
s_learner = BaseSClassifier(learner=HistGradientBoostingClassifier(**hgc_params))
s_ate = s_learner.estimate_ate(
X=data_train.x_transformed, treatment=data_train.treatment, y=data_train.y
)
One can access the trained model as:
s_learner.models[1]
HistGradientBoostingClassifier(categorical_features=array([1]))
T-Learner
For this one we train two ML models instead: one for each treatment assignment.
- Step 1: Training
\[ \underbrace{ \left( \begin{array}{ccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{n_{C}k} \\ \end{array} \right)}_{X|_{\text{control}}} \xrightarrow{\mu_{C}} \left( \begin{array}{c} y_{1} \\ \vdots \\ y_{n_{C}} \end{array} \right) \]
\[ \underbrace{ \left( \begin{array}{ccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{n_{T}k} \\ \end{array} \right)}_{X |_{\text{treatment}}} \xrightarrow{\mu_{T}} \left( \begin{array}{c} y_{1} \\ \vdots \\ y_{n_{T}} \end{array} \right) \]
- Step 2: Uplift Prediction
\[ \widehat{\text{uplift}} = \hat{\mu}_{T}\left( \begin{array}{cccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{mk} \\ \end{array} \right) - \hat{\mu}_{C} \left( \begin{array}{cccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{mk} \\ \end{array} \right) \]
t_learner = BaseTClassifier(learner=HistGradientBoostingClassifier(**hgc_params))
t_ate_lwr, t_ate, t_ate_upr = t_learner.estimate_ate(
X=data_train.x_transformed, treatment=data_train.treatment, y=data_train.y
)
One can access the trained models as:
t_learner.models_c[1] # control group
t_learner.models_t[1] # treatment group
HistGradientBoostingClassifier(categorical_features=array([1]))
X-Learner
The X-Learner is similar to the T-Learner but it adds and additional step where we transfer information from one model to the other, see Sören, R, et.al. (2019) “Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning” for details on the motivation.
- Step 1: Training
\[ \underbrace{ \left( \begin{array}{ccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{n_{C}k} \\ \end{array} \right)}_{X|_{\text{control}}} \xrightarrow{\mu_{C}} \left( \begin{array}{c} y_{1} \\ \vdots \\ y_{n_{C}} \end{array} \right) \]
\[ \underbrace{ \left( \begin{array}{ccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{n_{T}k} \\ \end{array} \right)}_{X |_{\text{treatment}}} \xrightarrow{\mu_{T}} \left( \begin{array}{c} y_{1} \\ \vdots \\ y_{n_{T}} \end{array} \right) \]
- Step 2: Compute imputed treatment effects
\[ \tilde{D}^{T} := \left( \begin{array}{c} y_{1} \\ \vdots \\ y_{n_{T}} \end{array} \right) - \hat{\mu}_{C} \left( \begin{array}{cccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{n_{T}k} \\ \end{array} \right) \]
\[ \tilde{D}^{C} := \hat{\mu}_{T} \left( \begin{array}{cccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{n_{C}k} \\ \end{array} \right) - \left( \begin{array}{c} y_{1} \\ \vdots \\ y_{n_{C}} \end{array} \right) \]
- Step 3: Uplift train with different targets
\[ \underbrace{ \left( \begin{array}{ccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{n_{C}k} \\ \end{array} \right)}_{X|_{\text{control}}} \xrightarrow{\tau_{C}} \left( \begin{array}{c} \tilde{D}^{C}_{1} \\ \vdots \\ \tilde{D}^{C}_{n_{C}} \end{array} \right) \]
\[ \underbrace{ \left( \begin{array}{ccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{n_{T}k} \\ \end{array} \right)}_{X|_{\text{treatment}}} \xrightarrow{\tau_{T}} \left( \begin{array}{c} \tilde{D}^{T}_{1} \\ \vdots \\ \tilde{D}^{T}_{n_{T}} \end{array} \right) \]
- Step 4: Uplift Prediction
\[ \widehat{\text{uplift}} = g(x)\hat{\tau}_{C}(x) + (1 - g(x))\hat{\tau}_{T}(x) \]
where \(g(x) \in [0, 1]\) is a weight function.
Remark: From Sören, R, et.al. (2019) “Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning”:
\(\hat{\tau}_{C}\) and \(\hat{\tau}_{T}\) are both estimators for \(\tau\) (the uplift) while \(g\) is chosen to combine these estimators to one improved estimator \(\hat{\tau}\). Based on our experience, we observe that it is good to use an estimate of the propensity score for \(g\), but it also makes sense to choose \(g = 1\) or \(0\), if the number of treated units is very large or small compared to the number of control units.
x_learner = BaseXClassifier(
outcome_learner=HistGradientBoostingClassifier(**hgc_params),
effect_learner=HistGradientBoostingRegressor(**hgc_params),
)
x_ate_lwr, x_ate, x_ate_upr = x_learner.estimate_ate(
X=data_train.x_transformed,
treatment=data_train.treatment,
y=data_train.y,
p=data_train.p,
)
One can access the trained models as:
# step 1
x_learner.models_mu_c[1] # control group
x_learner.models_mu_t[1] # treatment group
# step 3
x_learner.models_tau_c[1] # control group
x_learner.models_tau_t[1] # treatment group
HistGradientBoostingRegressor(categorical_features=array([1]))
fig, ax = plt.subplots(figsize=(6, 4))
pd.DataFrame(
data={
"model": ["s_learner", "t_learner", "x_learner"],
"ate": np.array([s_ate, t_ate, x_ate]).flatten(),
},
).pipe((sns.barplot, "data"), x="model", y="ate", ax=ax)
ax.set(title="ATE Estimation (Train)");
Predictions & Diagnostics
Next, now that we have fitted meta-learner models, we generate predictions in the training and validations sets.
@dataclass
class DataOut:
meta_learner_name: str
meta_learner: BaseLearner
y_pred: np.array
# in-sample predictions
data_out_train_s = DataOut(
meta_learner_name="S-Learner",
meta_learner=s_learner,
y_pred=s_learner.predict(
X=data_train.x_transformed, treatment=data_train.treatment
),
)
data_out_train_t = DataOut(
meta_learner_name="T-Learner",
meta_learner=t_learner,
y_pred=t_learner.predict(
X=data_train.x_transformed, treatment=data_train.treatment
),
)
data_out_train_x = DataOut(
meta_learner_name="X-Learner",
meta_learner=x_learner,
y_pred=x_learner.predict(
X=data_train.x_transformed, treatment=data_train.treatment, p=data_train.p
),
)
# out-of-sample predictions
data_out_val_s = DataOut(
meta_learner_name="S-Learner",
meta_learner=s_learner,
y_pred=s_learner.predict(
X=data_val.x_transformed, treatment=data_val.treatment
),
)
data_out_val_t = DataOut(
meta_learner_name="T-Learner",
meta_learner=t_learner,
y_pred=t_learner.predict(
X=data_val.x_transformed, treatment=data_val.treatment
),
)
data_out_val_x = DataOut(
meta_learner_name="X-Learner",
meta_learner=x_learner,
y_pred=x_learner.predict(
X=data_val.x_transformed, treatment=data_val.treatment, p=data_val.p
),
)
A natural question is: how well do we predict the uplift? Note that (Gutierrez, P., & Gérardy, J. Y. (2017). “Causal Inference and Uplift Modelling: A Review of the Literature”):
In machine learning, the standard is to use cross-validation: separate the data into a training and a testing datasets; learn on the training data, predict the target on the test data and compare to the ground truth. In uplift modeling, cross validation is still a valid idea but there is no more ground truth because we can never observe the effect of being treated and not treated on a person at the same time.
To answer it we deep dive into some common model diagnostic tools.
Perfect uplift prediction
How would a perfect uplift model predict? Following Diemert, Eustache, et.al. (2020) “A Large Scale Benchmark for Uplift Modeling”, > A perfect model assigns higher scores to all treated individuals with positive outcomes than any individuals with negative outcomes.
def perfect_uplift_model(data: DataIn):
# control Responders
cr_num = np.sum((data.y == 1) & (data.treatment == 0))
# treated Non-Responders
tn_num = np.sum((data.y == 0) & (data.treatment == 1))
# compute perfect uplift curve
summand = data.y if cr_num > tn_num else data.treatment
return 2 * (data.y == data.treatment) + summand
perfect_uplift_train = perfect_uplift_model(data=data_train)
perfect_uplift_val = perfect_uplift_model(data=data_val)
We can compare the sorted predictions of the models against the perfect one.
fig, ax = plt.subplots(
nrows=2, ncols=1, figsize=(8, 8), sharex=True, layout="constrained"
)
sns.lineplot(
x=range(data_train.y.size),
y=np.sort(a=perfect_uplift_train)[::-1],
color="C3",
label="perfect model",
ax=ax[0],
)
sns.lineplot(
x=range(data_train.y.size),
y=np.sort(a=data_out_train_s.y_pred.flatten())[::-1],
color="C0",
label="S Learner",
ax=ax[1],
)
sns.lineplot(
x=range(data_train.y.size),
y=np.sort(a=data_out_train_t.y_pred.flatten())[::-1],
color="C1",
label="T Learner",
ax=ax[1],
)
sns.lineplot(
x=range(data_train.y.size),
y=np.sort(a=data_out_train_x.y_pred.flatten())[::-1],
color="C2",
label="X Learner",
ax=ax[1],
)
ax[1].set(xlabel="Number treated")
fig.suptitle("np.sort(a=uplift_prediction)[::-1] (train)");
Uplift by percentile
- Sort uplift predictions by decreasing order.
- Predict uplift for both treated and control observations
- Compute the average prediction per percentile in both groups.
- The difference between those averages is taken for each percentile.
This difference gives an idea of the uplift gain per percentile. One can compute this using the uplift_by_percentile
function (from sklift.metrics
). Let us see how the data looks for the S learner.
uplift_by_percentile_df = uplift_by_percentile(
y_true=data_train.y,
uplift=data_out_train_s.y_pred.flatten(),
treatment=data_train.treatment,
strategy="overall",
total=True,
)
uplift_by_percentile_df
n_treatment | n_control | response_rate_treatment | response_rate_control | uplift | |
---|---|---|---|---|---|
percentile | |||||
0-10 | 7090 | 6862 | 0.612412 | 0.398863 | 0.213549 |
10-20 | 7024 | 6928 | 0.575456 | 0.478637 | 0.096818 |
20-30 | 7037 | 6915 | 0.587040 | 0.523644 | 0.063396 |
30-40 | 7013 | 6938 | 0.630543 | 0.570193 | 0.060350 |
40-50 | 7089 | 6862 | 0.665961 | 0.634946 | 0.031015 |
50-60 | 7019 | 6932 | 0.708790 | 0.688113 | 0.020677 |
60-70 | 7087 | 6864 | 0.749683 | 0.732226 | 0.017456 |
70-80 | 6910 | 7041 | 0.748625 | 0.774180 | -0.025555 |
80-90 | 6870 | 7081 | 0.623144 | 0.656687 | -0.033543 |
90-100 | 6583 | 7368 | 0.454960 | 0.575054 | -0.120095 |
total | 69722 | 69791 | 0.636743 | 0.603531 | 0.033213 |
A well performing model would have large values in the first percentiles and decreasing values for larger ones. Now we can generate the plots:
train_pred = [data_out_train_s, data_out_train_t, data_out_train_x]
for data_out_train in train_pred:
ax = plot_uplift_by_percentile(
y_true=data_train.y,
uplift=data_out_train.y_pred.flatten(),
treatment=data_train.treatment,
strategy="overall",
kind="line",
)
ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
fig = ax.get_figure()
fig.suptitle(
f"In-sample predictions ({data_out_train.meta_learner_name})", y=1.1
)
val_pred = [data_out_val_s, data_out_val_t, data_out_val_x]
for data_out_val in val_pred:
ax = plot_uplift_by_percentile(
y_true=data_val.y,
uplift=data_out_val.y_pred.flatten(),
treatment=data_val.treatment,
strategy="overall",
kind="line",
)
ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
fig = ax.get_figure()
fig.suptitle(
f"Out-of-sample predictions ({data_out_val.meta_learner_name})", y=1.1
)
Remark: Here is the uplift by percentile table for the perfect model (train):
uplift_by_percentile_df = uplift_by_percentile(
y_true=data_train.y,
uplift=perfect_uplift_train,
treatment=data_train.treatment,
strategy="overall",
total=False,
)
uplift_by_percentile_df
n_treatment | n_control | response_rate_treatment | response_rate_control | uplift | |
---|---|---|---|---|---|
percentile | |||||
0-10 | 13952 | 0 | 1.0 | NaN | NaN |
10-20 | 13952 | 0 | 1.0 | NaN | NaN |
20-30 | 13952 | 0 | 1.0 | NaN | NaN |
30-40 | 2539 | 11412 | 1.0 | 0.000000 | 1.0 |
40-50 | 0 | 13951 | NaN | 0.000000 | NaN |
50-60 | 0 | 13951 | NaN | 0.834636 | NaN |
60-70 | 0 | 13951 | NaN | 1.000000 | NaN |
70-80 | 0 | 13951 | NaN | 1.000000 | NaN |
80-90 | 11376 | 2575 | 0.0 | 1.000000 | -1.0 |
90-100 | 13951 | 0 | 0.0 | NaN | NaN |
Cumulative gain chart
Predict uplift for both treated and control observations and compute the average prediction per decile (bins) in both groups. Then, the difference between those averages is taken for each decile. (Gutierrez, P., & Gérardy, J. Y. (2017). “Causal Inference and Uplift Modelling: A Review of the Literature”)
\[ \left( \frac{Y^{T}}{N^{T}} - \frac{Y^{C}}{N^{C}} \right) (N^{T} + N^{C}) \]
- \(Y^{T} / Y^{C}\): sum of the treated / control individual outcomes in the bin.
- \(N^{T} / N^{C}\): number of treated / control observations in the bin.
We can compute this from the tables above. For example, for the S learner:
uplift_by_percentile_df = uplift_by_percentile(
y_true=data_train.y,
uplift=data_out_train_s.y_pred.flatten(),
treatment=data_train.treatment,
strategy="overall",
total=False,
)
def compute_response_absolutes(df: pd.DataFrame) -> pd.DataFrame:
df["responses_treatment"] = df["n_treatment"] * df["response_rate_treatment"]
df["responses_control"] = df["n_control"] * df["response_rate_control"]
return df
def compute_cumulative_response_rates(df: pd.DataFrame) -> pd.DataFrame:
df["n_treatment_cumsum"] = df["n_treatment"].cumsum()
df["n_control_cumsum"] = df["n_control"].cumsum()
df["responses_treatment_cumsum"] = df["responses_treatment"].cumsum()
df["responses_control_cumsum"] = df["responses_control"].cumsum()
df["response_rate_treatment_cumsum"] = (
df["responses_treatment_cumsum"] / df["n_treatment_cumsum"]
)
df["response_rate_control_cumsum"] = (
df["responses_control_cumsum"] / df["n_control_cumsum"]
)
return df
def compute_cumulative_gain(df: pd.DataFrame) -> pd.DataFrame:
df["uplift_cumsum"] = (
df["response_rate_treatment_cumsum"] - df["response_rate_control_cumsum"]
)
df["cum_gain"] = df["uplift_cumsum"] * (
df["n_treatment_cumsum"] + df["n_control_cumsum"]
)
return df
fig, ax = plt.subplots()
uplift_by_percentile_df \
.pipe(compute_response_absolutes) \
.pipe(compute_cumulative_response_rates) \
.pipe(compute_cumulative_gain) \
.plot(y="cum_gain", kind="line", marker="o", ax=ax)
ax.legend().remove()
ax.set(
title="Cumulative gain by percentile - S Learned (train)",
ylabel="cumulative gain"
);
Remark: From Gutierrez, P., & Gérardy, J. Y. (2017). “Causal Inference and Uplift Modelling: A Review of the Literature”, > This is useful to marketers because they can easily see if the treatment has a global positive or negative effect and if they can expect a better gain by targeting part of the population. We can thus choose the decile that maximizes the gain as the limit of the population to be targeted.
Uplift Curve
We can generalize the cumulative gain chart for each observation of the test set:
\[ f(t) = \left( \frac{Y^{T}_{t}}{N^{T}_{t}} - \frac{Y^{C}_{t}}{N^{C}_{t}} \right) (N^{T}_{t} + N^{C}_{t}) \]
where the \(t\) subscript indicates that the quantity is calculated for the first \(t\) observations, sorted by inferred uplift value. (Gutierrez, P., & Gérardy, J. Y. (2017). “Causal Inference and Uplift Modelling: A Review of the Literature”)
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 15), layout="constrained")
# in-sample
for i, data_out_train in enumerate(train_pred):
ax = axes[i, 0]
plot_uplift_curve(
y_true=data_train.y,
uplift=data_out_train.y_pred.flatten(),
treatment=data_train.treatment,
perfect=True,
ax=ax,
)
ax.set(title=f"In-sample predictions ({data_out_train.meta_learner_name})")
# out-of-sample
for j, data_out_val in enumerate(val_pred):
ax = axes[j, 1]
plot_uplift_curve(
y_true=data_val.y,
uplift=data_out_val.y_pred.flatten(),
treatment=data_val.treatment,
perfect=True,
ax=ax,
)
ax.set(title=f"Out-sample predictions ({data_out_val.meta_learner_name})")
fig.suptitle("Uplift Curves", fontsize=24);
A remark on the perfect uplift curve: (Diemert, Eustache, et.al. (2020) “A Large Scale Benchmark for Uplift Modeling”) > A perfect model assigns higher scores to all treated individuals with positive outcomes than any individuals with negative outcomes.
from sklift.metrics import uplift_curve
num_all, curve_values = uplift_curve(
y_true=data_train.y, uplift=perfect_uplift_train, treatment=data_train.treatment
)
fig, ax1 = plt.subplots(figsize=(8, 6))
ax2 = ax1.twinx()
sns.lineplot(
x=num_all,
y=curve_values,
color="C2",
marker="o",
markersize=10,
label="perfect uplift curve",
ax=ax1,
)
sns.lineplot(
x=range(data_train.y.size),
y=np.sort(a=perfect_uplift_train)[::-1],
color="C3",
label="np.sort(a=perfect_uplift_train)[::-1]",
ax=ax2,
)
ax1.legend(loc="upper center", bbox_to_anchor=(0.5, -0.12), ncol=1)
ax1.set(
xlabel="Number targeted",
ylabel="Number of incremental outcome",
title="Perfect Uplift Curve",
)
ax2.grid(None)
ax2.legend(loc="upper center", bbox_to_anchor=(0.5, -0.20), ncol=1);
We can compare the perfect uplift curve against a random one:
# number of random uplift curves to generate
n_random_samples = 100
# sample random uplift curves from a uniform distribution
uplift_random_samples = np.random.uniform(
low=-1,
high=1,
size=(data_train.y.size, n_random_samples),
)
# compute uplift curve for each random sample
random_uplift_curves = [
uplift_curve(
y_true=data_train.y,
uplift=uplift_random_samples[:, i],
treatment=data_train.treatment,
)
for i in range(n_random_samples)
]
# plot
fig, ax = plt.subplots(
nrows=2, ncols=1, figsize=(8, 10), sharex=True, layout="constrained"
)
# perfect uplift curve
sns.lineplot(
x=num_all,
y=curve_values,
color="C2",
marker="o",
markersize=10,
label="perfect uplift curve",
ax=ax[1],
)
# random uplift curves
for x, y in random_uplift_curves:
ax[0].plot(x, y, color="C1", alpha=0.05)
ax[1].plot(x, y, color="C1", alpha=0.05)
ax[0].set(title="Random Uplift Curves", ylabel="Number of incremental outcome")
ax[1].set(xlabel="Number targeted", ylabel="Number of incremental outcome");
Qini Curve
There is another variant for measuring the uplift, the Qini curve. It is defined as follows:
\[ g(t) = Y^{T}_{t} - Y^{C}_{t} \left( \frac{N^{T}_{t}}{N^{C}_{t}} \right) \]
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 15), layout="constrained")
# in-sample
for i, data_out_train in enumerate(train_pred):
ax = axes[i, 0]
plot_qini_curve(
y_true=data_train.y,
uplift=data_out_train.y_pred.flatten(),
treatment=data_train.treatment,
perfect=True,
ax=ax,
)
ax.set(title=f"In-sample predictions ({data_out_train.meta_learner_name})")
# out-of-sample
for j, data_out_val in enumerate(val_pred):
ax = axes[j, 1]
plot_qini_curve(
y_true=data_val.y,
uplift=data_out_val.y_pred.flatten(),
treatment=data_val.treatment,
perfect=True,
ax=ax,
)
ax.set(title=f"Out-sample predictions ({data_out_val.meta_learner_name})")
fig.suptitle("Qini Curves", fontsize=24);
For more details on the Qini curve an related metrics see Section 4 in Diemert, Eustache, et.al. (2020) “A Large Scale Benchmark for Uplift Modeling”.