Introduction to Uplift Modeling

Dr. Juan Orduz

Mathematician & Data Scientist

PyConDE & PyData Berlin 2022


How can we optimally select customers to be treated by marketing incentives?


We can not send and not send incentives to the same customers at the same time


What is Uplift Modeling?

From Gutierrez, P., & Gérardy, J. Y. (2017). "Causal Inference and Uplift Modelling: A Review of the Literature"

  • Uplift modeling refers to the set of techniques used to model the incremental impact of an action or treatment on a customer outcome.

  • Uplift modeling is therefore both a Causal Inference problem and a Machine Learning one.

Conditional Average Treatment Effect

  • Let Yi1Y^{1}_{i} denote person ii's outcome when it receives the treatment and Yi0Y^{0}_{i} when it does not receive the treatment.

  • We are interested in:

    • The causal effect τiYi1Yi0\tau_{i} \coloneqq Y^{1}_{i} - Y^{0}_{i}
    • Given a feature vector XiX_{i} of the ii-th person, we would like to estimate the conditional average treatment effect

      CATE:τ(Xi)E[Yi1Xi]E[Yi0Xi]CATE \: : \tau(X_{i}) \coloneqq E[Y^{1}_{i} | X_{i}] - E[Y^{0}_{i} | X_{i}]

  • However, we can not observe them! 🙁

CATE Estimation

Let WiW_{i} is a binary variable indicating whether person ii received the treatment, so that

Yiobs=Yi1Wi+(1Wi)Yi0Y_{i}^{obs} = Y^{1}_{i} W_{i} + (1 - W_{i}) Y^{0}_{i}

Unconfoundedness Assumption

If we assume that the treatment assignment WiW_{i} is independent of Yi1Y^{1}_{i} and Yi0Y^{0}_{i} conditional on XiX_i, then we can estimate the CATECATE from observational data by computing the empirical counterpart:

uplift=τ^(Xi)=E[YiobsXi,Wi=1]E[YiobsXi,Wi=0]\text{\bf{uplift}} = \widehat{\tau}(X_{i}) = E[Y^{obs}_{i} | X_{i}, W_{i}=1] - E[Y^{obs}_{i} | X_{i}, W_{i}=0]

🤔 Data Collection


Estimating Uplift

  • Meta Algorithms \longleftarrow Today

  • The Class Transformation Method

  • Direct measurements (e.g. trees)


Step 1: Training

(x11x1kw1x11xnkwn)XWμ(y1yn)\underbrace{ \left( \begin{array}{cccc} x_{11} & \cdots & x_{1k} & w_{1} \\ \vdots & \ddots & \vdots & \vdots \\ x_{11} & \cdots & x_{nk} & w_{n} \\ \end{array} \right)}_{X\bigoplus W} \xrightarrow{\mu} \left( \begin{array}{c} y_{1} \\ \vdots \\ y_{n} \end{array} \right)

Step 2: Uplift Prediction

τ^(X)=μ^(x11x1k1x11xmk1)μ^(x11x1k0x11xmk0)\widehat{\tau}(X') = \hat{\mu}\left( \begin{array}{cccc} x_{11}' & \cdots & x_{1k}' & 1 \\ \vdots & \ddots & \vdots & \vdots \\ x_{11}' & \cdots & x_{mk}' & 1 \\ \end{array} \right) - \hat{\mu} \left( \begin{array}{cccc} x_{11}' & \cdots & x_{1k}' & 0 \\ \vdots & \ddots & \vdots & \vdots \\ x_{11}' & \cdots & x_{mk}' & 0 \\ \end{array} \right)


Step 1: Training

(x11x1kx11xnCk)XCXcontrolμC(y1ynC)\underbrace{ \left( \begin{array}{ccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{n_{C}k} \\ \end{array} \right)}_{X^{C}\coloneqq X|_{\text{control}}} \xrightarrow{\mu_{C}} \left( \begin{array}{c} y_{1} \\ \vdots \\ y_{n_{C}} \end{array} \right)

(x11x1kx11xnTk)XTXtreatmentμT(y1ynT)\underbrace{ \left( \begin{array}{ccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{n_{T}k} \\ \end{array} \right)}_{X^{T}\coloneqq X |_{\text{treatment}}} \xrightarrow{\mu_{T}} \left( \begin{array}{c} y_{1} \\ \vdots \\ y_{n_{T}} \end{array} \right)


Step 2: Uplift Prediction

τ^(X)=μ^T(x11x1kx11xmk)μ^C(x11x1kx11xmk)\widehat{\tau}(X') = \hat{\mu}_{T}\left( \begin{array}{cccc} x_{11}' & \cdots & x_{1k}' \\ \vdots & \ddots & \vdots \\ x_{11}' & \cdots & x_{mk}' \\ \end{array} \right) - \hat{\mu}_{C} \left( \begin{array}{cccc} x_{11}' & \cdots & x_{1k}' \\ \vdots & \ddots & \vdots \\ x_{11}' & \cdots & x_{mk}' \\ \end{array} \right)


Step 1: Training: Same as T-Learner

Step 2: Compute imputed treatment effects

D~T(y1ynT)μ^C(x11x1kx11xnTk)\tilde{D}^{T} \coloneqq \left( \begin{array}{c} y_{1} \\ \vdots \\ y_{n_{T}} \end{array} \right) - \hat{\mu}_{C} \left( \begin{array}{cccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{n_{T}k} \\ \end{array} \right)

D~Cμ^T(x11x1kx11xnCk)(y1ynC)\tilde{D}^{C} \coloneqq \hat{\mu}_{T} \left( \begin{array}{cccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{n_{C}k} \\ \end{array} \right) - \left( \begin{array}{c} y_{1} \\ \vdots \\ y_{n_{C}} \end{array} \right)


Step 3: Train with different targets

(x11x1kx11xnCk)XcontrolτC(D~1CD~nCC)\underbrace{ \left( \begin{array}{ccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{n_{C}k} \\ \end{array} \right)}_{X|_{\text{control}}} \xrightarrow{\tau_{C}} \left( \begin{array}{c} \tilde{D}^{C}_{1} \\ \vdots \\ \tilde{D}^{C}_{n_{C}} \end{array} \right)

(x11x1kx11xnTk)XtreatmentτT(D~1TD~nTT)\underbrace{ \left( \begin{array}{ccc} x_{11} & \cdots & x_{1k} \\ \vdots & \ddots & \vdots \\ x_{11} & \cdots & x_{n_{T}k} \\ \end{array} \right)}_{X|_{\text{treatment}}} \xrightarrow{\tau_{T}} \left( \begin{array}{c} \tilde{D}^{T}_{1} \\ \vdots \\ \tilde{D}^{T}_{n_{T}} \end{array} \right)


Step 4: Uplift Prediction

τ^(X)=g(X)τ^C(X)+(1g(X))τ^T(X)\widehat{\tau}(X') = g(X')\hat{\tau}_{C}(X') + (1 - g(X'))\hat{\tau}_{T}(X')

where g(X)[0,1]g(X') \in [0, 1] is a weight function.

Remark: A common choice for g(X)g(X) is an estimator of the propensity score, which is defined as the probability of treatment given the covariates XX, i.e. p(Wi=1Xi)p(W_{i}=1|X_i).

Intuition behind the X-Learner

We study an simulated example where we know the uplift is exactly τ=1\tau=1.


X-Learner Step 1 (same as T-Learner):

Model fit for control (red) and treatment (blue) groups.


T-Learner Estimation:

  • The solid line represents the difference between the model fit for the control group and the treatment groups.
  • The estimation is not good as the treatment group is very small.


Imputed Treatment Effects:

D~T=YTμ^C(XT)D~C=μ^T(XC)YC\begin{align*} \tilde{D}^{T} &= Y^{T} - \hat{\mu}_{C}(X^T) \\ \tilde{D}^{C} &= \hat{\mu}_{T}(X^{C}) - Y^{C} \\ \end{align*}


X-Learner Estimation:

  • The dashed line represents the X-Learner estimation.
  • It combines the fit from the imputed effects by using and estimator of the propensity score, i.e. g(x)=e^(x)g(x)=\hat{e}(x). In this example e^(x)\hat{e}(x) will be small as we have much more observations in the control group. Hence the estimated uplift will be close to τ^T\hat{\tau}^{T}.


Some Python Implementations




Python Code: Example 🐍

from causalml.inference.meta import BaseTClassifier
from sklearn.ensemble import HistGradientBoostingClassifier

# define ml model
learner = HistGradientBoostingClassifier()

# set meta-model
t_learner = BaseTClassifier(learner=learner)

# estimate the average treatment effect (ATE)
t_ate_lwr, t_ate, t_ate_upr = t_learner.estimate_ate(X=x, treatment=w, y=y)

# predict treatment effects

# access ml models

Uplift Model Evaluation? 🤨



Uplift Evaluation: Uplift by Percentile

  1. Sort uplift predictions by decreasing order.
  2. Compute percentiles.
  3. Predict uplift for both treated and control observations per percentile.
  4. The difference between those averages is taken for each percentile.


Uplift Evaluation: Uplift by Percentile

A well performing model would have large values in the first percentiles and decreasing values for larger ones

Uplift Evaluation: Cumulative Gain Chart

Predict uplift for both treated and control observations and compute the average prediction per decile (bins) in both groups. Then, the difference between those averages is taken for each decile.

(YTNTYCNC)(NT+NC)\left( \frac{Y^{T}}{N^{T}} - \frac{Y^{C}}{N^{C}} \right) (N^{T} + N^{C})

  • YT/YCY^{T} / Y^{C}: sum of the treated / control individual outcomes in the bin.
  • NT/NCN^{T} / N^{C}: number of treated / control observations in the bin.

How to compute it ? 🫠

def compute_response_absolutes(df: pd.DataFrame) -> pd.DataFrame:
  df["responses_treatment"] = df["n_treatment"] * df["response_rate_treatment"]
  df["responses_control"] = df["n_control"] * df["response_rate_control"]
  return df

def compute_cumulative_response_rates(df: pd.DataFrame) -> pd.DataFrame:
  df["n_treatment_cumsum"] = df["n_treatment"].cumsum()
  df["n_control_cumsum"] = df["n_control"].cumsum()
  df["responses_treatment_cumsum"] = df["responses_treatment"].cumsum()
  df["responses_control_cumsum"] = df["responses_control"].cumsum()
  df["response_rate_treatment_cumsum"] = df["responses_treatment_cumsum"] / df["n_treatment_cumsum"]
  df["response_rate_control_cumsum"] = df["responses_control_cumsum"] / df["n_control_cumsum"]
  return df

def compute_cumulative_gain(df: pd.DataFrame) -> pd.DataFrame:
  df["uplift_cumsum"] = df["response_rate_treatment_cumsum"] - df["response_rate_control_cumsum"]
  df["cum_gain"] = df["uplift_cumsum"] * (df["n_treatment_cumsum"] + df["n_control_cumsum"])
  return df


Uplift Evaluation: Cumulative Gain Chart


  • We can assess whether the treatment has a global positive or negative effect and if one can expect a better gain by targeting part of the population.
  • We can thus choose the decile that maximizes the gain as the limit of the population to be targeted.

Uplift Metrics: Uplift Curve

We can generalize the cumulative gain chart for each observation of the test set:

f(t)=(YtTNtTYtCNtC)(NtT+NtC)f(t) = \left( \frac{Y^{T}_{t}}{N^{T}_{t}} - \frac{Y^{C}_{t}}{N^{C}_{t}} \right) (N^{T}_{t} + N^{C}_{t})

where the tt subscript indicates that the quantity is calculated for the first tt observations, sorted by inferred uplift value.

Uplift Metrics: Uplift Curve & AUC


Best uplift model? 🤓

A perfect model assigns higher scores to all treated individuals with positive outcomes than any individuals with negative outcomes.

# Control Responders
cr_num = np.sum((y_true == 1) & (treatment == 0))
# Treated Non-Responders
tn_num = np.sum((y_true == 0) & (treatment == 1))

summand = y_true if cr_num > tn_num else treatment

perfect_uplift = 2 * (y_true == treatment) + summand

Perfect Uplift Curve

from sklift.metrics import uplift_curve

a, b = uplift_curve(y_true=y_true, uplift=perfect_uplift, treatment=treatment)


Random Uplift Curves

# For example:

    size=(n, n_samples),

Model Comparison

Compute AUC on a test set.

Other metrics: Qini Curve

g(t)=YtTYtC(NtTNtC)g(t) = Y^{T}_{t} - Y^{C}_{t} \left( \frac{N^{T}_{t}}{N^{C}_{t}} \right)

Corrects uplifts of selected individuals with respect to the number of individuals in treatment/control using the NtT/NtCN^{T}_{t} / N^{C}_{t} factor.

Demo 💻

Notebook Link


References 📚

Thank you!

More Info: