14 min read

Experimentation, Non-Compliance and Instrumental Variables with PyMC

In this notebook we present an example of how to use PyMC to estimate the effect of a treatment in an experiment where there is non-compliance through the use of instrumental variables.

By non-compliance we mean that the treatment assignment does not guarantee that the treatment is actually received by the treated. The main challenge is that we can not simply estimate the treatment effect as a difference in means since the non-compliance mechanism is most of the time not at random and may introduce confounders. For a more detailed discussion of the problem see Chapter 9 in the (amazing!) book Causal Inference for The Brave and True by Matheus Facure Alves. One very common tool to tackle such problem are instrumental variables (see Chapter 8) for a nice introduction. In essence, we can use the variant assignment as an instrument to control for confounders introduced by the non-compliance mechanism.

To illustrate the concepts we are going to use the example worked out in Matheus’ book (for a simulated example see here): an experiment to test a push to engage with users measured by in app purchase.

You select 10000 random customers and, for each of them, you assign the push with 50% probability. However, when you execute the test, you notice that some customers who were assigned to receive the push are not receiving it. When you talk to the engineers, they say that it is because they probably have an older phone that doesn’t support the kind of push the marketing team designed.

Ok! So why bother to present the same example? Well, we are going to do it the bayesian way 😄 ! The main motivation is to include prior experiments information to have a more precise and informed estimate as described in the great blog post Bayesian Instrumental variables (with a hierarchical prior) in Stan by Jim Savage. For a great introduction to bayesian instrumental variables see the article Bayesian instrumental variables: priors and likelihoods by Hedibert F. Lopes and Nicholas G. Polson. In the first part of this notebook we replicate the example using PyMC and in the second part we extend it to include a hierarchical prior to encode previous experiments data (If you are Stan user I strongly recommend to read Jim’s blog in parallel).

Prepare Notebook

import os
import arviz as az
import graphviz as gr
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import pymc.sampling_jax
import pytensor.tensor as pt
import seaborn as sns

from linearmodels.iv import IV2SLS

plt.style.use("bmh")
plt.rcParams["figure.figsize"] = [12, 7]
plt.rcParams["figure.dpi"] = 100
plt.rcParams["figure.facecolor"] = "white"

%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = "retina"
seed: int = sum(map(ord, "instrumental_variables"))
rng: np.random.Generator = np.random.default_rng(seed=seed)

Read Data

We read the app_engagement_push.csv data directly from the repository.

root_path = "https://raw.githubusercontent.com/matheusfacure/python-causality-handbook/master/causal-inference-for-the-brave-and-true/data/"
data_path = os.path.join(root_path, "app_engagement_push.csv")

df = pd.read_csv(data_path)

n = df.shape[0]

df.head()
in_app_purchase push_assigned push_delivered
0 47 1 1
1 43 1 0
2 51 1 1
3 49 0 0
4 79 0 0

Here we have data regarding in-app purchases from an experiment. The columns push_assigned and push_received indicate whether the user was assigned to the treatment group and whether the user actually received the push notification.

EDA

First, let’s do an initial EDA to get a sense of the data. First, we compare the in_app_purchase distribution between the two groups.

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(9, 7), sharex=True, sharey=True, layout="constrained"
)
sns.histplot(x="in_app_purchase", hue="push_assigned", kde=True, data=df, ax=ax[0])
sns.histplot(x="in_app_purchase", hue="push_delivered", kde=True, data=df, ax=ax[1])
fig.suptitle("Histogram of in-app purchases", fontsize=16);

We clearly see the difference! When filtering for users that received the push notification we see higher values of in_app_purchase than the users in the assignment group. This would be a quick win for the experiment! However, we should not overlook the non-compliance mechanism! It could hide confounders that are not controlled for in the experiment. Let’s look into how big the non-compliance is.

fig, ax = plt.subplots(figsize=(10, 6))
df.query("push_assigned == 1")["push_delivered"].value_counts().plot(kind="bar", ax=ax)
ax.set(title="Push Delivered Count Among Push Assigned", xlabel="Push delivered", ylabel="Count");

We see that roughly \(30\%\) of users which were assigned to the push did not get it. What could be the reason? Well, Matheus suggest this could be related to the fact users with old phones are less likely to receive the push notification. This naturally induces a bias if we analyze the effect of the push notification on purchases by filtering out only the ones that get it as these users are more likely to have higher income (new phones) that the control. Think about it in the context of the potential outcomes 😉.

Here is the DAG we are going to use to model the problem.

g = gr.Digraph()

g.edge("push assigned", "push delivered")
g.edge("push delivered", "in app purchase")
g.edge("income", "in app purchase")
g.edge("income", "push delivered")
g.node("income", color="blue")
g

The unobserved income confounder opens a path (it is a fork!) which we can not close directly.

OLS Approach

Let’s take the last plot as a starting point for a more quantitative analysis. We are going to run an OLS regression to estimate the effect of the push notification on purchases.

Push Assigned Group

We start with the assignment group. First, we simply compute some statistics.

df.groupby("push_assigned").agg({"in_app_purchase": ["mean", "std"]})
in_app_purchase
mean std
push_assigned
0 69.291675 25.771624
1 71.655270 26.310248

The average difference seems relatively small. We can quantify it through and OLS regression.

ols_formula = "in_app_purchase ~ 1 + push_assigned"
ols = IV2SLS.from_formula(formula=ols_formula, data=df).fit()
ols.summary.tables[1]
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
Intercept 69.292 0.3624 191.22 0.0000 68.581 70.002
push_assigned 2.3636 0.5209 4.5376 0.0000 1.3427 3.3845

The estimates effect is \(2.36\) with a \(95\%\) confidence interval of \([1.34, 3.38]\).

Push Received Group

We run a a similar analysis for users that received the push notification.

df.groupby("push_delivered").agg({"in_app_purchase": ["mean", "std"]})
in_app_purchase
mean std
push_delivered
0 65.518519 25.111324
1 79.449958 25.358723
ols_formula = "in_app_purchase ~ 1 + push_delivered"
ols = IV2SLS.from_formula(formula=ols_formula, data=df).fit()
ols.summary.tables[1]
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
Intercept 65.519 0.3126 209.61 0.0000 64.906 66.131
push_delivered 13.931 0.5282 26.377 0.0000 12.896 14.967

The estimate effect is much higher (as expected)!

We could even try to control for the assignment group by including it as a covariate in the regression.

ols_formula = "in_app_purchase ~ 1 + push_assigned + push_delivered"
ols = IV2SLS.from_formula(formula=ols_formula, data=df).fit()
ols.summary.tables[1]
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
Intercept 69.292 0.3624 191.22 0.0000 68.581 70.002
push_assigned -17.441 0.5702 -30.590 0.0000 -18.559 -16.324
push_delivered 27.600 0.6124 45.069 0.0000 26.399 28.800

In this case we get a much higher effect 😒. From the book:

We have reasons to believe this is a biased estimate. We know that older phones are having trouble in receiving the push, so, probably, richer customers, with newer phones, are the compilers.

In summary, the problem are the unobserved confounders that are correlated with the assignment and the outcome. We can use instrumental variables to get better estimates!

Instrumental Variables Estimation

As described in Chapter 9, we can use the push assignment as an instrument to control for confounders. Why is in an instrument? Well, note that the push assignment has both an effect on getting a push notification and in in-app purchases. The key point is that the effect on in-app purchases is only due through the push notification path and therefore the effect of the assignment is easy to estimate as there are no confounders (e.g. income does not have an impact on the push assignment since it was done as random). A common to use instruments as a way of estimating effect is via 2-stage regression. The linearmodels package has a convenient implementation of this approach.

iv_formula = "in_app_purchase ~ 1 + [push_delivered ~ push_assigned]"
iv = IV2SLS.from_formula(formula=iv_formula, data=df).fit()
iv.summary.tables[1]
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
Intercept 69.292 0.3624 191.22 0.0000 68.581 70.002
push_delivered 3.2938 0.7165 4.5974 0.0000 1.8896 4.6981

The estimated effect is \(3.29\). How to interpret this effect? (again from the book):

We also need to remember about Local Average Treatment Effect (LATE) \(3.29\) is the average causal effect on compliers. Unfortunately, we can’t say anything about those never takers. This means that we are estimating the effect on the richer segment of the population that have newer phones.

Bayesian Instrumental Variables Estimation

In his section we want to describe to replicate the analysis using PyMC.Again, the motivation is to extend it in a later section to add prior knowledge from previous experiments. The main references for the sections below are: - Bayesian Instrumental variables (with a hierarchical prior) in Stan by Jim Savage. - Bayesian instrumental variables: priors and likelihoods by Hedibert F. Lopes and Nicholas G. Polson.

Recall that the main challenge of the problem at hand is that the errors of OLS approach are not independent of the treatment. What if we model this correlation explicitly while fitting the model? This is the idea behind the bayesian instrumental variables approach. Concretely, let’s consider the structural formulation (instead of the 2-stage form). This is nothing else as a regression model with a \(2\)-dimensional outcome:

  • We model the in_app_purchase (\(y\)) as a function of push_delivered (treatment) \(t\) and the error term \(\varepsilon_y\).
  • Simultaneously, we model treatment as a function of push_assigned (instrument) \(z\) and and error term \(\varepsilon_t\).

W e assume that the joint distribution of the errors can be parametrized by a multi-variate normal distribution with mean zero and covariance matrix \(\Sigma\).

\[\begin{align*} \left( \begin{array}{cc} y \\ t \end{array} \right) & \sim \text{MultiNormal}(\mu, \Sigma) \\ \mu & = \left( \begin{array}{cc} \mu_{y} \\ \mu_{t} \end{array} \right) = \left( \begin{array}{cc} \beta_{t}t + a_{y} \\ \beta_{z}z + a_{t} \end{array} \right) \end{align*}\]

We choose fairly uninformative priors (while still constraining them to plausible values) for the parameters. We use the LKJCholeskyCov distribution to model the covariance matrix.

# Prepare data
y = df["in_app_purchase"].to_numpy()
t = df["push_delivered"].to_numpy()
z = df["push_assigned"].to_numpy()
with pm.Model() as model:
    # --- Priors ---
    intercept_y = pm.Normal(name="intercept_y", mu=50, sigma=10)
    intercept_t = pm.Normal(name="intercept_t", mu=0, sigma=1)
    beta_t = pm.Normal(name="beta_t", mu=0, sigma=10)
    beta_z = pm.Normal(name="beta_z", mu=0, sigma=10)
    sd_dist = pm.HalfCauchy.dist(beta=2, shape=2)
    chol, corr, sigmas = pm.LKJCholeskyCov(name="chol_cov", eta=2, n=2, sd_dist=sd_dist)
    # compute and store the covariance matrix
    pm.Deterministic(name="cov", var=pt.dot(l=chol, r=chol.T))

    # --- Parameterization ---
    mu_y = pm.Deterministic(name="mu_y", var=beta_t * t + intercept_y)
    mu_t = pm.Deterministic(name="mu_t", var=beta_z * z + intercept_t)
    mu = pm.Deterministic(name="mu", var=pt.stack(tensors=(mu_y, mu_t), axis=1))

    # --- Likelihood ---
    likelihood = pm.MvNormal(
        name="likelihood",
        mu=mu,
        chol=chol,
        observed=np.stack(arrays=(y, t), axis=1),
        shape=(n, 2),
    )

pm.model_to_graphviz(model=model)

Now we fit the model.

with model:
    idata = pm.sampling_jax.sample_numpyro_nuts(draws=4_000, chains=4, random_seed=rng)

Let’s inspect the posterior distributions of the parameters.

var_names = ["beta_t", "beta_z", "intercept_y", "intercept_t"]

az.summary(data=idata, var_names=var_names)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
beta_t 3.340 0.722 1.975 4.682 0.009 0.006 6493.0 9732.0 1.0
beta_z 0.718 0.006 0.705 0.729 0.000 0.000 10286.0 11549.0 1.0
intercept_y 69.262 0.363 68.590 69.939 0.004 0.003 7635.0 11097.0 1.0
intercept_t -0.000 0.004 -0.009 0.008 0.000 0.000 11250.0 12108.0 1.0

Note that we have recovered a very similar result as in the 2-stage approach above 🚀 ! Next we look into the traces.

axes = az.plot_trace(
    data=idata,
    var_names=var_names,
    compact=True,
    kind="rank_bars",
    backend_kwargs={"figsize": (10, 7), "layout": "constrained"},
)
plt.gcf().suptitle("IV Model - Trace", fontsize=16);

Everything looks good! Let’s look deeper into the posterior distribution of the effect estimation \(\beta_{t}\). In particular, we can compare the \(94%\) HDI of the posterior distribution with the \(94\%\) confidence interval of the 2-stage approach.

fig, ax = plt.subplots(figsize=(8, 6))
hdi_prob = 0.94
az.plot_posterior(
    data=idata,
    var_names=["beta_t"],
    ref_val=iv.params["push_delivered"],
    hdi_prob=hdi_prob,
    ax=ax,
)
ax.axvline(
    x=iv.conf_int(level=hdi_prob).loc["push_delivered", "lower"],
    color="C1",
    ls="--",
    lw=1,
    label=f"{hdi_prob: .0%} CI (lower)",
)
ax.axvline(
    x=iv.conf_int(level=hdi_prob).loc["push_delivered", "upper"],
    color="C1",
    ls="--",
    lw=1,
    label=f"{hdi_prob: .0%} CI (upper)",
)
ax.legend()
ax.set(title="IV Model - Posterior Distribution Effect");

The results are very similar!

In addition, we can also get the correlation of the treatment and the errors from the OLS approach.

fig, ax = plt.subplots(figsize=(8, 6))
sns.histplot(
    x=az.extract(data=idata, var_names=["chol_cov_corr"])[0, 1, :],
    kde=True,
    color="C2",
    ax=ax,
)
ax.set(title="IV Model - Posterior Distribution Correlation");

The correlation is around \(0.3\), which is definitively not negligible. Finally, we can look into the posterior mean of the correlation matrix.

cov_mean = idata.posterior["cov"].mean(dim=("chain", "draw"))

cov_samples = np.random.multivariate_normal(
    mean=np.zeros(shape=(2)), cov=cov_mean, size=5_000
)

g = sns.jointplot(
    x=cov_samples[:, 0],
    y=cov_samples[:, 1],
    kind="kde",
    color="C2",
    fill=True,
    height=6,
)
g.fig.suptitle("IV Model - Covariance Samples", fontsize=16, y=1.05);

All of these results are consistent wih the discussion regarding the positive bias when filtering for users which received the push notification as a result of unobserved confounders like income.

Hierarchical Model: Adding past experiments as priors

Now, we look into a very interesting extension. Imagine we have access to previous experiments results (not-uncommon in the tech industry) where we have the med effects \(b_j\) and the corresponding standard errors \(\sigma_{j}\). We would like to pass this information to the model as trough the prior distributions. One approach suggested in the great post Bayesian Instrumental variables (with a hierarchical prior) in Stan by Jim Savage is t use a hierarchical model to include there previous experiments. The idea is to have a “hyper-prior” distribution \(\text{Normal}(\bar{\beta}_{t}, \bar{\sigma}_{t})\) so that

\[\begin{align*} \beta_j & \sim \text{Normal}(\bar{\beta}_{t}, \bar{\sigma}_{t}) \\ b_j & \sim \text{Normal}(\bar{\beta}, \sigma_{j}) \end{align*}\]

so that the current treatment effect has as a prior \(\beta_t \sim \text{Normal}(\bar{\beta}_{t}, \bar{\sigma}_{t})\). The PyMC implementation is fairly similar to the base case. One thing to consider in the hierarchical model is that we need to use a non-centered parametrization to make the sampling more efficient (see the example A Primer on Bayesian Methods for Multilevel Modeling for implementation details).

# previous experiment data
b_j = np.array([3.4, 3.6, 4.25, 4.0, 3.9])
se_j = np.array([0.5, 0.55, 0.6, 0.45, 0.44])
with pm.Model() as hierarchical_model:
    # --- Priors ---
    beta_t_bar = pm.Normal(name="beta_t_bar", mu=4, sigma=2)
    sigma_t_bar = pm.HalfCauchy(name="sigma_t_bar", beta=2)

    intercept_y = pm.Normal(name="intercept_y", mu=50, sigma=10)
    intercept_t = pm.Normal(name="intercept_t", mu=0, sigma=1)

    beta_z = pm.Normal(name="beta_z", mu=0, sigma=10)
    sd_dist = pm.HalfCauchy.dist(beta=2, shape=2)
    chol, corr, sigmas = pm.LKJCholeskyCov(name="chol_cov", eta=2, n=2, sd_dist=sd_dist)
    # compute and store the covariance matrix
    pm.Deterministic(name="cov", var=pt.dot(l=chol, r=chol.T))

    # --- Parameterization ---
    # previous experiment model
    # set standard normals to use in the non-centered parameterization
    w_j = pm.Normal(name="w_j", mu=0, sigma=1, shape=b_j.size)
    beta_j = pm.Deterministic(name="beta_j", var=beta_t_bar + sigma_t_bar * w_j)
    pm.Normal(name="beta_j_observed", mu=beta_j, sigma=se_j, observed=b_j)
    
    # current experiment model
    # set standard normals to use in the non-centered parameterization
    w = pm.Normal(name="w", mu=0, sigma=1)
    beta_t = pm.Deterministic(name="beta_t", var=beta_t_bar + sigma_t_bar * w)
    
    mu_y = pm.Deterministic(name="mu_y", var=beta_t * t + intercept_y)
    mu_t = pm.Deterministic(name="mu_t", var=beta_z * z + intercept_t)
    mu = pm.Deterministic(name="mu", var=pt.stack(tensors=(mu_y, mu_t), axis=1))
    
    # --- Likelihood ---
    likelihood = pm.MvNormal(
        name="likelihood",
        mu=mu,
        chol=chol,
        observed=np.stack(arrays=(y, t), axis=1),
        shape=(n, 2),
    )

pm.model_to_graphviz(model=hierarchical_model)

We sample from the model (it takes a bit longer than the previous one).

with hierarchical_model:
    hierarchical_idata = pm.sampling_jax.sample_numpyro_nuts(
        target_accept=0.95, draws=4_000, chains=4, random_seed=rng
    )
# verify we do not have divergences
print(f"Divergences = {hierarchical_idata.sample_stats.diverging.sum().item()}")
Divergences = 0

We now look into the posterior distributions of the parameters.

var_names = [
    "beta_t_bar",
    "sigma_t_bar",
    "beta_t",
    "beta_z",
    "intercept_y",
    "intercept_t",
]

az.summary(data=hierarchical_idata, var_names=var_names)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
beta_t_bar 3.776 0.262 3.278 4.275 0.003 0.002 8943.0 7629.0 1.0
sigma_t_bar 0.302 0.271 0.000 0.771 0.003 0.002 6148.0 6112.0 1.0
beta_t 3.709 0.363 2.990 4.398 0.004 0.003 10025.0 9744.0 1.0
beta_z 0.719 0.006 0.707 0.730 0.000 0.000 14523.0 12656.0 1.0
intercept_y 69.132 0.287 68.581 69.671 0.003 0.002 12511.0 10777.0 1.0
intercept_t -0.001 0.004 -0.009 0.007 0.000 0.000 14341.0 12822.0 1.0
axes = az.plot_trace(
    data=hierarchical_idata,
    var_names=var_names,
    compact=True,
    kind="rank_bars",
    backend_kwargs={"figsize": (10, 9), "layout": "constrained"},
)
plt.gcf().suptitle("IV Model - Trace", fontsize=16);

Note that the estimated effect is now \(3.77\), which is higher than the previous case as it is using the prior information from the previous experiments. Let’s zoom in into the posterior distribution of the effect.

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(9, 7), sharex=True, sharey=False, layout="constrained"
)
hdi_prob = 0.94

az.plot_posterior(
    data=idata,
    var_names=["beta_t"],
    ref_val=iv.params["push_delivered"],
    hdi_prob=hdi_prob,
    ax=ax[0],
)
ax[0].axvline(
    x=iv.conf_int(level=hdi_prob).loc["push_delivered", "lower"],
    color="C1",
    ls="--",
    lw=1,
    label=f"{hdi_prob: .0%} CI (lower)",
)
ax[0].axvline(
    x=iv.conf_int(level=hdi_prob).loc["push_delivered", "upper"],
    color="C1",
    ls="--",
    lw=1,
    label=f"{hdi_prob: .0%} CI (upper)",
)
ax[0].legend()
ax[0].set(title="IV Model - Posterior Distribution Effect")

az.plot_posterior(
    data=hierarchical_idata,
    var_names=["beta_t"],
    hdi_prob=hdi_prob,
    ax=ax[1],

)

for j, (b, se) in enumerate(zip(b_j, se_j)):
    ax[1].axvline(x=b, color=f"C{j + 1}", ls="--", lw=1, label=f"b_{j}")

ax[1].legend(title="previous experiemnts", loc="upper left")
ax[1].set(title="IV Hierarchical Model - Posterior Distribution Effect");

Note that the variance of the estimation is lower than the base case. This is one of the advantages of using the hierarchical model to pass prior information from previous experiments since we are also providing, the model with information about the expected uncertainty.

Finally, we can compare the posterior distribution of the effect with the hyper-prior distribution.

ax, *_ = az.plot_forest(
    data=hierarchical_idata,
    var_names=["beta_t_bar", "beta_t"],
    combined=True,
)

ax.axvline(x=iv.params["push_delivered"], color="C0", ls="--", label="IV estimate")

for j, (b, se) in enumerate(zip(b_j, se_j)):
    ax.axvline(x=b, color=f"C{j + 1}", ls="--", lw=1, label=f"b_{j}")

ax.legend(loc="center left",  bbox_to_anchor=(1, 0.5));

As expected, the posterior distribution of the effect is closer to the base case 🙂.