PyCon DE & PyData 2026
\[ \text{ATE} = \mathbb{E}[Y(1) - Y(0)] \]
The Fundamental Problem of Causal Inference
We can never observe both \(Y_i(1)\) and \(Y_i(0)\) for the same unit. One potential outcome is always missing (counterfactual).
\[ \text{Observed: } Y_i = T_i \cdot Y_i(1) + (1 - T_i) \cdot Y_i(0) \]
This is why naive comparisons (treated vs. untreated) can be biased: the groups may differ systematically due to confounders.
Bayes’ theorem: Update beliefs about parameters \(\theta\) after observing data \(y\):
\[ \underbrace{p(\theta \mid y)}_{\text{posterior}} \propto \underbrace{p(y \mid \theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}} \]
The PPL bridge
In a generative graphical model, every distribution starts as a prior over the data-generating process. Once we condition on observed data, those priors become likelihoods. A causal model and a Bayesian model are the same object, viewed before and after observing data.
Model = DAG: Writing the generative process forces you to explicitly state your causal assumptions: this alone is valuable!
Uncertainty for free: Posterior distributions over causal effects.
Prior Modeling: Setting priors to reflect our domain knowledge.
Flexible modeling: Swap likelihoods (Normal \(\to\) Gamma), add hierarchical structure (Mundlak), or latent variables for unobserved confounders (CEVAE).
do-operator: Intervene on the generative model for counterfactual reasoning.
Fast inference: Modern MCMC (NUTS) and SVI make posterior computation scalable and reliable.
This does not imply that Bayesian methods are better than frequentist methods! For every framework, you need to know the assumptions and the limitations.
PPLs do not solve identification! They make estimation under causal assumptions.
A/B testing is the gold standard of causal inference: random assignment makes treatment independent of all confounders, observed and unobserved. The hard part shifts from identification to estimation under uncertainty.
with pm.Model() as non_informative_model:
conversion_rate_control = pm.Uniform(
"conversion_rate_control", lower=0, upper=1
)
conversion_rate_treatment = pm.Uniform(
"conversion_rate_treatment", lower=0, upper=1
)
relative_lift = pm.Deterministic(
"relative_lift",
conversion_rate_treatment / conversion_rate_control - 1,
)\[\begin{align*} \text{cr}_\text{c} & \sim \text{Beta}(1, 1) \\ \text{cr}_\text{t} & \sim \text{Beta}(1, 1) \\ \text{lift} & = \text{cr}_\text{t} / \text{cr}_\text{c} - 1 \\ N_{\text{c}} & \sim \text{Binomial}(N, \text{cr}_\text{c}) \\ N_{\text{t}} & \sim \text{Binomial}(N, \text{cr}_\text{t}) \\ \end{align*}\]
Uninformative Priors yield implausible values for the relative lift.
Setting uninformative priors for the conversion rates yields implausible values for the relative lifts as this is a ratio of two random variables, so it can blow up to infinity or zero.




Informative Priors
Even if we set informative priors for the conversion rates, the relative lift range is still unreasonably wide.

with pm.Model() as correlated_model:
conversion_rate_control = pm.Beta(
"conversion_rate_control", alpha=15, beta=600
)
relative_lift = pm.Normal(
"relative_lift", mu=0, sigma=0.1
)
conversion_rate_treatment = pm.Deterministic(
"conversion_rate_treatment",
conversion_rate_control * (1 + relative_lift)
)


HDI + ROPE Framework
Declare significance when the \(94\%\) HDI of the posterior lift excludes the ROPE (Region of Practical Equivalence) around zero.
Five-step process:


CUPED: Use pre-treatment outcome as a covariate to reduce posterior variance of the causal effect estimate.
Regress \(y_{\text{post}}\) on \(y_{\text{pre}}\) and estimate the \(\theta\) coefficient.
Compute \(y_{\text{cuped}, i} = y_{\text{post}, i} - \theta \, (y_{\text{pre}, i} - \bar{y}_{\text{pre}})\) for each unit \(i\).
Compute the difference in means of \(y_{\text{cuped}}\) between treatment and control group.
\[ y_{\text{cuped}} = \alpha + \beta_{\text{cuped}} \times \text{treatment} + \varepsilon \]

The Lalonde Dataset: Job training program (National Supported Work, 1970s)
Question: Does job training increase earnings?
re78)re75), race/ethnicity, marital status, degreeBackdoor criterion: block all backdoor paths from treatment to outcome to identify the causal effect.
Two ways to estimate the ATE:
(1) Read the estimate from a regression coefficient:
Fit a linear model of the form
\[ \text{earnings} = \alpha + \beta_{\text{treat}} \text{treat} + \beta_{\text{covariates}} \text{covariates} + \varepsilon \]
and identify the estimate of \(\beta_{\text{treat}}\) as the ATE.
(2) Use the do operator for counterfactual computation:
\[ \text{ATE} = \mathbb{E}[\text{earnings} \mid \text{do}(\text{treat} = 1)] - \mathbb{E}[\text{earnings} \mid \text{do}(\text{treat} = 0)] \]
Pearl’s do operator and counterfactuals
The do operator implements Pearl’s intervention:
\[P(Y \mid \text{do}(T = t))\]
It cuts incoming edges to the treatment node, simulating a randomized experiment.
with pm.Model(coords=coords) as earnings_model:
# --- Data ---
covariates_data = pm.Data(
"covariates_data", covariates_obs, dims=("obs_idx", "covariate")
)
# --- Priors ---
intercept_treat = pm.Normal("intercept_treat", mu=0, sigma=10)
beta_covariate_treat = pm.Normal(
"beta_covariate_treat", mu=0, sigma=1, dims=("covariate",)
)
# --- Parametrization ---
logit_p_treat = intercept_treat + pm.math.dot(
covariates_data, beta_covariate_treat
)
p_treat = pm.math.sigmoid(logit_p_treat)
# --- Likelihood ---
treat = pm.Bernoulli("treat", p=p_treat, dims=("obs_idx",)) # --- Priors ---
intercept_earnings = pm.Normal("intercept_earnings", mu=0, sigma=10)
beta_treat_earnings = pm.Normal("beta_treat_earnings", mu=0, sigma=1)
beta_covariate_earnings = pm.Normal(
"beta_covariate_earnings", mu=0, sigma=1, dims=("covariate",)
)
sigma_earnings = pm.HalfNormal("sigma_earnings", sigma=10.0)
# --- Parametrization ---
mu_earnings = pm.Deterministic(
"mu_earnings",
intercept_earnings + beta_treat_earnings * treat
+ pm.math.dot(covariates_data, beta_covariate_earnings),
dims=("obs_idx",),
)
# --- Likelihood ---
pm.Normal(
"earnings", mu=mu_earnings, sigma=sigma_earnings, dims=("obs_idx",)
)
do OperatorThe do operator implements Pearl’s intervention:
\[P(Y \mid \text{do}(T = t))\]
It cuts incoming edges to the treatment node, simulating a randomized experiment.
We can use the do operator to compute the ATE:
\[ ATE = \mathbb{E}[Y \mid \text{do}(T=1)] - \mathbb{E}[Y \mid \text{do}(T=0)] \]

Computing the ATE using the do operator yields the same result as the regression coefficient approach.


pm.Normal(...) \(\to\) pm.Gamma(...) with a log link.do-operator.Why the do-operator now matters: with a non-linear link, the regression coefficient is not the ATE. The do-operator handles non-linearity, individual counterfactuals, and joint interventions on multiple variables.
Problem: Synthetic dataset where the true causal effect of \(X\) on \(Y\) is zero. We have group-level confounding through group-level effect \(U_G\).

Taken from Richard McElreath’s Statistical Rethinking (2026 Lectures).
Model Structure
| Model | Approach | Recovers true effect? |
|---|---|---|
| Naive | Ignores groups entirely | No (biased) |
| Fixed Effects | Group-specific intercepts | Yes (high variance) |
| Multilevel | Hierarchical priors on groups | No (confounded) |
| Mundlak | Adds group means of \(X\) | Yes (efficient) |
| Mundlak Latent | Latent group variable | Yes (best by LOO) |
\[ y = \beta_{xy}x + \beta_{zy}z + \text{?} + \varepsilon \]
Key insight (Mundlak trick): Including group-level means as predictors absorbs the between-group confounding while preserving partial pooling benefits.
Latent Model
y ~ 0 + x + z + C(g)
y ~ x + z + (1 | g)
y ~ x + z + (1 | g) + x_bar
\[\begin{align*} \color{red}{u_{[g]}} & \sim \text{Normal}(\mu_u, \sigma_u) \\ x &= \alpha_x + \beta_{ux}{\color{red}{u_{[g]}}} + \varepsilon_x \\ y &= \alpha_{y, [g]} + \beta_{xy}x + \beta_{zy}z + \beta_{yg}{\color{red}{u_{[g]}}} + \varepsilon_y \end{align*}\]
:::
Does side-quest engagement increase in-game purchases?Taken from Robert O. Ness’s book: Causal AI.
Model = Decoder (generative process): Follows the causal ordering:
Guide = Encoder (recognition network): Flax (NNX) neural network that maps observed data \((\text{guild membership}, \text{engagement}, \text{purchases})\) to the approximate posterior \(q_\phi(Z \mid \cdot) = \text{Normal}(\mu_\phi, \sigma_\phi)\)
Inference: SVI maximizes the ELBO, jointly training the decoder parameters \(\theta\) and encoder parameters \(\phi\) (NumPyro)
Simulation Study
Here is a simpler parameter recovery study for the CEVAE: CATE Estimation with Causal Effect Variational Autoencoders
Question: How does family intervention reduce substance use disorder? Decompose total effect into direct and indirect pathways.
gender (binary), conflict (ordinal family conflict level).fam_int (family intervention program, binary).dev_peer (deviant peer engagement), sub_exp (drug experimentation).sub_disorder (substance use disorder diagnosis, binary).Effect decomposition
Define \(E_{t, t', t''}\) as the expected outcome when fam_int \(= t\) in the outcome equation, dev_peer follows its \(\text{do}(T\!=\!t')\) distribution, and sub_exp follows its \(\text{do}(T\!=\!t'')\) distribution:
\[ E_{t,t',t''} = \sum_{m_1, m_2} \color{blue}{P(M_1\!=\!m_1 \mid \text{do}(T\!=\!t'))} \cdot \color{red}{P(M_2\!=\!m_2 \mid \text{do}(T\!=\!t''))} \cdot \color{green}{P(Y\!=\!1 \mid \text{do}(T\!=\!t, M_1\!=\!m_1, M_2\!=\!m_2))} \]
The mediators factor because there is no edge between dev_peer and sub_exp in the DAG.
We can compute the effects using the do-operator:
| Effect | Formula | Interpretation |
|---|---|---|
| TE (Total) | \(E_{1,1,1} - E_{0,0,0}\) | Full treatment effect |
| DE (Direct) | \(E_{1,0,0} - E_{0,0,0}\) | fam_int \(\to\) sub_disorder, mediators at control |
IIE\(_1\) (via dev_peer) |
\(E_{0,1,0} - E_{0,0,0}\) | Shift dev_peer to treated, hold rest at control |
IIE\(_2\) (via sub_exp) |
\(E_{0,0,1} - E_{0,0,0}\) | Shift sub_exp to treated, hold rest at control |
| INT (Interaction) | \(E_{0,1,1} - E_{0,1,0} - E_{0,0,1} + E_{0,0,0}\) | Joint mediator contribution beyond individual effects |
Takeaway: With PPLs, you decompose a single treatment effect into mechanism-specific pathways — each one a posterior distribution, each one a counterfactual query against the same model.

Marketing Mix Modeling (MMM) is by nature a causal inference problem.
Question: How to optimally allocate the budget across channels to maximize the revenue in the next period?
We need to estimate marketing efficiency per channel (ROAS) to then optimize the budget allocation.




What PPLs unlock for causal inference
do-operator built in. Counterfactuals are a model transformation, not a separate library. \(\to\) ATE, Mediation
![]()