Media Mix Model and Experimental Calibration: A Simulation Study

In this notebook, we present a complete simulation study of the media mix model (MMM) and experimental calibration method presented in the paper “Media Mix Model Calibration With Bayesian Priors”, by Zhang, et al., where the authors propose a convenient parametrization the regression model in terms of the ROAs (return on advertising spend) instead of the classical regression (beta) coefficients. The benefit of this parametrization is that it allows for using Bayesian priors on the ROAS, which typically come from previous experiments or domain knowledge. Providing this information to the model is essential in real-life MMM applications, as biases and noise can easily fool us. Similar to the author’s paper, we show that the proposed method can provide better ROAS estimation when we have a bias due to missing covariates or omitted variables. We work out an example of the classical media mix model presented in the paper “Bayesian Methods for Media Mix Modeling with Carryover and Shape Effects”, by Jin, et al.. I strongly recommend taking a look in too these two papers before reading this notebook. Reading them in parallel with this notebook is also a good alternative.

For an introduction to the topic I recommend my two previous posts:

and the PyMC-Marketing MMM example.

Outline

Here is the outline of the study.

Data Generating Process
ROAS Computation (Analytically)
Data Preparation
Models:
- Model 1 (Causal Model): Model with all variables needed to specify causal effects of media channels. We test parameter recovery from the data generating process.
- Model 2 (Unobserved Confounder): Model with out a confounding variable. We show how the ROAS estimation in this case is quite off due omitted variable bias (confounding).
- Model 3 (ROAS Calibration): We keep out the confounding variable as in Model 2, but we set up priors in the ROAS instead of the regression (beta) coefficients of the media variables. When setting informative priors on the ROAS, we show the biad due the missing confounding variable decreases significantly. This implies this technique can be very effective to calibrate media mix models.

All the results are fully reproducible!

Prepare Notebook

import arviz as az
import graphviz as gr
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import numpy as np
import pandas as pd
import preliz as pz
import pymc as pm
import pytensor.tensor as pt
import seaborn as sns
import xarray as xr
from dowhy import CausalModel
from sklearn import set_config
from sklearn.preprocessing import MaxAbsScaler
from tqdm.notebook import tqdm

set_config(transform_output="pandas")


az.style.use("arviz-darkgrid")
plt.rcParams["figure.figsize"] = [12, 7]
plt.rcParams["figure.dpi"] = 100
plt.rcParams["figure.facecolor"] = "white"


%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = "retina"

seed: int = sum(map(ord, "mmm_roas"))
rng: np.random.Generator = np.random.default_rng(seed=seed)

Data Generating Process

We begin by generating a synthetic data set to be able to compare the estimates of various models against the ground truth. We follow a similar strategy as in the PyMC-Marketing MMM example. However, the main difference is that we add a confounder variable which a causal effect on a specific channel and the target. The following DAG represents the data generating process and the causal dependencies between the variables.

g = gr.Digraph()
g.node(name="seasonality", label="seasonality", color="lightgray", style="filled")
g.node(name="trend", label="trend")
g.node(name="z", label="z", color="lightgray", style="filled")
g.node(name="x1", label="x1", color="#2a2eec80", style="filled")
g.node(name="x2", label="x2", color="#fa7c1780", style="filled")
g.node(name="y", label="y", color="#328c0680", style="filled")
g.edge(tail_name="seasonality", head_name="x1")
g.edge(tail_name="z", head_name="x1")
g.edge(tail_name="x1", head_name="y")
g.edge(tail_name="seasonality", head_name="y")
g.edge(tail_name="trend", head_name="y")
g.edge(tail_name="z", head_name="y")
g.edge(tail_name="x2", head_name="y")
g

Here \(x_1\) and \(x_2\) represent two media channels and \(y\) our target variable (e.g. sales). The confounder \(z\) has a causal effect on \(x_1\) and \(y\).

Remark: If you are a color geek like me, you can get the color palette color codes of the arviz theme here. Note that you can add them with transparency as described here. Note that the transparency is specified by the las two characters of the hex code.

We not proceed to generate weekly data for \(2.5\) years. In the following snippet of code we generate the date range, seasonality and trend components and, the covariate \(z\), which is always positive.

# date range
min_date = pd.to_datetime("2021-10-02")
max_date = pd.to_datetime("2024-03-30")

data_df = pd.DataFrame(
    data={"date": pd.date_range(start=min_date, end=max_date, freq="W-SAT")}
)

n = data_df.shape[0]

data_df["dayofyear"] = data_df["date"].dt.dayofyear
data_df["quarter"] = data_df["date"].dt.to_period("Q").dt.strftime("%YQ%q")
data_df["trend"] = (np.linspace(start=0.0, stop=50, num=n) + 10) ** (1 / 3) - 1
data_df["cs"] = -np.sin(2 * 2 * np.pi * data_df["dayofyear"] / 365.25)
data_df["cc"] = np.cos(1 * 2 * np.pi * data_df["dayofyear"] / 365.25)
data_df["seasonality"] = 0.3 * (data_df["cs"] + data_df["cc"])
data_df["z"] = 0.3 * rng.gamma(shape=1, scale=2 / 3, size=n)

The following plot shows the time series development of these features (and the aggregation).

fig, ax = plt.subplots(
    nrows=2,
    ncols=1,
    figsize=(12, 9),
    sharex=True,
    sharey=False,
    layout="constrained",
)
sns.lineplot(
    data=data_df,
    x="date",
    y="trend",
    color="C0",
    label="trend",
    ax=ax[0],
)
sns.lineplot(
    data=data_df,
    x="date",
    y="seasonality",
    color="C1",
    label="seasonality",
    ax=ax[0],
)
sns.lineplot(data=data_df, x="date", y="z", color="C2", label="z", ax=ax[0])
ax[0].legend(loc="upper left")
ax[0].set(xlabel="date", ylabel=None)

sns.lineplot(
    data=data_df.eval("sum = trend + seasonality + z"),
    x="date",
    y="sum",
    color="black",
    label="trend + seasonality + z",
    ax=ax[1],
)
ax[1].legend(loc="upper left")

fig.suptitle(t="Data Components", fontsize=18, fontweight="bold")

Next, we generate the media channels \(x_1\) and \(x_2\). Note that the channel spend \(x_1\) depends both the seasonality component and the \(z\) variable. We also generate Gaussian noise to be added to the target variable \(y\).

data_df["x1"] = 1.2 * (
    0.5
    + 0.4 * data_df["seasonality"]
    + 0.6 * data_df["z"]
    + 0.1 * rng.poisson(lam=1 / 2, size=n)
)

data_df["x2"] = 0.3 * rng.gamma(shape=1, scale=1, size=n)

data_df["epsilon"] = rng.normal(loc=0, scale=0.1, size=n)

We now can visualize all the components:

fig, ax = plt.subplots(
    nrows=2,
    ncols=1,
    figsize=(15, 10),
    sharex=True,
    sharey=False,
    layout="constrained",
)
sns.lineplot(
    data=data_df,
    x="date",
    y="trend",
    color="C0",
    alpha=0.5,
    label="trend",
    ax=ax[0],
)
sns.lineplot(
    data=data_df,
    x="date",
    y="seasonality",
    color="C1",
    alpha=0.5,
    label="seasonality",
    ax=ax[0],
)
sns.lineplot(data=data_df, x="date", y="z", color="C2", alpha=0.5, label="z", ax=ax[0])
sns.lineplot(
    data=data_df, x="date", y="x1", color="C3", linewidth=2.5, label="x1", ax=ax[0]
)
sns.lineplot(
    data=data_df, x="date", y="x2", color="C4", linewidth=2.5, label="x2", ax=ax[0]
)
sns.lineplot(
    data=data_df,
    x="date",
    y="epsilon",
    color="C5",
    linewidth=2.5,
    label="epsilon",
    ax=ax[0],
)
ax[0].legend(loc="upper left")
ax[0].set(xlabel="date", ylabel=None)

sns.lineplot(
    data=data_df.eval("sum = trend + seasonality + z + x1 + x2 + epsilon"),
    x="date",
    y="sum",
    color="black",
    linewidth=2.5,
    label="trend + seasonality + z + x1 + x2 + epsilon",
    ax=ax[1],
)
ax[1].legend(loc="upper left")

fig.suptitle(t="Data Components", fontsize=18, fontweight="bold")

Note that we have not yet added the non-linear effects of the media channels. This is a key component in media mix models as we expect that the media signal to have carryover (adstock) and saturation effects (for a detailed description of these transformation and alternative parametrizations see the paper “Bayesian Methods for Media Mix Modeling with Carryover and Shape Effects”, by Jin, et al.). We add these two transformations using some helper functions (the code is not that important):

def geometric_adstock(x, alpha, l_max, normalize):
    """Vectorized geometric adstock transformation."""
    cycles = [
        pt.concatenate(tensor_list=[pt.zeros(shape=x.shape)[:i], x[: x.shape[0] - i]])
        for i in range(l_max)
    ]
    x_cycle = pt.stack(cycles)
    x_cycle = pt.transpose(x=x_cycle, axes=[1, 2, 0])
    w = pt.as_tensor_variable([pt.power(alpha, i) for i in range(l_max)])
    w = pt.transpose(w)[None, ...]
    w = w / pt.sum(w, axis=2, keepdims=True) if normalize else w
    return pt.sum(pt.mul(x_cycle, w), axis=2)


def logistic_saturation(x, lam):
    """Logistic saturation transformation."""
    return (1 - pt.exp(-lam * x)) / (1 + pt.exp(-lam * x))

First, we apply the adstock transformation. The strength of the carryover effect is controlled by the decay parameter \(0\leq \alpha \leq 1\).

# adstock maximum lag
l_max = 4
# apply geometric adstock transformation
alpha1 = 0.3
alpha2 = 0.5

alpha = np.array([alpha1, alpha2])

data_df[["x1_adstock", "x2_adstock"]] = geometric_adstock(
    x=data_df[["x1", "x2"]], alpha=alpha, l_max=l_max, normalize=True
).eval()

Let’s visualize the raw and transformed data:

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(12, 7), sharex=True, sharey=False, layout="constrained"
)
sns.lineplot(x="date", y="x1", data=data_df, color="C0", label="x1", ax=ax[0])
sns.lineplot(
    x="date",
    y="x1_adstock",
    data=data_df,
    color="C1",
    label="x1_adstock",
    ax=ax[0],
)
sns.lineplot(x="date", y="x2", data=data_df, color="C0", label="x2", ax=ax[1])
sns.lineplot(
    x="date",
    y="x2_adstock",
    data=data_df,
    color="C1",
    label="x2_adstock",
    ax=ax[1],
)

fig.suptitle("Adstock Transformation", fontsize=18, fontweight="bold")

Observe that the adstock transformation has the effect of smoothing the media signal.

We continue by applying the saturation transformation:

# apply saturation transformation
lam1 = 1.0
lam2 = 2.5

lam = np.array([lam1, lam2])

data_df[["x1_adstock_saturated", "x2_adstock_saturated"]] = logistic_saturation(
    x=data_df[["x1_adstock", "x2_adstock"]], lam=lam
).eval()

This transformation simply compresses peaks. Let;s first visualize the saturation function (which depends on a parameter \(\lambda > 0\)).

x_range = np.linspace(start=0, stop=1.4, num=100)

fig, ax = plt.subplots(figsize=(8, 7))
sns.lineplot(
    data=data_df,
    x="x1_adstock",
    y="x1_adstock_saturated",
    linewidth=3,
    label="x1",
    ax=ax,
)
sns.lineplot(
    x=x_range,
    y=logistic_saturation(x=x_range, lam=lam1).eval(),
    color="C0",
    linestyle="--",
    ax=ax,
)
sns.lineplot(
    data=data_df,
    x="x2_adstock",
    y="x2_adstock_saturated",
    linewidth=3,
    label="x2",
    ax=ax,
)
sns.lineplot(
    x=x_range,
    y=logistic_saturation(x=x_range, lam=lam2).eval(),
    color="C1",
    linestyle="--",
    ax=ax,
)
ax.axline(
    (0, 0), slope=1, color="black", linestyle="--", linewidth=2.5, label="diagonal"
)
ax.legend(loc="upper left")
ax.set_title("Saturation Transformation", fontsize=18, fontweight="bold")

We can now compare the initial signal and how is transformed by these two on-linear transformations:

fig, ax = plt.subplots(
    nrows=3, ncols=2, figsize=(15, 10), sharex=True, sharey=False, layout="constrained"
)
sns.lineplot(x="date", y="x1", data=data_df, color="C0", ax=ax[0, 0])
sns.lineplot(x="date", y="x2", data=data_df, color="C1", ax=ax[0, 1])
sns.lineplot(x="date", y="x1_adstock", data=data_df, color="C0", ax=ax[1, 0])
sns.lineplot(x="date", y="x2_adstock", data=data_df, color="C1", ax=ax[1, 1])
sns.lineplot(x="date", y="x1_adstock_saturated", data=data_df, color="C0", ax=ax[2, 0])
sns.lineplot(x="date", y="x2_adstock_saturated", data=data_df, color="C1", ax=ax[2, 1])
fig.autofmt_xdate()
fig.suptitle("Media Costs Data - Transformed", fontsize=18, fontweight="bold")

Now we are ready to put everything together and generate the target variable \(y\). We just need to apply some weights \(\beta\) to the transformed media data (i.e. specifying the regression coefficients).

beta1 = 2.0
beta2 = 1.5
amplitude = 100

data_df = data_df.eval(
    f"""
    x1_effect = {beta1} * x1_adstock_saturated
    x2_effect = {beta2} * x2_adstock_saturated
    y = {amplitude} * (trend + seasonality + z + x1_effect + x2_effect + epsilon)
    y01 = {amplitude} * (trend + seasonality + z + x2_effect + epsilon)
    y02 = {amplitude} * (trend + seasonality + z + x1_effect + epsilon)
    """
)

Finally, we can see the components and the final target variable:

fig, ax = plt.subplots(
    nrows=2,
    ncols=1,
    figsize=(15, 10),
    sharex=True,
    sharey=False,
    layout="constrained",
)
sns.lineplot(
    data=data_df,
    x="date",
    y="trend",
    color="C0",
    alpha=0.5,
    label="trend",
    ax=ax[0],
)
sns.lineplot(
    data=data_df,
    x="date",
    y="seasonality",
    color="C1",
    alpha=0.5,
    label="seasonality",
    ax=ax[0],
)
sns.lineplot(data=data_df, x="date", y="z", color="C2", alpha=0.5, label="z", ax=ax[0])
sns.lineplot(
    data=data_df,
    x="date",
    y="x1_effect",
    color="C3",
    linewidth=2.5,
    label="x1_effect",
    ax=ax[0],
)
sns.lineplot(
    data=data_df,
    x="date",
    y="x2_effect",
    color="C4",
    linewidth=2.5,
    label="x2_effect",
    ax=ax[0],
)
sns.lineplot(
    data=data_df,
    x="date",
    y="epsilon",
    color="C5",
    linewidth=2.5,
    label="epsilon",
    ax=ax[0],
)
ax[0].legend(loc="upper left")
ax[0].set(title="Components", xlabel="date", ylabel=None)

sns.lineplot(
    data=data_df,
    x="date",
    y="y",
    color="black",
    linewidth=2.5,
    label="100 x (trend + seasonality + z + x1_effect + x2_effect + epsilon)",
    ax=ax[1],
)
ax[1].legend(loc="lower right")
ax[1].set(title="Target")

fig.suptitle(t="Data Generating Process", fontsize=18, fontweight="bold")

Remark: We are computing quantities y01 and y01 as the potential outcome sales when channels \(x_1\) and \(x_2\) are set to \(0\). This is useful to compute the global ROAS (see below).

data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131 entries, 0 to 130
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   date                  131 non-null    datetime64[ns]
 1   dayofyear             131 non-null    int32         
 2   quarter               131 non-null    object        
 3   trend                 131 non-null    float64       
 4   cs                    131 non-null    float64       
 5   cc                    131 non-null    float64       
 6   seasonality           131 non-null    float64       
 7   z                     131 non-null    float64       
 8   x1                    131 non-null    float64       
 9   x2                    131 non-null    float64       
 10  epsilon               131 non-null    float64       
 11  x1_adstock            131 non-null    float64       
 12  x2_adstock            131 non-null    float64       
 13  x1_adstock_saturated  131 non-null    float64       
 14  x2_adstock_saturated  131 non-null    float64       
 15  x1_effect             131 non-null    float64       
 16  x2_effect             131 non-null    float64       
 17  y                     131 non-null    float64       
 18  y01                   131 non-null    float64       
 19  y02                   131 non-null    float64       
dtypes: datetime64[ns](1), float64(17), int32(1), object(1)
memory usage: 20.1+ KB

ROAS

One of the key outputs of media mix models is the ROAS (return on advertising spend) which is defined as the ratio between the incremental sales and the media spend. In the following snippet of code we compute the ROAS analytically for each channel.

Recall from “Media Mix Model Calibration With Bayesian Priors”, by Zhang, et al., the definition of ROAS:

We compute a global ROAS as well as a quarterly ROAS. One has to be careful with the computation as we want to make sure we consider the carryover effects of the media channels. For example, if we compute the ROAS for the first week of the year, we need to consider the carryover effects of the media spend from the previous year.

The following function generate the sales when a given channel is turned off during a given period (note it mimics the data generating process described above). In the language of causal inference, this function computes the potential outcome when a given channel is turned off.

def generate_potential_y(
    dataset, alpha, l_max, lambda_, beta1, beta2, amplitude, channel, t0, t1
):
    dataset = dataset.copy()
    dataset["mask"] = ~dataset["date"].between(left=t0, right=t1, inclusive="both")
    dataset[channel] = dataset[channel].where(cond=dataset["mask"], other=0)
    dataset[["x1_adstock", "x2_adstock"]] = geometric_adstock(
        x=dataset[["x1", "x2"]], alpha=alpha, l_max=l_max, normalize=True
    ).eval()
    dataset[["x1_adstock_saturated", "x2_adstock_saturated"]] = logistic_saturation(
        x=dataset[["x1_adstock", "x2_adstock"]], lam=lambda_
    ).eval()
    return dataset.eval(
        f"""
        x1_effect = {beta1} * x1_adstock_saturated
        x2_effect = {beta2} * x2_adstock_saturated
        y = {amplitude} * (trend + seasonality + z + x1_effect + x2_effect + epsilon)
        """
    )

Let’s test this function does generates our dataset above when we do not turn off any channel. We achieve this by setting the t0 and t1 arguments (which define the turn off period) outside of the data range.

for channel in ["x1", "x2"]:
    temp_df = generate_potential_y(
        dataset=data_df,
        alpha=alpha,
        l_max=l_max,
        lambda_=lam,
        beta1=beta1,
        beta2=beta2,
        amplitude=amplitude,
        channel=channel,
        t0=min_date - pd.Timedelta(days=7),
        t1=min_date - pd.Timedelta(days=7),
    )

    pd.testing.assert_frame_equal(left=temp_df[data_df.columns], right=data_df)

We proceed to generate the potential outcome sales for each channel and each quarter.

quarter_ranges = data_df.groupby("quarter", as_index=False).agg(
    {"date": ["min", "max"]}
)

quarter_ranges.head()

	quarter	date
		min	max
0	2021Q4	2021-10-02	2021-12-25
1	2022Q1	2022-01-01	2022-03-26
2	2022Q2	2022-04-02	2022-06-25
3	2022Q3	2022-07-02	2022-09-24
4	2022Q4	2022-10-01	2022-12-31

roas_data = {
    channel: {
        row["quarter"].item(): generate_potential_y(
            dataset=data_df,
            alpha=alpha,
            l_max=l_max,
            lambda_=lam,
            beta1=beta1,
            beta2=beta2,
            amplitude=amplitude,
            channel=channel,
            t0=row["date"]["min"],
            t1=row["date"]["max"],
        )
        for _, row in tqdm(quarter_ranges.iterrows())
    }
    for channel in ["x1", "x2"]
}

We can visualize the potential outcome against the true sales for each quarter for the first channel \(x_1\).

fig, axes = plt.subplots(
    nrows=quarter_ranges.quarter.nunique(),
    ncols=1,
    figsize=(15, 17),
    sharex=True,
    sharey=True,
    layout="constrained",
)

axes = axes.flatten()

for i, (q, q_df) in enumerate(roas_data["x1"].items()):
    ax = axes[i]
    ax_twin = ax.twinx()

    sns.lineplot(
        data=data_df, x="date", y="y", color="black", label="$y (actual)$", ax=ax
    )
    sns.lineplot(data=q_df, x="date", y="y", color="C0", label="$y (potential)$", ax=ax)
    sns.lineplot(data=q_df, x="date", y="mask", color="C1", ax=ax_twin)
    ax.set_title(f"Quarter {q}")
    ax_twin.grid(None)

fig.suptitle(t="Actual vs Potential - x1", fontsize=18, fontweight="bold")

Observe that the potential sales and the true sales differ immediately after the mask period as a result of the carryover effect of the media spend. Next, we compute the ROAS for each each quarter by simply dividing the difference between the potential sales and the true sales by the media spend.

roas_x1 = pd.DataFrame.from_dict(
    data={
        q: (
            (data_df["y"].sum() - q_df["y"].sum())
            / data_df.query("quarter == @q")["x1"].sum()
        )
        for q, q_df in roas_data["x1"].items()
    },
    orient="index",
    columns=["roas"],
).assign(channel="x1")

We can do the exact same procedure for channel \(x_2\).

fig, axes = plt.subplots(
    nrows=quarter_ranges.quarter.nunique(),
    ncols=1,
    figsize=(15, 17),
    sharex=True,
    sharey=True,
    layout="constrained",
)

axes = axes.flatten()

for i, (q, q_df) in enumerate(roas_data["x2"].items()):
    ax = axes[i]
    ax_twin = ax.twinx()

    sns.lineplot(
        data=data_df, x="date", y="y", color="black", label="$y (actual)$", ax=ax
    )
    sns.lineplot(data=q_df, x="date", y="y", color="C0", label="$y (potential)$", ax=ax)
    sns.lineplot(data=q_df, x="date", y="mask", color="C1", ax=ax_twin)
    ax.set_title(f"Quarter {q}")
    ax_twin.grid(None)

fig.suptitle(t="Actual vs Potential - x2", fontsize=18, fontweight="bold")

roas_x2 = pd.DataFrame.from_dict(
    data={
        q: (
            (data_df["y"].sum() - q_df["y"].sum())
            / data_df.query("quarter == @q")["x2"].sum()
        )
        for q, q_df in roas_data["x2"].items()
    },
    orient="index",
    columns=["roas"],
).assign(channel="x2")

Moreover, we can easily compute the global ROAS:

roas_all_1 = (data_df["y"] - data_df["y01"]).sum() / data_df["x1"].sum()

roas_all_2 = (data_df["y"] - data_df["y02"]).sum() / data_df["x2"].sum()

Let’s visualize the quarterly and global ROAS for each channel:

roas = (
    pd.concat(objs=[roas_x1, roas_x2], axis=0)
    .reset_index(drop=False)
    .rename(columns={"index": "quarter"})
)

g = sns.catplot(
    data=roas.assign(quarter_num=lambda df: df.quarter.str[4:]),
    x="quarter",
    y="roas",
    row="channel",
    hue="quarter_num",
    kind="bar",
    height=4,
    aspect=3,
    alpha=0.7,
    sharey=False,
)

ax = g.axes.flatten()
ax[0].axhline(
    y=roas_all_1, color="black", linestyle="--", linewidth=2.5, label="roas (all)"
)
ax[1].axhline(
    y=roas_all_2, color="black", linestyle="--", linewidth=2.5, label="roas (all)"
)

g.fig.suptitle(t="True ROAS", fontsize=18, fontweight="bold", y=1.03)

We see that channel \(x_2\) has approximately \(75\%\) higher ROAS as compared to channel \(x_1\).

Data Preparation

Now that we have generated the data, we can proceed to prepare it for the media mix model. We start by selecting the features we would have in a real scenario. Note that in such a case we would not have x1_effect or x2_effect as this is precisely what we want to estimate.

columns_to_keep = [
    "date",
    "dayofyear",
    "z",
    "x1",
    "x2",
    "y",
]

model_df = data_df[columns_to_keep]

model_df.head()

	date	dayofyear	z	x1	x2	y
0	2021-10-02	275	0.053693	0.646554	0.336188	199.329637
1	2021-10-09	282	1.045243	1.411917	0.203931	371.237041
2	2021-10-16	289	0.179703	0.837610	0.024026	272.215933
3	2021-10-23	296	0.140496	0.973612	0.120257	291.104040
4	2021-10-30	303	0.869155	1.415985	0.084630	386.243000

Next, we proceed to scale the data with a MaxAbsScaler. For more details about the scaling please see the previous post Media Effect Estimation with PyMC: Adstock, Saturation & Diminishing Returns.

date = model_df["date"]

index_scaler = MaxAbsScaler()
index_scaled = index_scaler.fit_transform(model_df.reset_index(drop=False)[["index"]])

target = "y"
target_scaler = MaxAbsScaler()
target_scaled = target_scaler.fit_transform(model_df[[target]])

channels = ["x1", "x2"]
channels_scaler = MaxAbsScaler()
channels_scaled = channels_scaler.fit_transform(model_df[channels])

controls = ["z"]
controls_scaler = MaxAbsScaler()
controls_scaled = controls_scaler.fit_transform(model_df[controls])

To model the yearly seasonality we generate some Fourier modes to include them into the regression model.

n_order = 3
periods = model_df["dayofyear"] / 365.25
fourier_features = pd.DataFrame(
    {
        f"{func}_order_{order}": getattr(np, func)(2 * np.pi * periods * order)
        for order in range(1, n_order + 1)
        for func in ("sin", "cos")
    }
)

Now we are ready to start with the media mix model.

Model 1: Causal Model

In this first model we simply add the features from our causal graph. We should be able to recover the parameters and the ROAS from the data generating process. Before jumping into the model lets look closer the the causal structure.

Causal Model

It is very important to view media mix models from a causal perspective. We can not simply add a bunch of variables and hope for the best! See for example Using Data Science for Bad Decision-Making: A Case Study. There is a systematic way to select the variables we should / should not add into the model. To see this how this works in practice, let’s use DoWhy to get the set of variables for our model. First we need to define the DAG (this is where domain knowledge from the marketing mechanism comes into place!).

gml_graph = """
graph [
    directed 1

    node [
        id x1
        label "x1"
    ]
    node [
        id x2
        label "x2"
    ]
    node [
        id z
        label "z"
    ]
    node [
        id seasonality
        label "seasonality"
    ]
    node [
        id trend
        label "trend"
    ]
    node [
        id y
        label "y"
    ]
    edge [
        source seasonality
        target x1
    ]
    edge [
        source z
        target x1
    ]
    edge [
        source seasonality
        target y
    ]
    edge [
        source trend
        target y
    ]
    edge [
        source z
        target y
    ]
    edge [
        source x1
        target y
    ]
    edge [
        source x2
        target y
    ]
]
"""

We want to see if we can find a set of variables which we can put in a regression model to estimate the effect of \(x_1\) and \(x_2\) on \(y\). Hence, we define to causal models:

causal_model_1 = CausalModel(
    data=model_df,
    graph=gml_graph,
    treatment="x1",
    outcome="y",
)

causal_model_2 = CausalModel(
    data=model_df,
    graph=gml_graph,
    treatment="x2",
    outcome="y",
)

causal_model_1.view_model()

We can now query for backdoor paths between the treatment and the outcome:

causal_model_1._graph.get_backdoor_paths(nodes1=["x1"], nodes2=["y"])

[['x1', 'z', 'y'], ['x1', 'seasonality', 'y']]

causal_model_2._graph.get_backdoor_paths(nodes1=["x2"], nodes2=["y"])

[]

Note that for \(x_1\) we need to add \(z\) and seasonality, otherwise we would have open paths:

causal_model_1._graph.check_valid_backdoor_set(nodes1=["x1"], nodes2=["y"], nodes3=[])

{'is_dseparated': False}

Therefore the set of variables we should add to the model are \(x_1\), \(x_2\), \(z\), seasonality and trend (the trend component will allow us to reduce the variance on the estimate).

causal_model_1._graph.check_valid_backdoor_set(
    nodes1=["x1"], nodes2=["y"], nodes3=["z", "seasonality", "trend"]
)

{'is_dseparated': True}

causal_model_1._graph.check_valid_backdoor_set(
    nodes1=["x2"], nodes2=["y"], nodes3=["z", "seasonality", "trend"]
)

{'is_dseparated': True}

Prior Specification

Bayesian models provide great flexibility to add domain knowledge into the model through the prior distributions. PreliZ is a great library to study and set priors.

For example, what should be the prior for the adstock parameter \(\alpha\)? We know it should be between zero and one. We also do not expect the carry over to be very strong. Hence, we can use a Beta distribution with the mass a bit towards zero.

fig, ax = plt.subplots(figsize=(10, 6))
pz.Beta(alpha=2, beta=3).plot_pdf(ax=ax)
ax.set(xlabel=r"$\alpha$")
ax.set_title(label=r"Prior Distribution $\alpha$", fontsize=18, fontweight="bold")

For \(\lambda\) we could also argue similarly: it is a positive number and we expect a mild saturation. Hence, we can use a Gamma distribution with mean close to one. Still, we make sure it is not concentrated to much around one in order to allow for some flexibility.

fig, ax = plt.subplots(figsize=(10, 6))
pz.Gamma(alpha=2, beta=2).plot_pdf(ax=ax)
ax.set(xlabel=r"$\lambda$")
ax.set_title(label=r"Prior Distribution $\lambda$", fontsize=18, fontweight="bold")

A much harder choice for prior are the regression coefficients of the media variables. Without any prior information on channel performance a common choice is to use a weakly informative prior driven by the investment share. The idea motivation behind it is that we expect (without seeing the data!) that channels where we spend the most to contribute more to sales. This can actually not be the case once we see the data. Nevertheless, is a good guidance if you do not have any other information. One could put the same prior for all channels, but in practice I have seen in practice that, given the small sample sizes, smaller channels can get very high contributions just because of the noise (I have also tested via simulations). It is important to emphasize that there is no right answer here. It is a matter of what you believe and what you want to test.

For this example, let’s use such a prior. In addition, note we expect the effect to be always positive. Hence, we can use a half-normal distribution to enforce this.

channel_share = data_df[channels].sum() / data_df[channels].sum().sum()

fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(
    x=channel_share.index, y=channel_share.values, hue=channel_share.index, ax=ax
)
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1))
ax.set(xlabel="channel", ylabel="share")
ax.set_title(label="Channel Spend Share", fontsize=18, fontweight="bold")

fig, ax = plt.subplots(figsize=(10, 6))
pz.HalfNormal(sigma=channel_share.loc["x1"]).plot_pdf(ax=ax)
pz.HalfNormal(sigma=channel_share.loc["x2"]).plot_pdf(ax=ax)
ax.set(xlabel=r"$\beta$")
ax.set_title(label=r"Prior Distribution $\beta$", fontsize=18, fontweight="bold")

Model Specification

We are now ready to specify the model. We will not go into details as we are implementing the classical model from “Media Mix Model Calibration With Bayesian Priors”, by Zhang, et al., which is described in detail in the the previous post Media Effect Estimation with PyMC: Adstock, Saturation & Diminishing Returns.

The only difference it we do not add an explicit trend variable but we use a Gaussian process to model this latent variable (therefore we also remove an intercept) as described in the post Estimating the Long-Term Base in Marketing-Mix Models. We are using the same technique as described in the example Time Series Modeling with HSGP: Baby Births Example.

coords = {"date": date, "channel": channels, "fourier_mode": np.arange(2 * n_order)}

with pm.Model(coords=coords) as model_1:
    # --- Data Containers ---

    index_scaled_data = pm.Data(
        name="index_scaled",
        value=index_scaled.to_numpy().flatten(),
        mutable=True,
        dims="date",
    )

    channels_scaled_data = pm.Data(
        name="channels_scaled",
        value=channels_scaled,
        mutable=True,
        dims=("date", "channel"),
    )

    z_scaled_data = pm.Data(
        name="z_scaled",
        value=controls_scaled.to_numpy().flatten(),
        mutable=True,
        dims="date",
    )

    fourier_features_data = pm.Data(
        name="fourier_features",
        value=fourier_features,
        mutable=True,
        dims=("date", "fourier_mode"),
    )

    y_scaled_data = pm.Data(
        name="y_scaled",
        value=target_scaled.to_numpy().flatten(),
        mutable=True,
        dims="date",
    )

    # --- Priors ---

    amplitude_trend = pm.HalfNormal(name="amplitude_trend", sigma=1)
    ls_trend_params = pm.find_constrained_prior(
        distribution=pm.InverseGamma,
        lower=0.1,
        upper=0.9,
        init_guess={"alpha": 3, "beta": 1},
        mass=0.95,
    )
    ls_trend = pm.InverseGamma(name="ls_trend", **ls_trend_params)

    alpha = pm.Beta(name="alpha", alpha=2, beta=3, dims="channel")
    lam = pm.Gamma(name="lam", alpha=2, beta=2, dims="channel")
    beta_channel = pm.HalfNormal(
        name="beta_channel",
        sigma=channel_share.to_numpy(),
        dims="channel",
    )
    beta_z = pm.Normal(name="beta_z", mu=0, sigma=1)
    beta_fourier = pm.Normal(name="beta_fourier", mu=0, sigma=1, dims="fourier_mode")

    sigma = pm.HalfNormal(name="sigma", sigma=1)
    nu = pm.Gamma(name="nu", alpha=25, beta=2)

    # --- Parametrization ---

    cov_trend = amplitude_trend * pm.gp.cov.ExpQuad(input_dim=1, ls=ls_trend)
    gp_trend = pm.gp.HSGP(m=[20], c=1.5, cov_func=cov_trend)
    f_trend = gp_trend.prior(name="f_trend", X=index_scaled_data[:, None], dims="date")

    channel_adstock = pm.Deterministic(
        name="channel_adstock",
        var=geometric_adstock(
            x=channels_scaled_data,
            alpha=alpha,
            l_max=l_max,
            normalize=True,
        ),
        dims=("date", "channel"),
    )
    channel_adstock_saturated = pm.Deterministic(
        name="channel_adstock_saturated",
        var=logistic_saturation(x=channel_adstock, lam=lam),
        dims=("date", "channel"),
    )
    channel_contributions = pm.Deterministic(
        name="channel_contributions",
        var=channel_adstock_saturated * beta_channel,
        dims=("date", "channel"),
    )

    fourier_contribution = pm.Deterministic(
        name="fourier_contribution",
        var=pt.dot(fourier_features_data, beta_fourier),
        dims="date",
    )

    z_contribution = pm.Deterministic(
        name="z_contribution",
        var=z_scaled_data * beta_z,
        dims="date",
    )

    mu = pm.Deterministic(
        name="mu",
        var=f_trend
        + channel_contributions.sum(axis=-1)
        + z_contribution
        + fourier_contribution,
        dims="date",
    )

    # --- Likelihood ---
    y = pm.StudentT(
        name="y",
        nu=nu,
        mu=mu,
        sigma=sigma,
        observed=y_scaled_data,
        dims="date",
    )

pm.model_to_graphviz(model=model_1)

Prior Predictive Checks

Before fitting the model, it is always a good idea to check the prior predictive distribution. This is a way to see if the model is able to generate data that is consistent with our prior beliefs. We can do this by sampling from the prior and plotting the results.

with model_1:
    prior_predictive_1 = pm.sample_prior_predictive(samples=2_000, random_seed=rng)

fig, ax = plt.subplots()
az.plot_ppc(data=prior_predictive_1, group="prior", kind="kde", ax=ax)
ax.set_title(label="Prior Predictive - Model 1", fontsize=18, fontweight="bold")

It looks quite reasonable.

Model Fitting

We now fit the model using the No-U-Turn Sampler (NUTS) algorithm. We will run four chains in parallel and sample \(2000\) draws from each chain. We use NumPyro as the backend for the model fitting (it is super fast).

with model_1:
    idata_1 = pm.sample(
        target_accept=0.9,
        draws=2_000,
        chains=4,
        nuts_sampler="numpyro",
        random_seed=rng,
    )
    posterior_predictive_1 = pm.sample_posterior_predictive(
        trace=idata_1, random_seed=rng
    )

Model Diagnostics

First we verify we do not have divergences:

idata_1["sample_stats"]["diverging"].sum().item()

\(\displaystyle 0\)

Next, we look into the posterior distribution of the parameters.

var_names = [
    "amplitude_trend",
    "ls_trend",
    "alpha",
    "lam",
    "beta_channel",
    "beta_z",
    "beta_fourier",
    "sigma",
    "nu",
]

az.summary(data=idata_1, var_names=var_names, round_to=3)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
amplitude_trend	0.380	0.339	0.031	1.024	0.006	0.004	2216.743	3547.131	1.001
ls_trend	0.307	0.042	0.224	0.380	0.001	0.001	2068.505	2892.961	1.003
alpha[x1]	0.318	0.072	0.181	0.449	0.001	0.001	7797.458	6545.114	1.000
alpha[x2]	0.493	0.028	0.440	0.544	0.000	0.000	10193.826	5924.861	1.001
lam[x1]	0.973	0.452	0.276	1.806	0.006	0.004	5677.231	6177.363	1.000
lam[x2]	2.812	0.687	1.548	4.070	0.009	0.006	6103.450	5004.784	1.001
beta_channel[x1]	0.709	0.312	0.262	1.288	0.004	0.003	5211.349	6102.915	1.000
beta_channel[x2]	0.404	0.094	0.268	0.581	0.001	0.001	5946.573	4792.955	1.001
beta_z	0.190	0.017	0.158	0.222	0.000	0.000	7142.269	5758.734	1.000
beta_fourier[0]	0.000	0.008	-0.015	0.012	0.000	0.000	3680.755	3103.355	1.000
beta_fourier[1]	0.047	0.007	0.034	0.060	0.000	0.000	5009.843	3406.016	1.001
beta_fourier[2]	-0.055	0.004	-0.063	-0.047	0.000	0.000	7868.972	6689.172	1.001
beta_fourier[3]	-0.004	0.002	-0.008	0.000	0.000	0.000	10720.589	7004.293	1.001
beta_fourier[4]	-0.004	0.002	-0.008	-0.000	0.000	0.000	10558.562	6143.012	1.000
beta_fourier[5]	-0.000	0.002	-0.004	0.004	0.000	0.000	10858.069	5358.714	1.000
sigma	0.015	0.001	0.013	0.018	0.000	0.000	9595.037	5908.872	1.000
nu	12.606	2.482	8.049	17.155	0.023	0.017	12003.132	6203.788	1.001

axes = az.plot_trace(
    data=idata_1,
    var_names=var_names,
    compact=True,
    kind="rank_bars",
    backend_kwargs={"figsize": (15, 17), "layout": "constrained"},
)
plt.gcf().suptitle("Trace - Model 1", fontsize=16)

The trace looks fine!

Posterior Predictive Checks

We now want to see the model fit. We want to do it in the original scale. Therefore we need to scale our posterior samples back.

pp_vars_original_scale_1 = {
    var_name: xr.apply_ufunc(
        target_scaler.inverse_transform,
        idata_1["posterior"][var_name].expand_dims(dim={"_": 1}, axis=-1),
        input_core_dims=[["date", "_"]],
        output_core_dims=[["date", "_"]],
        vectorize=True,
    ).squeeze(dim="_")
    for var_name in ["mu", "f_trend", "fourier_contribution", "z_contribution"]
}

pp_vars_original_scale_1["channel_contributions"] = xr.apply_ufunc(
    target_scaler.inverse_transform,
    idata_1["posterior"]["channel_contributions"],
    input_core_dims=[["date", "channel"]],
    output_core_dims=[["date", "channel"]],
    vectorize=True,
)

# We define the baseline as f_trand + fourier_contribution + z_contribution
pp_vars_original_scale_1["baseline"] = xr.apply_ufunc(
    target_scaler.inverse_transform,
    (
        idata_1["posterior"]["f_trend"]
        + idata_1["posterior"]["fourier_contribution"]
        + idata_1["posterior"]["z_contribution"]
    ).expand_dims(dim={"_": 1}, axis=-1),
    input_core_dims=[["date", "_"]],
    output_core_dims=[["date", "_"]],
    vectorize=True,
).squeeze(dim="_")

We also scale back the likelihood.

pp_likelihood_original_scale_1 = xr.apply_ufunc(
    target_scaler.inverse_transform,
    posterior_predictive_1["posterior_predictive"]["y"].expand_dims(
        dim={"_": 1}, axis=-1
    ),
    input_core_dims=[["date", "_"]],
    output_core_dims=[["date", "_"]],
    vectorize=True,
).squeeze(dim="_")

We can now visualize the results:

fig, ax = plt.subplots()
sns.lineplot(data=model_df, x="date", y="y", color="black", label=target, ax=ax)
az.plot_hdi(
    x=date,
    y=pp_likelihood_original_scale_1,
    hdi_prob=0.94,
    color="C0",
    fill_kwargs={"alpha": 0.1, "label": r"likelihood $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_1["mu"],
    hdi_prob=0.94,
    color="C0",
    fill_kwargs={"alpha": 0.3, "label": r"$\mu$ $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
ax.legend(loc="upper left")
ax.set_title(label="Posterior Predictive - Model 1", fontsize=18, fontweight="bold")

The model does explain the data variance quite well. Let’s look into the distribution:

fig, ax = plt.subplots()
az.plot_ppc(
    data=posterior_predictive_1,
    num_pp_samples=1_000,
    observed_rug=True,
    random_seed=seed,
    ax=ax,
)
ax.set_title(label="Posterior Predictive - Model 1", fontsize=18, fontweight="bold")

Looks good!

Model Components

We now visualize the model components.

fig, ax = plt.subplots()
sns.lineplot(data=model_df, x="date", y="y", color="black", label=target, ax=ax)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_1["channel_contributions"].sel(channel="x1"),
    hdi_prob=0.94,
    color="C0",
    fill_kwargs={"alpha": 0.5, "label": r"$x1$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_1["channel_contributions"].sel(channel="x2"),
    hdi_prob=0.94,
    color="C1",
    fill_kwargs={"alpha": 0.5, "label": r"$x2$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_1["f_trend"],
    hdi_prob=0.94,
    color="C2",
    fill_kwargs={"alpha": 0.3, "label": r"$f_\text{trend}$ $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_1["fourier_contribution"],
    hdi_prob=0.94,
    color="C3",
    fill_kwargs={"alpha": 0.3, "label": r"seasonality $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_1["z_contribution"],
    hdi_prob=0.94,
    color="C4",
    fill_kwargs={"alpha": 0.3, "label": r"$z$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
ax.legend(loc="upper center", bbox_to_anchor=(0.5, -0.1), ncol=3)
ax.set_title(label="Components Contributions - Model 1", fontsize=18, fontweight="bold")

Remarks:

The Gaussian process is able to capture the trend component quite well.
Note that the variance of the \(x_1\) estimate is much wider than the one for \(x_2\). This is due to the fact \(x_1\) has a high correlation with the seasonality component. Multicollinearity is a common issue in media mix models.

We now plot the baseline against the media contribution:

fig, ax = plt.subplots()
sns.lineplot(data=model_df, x="date", y="y", color="black", label=target, ax=ax)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_1["channel_contributions"].sel(channel="x1"),
    hdi_prob=0.94,
    color="C0",
    fill_kwargs={"alpha": 0.5, "label": r"$x1$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_1["channel_contributions"].sel(channel="x2"),
    hdi_prob=0.94,
    color="C1",
    fill_kwargs={"alpha": 0.5, "label": r"$x2$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_1["baseline"],
    hdi_prob=0.94,
    color="C7",
    fill_kwargs={"alpha": 0.5, "label": r"baseline $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
ax.legend(loc="upper left")
ax.set_title(label="Baseline vs Channels - Model 1", fontsize=18, fontweight="bold")

Let’s now deep dive into the media contributions:

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(12, 7), sharex=True, sharey=False, layout="constrained"
)
sns.lineplot(
    x="date",
    y="x1_effect",
    data=data_df.assign(x1_effect=lambda x: amplitude * x["x1_effect"]),
    color="C0",
    label=r"$x1$ effect",
    ax=ax[0],
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_1["channel_contributions"].sel(channel="x1"),
    hdi_prob=0.94,
    color="C0",
    fill_kwargs={"alpha": 0.3, "label": r"$x1$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax[0],
)
ax[0].legend(loc="upper right")

sns.lineplot(
    x="date",
    y="x2_effect",
    data=data_df.assign(x2_effect=lambda x: amplitude * x["x2_effect"]),
    color="C1",
    label=r"$x2$ effect",
    ax=ax[1],
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_1["channel_contributions"].sel(channel="x2"),
    hdi_prob=0.94,
    color="C1",
    fill_kwargs={"alpha": 0.3, "label": r"$x2$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax[1],
)
ax[1].legend(loc="upper right")

fig.suptitle("Channel Contributions - Model 1", fontsize=18, fontweight="bold")

Overall, the model recovered the true effects. Still, the estimate of \(x_2\) is slightly underestimated by the model (the true value is not in the middle of the HDI).

Media Parameters

We now compare the estimated against the true values for the parameters \(\alpha\), \(\lambda\) and the regression coefficients \(\beta\). Recall that since we are working on the scaled space we need to scale \(\lambda\) and \(\beta\) back:

alpha_posterior_1 = idata_1["posterior"]["alpha"]
lam_posterior_1 = idata_1["posterior"]["lam"] * channels_scaler.scale_
beta_channel_posterior_1 = (
    idata_1["posterior"]["beta_channel"]
    * (1 / channels_scaler.scale_)
    * (target_scaler.scale_)
    * (1 / amplitude)
)

fig, ax = plt.subplots(figsize=(10, 6))
az.plot_forest(
    data=alpha_posterior_1, combined=True, hdi_prob=0.94, colors="black", ax=ax
)
ax.axvline(
    x=alpha1, color="C0", linestyle="--", linewidth=2, label=r"$\alpha_1$ (true)"
)
ax.axvline(
    x=alpha2, color="C1", linestyle="--", linewidth=2, label=r"$\alpha_2$ (true)"
)
ax.legend(loc="upper left")
ax.set_title(
    label=r"Adstock Parameter $\alpha$ ($94\%$ HDI) - Model 1",
    fontsize=18,
    fontweight="bold",
)

fig, ax = plt.subplots(figsize=(10, 6))
az.plot_forest(
    data=lam_posterior_1, combined=True, hdi_prob=0.94, colors="black", ax=ax
)
ax.axvline(x=lam1, color="C0", linestyle="--", linewidth=2, label=r"$\lambda_1$ (true)")
ax.axvline(x=lam2, color="C1", linestyle="--", linewidth=2, label=r"$\lambda_2$ (true)")
ax.legend(loc="upper right")
ax.set_title(
    label=r"Saturation Parameter $\lambda$ ($94\%$ HDI) - Model 1",
    fontsize=18,
    fontweight="bold",
)

fig, ax = plt.subplots(figsize=(10, 6))
az.plot_forest(
    data=beta_channel_posterior_1, combined=True, hdi_prob=0.94, colors="black", ax=ax
)
ax.axvline(x=beta1, color="C0", linestyle="--", linewidth=2, label=r"$\beta_1$ (true)")
ax.axvline(x=beta2, color="C1", linestyle="--", linewidth=2, label=r"$\beta_2$ (true)")
ax.legend(loc="upper right")
ax.set_title(
    label=r"Channel $\beta$ ($94\%$ HDI) - Model 1", fontsize=18, fontweight="bold"
)

Remarks:

We wer able to recover the true values of the carryover parameter \(\alpha\).
For \(x_1\), the true values of \(\lambda\) and \(\beta\) are in the \(94\%\) HDI.
The \(94\%\) HDI of the regression coefficient \(\beta\) for \(x_1\) is much wider than the one for \(x_2\). As mentioned before, this is probably related with multicollinearity with the seasonal component.
For \(x_2\) the model overestimated the saturation parameter \(\lambda\) and underestimated the regression coefficient \(\beta\). Maybe we should have had stricter priors? This is a true challenge in practice as we do not know the true values. This is a motivation for this notebook! Being able to easily pass information through ROAS than beta coefficients in the scaled space is a much more convenient way to pass information to the model.

ROAS Estimation

Once the model is fitted, we can use the posterior samples to compute the ROAS as in above. We simply need to generate predictions when setting the media data to zero. Let’s start with the global ROAS:

predictions_roas_data_all_1 = {"x1": {}, "x2": {}}

for channel in tqdm(["x1", "x2"]):
    with model_1:
        pm.set_data(
            new_data={
                "channels_scaled": channels_scaler.transform(
                    data_df[channels].assign(**{channel: 0})
                )
            }
        )
        predictions_roas_data_all_1[channel] = pm.sample_posterior_predictive(
            trace=idata_1, var_names=["y", "mu"], progressbar=False, random_seed=rng
        )

We now scale the predictions back to the original scale:

predictions_roas_data_scaled_diff_all_1 = {"x1": {}, "x2": {}}

for channel in ["x1", "x2"]:
    predictions_roas_data_scaled_diff_all_1[channel] = pp_vars_original_scale_1[
        "mu"
    ] - xr.apply_ufunc(
        target_scaler.inverse_transform,
        predictions_roas_data_all_1[channel]["posterior_predictive"]["mu"].expand_dims(
            dim={"_": 1}, axis=-1
        ),
        input_core_dims=[["date", "_"]],
        output_core_dims=[["date", "_"]],
        vectorize=True,
    ).squeeze(
        dim="_"
    )

We can now visualize the difference iin the potential outcomes:

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(12, 7), sharex=True, sharey=False, layout="constrained"
)
az.plot_hdi(
    x=date,
    y=predictions_roas_data_scaled_diff_all_1["x1"],
    color="C0",
    smooth=False,
    fill_kwargs={"alpha": 0.3, "label": r"$E(y - y_{01})$ $94\%$ HDI"},
    ax=ax[0],
)
sns.lineplot(
    data=data_df.assign(x1_effect=lambda x: amplitude * x["x1_effect"]),
    x="date",
    y="x1_effect",
    color="C0",
    label=r"$x1$ effect",
    ax=ax[0],
)
az.plot_hdi(
    x=date,
    y=predictions_roas_data_scaled_diff_all_1["x2"],
    color="C1",
    smooth=False,
    fill_kwargs={"alpha": 0.3, "label": r"$E(y - y_{02})$ $94\%$ HDI"},
    ax=ax[1],
)
sns.lineplot(
    data=data_df.assign(x2_effect=lambda x: amplitude * x["x2_effect"]),
    x="date",
    y="x2_effect",
    color="C1",
    label=r"$x2$ effect",
    ax=ax[1],
)
ax[0].legend(loc="upper left")
ax[1].legend(loc="upper left")
ax[0].set(ylabel=None)
ax[1].set(ylabel=None)
fig.suptitle("Potential Outcomes Difference - Model 1", fontsize=18, fontweight="bold")

It is not surprise that this is precisely the same plot as the one we saw before. Namely, the media contributions from the model 🙃.

We now compute the ROAS:

predictions_roas_all_1 = {"x1": {}, "x2": {}}

for channel in ["x1", "x2"]:
    predictions_roas_all_1[channel] = (
        predictions_roas_data_scaled_diff_all_1[channel].sum(dim="date")
        / data_df[channel].sum()
    )

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(12, 7), sharex=True, sharey=False, layout="constrained"
)

az.plot_posterior(predictions_roas_all_1["x1"], ref_val=roas_all_1, ax=ax[0])
ax[0].set(title="x1")
az.plot_posterior(predictions_roas_all_1["x2"], ref_val=roas_all_2, ax=ax[1])
ax[1].set(title="x2", xlabel="ROAS")
fig.suptitle("Global ROAS - Model 1", fontsize=18, fontweight="bold")

We can proceed in an analogous way for the quarterly ROAS. This time we need to loop ever time series on which we set up the quarterly spend to zero for a given channel.

predictions_roas_data_1 = {"x1": {}, "x2": {}}

for channel in tqdm(["x1", "x2"]):
    for q in tqdm(roas_data[channel].keys()):
        with model_1:
            pm.set_data(
                new_data={
                    "channels_scaled": channels_scaler.transform(
                        roas_data[channel][q][channels]
                    )
                }
            )
            predictions_roas_data_1[channel][q] = pm.sample_posterior_predictive(
                trace=idata_1, var_names=["y", "mu"], progressbar=False, random_seed=rng
            )

predictions_roas_data_scaled_diff_1 = {"x1": {}, "x2": {}}

for channel in ["x1", "x2"]:
    predictions_roas_data_scaled_diff_1[channel] = {
        q: pp_vars_original_scale_1["mu"]
        - xr.apply_ufunc(
            target_scaler.inverse_transform,
            idata_q["posterior_predictive"]["mu"].expand_dims(dim={"_": 1}, axis=-1),
            input_core_dims=[["date", "_"]],
            output_core_dims=[["date", "_"]],
            vectorize=True,
        ).squeeze(dim="_")
        for q, idata_q in tqdm(predictions_roas_data_1[channel].items())
    }

In order to have a better understanding on this computation, we can plot one of such time series differences:

fig, ax = plt.subplots()
ax.axvspan(
    xmin=quarter_ranges.loc[quarter_ranges.quarter == "2022Q3"]["date"]["min"].item(),
    xmax=quarter_ranges.loc[quarter_ranges.quarter == "2022Q3"]["date"]["max"].item(),
    color="C1",
    label="2022-Q3",
    alpha=0.2,
)
az.plot_hdi(
    x=date,
    y=predictions_roas_data_scaled_diff_1["x2"]["2022Q3"],
    hdi_prob=0.94,
    color="C0",
    fill_kwargs={"alpha": 0.7, "label": r"$x2$ difference against actuals $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
ax.legend(loc="upper right")
ax.set_title(
    label=r"Potential Outcomes Difference - Model 1 (Channel $x2$ (2023-Q3))",
    fontsize=18,
    fontweight="bold",
)

Observe the difference is non-zero just after the selected quarter. This is due to the carryover effect of the media spend.

We now divide the differences by the spend to get the ROAS:

predictions_roas_1 = {"x1": {}, "x2": {}}

for channel in ["x1", "x2"]:
    predictions_roas_1[channel] = {
        q: idata_q.sum(dim="date") / data_df.query("quarter == @q")[channel].sum()
        for q, idata_q in predictions_roas_data_scaled_diff_1[channel].items()
    }

Here are the results for \(x_1\) and compare them with the true values:

x1_ref_vals = (
    roas.query("channel == 'x1'")
    .drop("channel", axis=1)
    .rename(columns={"roas": "ref_val"})
    .to_dict(orient="records")
)

axes = az.plot_posterior(
    data=xr.concat(
        predictions_roas_1["x1"].values(),
        dim=pd.Index(data=predictions_roas_1["x1"].keys(), name="quarter"),
    ),
    figsize=(18, 15),
    grid=(5, 2),
    ref_val={"mu": x1_ref_vals},
    backend_kwargs={"sharex": True, "layout": "constrained"},
)
fig = axes.ravel()[0].get_figure()
fig.suptitle(
    t="Posterior Predictive - ROAS x1 - Model 1", fontsize=18, fontweight="bold"
)

Here are the results for \(x_2\):

x2_ref_vals = (
    roas.query("channel == 'x2'")
    .drop("channel", axis=1)
    .rename(columns={"roas": "ref_val"})
    .to_dict(orient="records")
)

axes = az.plot_posterior(
    data=xr.concat(
        predictions_roas_1["x2"].values(),
        dim=pd.Index(data=predictions_roas_1["x2"].keys(), name="quarter"),
    ),
    figsize=(18, 15),
    grid=(5, 2),
    ref_val={"mu": x2_ref_vals},
    backend_kwargs={"sharex": True, "layout": "constrained"},
)
fig = axes.ravel()[0].get_figure()
fig.suptitle(
    t="Posterior Predictive - ROAS x2 - Model 1", fontsize=18, fontweight="bold"
)

These results are consistent with the ones from the global ROAS. The model underestimates for \(x_2\) (for \(x_1\) the difference lies within the reasonable uncertainty estimates).

Model 2: Unobserved Confounder

In this second model we keep the same model structure as above but we remove the confounder \(z\) as a covariate. We want to see if the model is able to recover the true effects of the media channels. We should expect the model to overestimate the effect of \(x_1\) as a result of omitted variable bias.

with pm.Model(coords=coords) as model_2:
    # --- Data Containers ---

    index_scaled_data = pm.Data(
        name="index_scaled",
        value=index_scaled.to_numpy().flatten(),
        mutable=True,
        dims="date",
    )

    channels_scaled_data = pm.Data(
        name="channels_scaled",
        value=channels_scaled,
        mutable=True,
        dims=("date", "channel"),
    )

    fourier_features_data = pm.Data(
        name="fourier_features",
        value=fourier_features,
        mutable=True,
        dims=("date", "fourier_mode"),
    )

    y_scaled_data = pm.Data(
        name="y_scaled",
        value=target_scaled.to_numpy().flatten(),
        mutable=True,
        dims="date",
    )

    # --- Priors ---

    amplitude_trend = pm.HalfNormal(name="amplitude_trend", sigma=1)
    ls_trend_params = pm.find_constrained_prior(
        distribution=pm.InverseGamma,
        lower=0.1,
        upper=0.9,
        init_guess={"alpha": 3, "beta": 1},
        mass=0.95,
    )
    ls_trend = pm.InverseGamma(name="ls_trend", **ls_trend_params)

    alpha = pm.Beta(name="alpha", alpha=2, beta=3, dims="channel")
    lam = pm.Gamma(name="lam", alpha=2, beta=2, dims="channel")
    beta_channel = pm.HalfNormal(
        name="beta_channel",
        sigma=channel_share.to_numpy(),
        dims="channel",
    )
    beta_fourier = pm.Normal(name="beta_fourier", mu=0, sigma=1, dims="fourier_mode")

    sigma = pm.HalfNormal(name="sigma", sigma=1)
    nu = pm.Gamma(name="nu", alpha=25, beta=2)

    # --- Parametrization ---

    cov_trend = amplitude_trend * pm.gp.cov.ExpQuad(input_dim=1, ls=ls_trend)
    gp_trend = pm.gp.HSGP(m=[20], c=1.5, cov_func=cov_trend)
    f_trend = gp_trend.prior(name="f_trend", X=index_scaled_data[:, None], dims="date")

    channel_adstock = pm.Deterministic(
        name="channel_adstock",
        var=geometric_adstock(
            x=channels_scaled_data,
            alpha=alpha,
            l_max=l_max,
            normalize=True,
        ),
        dims=("date", "channel"),
    )
    channel_adstock_saturated = pm.Deterministic(
        name="channel_adstock_saturated",
        var=logistic_saturation(x=channel_adstock, lam=lam),
        dims=("date", "channel"),
    )
    channel_contributions = pm.Deterministic(
        name="channel_contributions",
        var=channel_adstock_saturated * beta_channel,
        dims=("date", "channel"),
    )

    fourier_contribution = pm.Deterministic(
        name="fourier_contribution",
        var=pt.dot(fourier_features_data, beta_fourier),
        dims="date",
    )

    mu = pm.Deterministic(
        name="mu",
        var=f_trend + channel_contributions.sum(axis=-1) + fourier_contribution,
        dims="date",
    )

    # --- Likelihood ---
    y = pm.StudentT(
        name="y",
        nu=nu,
        mu=mu,
        sigma=sigma,
        observed=y_scaled_data,
        dims="date",
    )

pm.model_to_graphviz(model=model_2)

The prior predictive check should be similar to the one above. We will skip it. We fit the model directly.

with model_2:
    idata_2 = pm.sample(
        target_accept=0.9,
        draws=2_000,
        chains=4,
        nuts_sampler="numpyro",
        random_seed=rng,
    )
    posterior_predictive_2 = pm.sample_posterior_predictive(
        trace=idata_2, random_seed=rng
    )

Model Diagnostics

We inspect for divergences the trace of the model:

idata_2["sample_stats"]["diverging"].sum().item()

\(\displaystyle 0\)

var_names = [
    "amplitude_trend",
    "ls_trend",
    "alpha",
    "lam",
    "beta_channel",
    "beta_fourier",
    "sigma",
    "nu",
]

az.summary(data=idata_2, var_names=var_names, round_to=3)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
amplitude_trend	0.296	0.298	0.017	0.867	0.006	0.005	1669.534	3017.828	1.001
ls_trend	0.337	0.049	0.243	0.426	0.001	0.001	1712.131	1506.935	1.001
alpha[x1]	0.140	0.037	0.068	0.206	0.000	0.000	5437.078	3453.508	1.001
alpha[x2]	0.481	0.043	0.396	0.560	0.000	0.000	8357.205	5956.070	1.001
lam[x1]	0.799	0.202	0.467	1.201	0.003	0.002	4217.968	5296.788	1.001
lam[x2]	2.090	0.584	1.041	3.153	0.009	0.006	4337.649	3596.350	1.001
beta_channel[x1]	1.485	0.334	0.915	2.099	0.005	0.004	4258.979	5367.265	1.001
beta_channel[x2]	0.488	0.127	0.289	0.730	0.002	0.001	4267.566	5275.431	1.001
beta_fourier[0]	0.006	0.007	-0.004	0.017	0.000	0.000	3971.823	2878.822	1.001
beta_fourier[1]	0.028	0.007	0.016	0.038	0.000	0.000	4614.429	2660.869	1.001
beta_fourier[2]	-0.027	0.005	-0.036	-0.019	0.000	0.000	6689.125	5829.273	1.001
beta_fourier[3]	-0.001	0.003	-0.007	0.005	0.000	0.000	10266.507	4704.094	1.000
beta_fourier[4]	-0.004	0.003	-0.009	0.002	0.000	0.000	13133.854	5397.527	1.001
beta_fourier[5]	-0.004	0.003	-0.010	0.002	0.000	0.000	11668.364	5770.083	1.002
sigma	0.022	0.002	0.019	0.026	0.000	0.000	6113.067	5754.287	1.000
nu	12.444	2.408	7.858	16.776	0.027	0.019	7514.359	5171.298	1.000

axes = az.plot_trace(
    data=idata_2,
    var_names=var_names,
    compact=True,
    kind="rank_bars",
    backend_kwargs={"figsize": (15, 17), "layout": "constrained"},
)
plt.gcf().suptitle("Trace - Model 2", fontsize=16)

Overall, the trace looks fine. One big difference we immediately see is that the regression coefficient \(\beta_1\) of \(x_1\) increased significantly as compared to the previous model

Posterior Predictive Checks

We run the same analysis to visualize the model fit. As berore, we need to rescale the predictions back to the original scale.

pp_vars_original_scale_2 = {
    var_name: xr.apply_ufunc(
        target_scaler.inverse_transform,
        idata_2["posterior"][var_name].expand_dims(dim={"_": 1}, axis=-1),
        input_core_dims=[["date", "_"]],
        output_core_dims=[["date", "_"]],
        vectorize=True,
    ).squeeze(dim="_")
    for var_name in ["mu", "f_trend", "fourier_contribution"]
}

pp_vars_original_scale_2["channel_contributions"] = xr.apply_ufunc(
    target_scaler.inverse_transform,
    idata_2["posterior"]["channel_contributions"],
    input_core_dims=[["date", "channel"]],
    output_core_dims=[["date", "channel"]],
    vectorize=True,
)

pp_likelihood_original_scale_2 = xr.apply_ufunc(
    target_scaler.inverse_transform,
    posterior_predictive_2["posterior_predictive"]["y"].expand_dims(
        dim={"_": 1}, axis=-1
    ),
    input_core_dims=[["date", "_"]],
    output_core_dims=[["date", "_"]],
    vectorize=True,
).squeeze(dim="_")

fig, ax = plt.subplots()
sns.lineplot(data=model_df, x="date", y="y", color="black", label=target, ax=ax)
az.plot_hdi(
    x=date,
    y=pp_likelihood_original_scale_2,
    hdi_prob=0.94,
    color="C0",
    fill_kwargs={"alpha": 0.1, "label": r"likelihood $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_2["mu"],
    hdi_prob=0.94,
    color="C0",
    fill_kwargs={"alpha": 0.3, "label": r"$\mu$ $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
ax.legend(loc="upper left")
ax.set_title(label="Posterior Predictive - Model 2", fontsize=18, fontweight="bold")

fig, ax = plt.subplots()
az.plot_ppc(
    data=posterior_predictive_2,
    num_pp_samples=1_000,
    observed_rug=True,
    random_seed=seed,
    ax=ax,
)
ax.set_title(label="Posterior Predictive - Model 2", fontsize=18, fontweight="bold")

The models seem to have a comparable in-sample fit.

Model Components

Now we inspect the model components.

fig, ax = plt.subplots()
sns.lineplot(data=model_df, x="date", y="y", color="black", label=target, ax=ax)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_2["channel_contributions"].sel(channel="x1"),
    hdi_prob=0.94,
    color="C0",
    fill_kwargs={"alpha": 0.5, "label": r"$x1$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_2["channel_contributions"].sel(channel="x2"),
    hdi_prob=0.94,
    color="C1",
    fill_kwargs={"alpha": 0.5, "label": r"$x2$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_2["f_trend"],
    hdi_prob=0.94,
    color="C2",
    fill_kwargs={"alpha": 0.3, "label": r"$f_\text{trend}$ $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_2["fourier_contribution"],
    hdi_prob=0.94,
    color="C3",
    fill_kwargs={"alpha": 0.3, "label": r"seasonality $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
ax.legend(loc="upper center", bbox_to_anchor=(0.5, -0.1), ncol=3)
ax.set_title(label="Components Contributions - Model 2", fontsize=18, fontweight="bold")

Here we clearly see that the model overestimates the effect of \(x_1\) by a lot! 🫠

Let’s compare against the true values for the media contributions:

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(12, 7), sharex=True, sharey=False, layout="constrained"
)
sns.lineplot(
    x="date",
    y="x1_effect",
    data=data_df.assign(x1_effect=lambda x: amplitude * x["x1_effect"]),
    color="C0",
    label=r"$x1$ effect",
    ax=ax[0],
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_2["channel_contributions"].sel(channel="x1"),
    hdi_prob=0.94,
    color="C0",
    fill_kwargs={"alpha": 0.3, "label": r"$x1$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax[0],
)
ax[0].legend(loc="upper right")

sns.lineplot(
    x="date",
    y="x2_effect",
    data=data_df.assign(x2_effect=lambda x: amplitude * x["x2_effect"]),
    color="C1",
    label=r"$x2$ effect",
    ax=ax[1],
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_2["channel_contributions"].sel(channel="x2"),
    hdi_prob=0.94,
    color="C1",
    fill_kwargs={"alpha": 0.3, "label": r"$x2$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax[1],
)
ax[1].legend(loc="upper right")

fig.suptitle("Channel Contributions - Model 2", fontsize=18, fontweight="bold")

The overestimation of \(x_1\) is close to \(100\%\)! Note that the estimate of \(x_2\) is not that bad, but still underestimated.

Media Parameters

We now

alpha_posterior_2 = idata_2["posterior"]["alpha"]
lam_posterior_2 = idata_2["posterior"]["lam"] * channels_scaler.scale_
beta_channel_posterior_2 = idata_2["posterior"]["beta_channel"] * channels_scaler.scale_

fig, ax = plt.subplots(figsize=(10, 6))
az.plot_forest(
    data=[alpha_posterior_2], combined=True, hdi_prob=0.94, colors="black", ax=ax
)
ax.axvline(
    x=alpha1, color="C0", linestyle="--", linewidth=2, label=r"$\alpha_1$ (true)"
)
ax.axvline(
    x=alpha2, color="C1", linestyle="--", linewidth=2, label=r"$\alpha_2$ (true)"
)
ax.legend(loc="upper left")
ax.set_title(
    label=r"Adstock Parameter $\alpha$ ($94\%$ HDI) - Model 2",
    fontsize=18,
    fontweight="bold",
)

fig, ax = plt.subplots(figsize=(10, 6))
az.plot_forest(
    data=lam_posterior_2, combined=True, hdi_prob=0.94, colors="black", ax=ax
)
ax.axvline(x=lam1, color="C0", linestyle="--", linewidth=2, label=r"$\lambda_1$ (true)")
ax.axvline(x=lam2, color="C1", linestyle="--", linewidth=2, label=r"$\lambda_2$ (true)")
ax.legend(loc="upper right")
ax.set_title(
    label=r"Saturation Parameter $\lambda$ ($94\%$ HDI) - Model 2",
    fontsize=18,
    fontweight="bold",
)

beta_channel_posterior_2 = (
    idata_2["posterior"]["beta_channel"]
    * (1 / channels_scaler.scale_)
    * (target_scaler.scale_)
    * (1 / amplitude)
)

fig, ax = plt.subplots(figsize=(10, 6))
az.plot_forest(
    data=beta_channel_posterior_2, combined=True, hdi_prob=0.94, colors="black", ax=ax
)
ax.axvline(x=beta1, color="C0", linestyle="--", linewidth=2, label=r"$\beta_1$ (true)")
ax.axvline(x=beta2, color="C1", linestyle="--", linewidth=2, label=r"$\beta_2$ (true)")
ax.legend(loc="upper right")
ax.set_title(
    label=r"Channel $\beta$ ($94\%$ HDI) - Model 1", fontsize=18, fontweight="bold"
)

The carryover parameter \(\alpha_1\) and the regression coefficient \(\beta_1\) for \(x_1\) are the ones that are most affected by the omitted variable bias. For \(x_2\) the model is able to recover the true values within the \(94\%\) HDI.

ROAS Estimation

We now compute the ROAS as before. We start with the global ROAS:

predictions_roas_data_all_2 = {"x1": {}, "x2": {}}

for channel in tqdm(["x1", "x2"]):
    with model_2:
        pm.set_data(
            new_data={
                "channels_scaled": channels_scaler.transform(
                    data_df[channels].assign(**{channel: 0})
                )
            }
        )
        predictions_roas_data_all_2[channel] = pm.sample_posterior_predictive(
            trace=idata_2, var_names=["y", "mu"], progressbar=False, random_seed=rng
        )

predictions_roas_data_scaled_diff_all_2 = {"x1": {}, "x2": {}}

for channel in ["x1", "x2"]:
    predictions_roas_data_scaled_diff_all_2[channel] = pp_vars_original_scale_2[
        "mu"
    ] - xr.apply_ufunc(
        target_scaler.inverse_transform,
        predictions_roas_data_all_2[channel]["posterior_predictive"]["mu"].expand_dims(
            dim={"_": 1}, axis=-1
        ),
        input_core_dims=[["date", "_"]],
        output_core_dims=[["date", "_"]],
        vectorize=True,
    ).squeeze(
        dim="_"
    )

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(12, 7), sharex=True, sharey=False, layout="constrained"
)
az.plot_hdi(
    x=date,
    y=predictions_roas_data_scaled_diff_all_2["x1"],
    color="C0",
    smooth=False,
    fill_kwargs={"alpha": 0.3, "label": r"$E(y - y_{01})$ $94\%$ HDI"},
    ax=ax[0],
)
sns.lineplot(
    data=data_df.assign(x1_effect=lambda x: amplitude * x["x1_effect"]),
    x="date",
    y="x1_effect",
    color="C0",
    label=r"$x1$ effect",
    ax=ax[0],
)
az.plot_hdi(
    x=date,
    y=predictions_roas_data_scaled_diff_all_2["x2"],
    color="C1",
    smooth=False,
    fill_kwargs={"alpha": 0.3, "label": r"$E(y - y_{02})$ $94\%$ HDI"},
    ax=ax[1],
)
sns.lineplot(
    data=data_df.assign(x2_effect=lambda x: amplitude * x["x2_effect"]),
    x="date",
    y="x2_effect",
    color="C1",
    label=r"$x2$ effect",
    ax=ax[1],
)
ax[0].legend(loc="upper left")
ax[1].legend(loc="upper left")
ax[0].set(ylabel=None)
ax[1].set(ylabel=None)
fig.suptitle("Potential Outcomes Difference - Model 2", fontsize=18, fontweight="bold")

predictions_roas_all_2 = {"x1": {}, "x2": {}}

for channel in ["x1", "x2"]:
    predictions_roas_all_2[channel] = (
        predictions_roas_data_scaled_diff_all_2[channel].sum(dim="date")
        / data_df[channel].sum()
    )

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(12, 7), sharex=True, sharey=False, layout="constrained"
)

az.plot_posterior(predictions_roas_all_2["x1"], ref_val=roas_all_1, ax=ax[0])
ax[0].set(title="x1")
az.plot_posterior(predictions_roas_all_2["x2"], ref_val=roas_all_2, ax=ax[1])
ax[1].set(title="x2", xlabel="ROAS")
fig.suptitle("Global ROAS - Model 2", fontsize=18, fontweight="bold")

The ROAS overestimation of \(x_1\) is massive! It is of the order of \(100\%\). For \(x_2\) the estimated global is less than the true value (but less dramatic than \(x_1\)).

We now compute the quarterly ROAS as before.

predictions_roas_data_2 = {"x1": {}, "x2": {}}

for channel in tqdm(["x1", "x2"]):
    for q in tqdm(roas_data[channel].keys()):
        with model_2:
            pm.set_data(
                new_data={
                    "channels_scaled": channels_scaler.transform(
                        roas_data[channel][q][channels]
                    )
                }
            )
            predictions_roas_data_2[channel][q] = pm.sample_posterior_predictive(
                trace=idata_2, var_names=["y", "mu"], progressbar=False, random_seed=rng
            )

predictions_roas_data_scaled_diff_2 = {"x1": {}, "x2": {}}

for channel in ["x1", "x2"]:
    predictions_roas_data_scaled_diff_2[channel] = {
        q: pp_vars_original_scale_2["mu"]
        - xr.apply_ufunc(
            target_scaler.inverse_transform,
            idata_q["posterior_predictive"]["mu"].expand_dims(dim={"_": 1}, axis=-1),
            input_core_dims=[["date", "_"]],
            output_core_dims=[["date", "_"]],
            vectorize=True,
        ).squeeze(dim="_")
        for q, idata_q in tqdm(predictions_roas_data_2[channel].items())
    }

predictions_roas_2 = {"x1": {}, "x2": {}}

for channel in ["x1", "x2"]:
    predictions_roas_2[channel] = {
        q: idata_q.sum(dim="date") / data_df.query("quarter == @q")[channel].sum()
        for q, idata_q in predictions_roas_data_scaled_diff_2[channel].items()
    }

axes = az.plot_posterior(
    data=xr.concat(
        predictions_roas_2["x1"].values(),
        dim=pd.Index(data=predictions_roas_2["x1"].keys(), name="quarter"),
    ),
    figsize=(18, 15),
    grid=(5, 2),
    ref_val={"mu": x1_ref_vals},
    backend_kwargs={"sharex": True, "layout": "constrained"},
)
fig = axes.ravel()[0].get_figure()
fig.suptitle(
    t="Posterior Predictive - ROAS x1 - Model 2", fontsize=18, fontweight="bold"
)

axes = az.plot_posterior(
    data=xr.concat(
        predictions_roas_2["x2"].values(),
        dim=pd.Index(data=predictions_roas_2["x2"].keys(), name="quarter"),
    ),
    figsize=(18, 15),
    grid=(5, 2),
    ref_val={"mu": x2_ref_vals},
    backend_kwargs={"sharex": True, "layout": "constrained"},
)
fig = axes.ravel()[0].get_figure()
fig.suptitle(
    t="Posterior Predictive - ROAS x2 - Model 2", fontsize=18, fontweight="bold"
)

The quarterly ROAS present the same pattern as the global ROAS. The model overestimates the ROAS for \(x_1\) heavily!

Model 3: ROAS Calibration

In this third model we use the parametrization presented in the paper “Media Mix Model Calibration With Bayesian Priors”, by Zhang, et al.. Here is the description from the paper:

Remark: In the paper they use a Hill saturation function instead of a logistic one. This is not important for the method. It works in both cases.

We show that by adding ROAS estimations (e.g. coming from previous experiments or domain knowledge) as priors we are able to mitigate a significant amount of the omitted variable bias.

We start with the estimation of the ROAS in the original scale. We set them so that they are close to the true values (i.e. assuming the experiment worked). In order to include them into the model we need to scale them (as always, the devil is in the details).

roas_x1_bar = 95
roas_x2_bar = 170

roas_x1_bar_scaled = roas_x1_bar / target_scaler.scale_.item()
roas_x2_bar_scaled = roas_x2_bar / target_scaler.scale_.item()

roas_bar_scaled = np.array([roas_x1_bar_scaled, roas_x2_bar_scaled])

For the prior distribution we choose a distribution which is centered at the (scaled) ROAS and has a standard deviation approximately equal to the uncertainty error of the estimation. We can use a log-normal distribution to ensure the ROAS values are positive.

error = 30
error_scaled = error / target_scaler.scale_.item()

fig, ax = plt.subplots(figsize=(10, 6))
pz.LogNormal(mu=np.log(roas_x1_bar_scaled), sigma=error_scaled).plot_pdf(
    color="C0", ax=ax
)
ax.axvline(x=roas_x1_bar_scaled, color="C0", linestyle="--", linewidth=2)
pz.LogNormal(mu=np.log(roas_x2_bar_scaled), sigma=error_scaled).plot_pdf(
    color="C1", ax=ax
)
ax.axvline(x=roas_x2_bar_scaled, color="C1", linestyle="--", linewidth=2)
ax.set(xlabel="ROAS (scaled)")
ax.set_title(label=r"Prior Distribution ROAS", fontsize=18, fontweight="bold")

Model Specification

In the following model we set priors on the ROAS instead of the regression coefficients.

eps = np.finfo(float).eps


with pm.Model(coords=coords) as model_3:
    # --- Data Containers ---

    index_scaled_data = pm.Data(
        name="index_scaled",
        value=index_scaled.to_numpy().flatten(),
        mutable=True,
        dims="date",
    )

    channels_scaled_data = pm.Data(
        name="channels_scaled",
        value=channels_scaled,
        mutable=True,
        dims=("date", "channel"),
    )

    channels_cost_data = pm.Data(
        name="channels_cost",
        value=data_df[channels].to_numpy(),
        mutable=True,
        dims=("date", "channel"),
    )

    fourier_features_data = pm.Data(
        name="fourier_features",
        value=fourier_features,
        mutable=True,
        dims=("date", "fourier_mode"),
    )

    y_scaled_data = pm.Data(
        name="y_scaled",
        value=target_scaled.to_numpy().flatten(),
        mutable=True,
        dims="date",
    )

    # --- Priors ---

    amplitude_trend = pm.HalfNormal(name="amplitude_trend", sigma=1)
    ls_trend_params = pm.find_constrained_prior(
        distribution=pm.InverseGamma,
        lower=0.1,
        upper=0.9,
        init_guess={"alpha": 3, "beta": 1},
        mass=0.95,
    )
    ls_trend = pm.InverseGamma(name="ls_trend", **ls_trend_params)

    alpha = pm.Beta(name="alpha", alpha=2, beta=3, dims="channel")
    lam = pm.Gamma(name="lam", alpha=2, beta=2, dims="channel")
    roas = pm.LogNormal(
        name="roas",
        mu=np.log(roas_bar_scaled),
        sigma=error_scaled,
        dims="channel",
    )
    beta_fourier = pm.Normal(name="beta_fourier", mu=0, sigma=1, dims="fourier_mode")

    sigma = pm.HalfNormal(name="sigma", sigma=1)
    nu = pm.Gamma(name="nu", alpha=25, beta=2)

    # --- Parametrization ---

    cov_trend = amplitude_trend * pm.gp.cov.ExpQuad(input_dim=1, ls=ls_trend)
    gp_trend = pm.gp.HSGP(m=[20], c=1.5, cov_func=cov_trend)
    f_trend = gp_trend.prior(name="f_trend", X=index_scaled_data[:, None], dims="date")

    channel_adstock = pm.Deterministic(
        name="channel_adstock",
        var=geometric_adstock(
            x=channels_scaled_data,
            alpha=alpha,
            l_max=l_max,
            normalize=True,
        ),
        dims=("date", "channel"),
    )
    channel_adstock_saturated = pm.Deterministic(
        name="channel_adstock_saturated",
        var=logistic_saturation(x=channel_adstock, lam=lam),
        dims=("date", "channel"),
    )
    beta_channel = pm.Deterministic(
        name="beta_channel",
        var=(channels_cost_data.sum(axis=0) * roas)
        / (channel_adstock_saturated.sum(axis=0) + eps),
        dims="channel",
    )
    channel_contributions = pm.Deterministic(
        name="channel_contributions",
        var=channel_adstock_saturated * beta_channel,
        dims=("date", "channel"),
    )

    fourier_contribution = pm.Deterministic(
        name="fourier_contribution",
        var=pt.dot(fourier_features_data, beta_fourier),
        dims="date",
    )

    mu = pm.Deterministic(
        name="mu",
        var=f_trend + channel_contributions.sum(axis=-1) + fourier_contribution,
        dims="date",
    )

    # --- Likelihood ---
    y = pm.StudentT(
        name="y",
        nu=nu,
        mu=mu,
        sigma=sigma,
        observed=y_scaled_data,
        dims="date",
    )

pm.model_to_graphviz(model=model_3)

Remark: The channels_cost_data has the cost of each channel without any scaling. Note that you could use impressions to model the contribution of each channel (channels_scaled_data). This method is agnostic to the type of data you use.

Prior Predictive Checks

Let’s see how this parametrization changes the prior predictive distribution:

with model_3:
    prior_predictive_3 = pm.sample_prior_predictive(samples=2_000, random_seed=rng)

fig, ax = plt.subplots()
az.plot_ppc(data=prior_predictive_3, group="prior", kind="kde", ax=ax)
ax.set_title(label="Prior Predictive - Model 3", fontsize=18, fontweight="bold")

It actually looks very similar to the first model.

Model Fitting

We now fit the model.

with model_3:
    idata_3 = pm.sample(
        target_accept=0.9,
        draws=2_000,
        chains=4,
        nuts_sampler="numpyro",
        random_seed=rng,
    )
    posterior_predictive_3 = pm.sample_posterior_predictive(
        trace=idata_3, random_seed=rng
    )

The sampling time is comparable with the models above.

Model Diagnostics

We now take a look into the results.

idata_3["sample_stats"]["diverging"].sum().item()

\(\displaystyle 0\)

var_names = [
    "amplitude_trend",
    "ls_trend",
    "alpha",
    "lam",
    "roas",
    "beta_channel",
    "beta_fourier",
    "sigma",
    "nu",
]

az.summary(data=idata_3, var_names=var_names, round_to=3)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
amplitude_trend	0.390	0.354	0.026	1.066	0.008	0.005	1577.531	2360.063	1.002
ls_trend	0.328	0.047	0.235	0.410	0.001	0.001	1510.604	1477.460	1.003
alpha[x1]	0.035	0.022	0.002	0.074	0.000	0.000	8148.597	4067.383	1.000
alpha[x2]	0.527	0.041	0.452	0.603	0.000	0.000	8590.509	6258.338	1.000
lam[x1]	0.288	0.160	0.020	0.569	0.002	0.001	7728.718	3689.815	1.001
lam[x2]	1.362	0.880	0.073	2.986	0.010	0.007	6400.545	4857.139	1.000
roas[x1]	0.217	0.012	0.194	0.239	0.000	0.000	7185.918	6334.511	1.000
roas[x2]	0.294	0.013	0.269	0.319	0.000	0.000	9086.750	5634.668	1.000
beta_channel[x1]	3.640	4.206	0.713	8.634	0.071	0.050	7509.496	3694.108	1.001
beta_channel[x2]	1.600	7.306	0.256	3.892	0.131	0.093	6502.806	4805.879	1.000
beta_fourier[0]	0.008	0.008	-0.009	0.020	0.000	0.000	2930.040	1823.914	1.002
beta_fourier[1]	0.042	0.008	0.029	0.055	0.000	0.000	3880.937	2236.211	1.000
beta_fourier[2]	-0.051	0.004	-0.059	-0.043	0.000	0.000	9493.599	6236.470	1.001
beta_fourier[3]	-0.002	0.004	-0.009	0.005	0.000	0.000	17322.683	5873.648	1.000
beta_fourier[4]	-0.003	0.004	-0.010	0.004	0.000	0.000	16458.519	5724.276	1.000
beta_fourier[5]	0.002	0.004	-0.005	0.009	0.000	0.000	14683.029	5765.216	1.001
sigma	0.027	0.002	0.023	0.031	0.000	0.000	7344.703	5838.703	1.000
nu	12.098	2.441	7.685	16.658	0.025	0.018	9119.262	4900.607	1.001

axes = az.plot_trace(
    data=idata_3,
    var_names=var_names,
    compact=True,
    backend_kwargs={"figsize": (15, 17), "layout": "constrained"},
)
plt.gcf().suptitle("Trace - Model 3", fontsize=16)

Everything looks in order. Note that the ROAS posterior distribution is very close to the prior.

Posterior Predictive Checks

We now evaluate the model fit.

pp_vars_original_scale_3 = {
    var_name: xr.apply_ufunc(
        target_scaler.inverse_transform,
        idata_3["posterior"][var_name].expand_dims(dim={"_": 1}, axis=-1),
        input_core_dims=[["date", "_"]],
        output_core_dims=[["date", "_"]],
        vectorize=True,
    ).squeeze(dim="_")
    for var_name in ["mu", "f_trend", "fourier_contribution"]
}

pp_vars_original_scale_3["channel_contributions"] = xr.apply_ufunc(
    target_scaler.inverse_transform,
    idata_3["posterior"]["channel_contributions"],
    input_core_dims=[["date", "channel"]],
    output_core_dims=[["date", "channel"]],
    vectorize=True,
)

pp_vars_original_scale_3["baseline"] = xr.apply_ufunc(
    target_scaler.inverse_transform,
    (
        idata_3["posterior"]["f_trend"] + idata_3["posterior"]["fourier_contribution"]
    ).expand_dims(dim={"_": 1}, axis=-1),
    input_core_dims=[["date", "_"]],
    output_core_dims=[["date", "_"]],
    vectorize=True,
).squeeze(dim="_")

pp_likelihood_original_scale_3 = xr.apply_ufunc(
    target_scaler.inverse_transform,
    posterior_predictive_3["posterior_predictive"]["y"].expand_dims(
        dim={"_": 1}, axis=-1
    ),
    input_core_dims=[["date", "_"]],
    output_core_dims=[["date", "_"]],
    vectorize=True,
).squeeze(dim="_")

fig, ax = plt.subplots()
sns.lineplot(data=model_df, x="date", y="y", color="black", label=target, ax=ax)
az.plot_hdi(
    x=date,
    y=pp_likelihood_original_scale_3,
    hdi_prob=0.94,
    color="C0",
    fill_kwargs={"alpha": 0.1, "label": r"likelihood $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_3["mu"],
    hdi_prob=0.94,
    color="C0",
    fill_kwargs={"alpha": 0.3, "label": r"$\mu$ $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
ax.legend(loc="upper left")
ax.set_title(label="Posterior Predictive - Model 3", fontsize=18, fontweight="bold")

fig, ax = plt.subplots()
az.plot_ppc(
    data=posterior_predictive_3,
    num_pp_samples=1_000,
    observed_rug=True,
    random_seed=seed,
    ax=ax,
)
ax.set_title(label="Posterior Predictive - Model 3", fontsize=18, fontweight="bold")

The posterior predictive distribution looks worse than the previous models.

Model Components

We now plot the model components.

fig, ax = plt.subplots()
sns.lineplot(data=model_df, x="date", y="y", color="black", label=target, ax=ax)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_3["channel_contributions"].sel(channel="x1"),
    hdi_prob=0.94,
    color="C0",
    fill_kwargs={"alpha": 0.5, "label": r"$x1$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_3["channel_contributions"].sel(channel="x2"),
    hdi_prob=0.94,
    color="C1",
    fill_kwargs={"alpha": 0.5, "label": r"$x2$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_3["f_trend"],
    hdi_prob=0.94,
    color="C2",
    fill_kwargs={"alpha": 0.3, "label": r"$f_\text{trend}$ $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_3["fourier_contribution"],
    hdi_prob=0.94,
    color="C3",
    fill_kwargs={"alpha": 0.3, "label": r"seasonality $94\%$ HDI"},
    smooth=False,
    ax=ax,
)
ax.legend(loc="upper center", bbox_to_anchor=(0.5, -0.1), ncol=3)
ax.set_title(label="Components Contributions - Model 3", fontsize=18, fontweight="bold")

Observe that the \(94\%\) HDI for both channels is much narrower than the previous models. This is because we are constraining the values through the ROAS tight priors.

Let’s compare now the media contributions against the true values:

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(12, 7), sharex=True, sharey=False, layout="constrained"
)
sns.lineplot(
    x="date",
    y="x1_effect",
    data=data_df.assign(x1_effect=lambda x: amplitude * x["x1_effect"]),
    color="C0",
    label=r"$x1$ effect",
    ax=ax[0],
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_3["channel_contributions"].sel(channel="x1"),
    hdi_prob=0.94,
    color="C0",
    fill_kwargs={"alpha": 0.3, "label": r"$x1$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax[0],
)
ax[0].legend(loc="upper right")

sns.lineplot(
    x="date",
    y="x2_effect",
    data=data_df.assign(x2_effect=lambda x: amplitude * x["x2_effect"]),
    color="C1",
    label=r"$x2$ effect",
    ax=ax[1],
)
az.plot_hdi(
    x=date,
    y=pp_vars_original_scale_3["channel_contributions"].sel(channel="x2"),
    hdi_prob=0.94,
    color="C1",
    fill_kwargs={"alpha": 0.3, "label": r"$x2$ contribution $94\%$ HDI"},
    smooth=False,
    ax=ax[1],
)
ax[1].legend(loc="upper right")

fig.suptitle("Channel Contributions - Model 3", fontsize=18, fontweight="bold")

The result for both channels is much better than the previous models *remember we do not have \(z\) in the model).

Media Parameters

alpha_posterior_3 = idata_3["posterior"]["alpha"]
lam_posterior_3 = idata_3["posterior"]["lam"] * channels_scaler.scale_
beta_channel_posterior_3 = idata_3["posterior"]["beta_channel"] * channels_scaler.scale_

fig, ax = plt.subplots(figsize=(10, 6))
az.plot_forest(
    data=alpha_posterior_3, combined=True, hdi_prob=0.94, colors="black", ax=ax
)
ax.axvline(
    x=alpha1, color="C0", linestyle="--", linewidth=2, label=r"$\alpha_1$ (true)"
)
ax.axvline(
    x=alpha2, color="C1", linestyle="--", linewidth=2, label=r"$\alpha_2$ (true)"
)
ax.legend(loc="upper left")
ax.set_title(
    label=r"Adstock Parameter $\alpha$ ($94\%$ HDI) - Model 2",
    fontsize=18,
    fontweight="bold",
)

fig, ax = plt.subplots(figsize=(10, 6))
az.plot_forest(
    data=lam_posterior_3, combined=True, hdi_prob=0.94, colors="black", ax=ax
)
ax.axvline(x=lam1, color="C0", linestyle="--", linewidth=2, label=r"$\lambda_1$ (true)")
ax.axvline(x=lam2, color="C1", linestyle="--", linewidth=2, label=r"$\lambda_2$ (true)")
ax.legend(loc="upper right")
ax.set_title(
    label=r"Saturation Parameter $\lambda$ ($94\%$ HDI) - Model 2",
    fontsize=18,
    fontweight="bold",
)

beta_channel_posterior_3 = (
    idata_3["posterior"]["beta_channel"]
    * (1 / channels_scaler.scale_)
    * (target_scaler.scale_)
    * (1 / amplitude)
)

fig, ax = plt.subplots(figsize=(10, 6))
az.plot_forest(
    data=beta_channel_posterior_3, combined=True, hdi_prob=0.94, colors="black", ax=ax
)
ax.axvline(x=beta1, color="C0", linestyle="--", linewidth=2, label=r"$\beta_1$ (true)")
ax.axvline(x=beta2, color="C1", linestyle="--", linewidth=2, label=r"$\beta_2$ (true)")
ax.legend(loc="upper right")
ax.set_title(
    label=r"Channel $\beta$ ($94\%$ HDI) - Model 1", fontsize=18, fontweight="bold"
)

For \(x_2\) we recovered all the media parameters within the \(94\%\) HDI.
For \(x_1\) the model heavily underestimated the carryover parameter \(\alpha_1\) and on the other hand overestimated the regression coefficient \(\beta_1\). The saturation parameter \(\lambda_1\) was closer to the true value, but still outside of the \(94\%\) HDI.

ROAS Estimation

Finally, we compute the ROAS. We should expect them to be much closer to the true values than model 2 since we have explicit tight priors over them. Let’s start with the global ROAS:

predictions_roas_data_all_3 = {"x1": {}, "x2": {}}

for channel in tqdm(["x1", "x2"]):
    with model_3:
        pm.set_data(
            new_data={
                "channels_scaled": channels_scaler.transform(
                    data_df[channels].assign(**{channel: 0}),
                ),
                "channels_cost": data_df[channels].assign(**{channel: 0}),
            }
        )
        predictions_roas_data_all_3[channel] = pm.sample_posterior_predictive(
            trace=idata_3, var_names=["y", "mu"], progressbar=False, random_seed=rng
        )

Note we hat to replace the media model data (channels_scaled) and the media cost data (channels_cost). Here you can see that we could use impressions instead of cost to model the contribution of each channel.

predictions_roas_data_scaled_diff_all_3 = {"x1": {}, "x2": {}}

for channel in ["x1", "x2"]:
    predictions_roas_data_scaled_diff_all_3[channel] = pp_vars_original_scale_3[
        "mu"
    ] - xr.apply_ufunc(
        target_scaler.inverse_transform,
        predictions_roas_data_all_3[channel]["posterior_predictive"]["mu"].expand_dims(
            dim={"_": 1}, axis=-1
        ),
        input_core_dims=[["date", "_"]],
        output_core_dims=[["date", "_"]],
        vectorize=True,
    ).squeeze(dim="_")

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(12, 7), sharex=True, sharey=False, layout="constrained"
)
az.plot_hdi(
    x=date,
    y=predictions_roas_data_scaled_diff_all_3["x1"],
    color="C0",
    smooth=False,
    fill_kwargs={"alpha": 0.3, "label": r"$E(y - y_{01})$ $94\%$ HDI"},
    ax=ax[0],
)
sns.lineplot(
    data=data_df.assign(x1_effect=lambda x: amplitude * x["x1_effect"]),
    x="date",
    y="x1_effect",
    color="C0",
    label=r"$x1$ effect",
    ax=ax[0],
)
az.plot_hdi(
    x=date,
    y=predictions_roas_data_scaled_diff_all_3["x2"],
    color="C1",
    smooth=False,
    fill_kwargs={"alpha": 0.3, "label": r"$E(y - y_{02})$ $94\%$ HDI"},
    ax=ax[1],
)
sns.lineplot(
    data=data_df.assign(x2_effect=lambda x: amplitude * x["x2_effect"]),
    x="date",
    y="x2_effect",
    color="C1",
    label=r"$x2$ effect",
    ax=ax[1],
)
ax[0].legend(loc="upper left")
ax[1].legend(loc="upper left")
ax[0].set(ylabel=None)
ax[1].set(ylabel=None)
fig.suptitle("Potential Outcomes Difference - Model 3", fontsize=18, fontweight="bold")

predictions_roas_all_3 = {"x1": {}, "x2": {}}

for channel in ["x1", "x2"]:
    predictions_roas_all_3[channel] = (
        predictions_roas_data_scaled_diff_all_3[channel].sum(dim="date")
        / data_df[channel].sum()
    )

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(12, 7), sharex=True, sharey=False, layout="constrained"
)

az.plot_posterior(predictions_roas_all_3["x1"], ref_val=roas_all_1, ax=ax[0])
ax[0].set(title="x1")
az.plot_posterior(predictions_roas_all_3["x2"], ref_val=roas_all_2, ax=ax[1])
ax[1].set(title="x2", xlabel="ROAS")
fig.suptitle("Global ROAS - Model 3", fontsize=18, fontweight="bold")

The results look as expected 🚀! The ROAS estimate for \(x_1\) is much closer to the true value than in the second model.

We wrap up the analysis by computing the quarterly ROAS.

predictions_roas_data_3 = {"x1": {}, "x2": {}}

for channel in tqdm(["x1", "x2"]):
    for q in tqdm(roas_data[channel].keys()):
        with model_3:
            pm.set_data(
                new_data={
                    "channels_scaled": channels_scaler.transform(
                        roas_data[channel][q][channels]
                    ),
                    "channels_cost": roas_data[channel][q][channels],
                }
            )
            predictions_roas_data_3[channel][q] = pm.sample_posterior_predictive(
                trace=idata_3, var_names=["y", "mu"], progressbar=False, random_seed=rng
            )

predictions_roas_data_scaled_diff_3 = {"x1": {}, "x2": {}}

for channel in ["x1", "x2"]:
    predictions_roas_data_scaled_diff_3[channel] = {
        q: pp_vars_original_scale_3["mu"]
        - xr.apply_ufunc(
            target_scaler.inverse_transform,
            idata_q["posterior_predictive"]["mu"].expand_dims(dim={"_": 1}, axis=-1),
            input_core_dims=[["date", "_"]],
            output_core_dims=[["date", "_"]],
            vectorize=True,
        ).squeeze(dim="_")
        for q, idata_q in tqdm(predictions_roas_data_3[channel].items())
    }

predictions_roas_3 = {"x1": {}, "x2": {}}

for channel in ["x1", "x2"]:
    predictions_roas_3[channel] = {
        q: idata_q.sum(dim="date") / data_df.query("quarter == @q")[channel].sum()
        for q, idata_q in predictions_roas_data_scaled_diff_3[channel].items()
    }

axes = az.plot_posterior(
    data=xr.concat(
        predictions_roas_3["x1"].values(),
        dim=pd.Index(data=predictions_roas_3["x1"].keys(), name="quarter"),
    ),
    figsize=(18, 15),
    grid=(5, 2),
    ref_val={"mu": x1_ref_vals},
    backend_kwargs={"sharex": True, "layout": "constrained"},
)
fig = axes.ravel()[0].get_figure()
fig.suptitle(
    t="Posterior Predictive - ROAS x1 - Model 3", fontsize=18, fontweight="bold"
)

axes = az.plot_posterior(
    data=xr.concat(
        predictions_roas_3["x2"].values(),
        dim=pd.Index(data=predictions_roas_3["x2"].keys(), name="quarter"),
    ),
    figsize=(18, 15),
    grid=(5, 2),
    ref_val={"mu": x2_ref_vals},
    backend_kwargs={"sharex": True, "layout": "constrained"},
)
fig = axes.ravel()[0].get_figure()
fig.suptitle(
    t="Posterior Predictive - ROAS x2 - Model 3", fontsize=18, fontweight="bold"
)

The results at quarterly level are consistent with the global ROAS.

Model Comparison

In this final section we compare the posterior distribution of the channel ROAS for the three models.

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(12, 9), sharex=True, sharey=False, layout="constrained"
)

az.plot_forest(
    data=[
        predictions_roas_all_1["x1"].rename("roas"),
        predictions_roas_all_2["x1"].rename("roas"),
        predictions_roas_all_3["x1"].rename("roas"),
    ],
    model_names=[
        "Model 1 (causal)",
        "Model 2 (unobserved confounder)",
        "Model 3 (roas prior)",
    ],
    combined=True,
    ax=ax[0],
)
ax[0].get_legend().update({"loc": "upper right"})
ax[0].axvline(x=roas_all_1, color="black", linestyle="--", linewidth=2)
ax[0].set(title="x1")

az.plot_forest(
    data=[
        predictions_roas_all_1["x2"].rename("roas"),
        predictions_roas_all_2["x2"].rename("roas"),
        predictions_roas_all_3["x2"].rename("roas"),
    ],
    model_names=[
        "Model 1 (causal)",
        "Model 2 (unobserved confounder)",
        "Model 3 (roas prior)",
    ],
    combined=True,
    ax=ax[1],
)
ax[1].get_legend().update({"loc": "upper left"})
ax[1].axvline(x=roas_all_2, color="black", linestyle="--", linewidth=2)
ax[1].set(title="x2")

fig.suptitle("Global ROAS - Model Comparison", fontsize=18, fontweight="bold")

Lastly, we compare the estimated quarterly ROAS for \(x_1\):

fig, ax = plt.subplots()
az.plot_forest(
    data=[
        xr.concat(
            predictions_roas_1["x1"].values(),
            dim=pd.Index(data=predictions_roas_1["x2"].keys(), name="quarter"),
        ).rename("roas"),
        xr.concat(
            predictions_roas_2["x1"].values(),
            dim=pd.Index(data=predictions_roas_2["x2"].keys(), name="quarter"),
        ).rename("roas"),
        xr.concat(
            predictions_roas_3["x1"].values(),
            dim=pd.Index(data=predictions_roas_2["x2"].keys(), name="quarter"),
        ).rename("roas"),
    ],
    model_names=[
        "Model 1 (causal)",
        "Model 2 (unobserved confounder)",
        "Model 3 (roas prior)",
    ],
    combined=True,
    ax=ax,
)
ax.axvline(x=roas_all_1, color="black", linestyle="--", linewidth=2)
ax.get_legend().update({"loc": "center left", "bbox_to_anchor": (1, 0.5)})
ax.set_title(
    label=r"Posterior Predictive - $x_1$ ROAS - Model Comparison",
    fontsize=18,
    fontweight="bold",
)

Conclusion

With this simulated case study we have seen the benefit of using ROAS priors instead of regression coefficients. We have seen that the model with the ROAS priors was able to partially recover the true values of the ROAS in the presence of unobserved confounders. We also see that the ROAS are not the same as the specified prior since it balances out with the evidence from the media mix regression model. Hence, this approach serves as a more holistic ROAS estimation for real world scenarios.