GLM in PyMC3: Out-Of-Sample Predictions

In this notebook I explore the glm module of PyMC3. I am particularly interested in the model definition using patsy formulas, as it makes the model evaluation loop faster (easier to include features and/or interactions). There are many good resources on this subject, but most of them evaluate the model in-sample. For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread “Out of sample” predictions with the GLM sub-module on the (great!) forum discourse.pymc.io/, thank you all for your input!

📢 WARNING:

Since pymc v4 the GLM module has been depreciated in favor of Bambi, a more complete implementation of the GLM sub-module which also allows for mixed-effects models. If you are coming from pymc you can take a look into a tutorial I wrote (together with Tomás Capretto) about how to port a hierarchical bayesian model from pymc to bambi here. That being said, besides the syntax in the model specification, the rest of the concepts and code of this post remain valid! 😉. Moreover, you can find an updated version of this notebook (thanks Benjamin T. Vincent!) here, as part of the PyMC example gallery.

Resources

PyMC3 Docs: Example Notebooks
- In particular check GLM: Logistic Regression
Bayesian Analysis with Python (Second edition) - Chapter 4
Statistical Rethinking

Acknowledgement: I would like to thank the pymc-devs team for their support and valuable input refining the initial version of this post.

Prepare Notebook

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

sns.set_style(
    style="darkgrid", 
    rc={"axes.facecolor": ".9", "grid.color": ".8"}
)
sns.set_palette(palette="deep")
sns_c = sns.color_palette(palette="deep")

import arviz as az
import patsy
import pymc3 as pm
from pymc3 import glm

plt.rcParams["figure.figsize"] = [7, 6]
plt.rcParams["figure.dpi"] = 100

Generate Sample Data

We want to fit a logistic regression model where there is a multiplicative interaction between two numerical features.

SEED = 42
np.random.seed(SEED)

# Number of data points.
n = 250
# Create features.
x1 = np.random.normal(loc=0.0, scale=2.0, size=n)
x2 = np.random.normal(loc=0.0, scale=2.0, size=n)
epsilon = np.random.normal(loc=0.0, scale=0.5, size=n)
# Define target variable.
intercept = -0.5
beta_x1 = 1
beta_x2 = -1
beta_interaction = 2
z = intercept + beta_x1 * x1 + beta_x2 * x2 + beta_interaction * x1 * x2
p = 1 / (1 + np.exp(-z))
y = np.random.binomial(n=1, p=p, size=n)

df = pd.DataFrame(dict(x1=x1, x2=x2, y=y))

df.head()

	x1	x2	y
0	0.993428	-2.521768	0
1	-0.276529	1.835724	0
2	1.295377	4.244312	1
3	3.046060	2.064931	1
4	-0.468307	-3.038740	1

Let us do some exploration of the data:

sns.pairplot(
    data=df,
    kind="scatter",
    height=2,
    plot_kws={"color": sns_c[1]},
    diag_kws={"color": sns_c[2]}
);

svg

\(x_1\) and \(x_2\) are not correlated.
\(x_1\) and \(x_2\) do not seem to separate the \(y\)-classes independently.
The distribution of \(y\) is not highly unbalanced.

fig, ax = plt.subplots()
sns_c_div = sns.diverging_palette(240, 10, n=2)
sns.scatterplot(
  x="x1",
  y="x2",
  data=df,
  hue="y",
  palette=[sns_c_div[0], sns_c_div[-1]]
)
ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5))
ax.set(
  title="Sample Data",
  xlim=(-9, 9),
  ylim=(-9, 9),
  xlabel="x1",
  ylabel="x2"
);

svg

Prepare Data for Modeling

I wanted to use the classmethod from_formula (see documentation), but I was not able to generate out-of-sample predictions with this approach (if you find a way please let me know!). As a workaround, I created the features from a formula using patsy directly and then use class pymc3.glm.linear.GLM (this was motivated by going into the source code).

# Define model formula.
formula = "y ~ x1 * x2"
# Create features.
y, x = patsy.dmatrices(formula_like=formula, data=df)
y = np.asarray(y).flatten()
labels = x.design_info.column_names
x = np.asarray(x)

As pointed out on the thread (thank you @Nicky!), we need to keep the lables of the features in the design matrix.

print(f"labels = {labels}")

labels = ['Intercept', 'x1', 'x2', 'x1:x2']

Now we do a train-test split.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
  x, y, train_size=0.7, random_state=SEED
)

Define the Model

We now specify the model in PyMC3.

with pm.Model() as model:
    # Set data container.
    data = pm.Data("data", x_train)
    # Define GLM family.
    family = pm.glm.families.Binomial()
    # Set priors.
    priors = {
        "Intercept": pm.Normal.dist(mu=0, sd=10),
        "x1": pm.Normal.dist(mu=0, sd=10),
        "x2": pm.Normal.dist(mu=0, sd=10),
        "x1:x2": pm.Normal.dist(mu=0, sd=10),
    }
    # Specify model.
    glm.GLM(
      y=y_train,
      x=data,
      family=family,
      intercept=False,
      labels=labels,
      priors=priors
    )

Prior Checks

Before fitting the model we run some prior predictive checks on the training data.

# Sample from prior distribution.
with model:
    prior_checks = pm.sample_prior_predictive(samples=100, random_seed=SEED)

We get samples for all our variables and the output:

prior_checks.keys()

dict_keys(['y', 'Intercept', 'x2', 'x1:x2', 'x1'])

Let us take a look into the samples. For the target variable y we have:

prior_checks["y"]

array([[0, 1, 0, ..., 1, 0, 0],
       [0, 1, 0, ..., 1, 0, 0],
       [1, 1, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 1, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0]])

So we can get summarize the model predictions (before seeing the data) by taking the mean over the samples. Let us store this in a dataframe to in order to generate some plots.

train_prior_df = pd.DataFrame(
    data={
        "x1_train": x_train[:, 1],
        "x2_train": x_train[:, 2],
        "y_train": y_train,
        "y_train_prior_mean": prior_checks["y"].mean(axis=0),
    },
)

train_prior_df.sort_values("y_train", inplace=True)

First we plot the distribution of the means:

# Plot means distribution.
fig, ax = plt.subplots()
sns.kdeplot(
    x="y_train_prior_mean",
    data=train_prior_df,
    hue="y_train",
    palette=[sns_c[0], sns_c[3]],
    fill=True,
    alpha=0.1,
    ax=ax,
)
ax.axvline(x=0.5, color="gray", linestyle="--")
ax.set(title="Prior Means Distribution (train)", xlim=(0, 1));

svg

We clearly see that the model can not distinguish between the two classes yet. This makes sense as we have non-informative priors for this synthetic data set. We can also confirm this if we plot each point separately:

fig, ax = plt.subplots()
cmap = sns.diverging_palette(240, 10, n=50, as_cmap=True)
sns.scatterplot(
    x="x1_train",
    y="x2_train",
    data=train_prior_df,
    hue="y_train_prior_mean",
    hue_norm=(0, 1),
    palette=cmap,
    edgecolor="black",
    style="y_train",
    ax=ax,
)
ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
ax.set(
  title="Prior Means (train)",
  xlim=(-9, 9),
  ylim=(-9, 9),
  xlabel="x1",
  ylabel="x2"
);

svg

Fit Model

with model:
    # Configure sampler.
    trace = pm.sample(5000, chains=5, tune=1000, target_accept=0.87, random_seed=SEED)

# Plot chains.
az.plot_trace(data=trace);

az.summary(trace)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
Intercept	-0.536	0.262	-1.043	-0.060	0.002	0.002	14479.0	16189.0	1.0
x1	0.616	0.211	0.223	1.013	0.002	0.001	14573.0	14635.0	1.0
x2	-0.865	0.221	-1.285	-0.466	0.002	0.001	12761.0	14415.0	1.0
x1:x2	1.790	0.335	1.162	2.403	0.003	0.002	12686.0	14676.0	1.0

The chains look good.

Generate Out-Of-Sample Predictions

Now we generate predictions on the test set.

# Update data reference.
pm.set_data({"data": x_test}, model=model)
# Generate posterior samples.
ppc_test = pm.sample_posterior_predictive(trace, model=model, samples=1000)

# Compute the point prediction by taking the mean
# and defining the category via a threshold.
p_test_pred = ppc_test["y"].mean(axis=0)
y_test_pred = (p_test_pred >= 0.5).astype("int")

Evaluate Model

First let us compute the accuracy on the test set.

from sklearn.metrics import accuracy_score

print(f"accuracy = {accuracy_score(y_true=y_test, y_pred=y_test_pred): 0.3f}")

accuracy =  0.787

Next, we plot the roc curve and compute the auc.

from sklearn.metrics import RocCurveDisplay, auc, roc_curve

fpr, tpr, thresholds = roc_curve(
    y_true=y_test, y_score=p_test_pred, pos_label=1, drop_intermediate=False
)
roc_auc = auc(fpr, tpr)

fig, ax = plt.subplots()
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
roc_display = roc_display.plot(ax=ax, marker="o", color=sns_c[4], markersize=4)
ax.set(title="ROC");

svg

The model is performing as expected (we of course know the data generating process, which is almost never the case in practical applications).

Model Decision Boundary

Finally we will describe and plot the model decision boundary, which is the space defined as

\[\mathcal{B} = \{(x_1, x_2) \in \mathbb{R}^2 \: | \: p(x_1, x_2) = 0.5\}\]

where \(p\) denotes the probability of belonging to the class \(y=1\) output by the model. To make this set explicit, we simply write the condition in terms of the model parametrization:

\[0.5 = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1x_2))}\]

which implies

\[0 = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1x_2\]

Solving for \(x_2\) we get the formula

\[x_2 = - \frac{\beta_0 + \beta_1 x_1}{\beta_2 + \beta_{12}x_1}\]

Observe that this curve is a hyperbola centered at the singularity point \(x_1 = - \beta_2 / \beta_{12}\).

Let us now plot the model decision boundary using a grid:

# Construct grid.
x1_grid = np.linspace(start=-9, stop=9, num=300)
x2_grid = x1_grid

x1_mesh, x2_mesh = np.meshgrid(x1_grid, x2_grid)

x_grid = np.stack(arrays=[x1_mesh.flatten(), x2_mesh.flatten()], axis=1)

# Create features on the grid.
x_grid_ext = patsy.dmatrix(
  formula_like="x1 * x2", data=dict(x1=x_grid[:, 0], x2=x_grid[:, 1])
)

x_grid_ext = np.asarray(x_grid_ext)

# Generate model predictions on the grid.
pm.set_data({"data": x_grid_ext}, model=model)
ppc_grid = pm.sample_posterior_predictive(trace, model=model, samples=1000)

Now we compute the model decision boundary on the grid for visualization purposes.

numerator = -(trace["Intercept"].mean(axis=0) + trace["x1"].mean(axis=0) * x1_grid)
denominator = trace["x2"].mean(axis=0) + trace["x1:x2"].mean(axis=0) * x1_grid
bd_grid = numerator / denominator

grid_df = pd.DataFrame(x_grid, columns=["x1", "x2"])
grid_df["p"] = ppc_grid["y"].mean(axis=0)
grid_df.sort_values("p", inplace=True)

p_grid = grid_df.pivot(index="x2", columns="x1", values="p").to_numpy()

We finally get the plot and the predictions on the test set:

fig, ax = plt.subplots()
cmap = sns.diverging_palette(240, 10, n=50, as_cmap=True)
sns.scatterplot(
    x=x_test[:, 1].flatten(),
    y=x_test[:, 2].flatten(),
    hue=y_test,
    palette=[sns_c_div[0], sns_c_div[-1]],
    ax=ax,
)
sns.lineplot(x=x1_grid, y=bd_grid, color="black", ax=ax)
ax.contourf(x1_grid, x2_grid, p_grid, cmap=cmap, alpha=0.3)
ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5))
ax.lines[0].set_linestyle("dotted")
ax.set(
  title="Model Decision Boundary",
  xlim=(-9, 9),
  ylim=(-9, 9),
  xlabel="x1",
  ylabel="x2"
);

svg

Remark: Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and auc). One way of doing this is by storing and computing it inside the model definition as a Deterministic variable as in Bayesian Analysis with Python (Second edition) - Chapter 4.