In this notebook I explore the glm module of PyMC3. I am particularly interested in the model definition using patsy formulas, as it makes the model evaluation loop faster (easier to include features and/or interactions). There are many good resources on this subject, but most of them evaluate the model in-sample. For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread “Out of sample” predictions with the GLM sub-module on the (great!) forum discourse.pymc.io/, thank you all for your input!
📢 WARNING:
Since pymc v4
the GLM
module has been depreciated in favor of Bambi, a more complete implementation of the GLM sub-module which also allows for mixed-effects models. If you are coming from pymc
you can take a look into a tutorial I wrote (together with Tomás Capretto) about how to port a hierarchical bayesian model from pymc
to bambi
here. That being said, besides the syntax in the model specification, the rest of the concepts and code of this post remain valid! 😉. Moreover, you can find an updated version of this notebook (thanks Benjamin T. Vincent!) here, as part of the PyMC example gallery.
Resources
-
- In particular check GLM: Logistic Regression
Acknowledgement: I would like to thank the pymc-devs team for their support and valuable input refining the initial version of this post.
Prepare Notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style(
style="darkgrid",
rc={"axes.facecolor": ".9", "grid.color": ".8"}
)
sns.set_palette(palette="deep")
sns_c = sns.color_palette(palette="deep")
import arviz as az
import patsy
import pymc3 as pm
from pymc3 import glm
plt.rcParams["figure.figsize"] = [7, 6]
plt.rcParams["figure.dpi"] = 100
Generate Sample Data
We want to fit a logistic regression model where there is a multiplicative interaction between two numerical features.
SEED = 42
np.random.seed(SEED)
# Number of data points.
n = 250
# Create features.
x1 = np.random.normal(loc=0.0, scale=2.0, size=n)
x2 = np.random.normal(loc=0.0, scale=2.0, size=n)
epsilon = np.random.normal(loc=0.0, scale=0.5, size=n)
# Define target variable.
intercept = -0.5
beta_x1 = 1
beta_x2 = -1
beta_interaction = 2
z = intercept + beta_x1 * x1 + beta_x2 * x2 + beta_interaction * x1 * x2
p = 1 / (1 + np.exp(-z))
y = np.random.binomial(n=1, p=p, size=n)
df = pd.DataFrame(dict(x1=x1, x2=x2, y=y))
df.head()
x1 | x2 | y | |
---|---|---|---|
0 | 0.993428 | -2.521768 | 0 |
1 | -0.276529 | 1.835724 | 0 |
2 | 1.295377 | 4.244312 | 1 |
3 | 3.046060 | 2.064931 | 1 |
4 | -0.468307 | -3.038740 | 1 |
Let us do some exploration of the data:
sns.pairplot(
data=df,
kind="scatter",
height=2,
plot_kws={"color": sns_c[1]},
diag_kws={"color": sns_c[2]}
);
- \(x_1\) and \(x_2\) are not correlated.
- \(x_1\) and \(x_2\) do not seem to separate the \(y\)-classes independently.
- The distribution of \(y\) is not highly unbalanced.
fig, ax = plt.subplots()
sns_c_div = sns.diverging_palette(240, 10, n=2)
sns.scatterplot(
x="x1",
y="x2",
data=df,
hue="y",
palette=[sns_c_div[0], sns_c_div[-1]]
)
ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5))
ax.set(
title="Sample Data",
xlim=(-9, 9),
ylim=(-9, 9),
xlabel="x1",
ylabel="x2"
);
Prepare Data for Modeling
I wanted to use the classmethod
from_formula
(see documentation), but I was not able to generate out-of-sample predictions with this approach (if you find a way please let me know!). As a workaround, I created the features from a formula using patsy directly and then use class
pymc3.glm.linear.GLM
(this was motivated by going into the source code).
# Define model formula.
formula = "y ~ x1 * x2"
# Create features.
y, x = patsy.dmatrices(formula_like=formula, data=df)
y = np.asarray(y).flatten()
labels = x.design_info.column_names
x = np.asarray(x)
As pointed out on the thread (thank you @Nicky!), we need to keep the lables of the features in the design matrix.
print(f"labels = {labels}")
labels = ['Intercept', 'x1', 'x2', 'x1:x2']
Now we do a train-test split.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
x, y, train_size=0.7, random_state=SEED
)
Define the Model
We now specify the model in PyMC3.
with pm.Model() as model:
# Set data container.
data = pm.Data("data", x_train)
# Define GLM family.
family = pm.glm.families.Binomial()
# Set priors.
priors = {
"Intercept": pm.Normal.dist(mu=0, sd=10),
"x1": pm.Normal.dist(mu=0, sd=10),
"x2": pm.Normal.dist(mu=0, sd=10),
"x1:x2": pm.Normal.dist(mu=0, sd=10),
}
# Specify model.
glm.GLM(
y=y_train,
x=data,
family=family,
intercept=False,
labels=labels,
priors=priors
)
Prior Checks
Before fitting the model we run some prior predictive checks on the training data.
# Sample from prior distribution.
with model:
prior_checks = pm.sample_prior_predictive(samples=100, random_seed=SEED)
We get samples for all our variables and the output:
prior_checks.keys()
dict_keys(['y', 'Intercept', 'x2', 'x1:x2', 'x1'])
Let us take a look into the samples. For the target variable y
we have:
prior_checks["y"]
array([[0, 1, 0, ..., 1, 0, 0],
[0, 1, 0, ..., 1, 0, 0],
[1, 1, 0, ..., 0, 0, 0],
...,
[1, 0, 0, ..., 0, 1, 0],
[0, 1, 0, ..., 0, 0, 0],
[1, 1, 0, ..., 0, 0, 0]])
So we can get summarize the model predictions (before seeing the data) by taking the mean over the samples. Let us store this in a dataframe to in order to generate some plots.
train_prior_df = pd.DataFrame(
data={
"x1_train": x_train[:, 1],
"x2_train": x_train[:, 2],
"y_train": y_train,
"y_train_prior_mean": prior_checks["y"].mean(axis=0),
},
)
train_prior_df.sort_values("y_train", inplace=True)
First we plot the distribution of the means:
# Plot means distribution.
fig, ax = plt.subplots()
sns.kdeplot(
x="y_train_prior_mean",
data=train_prior_df,
hue="y_train",
palette=[sns_c[0], sns_c[3]],
fill=True,
alpha=0.1,
ax=ax,
)
ax.axvline(x=0.5, color="gray", linestyle="--")
ax.set(title="Prior Means Distribution (train)", xlim=(0, 1));
We clearly see that the model can not distinguish between the two classes yet. This makes sense as we have non-informative priors for this synthetic data set. We can also confirm this if we plot each point separately:
fig, ax = plt.subplots()
cmap = sns.diverging_palette(240, 10, n=50, as_cmap=True)
sns.scatterplot(
x="x1_train",
y="x2_train",
data=train_prior_df,
hue="y_train_prior_mean",
hue_norm=(0, 1),
palette=cmap,
edgecolor="black",
style="y_train",
ax=ax,
)
ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
ax.set(
title="Prior Means (train)",
xlim=(-9, 9),
ylim=(-9, 9),
xlabel="x1",
ylabel="x2"
);
Fit Model
with model:
# Configure sampler.
trace = pm.sample(5000, chains=5, tune=1000, target_accept=0.87, random_seed=SEED)
# Plot chains.
az.plot_trace(data=trace);
az.summary(trace)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | -0.536 | 0.262 | -1.043 | -0.060 | 0.002 | 0.002 | 14479.0 | 16189.0 | 1.0 |
x1 | 0.616 | 0.211 | 0.223 | 1.013 | 0.002 | 0.001 | 14573.0 | 14635.0 | 1.0 |
x2 | -0.865 | 0.221 | -1.285 | -0.466 | 0.002 | 0.001 | 12761.0 | 14415.0 | 1.0 |
x1:x2 | 1.790 | 0.335 | 1.162 | 2.403 | 0.003 | 0.002 | 12686.0 | 14676.0 | 1.0 |
The chains look good.
Generate Out-Of-Sample Predictions
Now we generate predictions on the test set.
# Update data reference.
pm.set_data({"data": x_test}, model=model)
# Generate posterior samples.
ppc_test = pm.sample_posterior_predictive(trace, model=model, samples=1000)
# Compute the point prediction by taking the mean
# and defining the category via a threshold.
p_test_pred = ppc_test["y"].mean(axis=0)
y_test_pred = (p_test_pred >= 0.5).astype("int")
Evaluate Model
First let us compute the accuracy on the test set.
from sklearn.metrics import accuracy_score
print(f"accuracy = {accuracy_score(y_true=y_test, y_pred=y_test_pred): 0.3f}")
accuracy = 0.787
Next, we plot the roc curve and compute the auc.
from sklearn.metrics import RocCurveDisplay, auc, roc_curve
fpr, tpr, thresholds = roc_curve(
y_true=y_test, y_score=p_test_pred, pos_label=1, drop_intermediate=False
)
roc_auc = auc(fpr, tpr)
fig, ax = plt.subplots()
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
roc_display = roc_display.plot(ax=ax, marker="o", color=sns_c[4], markersize=4)
ax.set(title="ROC");
The model is performing as expected (we of course know the data generating process, which is almost never the case in practical applications).
Model Decision Boundary
Finally we will describe and plot the model decision boundary, which is the space defined as
\[\mathcal{B} = \{(x_1, x_2) \in \mathbb{R}^2 \: | \: p(x_1, x_2) = 0.5\}\]
where \(p\) denotes the probability of belonging to the class \(y=1\) output by the model. To make this set explicit, we simply write the condition in terms of the model parametrization:
\[0.5 = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1x_2))}\]
which implies
\[0 = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1x_2\]
Solving for \(x_2\) we get the formula
\[x_2 = - \frac{\beta_0 + \beta_1 x_1}{\beta_2 + \beta_{12}x_1}\]
Observe that this curve is a hyperbola centered at the singularity point \(x_1 = - \beta_2 / \beta_{12}\).
Let us now plot the model decision boundary using a grid:
# Construct grid.
x1_grid = np.linspace(start=-9, stop=9, num=300)
x2_grid = x1_grid
x1_mesh, x2_mesh = np.meshgrid(x1_grid, x2_grid)
x_grid = np.stack(arrays=[x1_mesh.flatten(), x2_mesh.flatten()], axis=1)
# Create features on the grid.
x_grid_ext = patsy.dmatrix(
formula_like="x1 * x2", data=dict(x1=x_grid[:, 0], x2=x_grid[:, 1])
)
x_grid_ext = np.asarray(x_grid_ext)
# Generate model predictions on the grid.
pm.set_data({"data": x_grid_ext}, model=model)
ppc_grid = pm.sample_posterior_predictive(trace, model=model, samples=1000)
Now we compute the model decision boundary on the grid for visualization purposes.
numerator = -(trace["Intercept"].mean(axis=0) + trace["x1"].mean(axis=0) * x1_grid)
denominator = trace["x2"].mean(axis=0) + trace["x1:x2"].mean(axis=0) * x1_grid
bd_grid = numerator / denominator
grid_df = pd.DataFrame(x_grid, columns=["x1", "x2"])
grid_df["p"] = ppc_grid["y"].mean(axis=0)
grid_df.sort_values("p", inplace=True)
p_grid = grid_df.pivot(index="x2", columns="x1", values="p").to_numpy()
We finally get the plot and the predictions on the test set:
fig, ax = plt.subplots()
cmap = sns.diverging_palette(240, 10, n=50, as_cmap=True)
sns.scatterplot(
x=x_test[:, 1].flatten(),
y=x_test[:, 2].flatten(),
hue=y_test,
palette=[sns_c_div[0], sns_c_div[-1]],
ax=ax,
)
sns.lineplot(x=x1_grid, y=bd_grid, color="black", ax=ax)
ax.contourf(x1_grid, x2_grid, p_grid, cmap=cmap, alpha=0.3)
ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5))
ax.lines[0].set_linestyle("dotted")
ax.set(
title="Model Decision Boundary",
xlim=(-9, 9),
ylim=(-9, 9),
xlabel="x1",
ylabel="x2"
);
Remark: Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and auc). One way of doing this is by storing and computing it inside the model definition as a Deterministic
variable as in Bayesian Analysis with Python (Second edition) - Chapter 4.