Feature Engineering: patsy as FormulaTransformer

In this notebook I want to describe how to create features inside scikit-learn pipelines using patsy-like formulas. I have used this approach to generate features in a previous post: GLM in PyMC3: Out-Of-Sample Predictions, so I will consider the same data set here for the sake of comparison.

Remark: Very recently (2021-09-01) I discovered there is an implementation of this transformer in scikit-lego, see PatsyTransformer. In addition, please refer to the great tutorial on patsy in calmcode.io.

Prepare Notebook

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import patsy
import seaborn as sns
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.inspection import plot_partial_dependence
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import plot_confusion_matrix, RocCurveDisplay, auc, roc_curve
from sklearn.preprocessing import StandardScaler

sns.set_style(
    style='darkgrid', 
    rc={'axes.facecolor': '0.9', 'grid.color': '0.8'}
)
sns.set_palette(palette='deep')
sns_c = sns.color_palette(palette='deep')

plt.rcParams['figure.figsize'] = [7, 6]
plt.rcParams['figure.dpi'] = 100

Generate Sample Data

We want to fit a logistic regression model where there is a multiplicative interaction between two numerical features.

SEED = 42
np.random.seed(SEED)

# Number of data points.
n = 250
# Create features.
x1 = np.random.normal(loc=0.0, scale=2.0, size=n)
x2 = np.random.normal(loc=0.0, scale=2.0, size=n)
epsilon = np.random.normal(loc=0.0, scale=0.5, size=n)
# Define target variable.
intercept = -0.5
beta_x1 = 1
beta_x2 = -1
beta_interaction = 2
z = intercept + beta_x1 * x1 + beta_x2 * x2 + beta_interaction * x1 * x2
p = 1 / (1 + np.exp(-z))
y = np.random.binomial(n=1, p=p, size=n)

df = pd.DataFrame(dict(x1=x1, x2=x2, y=y))

df.head()

	x1	x2	y
0	0.993428	-2.521768	0
1	-0.276529	1.835724	0
2	1.295377	4.244312	1
3	3.046060	2.064931	1
4	-0.468307	-3.038740	1

This is data set is relatively well balanced:

df['y'].value_counts() / df['y'].shape[0]

0    0.54
1    0.46
Name: y, dtype: float64

Let us now split the data into the given features and the target variable

numeric_features = ['x1', 'x2']

x = df[numeric_features]
y = df['y']

Next, we do a random train-test split.

x_train, x_test, y_train, y_test = train_test_split(
  x, y, train_size=0.7, random_state=SEED
)

Model Pipeline

Implement `FormulaTransformer`

We can follow scikit-learn’s guide to implement custom transformations inside a pipeline. The following transformer creates features from a formula as described here. Please note that by default it will add an Intercept term (i.e. columns of ones). If we do not want this feature we can simply add a - 1 inside the formula.

class FormulaTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, formula):
        self.formula = formula
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_formula = patsy.dmatrix(formula_like=self.formula, data=X)
        columns = X_formula.design_info.column_names
        return pd.DataFrame(X_formula, columns=columns)

Define and Fit the Model Pipeline

Now we define and fit our model (see Column Transformer with Mixed Types guide). We use a \(l_2\)-regularized logistic regression on a multiplicative interaction of the features. Note that we are not including the intercept in the formula but rather in the estimator itself.

numeric_transformer = Pipeline(steps=[
    ('formula_transformer', FormulaTransformer(formula='x1 * x2 - 1')),
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(transformers=[
        ('numeric', numeric_transformer, numeric_features)
])

estimator = LogisticRegression(
    fit_intercept=True,
    penalty='l2',
    class_weight='balanced'
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('logistic_regression', estimator)
])

param_grid = {
    'logistic_regression__C' : np.logspace(start=-2, stop=2, num=20)
}

grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    scoring='roc_auc',
    cv=10
)

grid_search = grid_search.fit(X=x_train, y=y_train)

Remark: Note that one could have used the in-build PolynomialFeatures transformer for this particular example. Nevertheless, this formula approach can give additional flexibility to explore feature transformations. For example, look how easy is to include numpy-transformations (which can actually be encoded using a FunctionTransformer as well):

formula_transformer = FormulaTransformer(formula='x1 + x2 + np.sin(x1) + np.abs(x2) - 1')
x_train_features = formula_transformer.fit_transform(X=x_train)
x_train_features.head()

	x1	x2	np.sin(x1)	np.abs(x2)
0	2.955788	0.151609	0.184737	0.151609
1	-0.583387	-0.770627	-0.550854	0.770627
2	-1.981073	1.744641	-0.917011	1.744641
3	-3.975138	1.256691	0.740319	1.256691
4	1.830804	-1.650994	0.966388	1.650994

Let us visualize the transformed features:

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 6)) 
sns.lineplot(x='x1', y='np.sin(x1)', data=x_train_features, marker='o', color=sns_c[1], ax=ax[0])
sns.lineplot(x='x2', y='np.abs(x2)', data=x_train_features, marker='o', color=sns_c[2], ax=ax[1])
ax[0].set(title=r'$x_1 \longmapsto \sin(x_1)$')
ax[1].set(title=r'$x_2 \longmapsto |x_2|$')
fig.suptitle('Feature Transformations', y=0.93);

Predictions and Evaluation Metrics

Now we simply evaluate the model by looking into its performance in out-of-sample data. The results are similar to the ones in the previous post. First let us get generate predictions (both probabilities and predictions) and compute the accuracy:

p_test_pred = grid_search.predict_proba(X=x_test)[:, 1]
y_test_pred = grid_search.predict(X=x_test)

print(f"accuracy = {accuracy_score(y_true=y_test, y_pred=y_test_pred): 0.3f}")

accuracy =  0.827

Now let us plot the confusion matrix:

fig, ax = plt.subplots()
plot_confusion_matrix(
    estimator=grid_search, 
    X=x_test,
    y_true=y_test,
    normalize='all',
    cmap='viridis_r',
    ax=ax
)
ax.set(title='Confusion Matrix');

It seems the false positives and false negatives shares are balanced.

Next we look into the roc curve:

fpr, tpr, thresholds = roc_curve(
    y_true=y_test, y_score=p_test_pred, pos_label=1, drop_intermediate=False
)
roc_auc = auc(fpr, tpr)

fig, ax = plt.subplots()
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
roc_display = roc_display.plot(ax=ax, marker="o", color=sns_c[4], markersize=4)
ax.set(title='ROC - Test Set');

Partial Dependency Plot

Finally let us compute the partial dependency plot for both features x1 and x2. First we need to make sure the features are not highly correlated:

print(f'Input features correlation: {x_test.corr().loc["x1", "x2"]: 0.3f}')

Input features correlation:  0.086

Now let us visualize the plot:

fig, ax = plt.subplots()
cmap = sns.color_palette('Spectral_r', as_cmap=True)
plot_partial_dependence(
    estimator=grid_search,
    X=x_test,
    features=[('x1', 'x2')],
    contour_kw={'cmap': cmap},
    ax=ax
)
ax.set(title='Partial Dependency Plot');

Note how similar it looks as compared with the model decision boundary in the previous post.

Remark: Note that we have computed the partial dependency plot using the test set. For a deeper analysis on the subject please see Section 5.5.2 Should I Compute Importance on Training or Test Data? (great reference!). We are following the logic as in Interpretable Machine Learning with Python.