In this short post, I show how to vectorize an scikit-learn
transformer over a numpy
array. That is, how to apply a transformer along a specific axes of a numpy
array. I have found this to be particularly useful when working with output sample posterior distributions from a bayesian model where I want to apply a transformer to each sample. This is not particularly difficult, but I always forget how to do it, so I thought I would write it down once and for all 😄.
Prepare Notebook
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_circles
from sklearn.preprocessing import FunctionTransformer
from numpy.random import RandomState
plt.style.use("bmh")
plt.rcParams["figure.figsize"] = [10, 6]
plt.rcParams["figure.dpi"] = 100
plt.rcParams["figure.facecolor"] = "white"
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = "retina"
Generate Data
We generate a synthetic data set using the make_circles
function from scikit-learn
.
random_state = RandomState(seed=42)
x, _ = make_circles(n_samples=100, factor=0.5, noise=0.05, random_state=random_state)
x = x + 0.5
fig, ax = plt.subplots(figsize=(8, 8))
ax.plot(x[:, 0], x[:, 1], "o", c="C0")
ax.set(
title="Raw Data",
)
Let’s see the dimensions of our circles data set.
x.shape
(100, 2)
Next, we generate another synthetic data set on which we have a bunch of these circles indexed by vectors in a numpy
array
n = 6
z = np.array(
[
[make_circles(n_samples=(60, 60), random_state=random_state)[0] + i * 0.8]
for i in range(n)
]
)
z.shape
(6, 1, 120, 2)
We can visualize the data set by coloring each circles subset according to the index of the array.
fig, ax = plt.subplots(figsize=(8, 8))
for i in range(n):
ax.plot(z[i, :, :, 0], z[i, :, :, 1], "o", c=f"C{i + 1}")
ax.set(
title="Data Samples",
)
Define Transformer
We specify a simple custom transformer which maps any non-zero vector to the unit circle by normalizing it.
circle_transformer = FunctionTransformer(
func=lambda x: x / np.linalg.norm(x, axis=1)[..., None]
)
x_circle = circle_transformer.fit_transform(x)
We can plot the transformed data set.
fig, ax = plt.subplots(figsize=(8, 8))
ax.plot(x_circle[:, 0], x_circle[:, 1], "o", c="C0")
ax.set(
title="Transformed Data (projected onto unit circle)",
)
Vectorize Transformer
First let’s see what happens if we apply the transformer to the entire array.
try:
circle_transformer.transform(z)
except ValueError as e:
print(f"ValueError: {e}")
ValueError: operands could not be broadcast together with shapes (6,1,120,2) (6,120,2,1)
One could try to move some dimensions around:
z_shifted = np.moveaxis(a=z, source=[1], destination=[3])
z_shifted.shape
(6, 120, 2, 1)
try:
circle_transformer.transform(z_shifted)
except ValueError as e:
print(f"ValueError: {e}")
ValueError: operands could not be broadcast together with shapes (6,120,2,1) (6,2,1,1)
But things do not seem to work 😒. The problem is that the transformer does not understand how to brodcast the transformation over the array. We need to tell it how to do it. We can do this by using the np.vectorize
function.
vectorized_circle_transformer = np.vectorize(
pyfunc=circle_transformer.transform, # <- the function to vectorize
excluded=[2, 3], # <- the axes to exclude from the vectorization
signature="(m, n) -> (m, n)", # <- the signature of the function (input and output matrices)
)
Let’s apply the vectorized transformer to the entire array.
z_circle = vectorized_circle_transformer(z)
z.shape
(6, 1, 120, 2)
Yay! It works! 🎉
Finally, we can plot the transformed data set.
fig, ax = plt.subplots(figsize=(8, 8))
for i in range(n):
ax.plot(z_circle[i, :, :, 0], z_circle[i, :, :, 1], "o", c=f"C{i + 1}")
ax.set(
title="Data Samples Transformed",
)
As expected all the points are mapped to the unit circle centered at the origin.