A conceptual and practical viewpoint
Reference Notebook 📓
Motivation: Some Applications
Gaussian Processes
Hilbert Space Approximation Deep Dive
References
👉 A way to encode similarity between points.
\[ k(x, x') = a^ 2 \exp\left(-\frac{(x - x')^2}{2 \ell^2}\right) \]
Where \(a\) is the amplitude and \(\ell\) is the length-scale.
Note
Observe that the kernel just depends on the distance between points \(r = x - x'\). This is a property of stationary kernels.
Let us denote by \((x, y)\) and the training data and by \(x_{*}\) the test set for which we want to generate predictions. We define by \(f\) and \(f_*\) the latent functions for the training and test sets respectively.
\[ \left( \begin{array}{c} y \\ f_* \end{array} \right) \sim \text{MultivariateNormal}(0, \boldsymbol{K}) \]
where
\[ \boldsymbol{K} = \left( \begin{array}{cc} K(X, X) + \sigma^2_n I & K(X, X_*) \\ K(X_*, X) & K(X_*, X_*) \end{array} \right) \]
\[ f_*|X, y, X_* \sim \text{MultivariateNormal}(\bar{f}_*, \text{cov}(f_*)) \]
where
\[ \begin{align*} \bar{f}_* &= K(X_*, X){\color{red}{(K(X, X) + \sigma^2_n I)^{-1}}} \\ \text{cov}(f_*) & = K(X_*, X_*) - K(X_*, X){\color{red}{(K(X, X) + \sigma^2_n I)^{-1}}} K(X, X_*) \end{align*} \]
Important
Taking the inverse of the kernel matrix is the most computationally expensive part of the GP model. It is of order \(\mathcal{O}(n^3)\).
with pm.Model() as gp_pymc_model:
x_data = pm.MutableData("x_data", value=x_train)
kernel_amplitude = pm.InverseGamma("kernel_amplitude", ...)
kernel_length_scale = pm.InverseGamma("kernel_length_scale", ...)
noise = pm.InverseGamma("noise", ...)
mean = pm.gp.mean.Zero()
cov = kernel_amplitude**2 * pm.gp.cov.ExpQuad(
input_dim=1, ls=kernel_length_scale
)
gp = pm.gp.Latent(mean_func=mean, cov_func=cov)
f = gp.prior("f", X=x_data[:, None])
pm.Normal("likelihood", mu=f, sigma=noise, observed=y_train_obs)
conditional
methodwith gp_pymc_model:
x_star_data = pm.MutableData("x_star_data", x_test)
f_star = gp.conditional("f_star", x_star_data[:, None])
pm.Normal("likelihood_test", mu=f_star, sigma=noise)
gp_pymc_idata.extend(
pm.sample_posterior_predictive(
trace=gp_pymc_idata,
var_names=["f_star", "likelihood_test"],
random_seed=rng_subkey[1].item(),
)
)
Problem with scaling Gaussian processes?
👉 Inverting the kernel matrix is computationally expensive!
Approach Summary 🤓
Strategy: Approximate the kernel matrix \(K\) with a a matrix with a smaller rank.
Key Idea: Interpret the covariance function as the kernel of a pseudo-differential operator and approximate it using Hilbert space methods.
Result: A reduced-rank approximation where the basis functions are independent of the covariance functions and their parameters (plus asymptotic convergence).
\[ f(x) \approx \sum_{j = 1}^{m} \overbrace{\underbrace{\color{red}{\left(S(\sqrt{\lambda_j})\right)^{1/2}}}_{\text{All hyperparameters are here!}}}^{\text{Spectral density evaluated at the eigenvalues}} \times \overbrace{\underbrace{\color{blue}{\phi_{j}(x)}}_{\text{Easy to compute!}}}^{\text{Laplacian Eigenfunctions}} \times \overbrace{\color{green}{\beta_{j}}}^{\sim \: \text{Normal}(0,1)} \]
Recall that given a matrix \(A\) (or a linear operator) the eigenvalues and eigenvectors are the solutions to the equation
\[A v = \lambda v\]
where \(v \neq \vec{0}\) is the eigenvector and \(\lambda\) is the eigenvalue.
The spectrum of a matrix is the set of its eigenvalues.
\[ A = \begin{pmatrix} 1 & 2 \\ 2 & 1 \end{pmatrix} \]
The eigenvalues are \(\lambda_1 = 3\) and \(\lambda_2 = -1\) with eigenvectors:
\[ v_1 = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 \\ 1 \end{pmatrix} \quad \text{and} \quad v_2 = \frac{1}{\sqrt{2}} \begin{pmatrix} -1 \\ 1 \end{pmatrix} \]
Orthonormal Basis
\[ v_1^{T} v_2 = 0 \quad \text{and} \quad v_1^{T} v_1 = v_2^{T} v_2 = 1 \]
The spectral theorem states that we can always find such orthonormal basis of eigenvectors for a symmetric matrix.
Change of Basis
Observe that if we consider the change-of-basis matrix
\[ Q = \begin{pmatrix} v_1 & v_2 \end{pmatrix} \]
then we can write the matrix \(A\) in the new basis as
\[ D = Q^{T} A Q = \begin{pmatrix} 3 & 0 \\ 0 & -1 \end{pmatrix} \]
import jax.numpy as jnp
A = jnp.array([[1.0, 2.0], [2.0, 1.0]])
# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = jnp.linalg.eig(A)
print(f"Eigenvalues: {eigenvalues}\nEigenvectors:\n{eigenvectors}")
>> Eigenvalues: [ 3.+0.j -1.+0.j]
>> Eigenvectors:
>> [[ 0.70710678+0.j -0.70710678+0.j]
>> [ 0.70710678+0.j 0.70710678+0.j]]
# Change of basis matrix
Q = eigenvectors
Q.T @ A @ Q
>> Array([[ 3.00000000e+00+0.j, 4.44089210e-16+0.j],
[ 6.10622664e-16+0.j, -1.00000000e+00+0.j]], dtype=complex128)
\[ A = \sum_{j=1}^{2} \color{red}{\lambda_i} \color{blue}{v_j v_j^{T}} \]
The operator \(P_{j} = v_j v_j ^{T}\) satisfies \(P^2 = P\) and \(P^T = P\).
The functional calculus is a generalization of the Taylor series for functions to operators. Given a function \(f\) we can define the operator \(f(A)\) as
\[ f(A) = \sum_{j=1}^{2} \color{red}{f(\lambda_j)} \color{blue}{v_j v_j^{T}} \]
Examples
In the case a kernel function is stationary, we can use the spectral representation of the kernel function (Bochner’s theorem) and write it as:
\[ k(r) = \frac{1}{(2 \pi)^{d}}\int_{\mathbb{R}^{d}} e^{i \omega^{T} r} {\color{red}{S(\omega)}} d\omega \]
where \(\color{red}{S(\omega)}\) is the spectral density corresponding to the covariance function.
Gaussian Processes for Machine Learning - Chapter 4: Covariance Functions.
(Here we are assuming the positive measure from Bochner’s theorem has a density, i.e. \(d\mu(\omega) = S(\omega)d\omega\))
For the squared exponential kernel, it can be shown that
\[ S(\omega) = a^2(2 \pi \ell^2)^{d/2} \exp\left(-2\pi^2\ell^2\omega^2\right) \]
# d = 1
def squared_exponential_spectral_density(w, amplitude, length_scale):
c = amplitude**2 * jnp.sqrt(2 * jnp.pi) * length_scale
e = jnp.exp(-0.5 * (length_scale**2) * (w**2))
return c * e
Note the similarity between the spectral density and the Fourier transform. For Gaussian-like kernels, like the squared exponential, we expect the associated spectral density to also be Gaussian-like.
Let us assume the kernel function is isotropic, i.e. it just depends on the Euclidean norm \(||r||\).
\(S(\omega)\) is also isotropic.
Write \(S(||\omega||) = \psi(||\omega||^2)\) for a suitable function \(\psi\).
We can expand \(\psi\) (Taylor-like expansion)
\[ \begin{align*} \psi(||\omega||^2) = a_0 + a_1 (||\omega||^2) + a_2 (||\omega||^2)^2 + a_3 (||\omega||^2)^3 + \cdots \end{align*} \]
Recall the Laplace operator is defined as
\[ - \nabla^2 f = \sum_{i=1}^{d} \frac{\partial^2 f}{\partial x_i^2} \]
Properties
The Laplacian has many good properties:
Positive semi-definite
Elliptic
Self-adjoint (appropriate boundary conditions)*
One can verify that the Fourier transform of the Laplacian is
\[ \mathcal{F}[\nabla^2 f](\omega) = - ||\omega||^2 \mathcal{F}[f] \]
Formal expansion as powers of the Laplacian
We can write the integral operator associated with the kernel
\[ \mathcal{K} := \int_{\mathbb{R}^d} k(\cdot, x')f(x') dx' \]
as
\[ \mathcal{K} = a_0 + a_1 (- \nabla^2) + a_2 (-\nabla^2)^2 - a_3 (-\nabla^2)^3 + \cdots \]
Dirichlet boundary conditions: Only consider function vanishes on the boundary of the domain \(\Omega\).
It is is a self-adjoint operator defined on the Hilbert space \(L^{2}(\Omega)\) equipped with the Lebesgue measure. That is,
\[ \int_{\Omega} {\color{purple}{(-\nabla^2 f(x))}} g(x) dx = \int_{\Omega} f(x) {\color{purple}{(-\nabla^2 g(x))}} dx \]
It has discrete spectrum with eigenvalues \({\color{red}{\lambda_j}} \rightarrow \infty\) and eigenfunctions \(\color{blue}{\phi_j}\) that form an orthonormal basis of \(L^2(\Omega)\)
\[ \frac{d^2 \phi}{dx^2} = - \lambda \phi \quad \text{with} \quad \phi(-L) = \phi(L) = 0 \]
Spectrum
\[ \phi_j(x) = \sqrt{\frac{1}{L}} \sin\left(\frac{\pi j (x + L)}{2L}\right) \quad \text{and} \quad \lambda_j = \left(\frac{j \pi}{2L}\right)^2 \quad \text{for} \: j = 1, 2, \ldots \]
\[ -\nabla^2 f(x) = \int_{\Omega} \overbrace{\left[\sum_{j} \lambda_{j} \phi_{j}(x) \phi_{j}(x')\right]}^{l(x, x')} f(x') dx' \]
We can construct powers of \(-\nabla^2\) as
\[ (- \nabla^2)^{s} f(x) = \int_{\Omega} {\color{red}{\overbrace{\left[\sum_{j} \lambda_{j}^{s} \phi_{j}(x) \phi_{j}(x')\right]}^{l^s(x, x')}}} f(x') dx'. \]
\[ \begin{align*} & \overbrace{\left[ a_{0} + a_{1}(-\nabla^2) + a_{2}(-\nabla^2)^2 + \cdots \right]}^{\mathbfcal{K}}f(x) = \\ &\int_{\Omega} \left[a_{0} + a_{1} l(x, x') + a_{2} l^{2}(x, x') + \cdots \right] f(x') dx' \end{align*} \]
This implies we can approximate the kernel as
\[ \begin{align*} k(x, x') \approx & \: a_{0} + a_{1} l(x, x') + a_{2} l^{2}(x, x') + \cdots \\ \approx & \sum_{j} {\color{red}{\underbrace{\left[ a_{0} + a_{1} \lambda_{j} + a_{2} \lambda_{j}^2 + \cdots \right]}_{S(\sqrt{\lambda_{j}})}}} \phi_{j}(x) \phi_{j}(x') \end{align*} \]
In summary, we have the following approximation formula for the kernel function:
\[ \boxed{ k(x, x') \approx \sum_{j}^{m} \color{red}{S(\sqrt{\lambda_j})}\color{blue}{\phi_{j}(x) \phi_{j}(x')} } \]
That is, the model of the Gaussian process \(f\) can be written as
\[ f(x) \sim \text{MultivariateNormal}(\boldsymbol{\mu}, \Phi\mathcal{D}\Phi^{T}) \]
where \(\mathcal{D} = \text{diag}(S(\sqrt{\lambda_1}), S(\sqrt{\lambda_2}), \ldots, S(\sqrt{\lambda_{m}}))\).
\[ f(x) \approx \sum_{j = 1}^{m} \overbrace{\color{red}{\left(S(\sqrt{\lambda_j})\right)^{1/2}}}^{\text{all hyperparameters are here!}} \times \underbrace{\color{blue}{\phi_{j}(x)}}_{\text{easy to compute!}} \times \overbrace{\color{green}{\beta_{j}}}^{\sim \: \text{Normal}(0,1)} \]
Key Observations
The only dependence on the hyperparameters is through the spectral density.
The computational cost of evaluating the log posterior density of univariate HSGPs scales as \(\mathcal{O}(nm + m)\).
\[ \begin{align*} k(x, x') \approx \sum_{j=1}^{m} & \color{red}{\overbrace{\left(a^2 \sqrt{2 \pi} \ell \exp\left(-2\pi^2\ell^2 \left(\frac{\pi j}{2L}\right)^2 \right)\right)^{1/2}}^{\left(S(\sqrt{\lambda_j})\right)^{1/2}}} \\ & \times \color{blue}{\overbrace{\left(\sqrt{\frac{1}{L}} \sin\left(\frac{\pi j (x + L)}{2L}\right)\right)}^{\phi_{j}(x)}} \\ & \times \color{green}{\overbrace{\beta_{j}}^{\sim \: \text{Normal}(0, 1)}} \end{align*} \]
def hs_approx_squared_exponential_ncp(x, amplitude, length_scale, l_max, m):
phi = laplace_eigenfunctions(x, l_max, m)
spd = jnp.sqrt(
diag_squared_exponential_spectral_density(amplitude, length_scale, l_max, m)
)
with numpyro.plate("basis", m):
beta = numpyro.sample("beta", dist.Normal(0, 1))
return numpyro.deterministic("f", phi @ (spd * beta))
def hsgp_model(x, l_max, m, y=None) -> None:
# --- Priors ---
kernel_amplitude = numpyro.sample("kernel_amplitude", ...)
kernel_length_scale = numpyro.sample( "kernel_length_scale", ...)
noise = numpyro.sample( "noise", ...)
# --- Parametrization ---
f = hs_approx_squared_exponential_ncp(
x, kernel_amplitude, kernel_length_scale, l_max, m
)
# --- Likelihood ---
with numpyro.plate("data", x.shape[0]):
numpyro.sample("likelihood", dist.Normal(loc=f, scale=noise), obs=y)
HSGP
Classwith pm.Model() as hsgp_pymc_model:
x_data = pm.MutableData("x_data", value=x_train)
y_data = pm.MutableData("y_data", y_train_obs)
kernel_amplitude = pm.InverseGamma("kernel_amplitude", ...)
kernel_length_scale = pm.InverseGamma("kernel_length_scale", ...)
noise = pm.InverseGamma("noise", ...)
mean = pm.gp.mean.Zero()
cov = kernel_amplitude**2 * pm.gp.cov.ExpQuad(
input_dim=1, ls=kernel_length_scale
)
gp = pm.gp.HSGP(m=[20], L=[1.3], mean_func=mean, cov_func=cov)
f = gp.prior("f", X=x_data[:, None])
pm.Normal("likelihood", mu=f, sigma=noise, observed=y_data)
conditional
methodwith hsgp_pymc_model:
x_star_data = pm.MutableData("x_star_data", x_test)
f_star = gp.conditional("f_star", x_star_data[:, None])
pm.set_data({"x_data": x_test, "y_data": np.ones_like(x_test)})
hsgp_pymc_idata.extend(
pm.sample_posterior_predictive(
trace=hsgp_pymc_idata,
var_names=["f_star", "likelihood"],
random_seed=rng_subkey[1].item(),
)
)
HSGP.prior_linearized
with pm.Model() as hsgp_linearized_pymc_model:
x_data = pm.MutableData("x_data", value=x_train)
y_data = pm.MutableData("y_data", y_train_obs)
kernel_amplitude = pm.InverseGamma("kernel_amplitude", ...)
kernel_length_scale = pm.InverseGamma("kernel_length_scale", ...)
noise = pm.InverseGamma("noise", ...)
mean = pm.gp.mean.Zero()
cov = kernel_amplitude**2 * pm.gp.cov.ExpQuad(input_dim=1, ls=kernel_length_scale)
gp = pm.gp.HSGP(m=[20], L=[1.3], mean_func=mean, cov_func=cov)
phi, sqrt_psd = gp.prior_linearized(Xs=x_data_centered[:, None])
beta = pm.Normal("beta", mu=0, sigma=1, size=gp._m_star)
f = pm.Deterministic("f", phi @ (beta * sqrt_psd))
pm.Normal("likelihood", mu=f, sigma=noise, observed=y_data)
pm.set_data
method