Where \(a\) is the amplitude and \(\ell\) is the length-scale.
Note
Observe that the kernel just depends on the distance between points \(r = x - x'\). This is a property of stationary kernels.
Kernel Matrix
GP Model - Prior
Let us denote by \((x, y)\) and the training data and by \(x_{*}\) the test set for which we want to generate predictions. We define by \(f\) and \(f_*\) the latent functions for the training and test sets respectively.
Hilbert Space Gaussian Process (HSGP) Approximation
HSGP Deep Dive
Approach Summary 🤓
Strategy: Approximate the Gram matrix \(K\) with a matrix \(\tilde{K}\) with a smaller rank \(m < n\).
Key Idea: Interpret the covariance function as the kernel of a pseudo-differential operator and approximate it using Hilbert space methods.
Result: A reduced-rank approximation for the covariance function, where the basis functions are independent of the covariance functions and their parameters (plus asymptotic convergence).
We recall the spectral density \(S(\omega)\) associated with a stationary kernel function \(k\).
We approximate the spectral density \(S(\omega)\) as a polynomial series in \(||\omega||^2\).
We can interpret these polynomial terms as powers of the Laplacian operator. The key observation is that the Fourier transform of the Laplacian operator is \(||\omega||^2\).
Next, we impose Dirichlet boundary conditions on the Laplacian operator which makes it self-adjoint and with a discrete spectrum.
We identify the expansion in (2) with the sum of powers of the Laplacian operator in the eigenbasis of (4).
We arrive at the final approximation formula and explicitly compute the terms for the squared exponential kernel in the one-dimensional case.
Eigenvalues and Eigenvectors
Recall that given a matrix \(A\) (or a linear operator) the eigenvalues and eigenvectors are the solutions to the equation
\[A v = \lambda v\]
where \(v \neq \vec{0}\) is the eigenvector and \(\lambda\) is the eigenvalue.
The spectrum of a matrix is the set of its eigenvalues.
Another important consequence of the spectral theorem is that we can decompose the matrix \(A\) as a sum of projections onto its eigenspaces. That is, we can write re-write \(A\) as
\[
A = \sum_{j=1}^{2} \lambda_i v_j v_j^{T}
\]
Projections
Observe that the operator \(P_{j} = v_j v_j ^{T}\) is indeed a projection operator, that is \(P^2 = P\) and \(P^T = P\), since the \(v_j\) constitutes an orthonormal basis.
The functional calculus is a generalization of the Taylor series for functions to operators. Given a function \(f\) we can define the operator \(f(A)\) as
where \(\mu\) is a positive measure. If this measure has a density, it is called the spectral density\(S(\omega)\) corresponding to the covariance function, i.e. \(d\mu(\omega) = S(\omega) d\omega\).
Spectral Density - Example
For the squared exponential kernel, it can be shown that
Note the similarity between the spectral density and the Fourier transform. For Gaussian-like kernels, like the squared exponential, we expect the associated spectral density to also be Gaussian-like.
\(S(\omega)\) Squared Exponential
Formal Power Expansion of the Spectral Density
Let us assume the kernel function is isotropic, i.e. it just depends on the Euclidean norm \(||r||\).
\(S(\omega)\) is also isotropic.
Write \(S(||\omega||) = \psi(||\omega||^2)\) for a suitable function \(\psi\).
The most common boundary conditions are the Dirichlet boundary conditions, which are defined by requiring that the function vanishes on the boundary of the domain.
\[
\begin{align*}
-\nabla^2 f &= \lambda f \quad \text{in} \quad \Omega \\
f &= 0 \quad \text{on} \quad \partial \Omega
\end{align*}
\]
Dirichlet’s Laplacian - Spectrum
It is is a self-adjoint operator defined on the Hilbert space \(L^{2}(\Omega)\) equipped with the Lebesgue measure. That is,
It has discrete spectrum with eigenvalues \(\lambda_j \rightarrow \infty\) and eigenfunctions \(\phi_j\) that form an orthonormal basis of \(L^2(\Omega)\), so that \[
\int_{\Omega} \phi_{j}(x) \phi_{k}(x) dx = \delta_{jk}
\]
We use a Gaussian process with a squared exponential kernel.
HSGP Model - Yearly Seasonality
We use a Gaussian process with a periodic kernel*.
HSGP Model - Weekly Seasonality
We use a ZeroSumNormal distribution to capture the relative difference across weekdays. We couple it with a multiplicative factor parametrized by a Gaussian process.
HSGP Model - Special Days
We use a StudentT distribution over the day_of_year2.