In this notebook we run some experiments to demonstrate how we can use Gaussian Processes in the context of time series forecasting with scikit-learn. This material is part of a talk on Gaussian Process for Time Series Analysis presented at the PyCon DE & PyData 2019 Conference in Berlin.
Update: Additional material and plots were included for the Second Symposium on Machine Learning and Dynamical Systems at The Fields Institute (virtual event).
Videos
- PyData Berlin 2019
- Second Symposium on Machine Learning and Dynamical Systems 2020
Slides
Here you can find the slides of the talk.
Suggestions and comments are always welcome!
References:
- Gaussian Processes for Machine Learning, by Carl Edward Rasmussen and Christopher K. I. Williams.
- Gaussian Processes for Timeseries Modelling, by S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson2 and S. Aigrain.
- Bayesian Data Analysis, by Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin.
- Statistical Rethinking, by Richard McElreath.
- scikit-learn docs: 1.7. Gaussian Processes. In particular I recommend the example: Gaussian process regression (GPR) on Mauna Loa CO2 data.
- Blog Posts:
Prepare Notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
import seaborn as sns
sns.set_style(
style='darkgrid',
rc={'axes.facecolor': '.9', 'grid.color': '.8'}
)
sns.set_palette(palette='deep')
sns_c = sns.color_palette(palette='deep')
%matplotlib inline
plt.rcParams['figure.figsize'] = [12, 6]
plt.rcParams['figure.dpi'] = 100
Example 1
In this first example we consider a seasonal stationary time series.
Generate Data
- Time Variable
Let us define a “time” variable \(t\in \mathbb{N}\).
# Number of samples.
n = 1000
# Generate "time" variable.
t = np.arange(n)
data_df = pd.DataFrame({'t' : t})
- Seasonal Component
Let us define a function to generate seasonal components (Fourier modes).
# Generate seasonal variables.
def seasonal(t, amplitude, period):
"""Generate a sinusoidal curve."""
y1 = amplitude * np.sin((2*np.pi)*t/period)
return y1
# Add two seasonal components.
data_df['s1'] = data_df['t'].apply(lambda t : seasonal(t, amplitude=2, period=40))
# Define target variable.
data_df['y1'] = data_df['s1']
Let us plot this seasonal variable:
fig, ax = plt.subplots()
sns.lineplot(x='t', y='s1', data=data_df, color=sns_c[0], label='s1', ax=ax)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='Seasonal Component', xlabel='t', ylabel='');
- Gaussian Noise
Finally, we add some noise.
# Set noise standard deviation.
sigma_n = 0.3
data_df['epsilon'] = np.random.normal(loc=0, scale=sigma_n, size=n)
# Add noise to target variable.
data_df ['y1'] = data_df ['y1'] + data_df ['epsilon']
Let us plot the resulting data:
fig, ax = plt.subplots()
sns.lineplot(x='t', y='y1', data=data_df, color=sns_c[0], label='y1', ax=ax)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='Sample Data 1', xlabel='t', ylabel='');
Define Model
Let us now define the kernel of the gaussian process model. We include the following kernel components (recall that the sum of kernels is again a kernel):
WhiteKernel
to account for noise.ExpSineSquared
to model the periodic component.
We add bounds to the kernel hyper-parameters which are optimized by maximizing the log-marginal-likelihood (see documentation).
from sklearn.gaussian_process.kernels import WhiteKernel, ExpSineSquared, ConstantKernel
k0 = WhiteKernel(noise_level=0.3**2, noise_level_bounds=(0.1**2, 0.5**2))
k1 = ConstantKernel(constant_value=2) * \
ExpSineSquared(length_scale=1.0, periodicity=40, periodicity_bounds=(35, 45))
kernel_1 = k0 + k1
Next we initialize the GaussianProcessRegressor
object.
from sklearn.gaussian_process import GaussianProcessRegressor
gp1 = GaussianProcessRegressor(
kernel=kernel_1,
n_restarts_optimizer=10,
normalize_y=True,
alpha=0.0
)
Split Data
We now prepare and split the data for the model.
X = data_df['t'].values.reshape(n, 1)
y = data_df['y1'].values.reshape(n, 1)
prop_train = 0.7
n_train = round(prop_train * n)
X_train = X[:n_train]
y_train = y[:n_train]
X_test = X[n_train:]
y_test = y[n_train:]
Prior Sampling
Let us sample from the prior distribution and compare it with our given data.
gp1_prior_samples = gp1.sample_y(X=X_train, n_samples=100)
fig, ax = plt.subplots()
for i in range(100):
sns.lineplot(x=X_train[...,0], y = gp1_prior_samples[:, i], color=sns_c[1], alpha=0.2, ax=ax)
sns.lineplot(x=X_train[...,0], y=y_train[..., 0], color=sns_c[0], label='y1', ax=ax)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='GP1 Prior Samples', xlabel='t');
We indeed see the prior specification seems plausible for this model.
Model Fit + Predictions
Let us fit the model and generate predictions.
gp1.fit(X_train, y_train)
GaussianProcessRegressor(alpha=0.0,
kernel=WhiteKernel(noise_level=0.09) + 1.41**2 * ExpSineSquared(length_scale=1, periodicity=40),
n_restarts_optimizer=10, normalize_y=True)
# Generate predictions.
y_pred, y_std = gp1.predict(X, return_std=True)
data_df['y_pred'] = y_pred
data_df['y_std'] = y_std
data_df['y_pred_lwr'] = data_df['y_pred'] - 2*data_df['y_std']
data_df['y_pred_upr'] = data_df['y_pred'] + 2*data_df['y_std']
We plot the predictions.
fig, ax = plt.subplots()
ax.fill_between(
x=data_df['t'],
y1=data_df['y_pred_lwr'],
y2=data_df['y_pred_upr'],
color=sns_c[2],
alpha=0.15,
label='credible_interval'
)
sns.lineplot(x='t', y='y1', data=data_df, color=sns_c[0], label = 'y1', ax=ax)
sns.lineplot(x='t', y='y_pred', data=data_df, color=sns_c[2], label='y_pred', ax=ax)
ax.axvline(n_train, color=sns_c[3], linestyle='--', label='train-test split')
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='Prediction Sample 1', xlabel='t', ylabel='');
Let us compute the \(R^2\) (coefficient of determination, which can be negative) and the Mean Absolute Error of the prediction on the train and test set.
from sklearn.metrics import mean_absolute_error
print(f'R2 Score Train = {gp1.score(X=X_train, y=y_train): 0.3f}')
print(f'R2 Score Test = {gp1.score(X=X_test, y=y_test): 0.3f}')
print(f'MAE Train = {mean_absolute_error(y_true=y_train, y_pred=gp1.predict(X_train)): 0.3f}')
print(f'MAE Test = {mean_absolute_error(y_true=y_test, y_pred=gp1.predict(X_test)): 0.3f}')
R2 Score Train = 0.961
R2 Score Test = 0.960
MAE Train = 0.228
MAE Test = 0.236
Errors
errors = gp1.predict(X_test) - y_test
errors = errors.flatten()
errors_mean = errors.mean()
errors_std = errors.std()
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
sns.regplot(x=y_test.flatten(), y=gp1.predict(X_test).flatten(), ax=ax[0])
sns.distplot(a=errors, ax=ax[1])
ax[1].axvline(x=errors_mean, color=sns_c[3], linestyle='--', label=f'$\mu$')
ax[1].axvline(x=errors_mean + 2*errors_std, color=sns_c[4], linestyle='--', label=f'$\mu \pm 2\sigma$')
ax[1].axvline(x=errors_mean - 2*errors_std, color=sns_c[4], linestyle='--')
ax[1].axvline(x=errors_mean, color=sns_c[3], linestyle='--')
ax[1].legend()
ax[0].set(title='Model 1 - Test vs Predictions (Test Set)', xlabel='y_test', ylabel='y_pred');
ax[1].set(title='Model 1 - Errors', xlabel='error', ylabel=None);
Example 2
In this example we add a linear trend component.
- Linear Trend Component
# Generate trend component.
def linear_trend(beta, x):
"""Scale vector by a scalar."""
trend_comp = beta * x
return trend_comp
data_df['tr1'] = data_df['t'].apply(lambda x : linear_trend(0.01, x))
# Add trend to target variable y_1.
data_df['y2'] = data_df['y1'] + data_df['tr1']
Let us see the trend plot:
fig, ax = plt.subplots()
sns.lineplot(x='t', y='y2', data=data_df, color=sns_c[0], label='y2', ax=ax)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='Sample Data 2', xlabel='t', ylabel='');
Define Model
For this second example we add a RBF
kernel to model the trend component.
from sklearn.gaussian_process.kernels import RBF
k0 = WhiteKernel(noise_level=0.3**2, noise_level_bounds=(0.1**2, 0.5**2))
k1 = ConstantKernel(constant_value=12) * \
ExpSineSquared(length_scale=1.0, periodicity=40, periodicity_bounds=(35, 45))
k2 = ConstantKernel(constant_value=10, constant_value_bounds=(1e-2, 1e3)) * \
RBF(length_scale=1e2, length_scale_bounds=(1, 1e3))
kernel_2 = k0 + k1 + k2
# Define GaussianProcessRegressor object.
gp2 = GaussianProcessRegressor(
kernel=kernel_2,
n_restarts_optimizer=10,
normalize_y=True,
alpha=0.0
)
Split Data
We split the data as above.
y = data_df['y2'].values.reshape(n, 1)
y_train = y[:n_train]
y_test = y[n_train:]
Prior Sampling
gp2_prior_samples = gp2.sample_y(X=X_train, n_samples=100)
fig, ax = plt.subplots()
for i in range(100):
sns.lineplot(x=X_train[...,0], y = gp2_prior_samples[:, i], color=sns_c[1], alpha=0.2, ax=ax)
sns.lineplot(x=X_train[...,0], y=y_train[..., 0], color=sns_c[0], label='y2', ax=ax)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='GP2 Prior Samples', xlabel='t');
Model Fit + Predictions
gp2.fit(X_train, y_train)
GaussianProcessRegressor(alpha=0.0,
kernel=WhiteKernel(noise_level=0.09) + 3.46**2 * ExpSineSquared(length_scale=1, periodicity=40) + 3.16**2 * RBF(length_scale=100),
n_restarts_optimizer=10, normalize_y=True)
# Generate predictions.
y_pred, y_std = gp2.predict(X, return_std=True)
data_df['y_pred'] = y_pred
data_df['y_std'] = y_std
data_df['y_pred_lwr'] = data_df['y_pred'] - 2*data_df['y_std']
data_df['y_pred_upr'] = data_df['y_pred'] + 2*data_df['y_std']
We plot the predictions.
fig, ax = plt.subplots()
ax.fill_between(
x=data_df['t'],
y1=data_df['y_pred_lwr'],
y2=data_df['y_pred_upr'],
color=sns_c[2],
alpha=0.15,
label='credible_interval'
)
sns.lineplot(x='t', y='y2', data=data_df, color=sns_c[0], label = 'y2', ax=ax)
sns.lineplot(x='t', y='y_pred', data=data_df, color=sns_c[2], label='y_pred', ax=ax)
ax.axvline(n_train, color=sns_c[3], linestyle='--', label='train-test split')
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='Prediction Sample 2', xlabel='t', ylabel='');
Let us compute the metrics:
print(f'R2 Score Train = {gp2.score(X=X_train, y=y_train): 0.3f}')
print(f'R2 Score Test = {gp2.score(X=X_test, y=y_test): 0.3f}')
print(f'MAE Train = {mean_absolute_error(y_true=y_train, y_pred=gp2.predict(X_train)): 0.3f}')
print(f'MAE Test = {mean_absolute_error(y_true=y_test, y_pred=gp2.predict(X_test)): 0.3f}')
R2 Score Train = 0.987
R2 Score Test = 0.958
MAE Train = 0.229
MAE Test = 0.274
errors = gp2.predict(X_test) - y_test
errors = errors.flatten()
errors_mean = errors.mean()
errors_std = errors.std()
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
sns.regplot(x=y_test.flatten(), y=gp2.predict(X_test).flatten(), ax=ax[0])
sns.distplot(a=errors, ax=ax[1])
ax[1].axvline(x=errors_mean, color=sns_c[3], linestyle='--', label=f'$\mu$')
ax[1].axvline(x=errors_mean + 2*errors_std, color=sns_c[4], linestyle='--', label=f'$\mu \pm 2\sigma$')
ax[1].axvline(x=errors_mean - 2*errors_std, color=sns_c[4], linestyle='--')
ax[1].axvline(x=errors_mean, color=sns_c[3], linestyle='--')
ax[1].legend()
ax[0].set(title='Model 2 - Test vs Predictions (Test Set)', xlabel='y_test', ylabel='y_pred');
ax[1].set(title='Model 2 - Errors', xlabel='error', ylabel=None);
Example 3
In this third example we add a second seasonal component.
- Add Other Seasonal Component
# Create other seasonal component.
data_df['s2'] = data_df['t'].apply(lambda t : seasonal(t, amplitude=1, period=13.3))
# Add to y_2.
data_df['y3'] = data_df['y2'] + data_df['s2']
fig, ax = plt.subplots()
sns.lineplot(x='t', y='y3', data=data_df, color=sns_c[0], label='y3', ax=ax)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='Sample Data 3', xlabel='t', ylabel='');
Define Model
We add another ExpSineSquared
kernel to the one of Example 2.
k0 = WhiteKernel(noise_level=0.3**2, noise_level_bounds=(0.1**2, 0.5**2))
k1 = ConstantKernel(constant_value=2) * \
ExpSineSquared(length_scale=1.0, periodicity=40, periodicity_bounds=(35, 45))
k2 = ConstantKernel(constant_value=25, constant_value_bounds=(1e-2, 1e3)) * \
RBF(length_scale=100.0, length_scale_bounds=(1, 1e4))
k3 = ConstantKernel(constant_value=1) * \
ExpSineSquared(length_scale=1.0, periodicity=12, periodicity_bounds=(10, 15))
kernel_3 = k0 + k1 + k2 + k3
# Define GaussianProcessRegressor object.
gp3 = GaussianProcessRegressor(
kernel=kernel_3,
n_restarts_optimizer=10,
normalize_y=True,
alpha=0.0
)
Split Data
y = data_df['y3'].values.reshape(n, 1)
y_train = y[:n_train]
y_test = y[n_train:]
Prior Sampling
gp3_prior_samples = gp3.sample_y(X=X_train, n_samples=100)
fig, ax = plt.subplots()
for i in range(100):
sns.lineplot(x=X_train[...,0], y = gp3_prior_samples[:, i], color=sns_c[1], alpha=0.2, ax=ax)
sns.lineplot(x=X_train[...,0], y=y_train[..., 0], color=sns_c[0], label='y3', ax=ax)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='GP3 Prior Samples', xlabel='t');
Model Fit + Predictions
gp3.fit(X_train, y_train)
GaussianProcessRegressor(alpha=0.0,
kernel=WhiteKernel(noise_level=0.09) + 1.41**2 * ExpSineSquared(length_scale=1, periodicity=40) + 5**2 * RBF(length_scale=100) + 1**2 * ExpSineSquared(length_scale=1, periodicity=12),
n_restarts_optimizer=10, normalize_y=True)
y_pred, y_std = gp3.predict(X, return_std=True)
data_df['y_pred'] = y_pred
data_df['y_std'] = y_std
data_df['y_pred_lwr'] = data_df['y_pred'] - 2*data_df['y_std']
data_df['y_pred_upr'] = data_df['y_pred'] + 2*data_df['y_std']
We plot the predictions:
fig, ax = plt.subplots()
ax.fill_between(
x=data_df['t'],
y1=data_df['y_pred_lwr'],
y2=data_df['y_pred_upr'],
color=sns_c[2],
alpha=0.15,
label='credible_interval'
)
sns.lineplot(x='t', y='y3', data=data_df, color=sns_c[0], label = 'y3', ax=ax)
sns.lineplot(x='t', y='y_pred', data=data_df, color=sns_c[2], label='y_pred', ax=ax)
ax.axvline(n_train, color=sns_c[3], linestyle='--', label='train-test split')
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='Prediction Sample Data 3', xlabel='t', ylabel='');
print(f'R2 Score Train = {gp3.score(X=X_train, y=y_train): 0.3f}')
print(f'R2 Score Test = {gp3.score(X=X_test, y=y_test): 0.3f}')
print(f'MAE Train = {mean_absolute_error(y_true=y_train, y_pred=gp3.predict(X_train)): 0.3f}')
print(f'MAE Test = {mean_absolute_error(y_true=y_test, y_pred=gp3.predict(X_test)): 0.3f}')
R2 Score Train = 0.989
R2 Score Test = 0.960
MAE Train = 0.211
MAE Test = 0.293
Errors
errors = gp3.predict(X_test) - y_test
errors = errors.flatten()
errors_mean = errors.mean()
errors_std = errors.std()
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
sns.regplot(x=y_test.flatten(), y=gp3.predict(X_test).flatten(), ax=ax[0])
sns.distplot(a=errors, ax=ax[1])
ax[1].axvline(x=errors_mean, color=sns_c[3], linestyle='--', label=f'$\mu$')
ax[1].axvline(x=errors_mean + 2*errors_std, color=sns_c[4], linestyle='--', label=f'$\mu \pm 2\sigma$')
ax[1].axvline(x=errors_mean - 2*errors_std, color=sns_c[4], linestyle='--')
ax[1].axvline(x=errors_mean, color=sns_c[3], linestyle='--')
ax[1].legend()
ax[0].set(title='Model 3 - Test vs Predictions (Test Set)', xlabel='y_test', ylabel='y_pred');
ax[1].set(title='Model 3 - Errors', xlabel='error', ylabel=None);
Example 4
In this last example we consider a non-linear trend component.
- Non-Linear Trend Component
# Generate trend component.
def non_linear_trend(x):
"""Scale and take square root."""
trend_comp = 0.2 * np.power(x, 1/2)
return trend_comp
# Compute non-linear trend.
data_df ['tr2'] = data_df['t'].apply(non_linear_trend)
# Add trend to target variable.
data_df ['y4'] = data_df ['y3'] + data_df ['tr2']
Let us see the trend plot:
fig, ax = plt.subplots()
sns.lineplot(x='t', y='y4', data=data_df, color=sns_c[0], label='y4', ax=ax)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='Sample Data 4', xlabel='t', ylabel='');
Define Model
Instead of an RBF
kernel to model the trendd, we use a RationalQuadratic
which can seen as a scale mixture (an infinite sum) of RBF kernels with different characteristic length-scales.
from sklearn.gaussian_process.kernels import RationalQuadratic
k0 = WhiteKernel(noise_level=0.3**2, noise_level_bounds=(0.1**2, 0.5**2))
k1 = ConstantKernel(constant_value=2) * \
ExpSineSquared(length_scale=1.0, periodicity=40, periodicity_bounds=(35, 45))
k2 = ConstantKernel(constant_value=100, constant_value_bounds=(1, 500)) * \
RationalQuadratic(length_scale=500, length_scale_bounds=(1, 1e4), alpha= 50.0, alpha_bounds=(1, 1e3))
k3 = ConstantKernel(constant_value=1) * \
ExpSineSquared(length_scale=1.0, periodicity=12, periodicity_bounds=(10, 15))
kernel_4 = k0 + k1 + k2 + k3
# Define GaussianProcessRegressor object.
gp4 = GaussianProcessRegressor(
kernel=kernel_4,
n_restarts_optimizer=10,
normalize_y=True,
alpha=0.0
)
Split Data
y = data_df['y4'].values.reshape(n ,1)
y_train = y[:n_train]
y_test = y[n_train:]
Prior Sampling
gp4_prior_samples = gp4.sample_y(X=X_train, n_samples=100)
fig, ax = plt.subplots()
for i in range(100):
sns.lineplot(x=X_train[...,0], y = gp4_prior_samples[:, i], color=sns_c[1], alpha=0.2, ax=ax)
sns.lineplot(x=X_train[...,0], y=y_train[..., 0], color=sns_c[0], label='y3', ax=ax)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='GP4 Prior Samples', xlabel='t');
Model Fit + Predictions
gp4.fit(X_train, y_train)
GaussianProcessRegressor(alpha=0.0,
kernel=WhiteKernel(noise_level=0.09) + 1.41**2 * ExpSineSquared(length_scale=1, periodicity=40) + 10**2 * RationalQuadratic(alpha=50, length_scale=500) + 1**2 * ExpSineSquared(length_scale=1, periodicity=12),
n_restarts_optimizer=10, normalize_y=True)
y_pred, y_std = gp4.predict(X, return_std=True)
data_df['y_pred'] = y_pred
data_df['y_std'] = y_std
data_df['y_pred_lwr'] = data_df['y_pred'] - 2*data_df['y_std']
data_df['y_pred_upr'] = data_df['y_pred'] + 2*data_df['y_std']
Finally, let us plot the prediction:
fig, ax = plt.subplots()
ax.fill_between(
x=data_df['t'],
y1=data_df['y_pred_lwr'],
y2=data_df['y_pred_upr'],
color=sns_c[2],
alpha=0.15,
label='credible_interval'
)
sns.lineplot(x='t', y='y4', data=data_df, color=sns_c[0], label = 'y4', ax=ax)
sns.lineplot(x='t', y='y_pred', data=data_df, color=sns_c[2], label='y_pred', ax=ax)
ax.axvline(n_train, color=sns_c[3], linestyle='--', label='train-test split')
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='Prediction Sample Data 4', xlabel='t', ylabel='');
print(f'R2 Score Train = {gp4.score(X=X_train, y=y_train): 0.3f}')
print(f'R2 Score Test = {gp4.score(X=X_test, y=y_test): 0.3f}')
print(f'MAE Train = {mean_absolute_error(y_true=y_train, y_pred=gp4.predict(X_train)): 0.3f}')
print(f'MAE Test = {mean_absolute_error(y_true=y_test, y_pred=gp4.predict(X_test)): 0.3f}')
R2 Score Train = 0.993
R2 Score Test = 0.957
MAE Train = 0.238
MAE Test = 0.320
Errors
errors = gp4.predict(X_test) - y_test
errors = errors.flatten()
errors_mean = errors.mean()
errors_std = errors.std()
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
sns.regplot(x=y_test.flatten(), y=gp4.predict(X_test).flatten(), ax=ax[0])
sns.distplot(a=errors, ax=ax[1])
ax[1].axvline(x=errors_mean, color=sns_c[3], linestyle='--', label=f'$\mu$')
ax[1].axvline(x=errors_mean + 2*errors_std, color=sns_c[4], linestyle='--', label=f'$\mu \pm 2\sigma$')
ax[1].axvline(x=errors_mean - 2*errors_std, color=sns_c[4], linestyle='--')
ax[1].axvline(x=errors_mean, color=sns_c[3], linestyle='--')
ax[1].legend()
ax[0].set(title='Model 4 - Test vs Predictions (Test Set)', xlabel='y_test', ylabel='y_pred');
ax[1].set(title='Model 4 - Errors', xlabel='error', ylabel=None);
Again, the aim of this notebook was to give a flavor on how to use Gaussian processes for time series forecasting. I strongly encourage you to go to the references presented above to get a deeper and much more complete exposition of this subject.