7 min read

PyData Berlin 2019: Gaussian Processes for Timeseries Forecasting

In this notebook we run some experiments to demonstrate how we can use Gaussian Processes in the context of time series forecasting.

This material is part of a talk on Gaussian Process for Time Series Analysis presented at the PyCon DE & PyData 2019 Conference in Berlin.

Video

Slides

Here you can find the slides of the talk.

Suggestions and comments are always welcome!

References:

Prepare Notebook

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

Example 1

In this first example we consider a seasonal stationary time series.

Generate Data

  • Time Variable

Let us define a “time” variable \(t\in \mathbb{N}\).

# Number of samples. 
n = 1000
# Generate "time" variable. 
t = np.arange(n)
data_df = pd.DataFrame({'t' : t})
  • Seasonal Component

Let us define a function to generate seasonal components (Fourier modes).

# Generate seasonal variables. 
def seasonal(t, amplitude, period):
    """Generate a sinusoidal curve."""
    y1 = amplitude * np.sin((2*np.pi)*t/period) 
    return y1

# Add two seasonal components. 
data_df['s1'] = data_df['t'].apply(lambda t : seasonal(t, amplitude=2, period=40))

# Define target variable. 
data_df['y1'] = data_df['s1']

Let us plot this seasonal variable:

plt.rcParams['figure.figsize'] = [13, 8]

fig, ax = plt.subplots()
sns.lineplot(x='t', y='s1', data=data_df, label='s1', ax=ax) 
ax.set(title='Seasonal Component', xlabel='t', ylabel='')
ax.legend(loc='lower left');
png
  • Gaussian Noise

Finally, we add some noise.

# Set noise standard deviation. 
sigma_n = 0.3
data_df['epsilon'] = np.random.normal(loc=0, scale=sigma_n, size=n)
# Add noise to target variable. 
data_df ['y1'] = data_df ['y1'] + data_df ['epsilon']

Let us plot the resulting data:

fig, ax = plt.subplots()
sns.lineplot(x='t', y='y1', data=data_df, label='y1', ax=ax) 
ax.set(title='Sample Data 1', xlabel='t', ylabel='')
ax.legend(loc='lower left');
png

Define Model

Let us now define the kernel of the gaussian process model. We include the following kernel components (recall that the sum of kernels is againi a kernel):

We add bounds to the kernel hyper-parameters which are optimized by maximizing the log-marginal-likelihood (see documentation).

from sklearn.gaussian_process.kernels import WhiteKernel, ExpSineSquared, ConstantKernel

k0 = WhiteKernel(noise_level=0.3**2, noise_level_bounds=(0.1**2, 0.5**2))

k1 = ConstantKernel(constant_value=2) * \ 
    ExpSineSquared(length_scale=1.0, periodicity=40, periodicity_bounds=(35, 45))

kernel_1  = k0 + k1 

Next we initialize the GaussianProcessRegressor object.

from sklearn.gaussian_process import GaussianProcessRegressor

gp1 = GaussianProcessRegressor(
    kernel=kernel_1, 
    n_restarts_optimizer=10, 
    normalize_y=True,
    alpha=0.0
)

Split Data

We now prepare and split the data for the model.

X = data_df['t'].values.reshape(n ,1)
y = data_df['y1'].values.reshape(n ,1)

prop_train = 0.7
n_train = round(prop_train*n)

X_train = X[:n_train]
y_train = y[:n_train]

X_test = X[n_train:]
y_test = y[n_train:]

Model Fit + Predictions

Let us fit the model and generate predictions.

gp1.fit(X_train, y_train)
GaussianProcessRegressor(alpha=0.0, copy_X_train=True,
                         kernel=WhiteKernel(noise_level=0.09) + 1.41**2 * ExpSineSquared(length_scale=1, periodicity=40),
                         n_restarts_optimizer=10, normalize_y=True,
                         optimizer='fmin_l_bfgs_b', random_state=None)
#Generate predictions.
y_pred, y_std = gp1.predict(X, return_std=True)

data_df['y_pred'] = y_pred
data_df['y_std'] = y_std
data_df['y_pred_lwr'] = data_df['y_pred'] - data_df['y_std']
data_df['y_pred_upr'] = data_df['y_pred'] + data_df['y_std']

We plot the predictions.

fig, ax = plt.subplots()

ax.fill_between(
    x=data_df['t'], 
    y1=data_df['y_pred_lwr'], 
    y2=data_df['y_pred_upr'], 
    color='black', 
    alpha=0.15, 
    label='credible_interval'
)

sns.lineplot(x='t', y='y1', data=data_df, label = 'y1', ax=ax)
sns.lineplot(x='t', y='y_pred', data=data_df, label='y_pred', color='black', ax=ax)

ax.axvline(n_train, color='red', linestyle='--', label='train_test_split')
ax.set(title='Prediction Sample 1', xlabel='t', ylabel='')
ax.legend(loc='upper left');
png

Let us compute the \(R^2\) of the prediction on the test set.

gp1.score(X=X_test, y=y_test)
0.9569549925482237

Example 2

In this example we add a linear trend component.

  • Linear Trend Component
# Generate trend component. 
def linear_trend(beta, x):
    """Scale vector by a scalar."""
    trend_comp = beta * x 
    return trend_comp

data_df['tr1'] = data_df['t'].apply(lambda x : linear_trend(0.01, x))
# Add trend to target variable y_1. 
data_df['y2'] = data_df['y1'] + data_df['tr1']

Let us see the trend plot:

fig, ax = plt.subplots()
sns.lineplot(x='t', y='y2', data=data_df, label='y_2', ax=ax)
ax.set(title='Sample Data 2', xlabel='t', ylabel='')
ax.legend(loc='upper left');
png

Define Model

For this second example we add a RBF kernel to model the trend component.

from sklearn.gaussian_process.kernels import RBF

k0 = WhiteKernel(noise_level=0.3**2, noise_level_bounds=(0.1**2, 0.5**2))

k1 = ConstantKernel(constant_value=2) * \ 
    ExpSineSquared(length_scale=1.0, periodicity=40, periodicity_bounds=(35, 45))

k2 = ConstantKernel(constant_value=10, constant_value_bounds=(1e-2, 1e3)) * \
    RBF(length_scale=100.0, length_scale_bounds=(1, 1e4)) 

kernel_2  = k0 + k1 + k2
# Define GaussianProcessRegressor object. 
gp2 = GaussianProcessRegressor(
    kernel=kernel_2, 
    n_restarts_optimizer=10, 
    normalize_y=True,
    alpha=0.0
)

Split Data

We split the data as above.

y = data_df['y2'].values.reshape(n ,1)
y_train = y[:n_train]
y_test = y[n_train:]

Model Fit + Predictions

gp2.fit(X_train, y_train)
GaussianProcessRegressor(alpha=0.0, copy_X_train=True,
                         kernel=WhiteKernel(noise_level=0.09) + 1.41**2 * ExpSineSquared(length_scale=1, periodicity=40) + 3.16**2 * RBF(length_scale=100),
                         n_restarts_optimizer=10, normalize_y=True,
                         optimizer='fmin_l_bfgs_b', random_state=None)
# Generate predictions.
y_pred, y_std = gp2.predict(X, return_std=True)

data_df['y_pred'] = y_pred
data_df['y_std'] = y_std
data_df['y_pred_lwr'] = data_df['y_pred'] - data_df['y_std']
data_df['y_pred_upr'] = data_df['y_pred'] + data_df['y_std']

We plot the predictions.

fig, ax = plt.subplots()

ax.fill_between(
    x=data_df['t'], 
    y1=data_df['y_pred_lwr'], 
    y2=data_df['y_pred_upr'], 
    color='black', 
    alpha=0.15, 
    label='credible_interval'
)

sns.lineplot(x='t', y='y2', data=data_df, label = 'y2', ax=ax)
sns.lineplot(x='t', y='y_pred', data=data_df, label='y_pred', color='black', ax=ax)

ax.axvline(n_train, color='red', linestyle='--', label='train_test_split')
ax.set(title='Prediction Sample 2', xlabel='t', ylabel='')
ax.legend(loc='upper left');
png

Let us compute the \(R^2\) of the prediction on the test set for this second model.

gp2.score(X=X_test, y=y_test)
0.9498746360272532

Example 3

In this third example we add a second seasonal component.

  • Add Other Seasonal Component
# Create other seasonal component.
data_df['s2'] = data_df['t'].apply(lambda t : seasonal(t, amplitude=1, period=13.3))
# Add to y_2.
data_df['y3'] = data_df['y2'] + data_df['s2'] 
fig, ax = plt.subplots()
sns.lineplot(x='t', y='y3', data=data_df, label='y3', ax=ax) 
ax.set(title='Sample Data 3', xlabel='t', ylabel='')
ax.legend(loc='upper left');
png

Define Model

We add another ExpSineSquared kernel to the one of Example 2.

k0 = WhiteKernel(noise_level=0.3**2, noise_level_bounds=(0.1**2, 0.5**2))

k1 = ConstantKernel(constant_value=2) * \
    ExpSineSquared(length_scale=1.0, periodicity=40, periodicity_bounds=(35, 45))

k2 = ConstantKernel(constant_value=10, constant_value_bounds=(1e-2, 1e3)) * \ 
    RBF(length_scale=100.0, length_scale_bounds=(1, 1e4)) 

k3 = ConstantKernel(constant_value=1) * \ 
    ExpSineSquared(length_scale=1.0, periodicity=12, periodicity_bounds=(10, 15))

kernel_3  = k0 + k1 + k2 + k3
# Define GaussianProcessRegressor object. 
gp3 = GaussianProcessRegressor(
    kernel=kernel_3, 
    n_restarts_optimizer=10, 
    normalize_y=True,
    alpha=0.0
)

Split Data

y = data_df['y3'].values.reshape(n ,1)
y_train = y[:n_train]
y_test = y[n_train:]

Model Fit + Predictions

gp3.fit(X_train, y_train)
GaussianProcessRegressor(alpha=0.0, copy_X_train=True,
                         kernel=WhiteKernel(noise_level=0.09) + 1.41**2 * ExpSineSquared(length_scale=1, periodicity=40) + 3.16**2 * RBF(length_scale=100) + 1**2 * ExpSineSquared(length_scale=1, periodicity=12),
                         n_restarts_optimizer=10, normalize_y=True,
                         optimizer='fmin_l_bfgs_b', random_state=None)
y_pred, y_std = gp3.predict(X, return_std=True)

data_df['y_pred'] = y_pred
data_df['y_std'] = y_std
data_df['y_pred_lwr'] = data_df['y_pred'] - data_df['y_std']
data_df['y_pred_upr'] = data_df['y_pred'] + data_df['y_std']

We plot the predictions:

fig, ax = plt.subplots()

ax.fill_between(
    x=data_df['t'], 
    y1=data_df['y_pred_lwr'], 
    y2=data_df['y_pred_upr'], 
    color='black', 
    alpha=0.15, 
    label='credible_interval'
)

sns.lineplot(x='t', y='y3', data=data_df, label = 'y3', ax=ax)
sns.lineplot(x='t', y='y_pred', data=data_df, label='y_pred', color='black', ax=ax)

ax.axvline(n_train, color='red', linestyle='--', label='train_test_split')
ax.set(title='Prediction Sample Data 3', xlabel='t', ylabel='')
ax.legend(loc='upper left');
png

png

# Compute R-squared. 
gp3.score(X=X_test, y=y_test)
0.9270716869235976

Example 4

In this last example we consider a non-linear trend component.

  • Non-Linear Trend Component
# Generate trend component. 
def non_linear_trend(x):
    """Scale and take square root."""
    trend_comp = 0.2 * np.power(x, 1/2)
    return trend_comp

# Compute non-linear trend. 
data_df ['tr2'] = data_df['t'].apply(non_linear_trend)

# Add trend to target variable. 
data_df ['y4'] = data_df ['y3'] + data_df ['tr2']

Let us see the trend plot:

fig, ax = plt.subplots()
sns.lineplot(x='t', y='y4', data=data_df, ax=ax)
ax.set(title='Sample Data 4', xlabel='t', ylabel='');
png

Define Model

Instead of an RBF kernel to model the trendd, we use a RationalQuadratic which can seen as a scale mixture (an infinite sum) of RBF kernels with different characteristic length-scales.

from sklearn.gaussian_process.kernels import RationalQuadratic

k0 = WhiteKernel(noise_level=0.3**2, noise_level_bounds=(0.1**2, 0.5**2))

k1 = ConstantKernel(constant_value=2) * \
    ExpSineSquared(length_scale=1.0, periodicity=40, periodicity_bounds=(35, 45))

k2 = ConstantKernel(constant_value=10) * \
    RationalQuadratic(length_scale=500, alpha= 50.0, alpha_bounds=(1, 1e3))

k3 = ConstantKernel(constant_value=1) * \
    ExpSineSquared(length_scale=1.0, periodicity=12, periodicity_bounds=(10, 15))

kernel_4  = k0 + k1 + k2 + k3
# Define GaussianProcessRegressor object. 
gp4 = GaussianProcessRegressor(
    kernel=kernel_4, 
    n_restarts_optimizer=10, 
    normalize_y=True,
    alpha=0.0
)

Split Data

y = data_df['y4'].values.reshape(n ,1)
y_train = y[:n_train]
y_test = y[n_train:]

Model Fit + Predictions

gp4.fit(X_train, y_train)
GaussianProcessRegressor(alpha=0.0, copy_X_train=True,
                         kernel=WhiteKernel(noise_level=0.09) + 1.41**2 * ExpSineSquared(length_scale=1, periodicity=40) + 3.16**2 * RationalQuadratic(alpha=50, length_scale=500) + 1**2 * ExpSineSquared(length_scale=1, periodicity=12),
                         n_restarts_optimizer=10, normalize_y=True,
                         optimizer='fmin_l_bfgs_b', random_state=None)
y_pred, y_std = gp4.predict(X, return_std=True)

data_df['y_pred'] = y_pred
data_df['y_std'] = y_std
data_df['y_pred_lwr'] = data_df['y_pred'] - data_df['y_std']
data_df['y_pred_upr'] = data_df['y_pred'] + data_df['y_std']

Finally, let us plot the prediction:

fig, ax = plt.subplots()

ax.fill_between(
    x=data_df['t'], 
    y1=data_df['y_pred_lwr'], 
    y2=data_df['y_pred_upr'], 
    color='black', 
    alpha=0.15, 
    label='credible_interval'
)

sns.lineplot(x='t', y='y4', data=data_df, label = 'y4', ax=ax)
sns.lineplot(x='t', y='y_pred', data=data_df, label='y_pred', color='black', ax=ax)

ax.axvline(n_train, color='red', linestyle='--', label='train_test_split')
ax.set(title='Prediction Sample Data 4', xlabel='t', ylabel='')
ax.legend(loc='upper left');
png
# Compute R-squared. 
gp4.score(X=X_test, y=y_test)
0.9455375317908682

Again, the aim of this notebook was to give a flavour on how to use gausian processes for time series forecastiing. I strongly encourage you to go to the references presented above to get a deeper and muc more complete exposition of this subject.