18 min read

Exploring TensorFlow Probability STS Forecasting

In this notebook we explore the Structural Time Series (STS) Module of TensorFlow Probability. We follow closely the use cases presented in their Medium blog. As described there: An STS model expresses an observed time series as the sum of simpler components 1:

\[ f(t) = \sum_{k=1}^{N}f_{k}(t) + \varepsilon, \quad \text{where}\quad \varepsilon \sim N(0, \sigma^2). \]

Each summand \(f_{k}(t)\) has a particular structure, e.g. specific seasonality, trend, autoregressive terms, etc.

In this notebook we generate a time series sample and then present some techniques to recover its component structure. This is indeed a crucial step since in real applications we are not given the components separately. We then show how to fit and generate predictions using variational inference. Finally we run some diagnostics of the errors on the test set.

Prepare Notebook

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid', {'axes.facecolor': '.9'})
sns.set_palette(palette='deep')
sns_c = sns.color_palette(palette='deep')
%matplotlib inline

from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

import tensorflow as tf
import tensorflow_probability as tfp
print(tf.__version__)
print(tfp.__version__)
2.1.0
0.8.0-rc0

Generate Data

We are going to generate daily data on the range 2017-01-01 to 2020-12-31.

# Create dataframe with a date range (4 years).
date_range = pd.date_range(start='2017-01-01', end='2020-12-31', freq='D')
# Create data frame.
df = pd.DataFrame(data={'date': date_range})
# Extract number of points.
n = df.shape[0]

We begin with Gaussian noise.

# Set seed.
np.random.seed(seed=42)

df['y'] = np.random.normal(loc=0.0, scale=0.5, size=n)

Next we define an external regressor x.

# External regressor:
df['x'] = np.random.uniform(low=0.0, high=1.0, size=n)
df['x'] = df['x'].apply(lambda x: x if abs(x) > 0.95 else 0.0)

df['y'] = df['y'] + 6*df['x']
plt.rcParams['figure.figsize'] = [12, 7]

fig, ax = plt.subplots()
sns.lineplot(x='date', y='x', label='x', data=df, ax=ax)
ax.legend(loc='upper left')
ax.set(title=r'External Regressor $x$');
png

Now, let us add a non-linear trend component of the form \(t\longmapsto t^{1/3}\).

df['y'] = df['y'] + np.power(df.index.values, 1/3)

Finally, we add some seasonal variables, which we encode as cyclic variables using \(\sin(z)\) and \(\cos(z)\) functions.

# Seasonal features:
df['day_of_month'] = df['date'].dt.day
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['daysinmonth'] = df['date'].dt.daysinmonth


df['y'] = df['y'] \
    + 2*np.cos(2*np.pi*df['month']/12) + 0.5*np.sin(2*np.pi*df['month']/12)\
    + 1.5*np.cos(2*np.pi*df['day_of_week']/7) \
    + 2*np.sin(2*np.pi*df['day_of_month']/ df['daysinmonth'])

df['y'] = df['y'].fillna(method='backfill').astype(np.float32)

df.head(10)
date y x day_of_month month day_of_week daysinmonth
0 2017-01-01 3.568240 0.00000 1 1 6 31
1 2017-01-02 5.201631 0.00000 2 1 0 31
2 2017-01-03 5.643587 0.00000 3 1 1 31
3 2017-01-04 11.252480 0.99181 4 1 2 31
4 2017-01-05 3.798210 0.00000 5 1 3 31
5 2017-01-06 4.099009 0.00000 6 1 4 31
6 2017-01-07 6.231933 0.00000 7 1 5 31
7 2017-01-08 7.211367 0.00000 8 1 6 31
8 2017-01-09 7.183468 0.00000 9 1 0 31
9 2017-01-10 7.064259 0.00000 10 1 1 31

Let us plot the generated data:

fig, ax = plt.subplots()
sns.lineplot(x='date', y='y', label='y', data=df, ax=ax)
ax.legend(loc='upper left')
ax.set(title='Dependent Variable');
png

Train - Test Split

We assume we have data until 2020-06-30 and we will predict six months ahead. We assume we have access to the x variable in advance (e.g. media spend plan).

threshold_date = pd.to_datetime('2020-07-01')
mask = df['date'] < threshold_date

df_train = df[mask]
df_test = df[~ mask]
fig, ax = plt.subplots()
sns.lineplot(x='date', y='y', label='y_train', data=df_train, ax=ax)
sns.lineplot(x='date', y='y', label='y_test', data=df_test, ax=ax)
ax.axvline(threshold_date, color=sns_c[3], linestyle='--', label='train test split')
ax.legend(loc='upper left')
ax.set(title='Dependent Variable');
png

Time Series Exploratory Analysis

We are going to assume we do not know how the data was generated. We present a sample of techniques to extract the seasonality and estimate the effect of the x variable on y. There is no unique way to do this and we focus on using the intuition and common sense to extract the relevant features. Time series forecasting can be a very challenging problem. Spending time on exploring the data is always a good investment in order to generate meaningful models.

Warning: We just use the training data in this step so that we do not leak information from the test set.

Smoothing

We begin the analysis by studying and visualize various level of smoothness using a gaussian_filter. You can think of it as a weighted centered moving average based on a normal distribution. In particular, the level of smoothness is controlled by the parameter sigma which represents the standard deviation.

from scipy.ndimage import gaussian_filter

df_smooth = df_train \
    .assign(y_smooth_1 = lambda x: gaussian_filter(input=x['y'], sigma=3.5)) \
    .assign(y_smooth_2 = lambda x: gaussian_filter(input=x['y'], sigma=15)) \

fig, ax = plt.subplots()
sns.lineplot(x='date', y='y', label='y_train', data=df_smooth, alpha=0.5, ax=ax)
sns.lineplot(x='date', y='y_smooth_1', label='y_train_smooth monthly', data=df_smooth, ax=ax)
sns.lineplot(x='date', y='y_smooth_2', label='y_train_smooth yearly', data=df_smooth, ax=ax)
ax.legend(loc='upper left')
ax.set(title='Dependent Variable - Smoothing', ylabel='y');
png

From this plot we see two clear seasonalities: yearly and monthly. The variable y_train_smooth yearly also includes a positive trend component, which does not look linear.

To understand how to model the monthly seasonality we plot the mean of y over the day of the month.

fig, ax = plt.subplots()
df_train.groupby('day_of_month').agg({'y': np.mean}).plot(ax=ax)
ax.set(title='Day of the month mean');
png

This shows a good first approximation is to model this component via a cyclic variable of the form \(z\longmapsto \sin(z)\).

Remove yearly and monthly seasonality

Let us remove these two seasonal components:

# Remove yearly seasonality.
y_no_year_season = df_smooth['y'] - df_smooth['y_smooth_2']
# Remove monthly seasonality.
y_no_year_month_season = y_no_year_season \
    - gaussian_filter(input=y_no_year_season, sigma=3.5)

# Plot components.
fig, ax = plt.subplots(3, 1, figsize=(12, 16))
sns.lineplot(x='date', y='y', label='y_train', data=df_smooth, alpha=0.5, ax=ax[0])
sns.lineplot(x='date', y='y_smooth_1', label='y_train_smooth monthly', data=df_smooth, ax=ax[0])
sns.lineplot(x='date', y='y_smooth_2', label='y_train_smooth yearly', data=df_smooth, ax=ax[0])
ax[0].set(title='y')

ax[1].plot(df_smooth['date'], y_no_year_season, c=sns_c[2])
ax[1].axhline(y_no_year_season.mean(), color=sns_c[3], linestyle='--', label='mean')
ax[1].legend()
ax[1].set(title='y no year seasonality')

ax[2].plot(df_smooth['date'], y_no_year_month_season, c=sns_c[1])
ax[2].axhline(y_no_year_month_season.mean(), color=sns_c[3], linestyle='--', label='mean')
ax[2].legend()
ax[2].set(title='y no year & month seasonality');

png

AC and PAC

Let us now compute the autocorrelation and partial-autocorrelation of the remainder component.

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

fig, ax = plt.subplots(2, 1)
plot_acf(x=y_no_year_month_season, ax=ax[0])
plot_pacf(x=y_no_year_month_season, ax=ax[1]);
png

From the autocorrelation plot we observe a clear seasonality component at lag 7. Moreover, looking into the shape of the plot we can also see that the encoding is cyclical.

Correlations

We want to see whether there is a linear effect of x on y. A first good indication can be obtained by looking into correlations. Naively, one would simply compute:

np.corrcoef(df_train['y'], df_train['x'])[0, 1]
0.3773272976778308

However, this correlation does not reflect the potential relation of x on y since we have a clear positive trend and seasonality components. A more meaningful indication is the correlation:

np.corrcoef(y_no_year_month_season, df_train['x'])[0, 1]
0.6714258581031259

Note that we still have not removed the 7-day (weekly) seasonality. This of course has an effect on this correlation. To see this let us consider the scatter plot:

fig, ax = plt.subplots()
sns.scatterplot(
    x=y_no_year_month_season, 
    y=df_train['x'], 
    hue=df['day_of_week'],
    palette='viridis',
    ax=ax, 
)

ax.set(
    title='y_no_year_month_season vs x', 
    xlabel='x', 
    ylabel='y_no_year_month_season'
);
png

We indeed see we have clear clusters on each level of x corresponding to each day of the week.

One way of removing this seasonality is via a linear regression model. We use a one-hot encoding for the day of the week variable.

Remark: As we have seen before a cyclical encoding might be a better choice. Nevertheless, given how the data was generated, we want to test the one-hot encoding approach.

# Prepare model data frame.
dow_df = df_train[['day_of_week']].copy()
dow_df['y_no_year_month_season'] = y_no_year_month_season
# One-hot encoding of the day of the week.
dow_dummies = pd.get_dummies(dow_df['day_of_week'], drop_first=True)
dow_dummies.columns = ['d' + str(i) for i in dow_dummies.columns]
dow_df = pd.concat([dow_df, dow_dummies], axis=1)
dow_df.head()
day_of_week y_no_year_month_season d1 d2 d3 d4 d5 d6
0 6 -2.105084 0 0 0 0 0 1
1 0 -0.513847 0 0 0 0 0 0
2 1 -0.143844 1 0 0 0 0 0
3 2 5.383627 0 1 0 0 0 0
4 3 -2.138967 0 0 1 0 0 0

Next, we fit the linear model.

import statsmodels.formula.api as smf

dow_mod = smf.ols(
    formula='y_no_year_month_season ~  d1 + d2 + d3 + d4 + d5 + d6', 
    data=dow_df
)

dow_res = dow_mod.fit()

print(dow_res.summary())
                        OLS Regression Results                              
==================================================================================
Dep. Variable:     y_no_year_month_season   R-squared:                       0.350
Model:                                OLS   Adj. R-squared:                  0.347
Method:                     Least Squares   F-statistic:                     114.2
Date:                    Tue, 11 Feb 2020   Prob (F-statistic):          2.64e-115
Time:                            11:32:43   Log-Likelihood:                -2214.7
No. Observations:                    1277   AIC:                             4443.
Df Residuals:                        1270   BIC:                             4479.
Df Model:                               6                                         
Covariance Type:                nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.4182      0.102     13.958      0.000       1.219       1.618
d1            -0.6520      0.144     -4.538      0.000      -0.934      -0.370
d2            -1.7017      0.144    -11.826      0.000      -1.984      -1.419
d3            -2.7558      0.144    -19.151      0.000      -3.038      -2.473
d4            -2.6323      0.144    -18.293      0.000      -2.915      -2.350
d5            -1.7910      0.144    -12.446      0.000      -2.073      -1.509
d6            -0.4125      0.144     -2.871      0.004      -0.694      -0.131
==============================================================================
Omnibus:                      665.602   Durbin-Watson:                   2.277
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3977.387
Skew:                           2.431   Prob(JB):                         0.00
Kurtosis:                      10.150   Cond. No.                         7.86
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We now compute the correlation of the regressor x with the model residuals:

np.corrcoef(dow_res.resid, df_train['x'])[0, 1]

0.8612502137457188

The correlation increased significantly.

Regressor Effect

A common question in time series analysis is to estimate the effect (ROI) of a specific external regressor (e.g. what is the ROI of media spend on sales?) For this specific case, given the correlation above, we can estimate this for x running a linear model on the residuals of the dow_model.

x_df = df_train[['x']].copy()
x_df['dow_model_resid'] = dow_res.resid

x_mod = smf.ols(formula='dow_model_resid ~ x', data=x_df)

x_res = x_mod.fit()

print(x_res.summary())
                      OLS Regression Results                            
==============================================================================
Dep. Variable:        dow_model_resid   R-squared:                       0.742
Model:                            OLS   Adj. R-squared:                  0.742
Method:                 Least Squares   F-statistic:                     3662.
Date:                Tue, 11 Feb 2020   Prob (F-statistic):               0.00
Time:                        11:32:44   Log-Likelihood:                -1350.3
No. Observations:                1277   AIC:                             2705.
Df Residuals:                    1275   BIC:                             2715.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.2907      0.020    -14.467      0.000      -0.330      -0.251
x              5.2360      0.087     60.515      0.000       5.066       5.406
==============================================================================
Omnibus:                       31.128   Durbin-Watson:                   1.226
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               34.638
Skew:                          -0.340   Prob(JB):                     3.01e-08
Kurtosis:                       3.435   Cond. No.                         4.45
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The estimates effect is then:

x_res.params['x']

5.236046936179519

Recall that the true effect is 5.0.

Periodogram

There is another way to extract seasonality patterns: using a periodogram to estimate the spectral density of a signal.

from scipy import signal

f, Pxx_den = signal.periodogram(
    x=df_train['y'], 
    detrend='linear', 
    nfft=int(7e2)
)

fig, ax = plt.subplots()
sns.lineplot(x=f, y=Pxx_den, ax=ax)
ax.set(title='Power Spectral Density');
png

Each of these peaks represent the seasonal components. To compute the associated frequency wee need to compute their multiplicative inverse.

# Sort to get the peak values.
sort_freq_index = np.argsort(a=Pxx_den)[::-1]

periodogram_df = pd.DataFrame(
    {'sort_freq': f[sort_freq_index], 'Pxx_den': Pxx_den[sort_freq_index]}
)

periodogram_df.assign(days = lambda x: 1/x['sort_freq']).head(5)
sort_freq Pxx_den days
0 0.032857 1341.182617 30.434783
1 0.002857 966.455322 350.000000
2 0.142857 701.349792 7.000000
3 0.001429 191.129486 700.000000
4 0.030000 41.348232 33.333333

The first three values correspond to days = 30, 350 and 7. Which correspond to monthly, yearly and weekly seasonality.

Time Series Decomposition

statsmodels has an inbuilt decomposition function using moving averages. Let us use it to estimate the effect of the external regressor (which we already know it should be the residual component).

from statsmodels.tsa.seasonal import seasonal_decompose

decomposition_obj = seasonal_decompose(
    x=df_train[['date', 'y']].set_index('date'), 
    model='additive'
)

fig, ax = plt.subplots(4, 1, figsize=(12, 12))

decomposition_obj.observed.plot(ax=ax[0])
ax[0].set(title='observed')
decomposition_obj.trend.plot(ax=ax[1])
ax[1].set(title='trend')
decomposition_obj.seasonal.plot(ax=ax[2])
ax[2].set(title='seasonal')
decomposition_obj.resid.plot(ax=ax[3])
ax[3].set(title='residual')
plt.tight_layout()
png

Note that the trend component also includes the monthly and yearly seasonality. This might not always be desired. You can of course use the smoothing technique presented above or keep decomposing the resulting trend by specifying the period parameter in the seasonal_decompose function.

  • Correlation
# We remove the first and last 3 entries as they are np.nan 
# coming from the moving average method. 
np.corrcoef(decomposition_obj.resid[3:-3], df_train['x'][3:-3])[0, 1]

0.8616875939727174

  • Effect Estimation
x_df['decomposition_resid'] = decomposition_obj.resid.values

x_mod2 = smf.ols(formula='decomposition_resid ~ x', data=x_df[3:-3])

x_res2 = x_mod2.fit()

print(x_res2.summary())
                       OLS Regression Results                            
===============================================================================
Dep. Variable:     decomposition_resid   R-squared:                       0.743
Model:                             OLS   Adj. R-squared:                  0.742
Method:                  Least Squares   F-statistic:                     3659.
Date:                 Tue, 11 Feb 2020   Prob (F-statistic):               0.00
Time:                         11:32:47   Log-Likelihood:                -1311.4
No. Observations:                 1271   AIC:                             2627.
Df Residuals:                     1269   BIC:                             2637.
Df Model:                            1                                         
Covariance Type:             nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.2828      0.020    -14.404      0.000      -0.321      -0.244
x              5.1026      0.084     60.492      0.000       4.937       5.268
==============================================================================
Omnibus:                       73.490   Durbin-Watson:                   1.441
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               88.742
Skew:                          -0.568   Prob(JB):                     5.37e-20
Kurtosis:                       3.620   Cond. No.                         4.44
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We get similar results as above.

Remarks:

  • In real life applications the effects of external regressors might include lags. A correlation analysis with lags needs to be done before modeling.
  • Also, linear relations might serve as a good first approximation. Nevertheless, going into non-linear relations might improve the model significantly. Two concrete methods which appear often is to model via adstock effect and generalized additive models.

Define Model

Based on the previous analysis we have identified the main components of our time series:

  • A local_linear_trend

  • Seasonality:

    • month_of_year
    • day_of_week
    • day_of_month
  • External regressor x_var.

We define each of these components separately in order to build the sts-model using TensorFlow Probability.

# Local linear trend. 
local_linear_trend = tfp.sts.LocalLinearTrend(
    observed_time_series=df_train['y'], 
    name='local_linear_trend',
)

# We need to pre-define the number of days in each month.
num_days_per_month = np.array(
  [
    [31, 28, 31, 30, 30, 31, 31, 31, 30, 31, 30, 31],
    [31, 28, 31, 30, 30, 31, 31, 31, 30, 31, 30, 31],
    [31, 28, 31, 30, 30, 31, 31, 31, 30, 31, 30, 31],
    [31, 29, 31, 30, 30, 31, 31, 31, 30, 31, 30, 31] # year with leap day.
  ] 
)  

# Define month of year seasonal variable.
month_of_year = tfp.sts.Seasonal(
  num_seasons=12,
  num_steps_per_season=num_days_per_month,
  name='month_of_year'
)

# Define day of week as seasonal variable.
day_of_week = tfp.sts.Seasonal(
    num_seasons=7, 
    num_steps_per_season=1,
    observed_time_series=df_train['y'], 
    name='day_of_week',
)

# Create cyclic variable for day of the month.
design_matrix_day_of_month = tf.reshape(
    np.sin(2*np.pi*df['day_of_month'] / df['daysinmonth']).values.astype(np.float32), 
    (-1, 1)
)

# Define day of the month as an external regressor.
# We do not encode it as seasonal as the number of steps is not uniform.
day_of_month = tfp.sts.LinearRegression(
    design_matrix=design_matrix_day_of_month,
    name='day_of_month'
)

# Define external regressor component. 
# We use the whole data set (df) as we expect to have these values in the future. 
design_matrix_x_var = tf.reshape(df['x'].values.astype(np.float32), (-1, 1))

x_var = tfp.sts.LinearRegression(
    design_matrix=design_matrix_x_var,
    name='x_var'
)

Now we build the model:

model_components = [
    local_linear_trend, 
    month_of_year, 
    day_of_week, 
    day_of_month, 
    x_var,
]

toy_model = tfp.sts.Sum(
    components=model_components, 
    observed_time_series=df_train['y']
)

Let us see the model parameters and their priors:

for p in toy_model.parameters:
    print('-'*140)
    print('Parameter: ' + p.name)
    print('Prior: ' + str(p.prior))
    
    --------------------------------------------------------------------------------------------------------------------------------------------
    Parameter: observation_noise_scale
    Prior: tfp.distributions.LogNormal("Sum_LogNormal", batch_shape=[], event_shape=[], dtype=float32)
    --------------------------------------------------------------------------------------------------------------------------------------------
    Parameter: local_linear_trend/_level_scale
    Prior: tfp.distributions.LogNormal("local_linear_trend_level_scale_prior", batch_shape=[], event_shape=[], dtype=float32)
    --------------------------------------------------------------------------------------------------------------------------------------------
    Parameter: local_linear_trend/_slope_scale
    Prior: tfp.distributions.LogNormal("local_linear_trend_slope_scale_prior", batch_shape=[], event_shape=[], dtype=float32)
    --------------------------------------------------------------------------------------------------------------------------------------------
    Parameter: month_of_year/_drift_scale
    Prior: tfp.distributions.LogNormal("month_of_year_LogNormal", batch_shape=[], event_shape=[], dtype=float32)
    --------------------------------------------------------------------------------------------------------------------------------------------
    Parameter: day_of_week/_drift_scale
    Prior: tfp.distributions.LogNormal("day_of_week_LogNormal", batch_shape=[], event_shape=[], dtype=float32)
    --------------------------------------------------------------------------------------------------------------------------------------------
    Parameter: day_of_month/_weights
    Prior: tfp.distributions.TransformedDistribution("day_of_month_day_of_month_identityday_of_month_StudentT", batch_shape=[], event_shape=[1], dtype=float32)
    --------------------------------------------------------------------------------------------------------------------------------------------
    Parameter: x_var/_weights
    Prior: tfp.distributions.TransformedDistribution("x_var_x_var_identityx_var_StudentT", batch_shape=[], event_shape=[1], dtype=float32)

Model Fit

Variational Inference (Short Intro)

We follow the strategy of the TensorFlow Probability use cases and fit the model using variational inference. I find the article Variational Inference: A Review for Statisticians very good for an introduction to the subject. The main idea of variational inference is to use optimization methods to minimize the Kullback-Leibler divergence in order to approximate a conditional density within a family of specified parameter densities \(\mathfrak{D}\). Specifically, assume we are given observed variables \(x\) (i.e. data) and let \(z\) be a set of latent variables with joint density \(p(x, z)\). We are interested in computing the conditional density

\[ p(z|x) = \frac{p(x, z)}{p(x)}, \quad\text{where} \quad p(x)=\int p(x, z) dz. \]

This last integral (known as the evidence) is in general very hard to compute. The main idea of variational inference is to solve the optimization problem

\[ q^{*}(z)=\min_{q \in \mathfrak{D}} \text{KL}(q(z)||p(z|x)), \]

where

\[ \begin{align} \text{KL}(q(z)||p(z|x)) =& \text{E}[\log(q(z))] - \text{E}[\log(p(z|x))] \\ =& \text{E}[\log(q(z))] - \text{E}[\log(p(z,x))] - \text{E}[\log(p(x))] \\ =& \text{E}[\log(q(z))] - \text{E}[\log(p(z,x))] - \log(p(x) \end{align} \]

is the Kullback-Leibler (KL) divergence (the expected values are taken with respect to \(q(z)\)). Note that this quantity contains a term \(\text{E}[\log(p(x))]\), which is hard to compute. Because we cannot compute the KL, we optimize an alternative objective that is equivalent to the KL up to an added constant 2,

\[ \text{ELBO}(q) = \text{E}[\log(p(z, x))] - \text{E}[\log(q(z))] \]

ELBO stands for evidence lower bound.

Remark: Note that

\[ \text{ELBO}(q)= -\text{KL}(q(z)||p(z|x)) + \log(p(x)) \]

Moreover, it is easy to see that

\[ \text{ELBO}(q)= \text{E}[\log(p(x|z)]-\text{KL}(q(z)||p(x)) \]

The first term is the expected likelihood and the second term is the negative divergence between the variational density and the prior.

For more details and enlightening comments, please refer to the article mentioned above (from which this short introduction was taken from).

Variational Inference in TensorFlow Probability

First we build the variational surrogate posteriors, which consist of independent normal distributions with two trainable hyper-parameters loc and scale for each parameter in the model. That is, \(\mathfrak{D}\) above consists of normal distributions.

variational_posteriors = tfp.sts.build_factored_surrogate_posterior(
  model=toy_model, 
  seed=42
)

Let us sample from the prior distributions and plot the corresponding densities:

q_prior_samples = variational_posteriors.sample(1000)
num_parameters = len(toy_model.parameters)

fig, ax = plt.subplots(num_parameters, 1, figsize=(12, 21))

for i, param in enumerate(toy_model.parameters):
    
    param_mean = np.mean(q_prior_samples[param.name], axis=0)
    param_median = np.median(q_prior_samples[param.name], axis=0)
    param_std = np.std(q_prior_samples[param.name], axis=0)
    
    sns.distplot(a=q_prior_samples[param.name], rug=True, ax=ax[i])
    ax[i].set(title=param.name)
    ax[i].axvline(x= param_median, color=sns_c[1], linestyle='--', label='median')
    ax[i].axvline(x= param_mean, color=sns_c[2], linestyle='--', label='mean')
    ax[i].axvline(x= param_mean + param_std, color=sns_c[3], linestyle='--', label=r'$\mu \pm 2\sigma$')
    ax[i].axvline(x= param_mean - param_std, color=sns_c[3], linestyle='--')
    ax[i].legend()
    
plt.tight_layout(pad=3.0)
plt.suptitle('PRIOR DISTRIBUTIONS', y=0.99);
png

Next we run the optimization procedure.

num_variational_steps = int(200)

# Set optimizer.
optimizer = tf.optimizers.Adam(learning_rate=0.1)

# Using fit_surrogate_posterior to build and optimize 
# the variational loss function.
@tf.function(experimental_compile=True)
def train():
    # Build the joint density. 
    target_log_prob_fn = toy_model.joint_log_prob(
        observed_time_series=df_train['y']
    )
    
    elbo_loss_curve = tfp.vi.fit_surrogate_posterior(
        target_log_prob_fn=target_log_prob_fn,
        surrogate_posterior=variational_posteriors,
        optimizer=optimizer,
        num_steps=num_variational_steps,
        seed=42
    )
    
    return elbo_loss_curve

# Run optimization.
elbo_loss_curve = train()

Let us plot the elbo_loss_curve.

fig, ax = plt.subplots()
ax.plot(elbo_loss_curve, marker='.')
ax.set(title='ELBO Loss Curve', xlabel='iteration');
png

We see that a minimum has been reached.

Now we sample from the variational posteriors obtained:

q_samples = variational_posteriors.sample(1000)

Let us plot the parameter posterior distributions:

num_parameters = len(toy_model.parameters)

fig, ax = plt.subplots(num_parameters, 1, figsize=(12, 21))

for i, param in enumerate(toy_model.parameters):
    
    param_mean = np.mean(q_samples[param.name], axis=0)
    param_median = np.median(q_samples[param.name], axis=0)
    param_std = np.std(q_samples[param.name], axis=0)
    
    sns.distplot(a=q_samples[param.name], rug=True, ax=ax[i])
    ax[i].set(title=param.name)
    ax[i].axvline(x= param_median, color=sns_c[1], linestyle='--', label='median')
    ax[i].axvline(x= param_mean, color=sns_c[2], linestyle='--', label='mean')
    ax[i].axvline(x= param_mean + param_std, color=sns_c[3], linestyle='--', label=r'$\mu \pm 2\sigma$')
    ax[i].axvline(x= param_mean - param_std, color=sns_c[3], linestyle='--')
    ax[i].legend()
    
plt.tight_layout(pad=3.0)
plt.suptitle('POSTERIOR DISTRIBUTIONS', y=0.99);
png
# Get mean and std for each parameter.
print('Inferred parameters:')
for param in toy_model.parameters:
    print('{}: {} +- {}'.format(
        param.name,
        np.mean(q_samples[param.name], axis=0),
        np.std(q_samples[param.name], axis=0))
    )
    Inferred parameters:
    observation_noise_scale: 0.5021326541900635 +- 0.007594980299472809
    local_linear_trend/_level_scale: 0.009583557024598122 +- 0.006971438881009817
    local_linear_trend/_slope_scale: 0.0018680891953408718 +- 0.0003248010470997542
    month_of_year/_drift_scale: 0.024883059784770012 +- 0.08357027918100357
    day_of_week/_drift_scale: 0.018588630482554436 +- 0.004957796074450016
    day_of_month/_weights: [1.9918456] +- [0.02078216]
    x_var/_weights: [5.9605927] +- [0.07460035]

Remark: Observe that the estimated effect of the regressor x is around 5.96.

Model Predictions

We now generate the forecast six months ahead.

# Compute number of days in the last 6 months of 2020.
forecast_window = num_days_per_month[-1][6:13].sum()

# Get forecast distribution.
forecast_dist = tfp.sts.forecast(
    toy_model,
    observed_time_series=df_train['y'],
    parameter_samples=q_samples,
    num_steps_forecast=forecast_window 
)
# Sample and compute mean and std. 
num_samples = 100

forecast_mean, forecast_scale, forecast_samples = (
    forecast_dist.mean().numpy().flatten(),
    forecast_dist.stddev().numpy().flatten(),
    forecast_dist.sample(num_samples).numpy().flatten()
)

Next, we store the predictions on the df_test data frame.

df_test['y_pred'] = forecast_mean
df_test['y_pred_std'] = forecast_scale
df_test['errors'] = df_test['y'] - df_test['y_pred']

Let us plot the predictions:

fig, ax = plt.subplots()

ax.fill_between(
    x=df_test['date'],
    y1=df_test['y_pred']-2*df_test['y_pred_std'],
    y2=df_test['y_pred']+2*df_test['y_pred_std'],
    color=sns_c[2], 
    alpha=0.25,
    label=r'credible_interval ($\mu \pm 2\sigma$)'
)

sns.lineplot(x='date', y='y', label='train', data=df_train, ax=ax)
sns.lineplot(x='date', y='y', label='test', data=df_test, ax=ax)
sns.lineplot(x='date', y='y_pred', label='prediction', data=df_test, ax=ax)
ax.axvline(x= threshold_date, color=sns_c[3], linestyle='--', label='train-test split')
ax.legend(loc='upper left')
ax.set(title='STS Forecast');
png

Zooming in:

fig, ax = plt.subplots()

ax.fill_between(
    x=df_test['date'],
    y1=df_test['y_pred']-2*df_test['y_pred_std'],
    y2=df_test['y_pred']+2*df_test['y_pred_std'],
    color=sns_c[2], 
    alpha=0.25,
    label=r'credible_interval ($\mu \pm 2\sigma$)'
)
sns.lineplot(x='date', y='y', label='test', data=df_test, ax=ax)
sns.lineplot(x='date', y='y_pred', label='prediction', data=df_test, ax=ax)
ax.legend(loc='upper left')
ax.set(title='STS Forecast');
png

Let us see it as a scatter plot.

fig, ax = plt.subplots(figsize=(8,8))

# Generate diagonal line to plot. 
d_x = np.linspace(start=df_test['y'].min() - 1, stop=df_test['y'].max() + 1, num=100)

sns.regplot(x='y', y='y_pred', data=df_test, ax=ax)
sns.lineplot(x=d_x, y=d_x, dashes={'linestyle': ''}, ax=ax)
ax.lines[1].set_linestyle('--')
png

Error Analysis

It is important to emphasize that, for any modeling problem, getting the predictions is not the end of the story. It is very important to analyze the errors to understand where is the model not working as expected.

  • Distribution
errors_mean = df_test['errors'].mean()
errors_std = df_test['errors'].std()

fig, ax = plt.subplots()

sns.distplot(a=df_test['errors'], ax=ax, bins=20, rug=True)
ax.axvline(x=errors_mean, color=sns_c[2], linestyle='--', label=r'$\mu$')
ax.axvline(x=errors_mean + 2*errors_std, color=sns_c[3], linestyle='--', label=r'$\mu \pm 2\sigma$')
ax.axvline(x=errors_mean - 2*errors_std, color=sns_c[3], linestyle='--')
ax.legend()
ax.set(title='Model Errors (Test Set)');
png

The errors look normally distributed and centered around zero.

  • Autocorrelation
fig, ax = plt.subplots()

sns.scatterplot(x='index', y='errors', data=df_test.reset_index(), ax=ax)
ax.axhline(y=errors_mean, color=sns_c[2], linestyle='--', label=r'$\mu$ ')
ax.axhline(y=errors_mean + 2*errors_std, color=sns_c[3], linestyle='--', label=r'$\mu \pm 2\sigma$')
ax.axhline(y=errors_mean - 2*errors_std, color=sns_c[3], linestyle='--')
ax.legend()
ax.set(title='Model Errors (Test Set)');
png
fig, ax = plt.subplots(2, 1)
plot_acf(x=df_test['errors'], ax=ax[0])
plot_pacf(x=df_test['errors'], ax=ax[1]);
png

We do not see any significant (partial) autocorrelation nor patterns on the errors. They look independent and normally distributed around zero.

Model Fit Decomposition

Finally, let us see the individual model components.

component_dists = tfp.sts.decompose_by_component(
    model=toy_model,
    observed_time_series=df_train['y'],
    parameter_samples=q_samples
)
component_means, component_stddevs = (
    {k.name: c.mean() for k, c in component_dists.items()},
    {k.name: c.stddev() for k, c in component_dists.items()}
)
num_components = len(component_means)

fig, ax = plt.subplots(num_components, 1, figsize= (12, 15))

for i, component_name in enumerate(component_means.keys()):
    component_mean = component_means[component_name]
    component_stddev = component_stddevs[component_name]
    
    sns.lineplot(x=df_train['date'], y=component_mean, color=sns_c[0], ax=ax[i])
    
    ax[i].fill_between(
        x=df_train['date'],
        y1=component_mean-2*component_stddev,
        y2=component_mean+2*component_stddev,
        alpha=0.4, 
        color=sns_c[1]
    )
    
    ax[i].set(title=component_name)

plt.tight_layout()
png

This was a first exploration of the STS TensorFlow Probability module. We emphasized on the time series exploratory analysis since, in real applications, we are never given the time series structure prior modeling (that would be quite boring). I really want to keep exploring the tools and techniques implemented in this probabilistic programming framework.