Forecasting: From Structural Models to Transformers

Author

Vadim Sokolov

Published

November 19, 2023

This document is a copy (with some additions) of the blog post by Steven L. Scott (2017), available at https://www.unofficialgoogledatascience.com/2017/07/fitting-bayesian-structural-time-series.html

Introduction

Time series data are everywhere, but time series modeling is a fairly specialized area within statistics and data science. This post describes the bsts software package, which makes it easy to fit some fairly sophisticated time series models with just a few lines of R code.

Time series data appear in a surprising number of applications, ranging from business, to the physical and social sciences, to health, medicine, and engineering. Forecasting (e.g. next month’s sales) is common in problems involving time series data, but explanatory models (e.g. finding drivers of sales) are also important. Time series data are having something of a moment in the tech blogs right now, with Facebook announcing their “Prophet” system for time series forecasting (Sean J. Taylor and Ben Letham (2017)), and Google posting about its forecasting system in this blog (Eric Tassone and Farzan Rohani (2017)).

This post summarizes the bsts R package, a tool for fitting Bayesian structural time series models. These are a widely useful class of time series models, known in various literatures as “structural time series,” “state space models,” “Kalman filter models,” and “dynamic linear models,” among others. Though the models need not be fit using Bayesian methods, they have a Bayesian flavor and the bsts package was built to use Bayesian posterior sampling.

The bsts package is open source. You can download it from CRAN with the R command install.packages("bsts"). It shares some features with Facebook and Google systems, but it was written with different goals in mind. The other systems were written to do “forecasting at scale,” a phrase that means something different in time series problems than in other corners of data science. The Google and Facebook systems focus on forecasting daily data into the distant future. The “scale” in question comes from having many time series to forecast, not from any particular time series being extraordinarily long. The bottleneck in both cases is the lack of analyst attention, so the systems aim to automate analysis as much as possible. The Facebook system accomplishes this using regularized regression, while the Google system works by averaging a large ensemble of forecasts. Both systems focus on daily data, and derive much of their efficiency through the careful treatment of holidays.

There are aspects of bsts which can be similarly automated, and a specifically configured version of bsts is a powerful member of the Google ensemble. However, bsts can also be configured for specific tasks by an analyst who knows whether the goal is short term or long term forecasting, whether or not the data are likely to contain one or more seasonal effects, and whether the goal is actually to fit an explanatory model, and not primarily to do forecasting at all.

The workhorse behind bsts is the structural time series model. These models are briefly described in the section Structural time series models. Then the software is introduced through a series of extended examples that focus on a few of the more advanced features of bsts. Example 1: Nowcasting includes descriptions of the local linear trend and seasonal state models, as well as spike and slab priors for regressions with large numbers of predictors. Example 2: Long term forecasting describes a situation where the local level and local linear trend models would be inappropriate. It offers a semilocal linear trend model as an alternative. Example 3: Recession modeling describes an model where the response variable is non-Gaussian. The goal in Example 3 is not to predict the future, but to control for serial dependence in an explanatory model that seeks to identify relevant predictor variables. A final section concludes with a discussion of other features in the package which we won’t have space (maybe “time” is a better word) to explore with fully fleshed out examples.

Bayesian Learning: Black Swans

We introduce the concept of Bayesian learinng using the Black Swan inference problem. Suppose that after \(n\) trials where \(n\) is large you have only seen successes and that you assess the probability of the next trial being a success as \((T+1)/(T+2)\) that is, almost certain. This is a model of observing White Swans and having never seen a Black Swan. Taleb (2007) makes it sound as if the rules of probability are not rich enough to be able to handle Black Swan events. There is a related class of problems in finance known as Peso problems where countries decide to devalue their currencies and there is little a prior evidence from recent history that such an event is going to happen.

To obtain such a probability assessment we use a Binomial/Beta conjugate Bayes updating model. The key point is that it can also explain that there is still a large probability of a Black Swan event to happen sometime in the future. Independence model has difficulty doing this.

The Bayes Learning Beta-Binomial model will have no problem. We model with where \(Y_{t}=0\) or \(1\), with probability \(P\left( Y_{t}=1\mid \theta\right) =\theta\). This is the classic Bernoulli “coin-flipping” model and is a component of more general specifications such as regime switching or outlier-type models.

The likelihood for a sequence of Bernoulli observations is \[ p\left( y\mid \theta\right) =\prod_{t=1}^{T}p\left( y_{t}\mid \theta\right) =\theta^{\sum_{t=1}^{T}y_{t}}\left( 1-\theta\right) ^{T-\sum_{t=1}^{T}y_{t}}. \] The maximum likelihood estimator is the sample mean \[ \widehat{\theta} = (1/T)\sum_{t=1}^{T}y_{t}. \] This makes little sense when you just observe white swans. It predicts \(\widehat{\theta} = 1\) and gets shocked when it sees a black swan (zero probability event). Bayes, on the other hand, allows for ‘learning’.

To do this we need prior distribution for the ‘parameter’ \(\theta\). A natural choice is a Beta distribution, denoted by \(\theta\sim\text{Beta}\left( a,A\right)\) with pdf is given by \[ p\left( \theta\mid a,A\right) =\frac{\theta^{a-1}\left( 1-\theta\right) ^{A-1}}{B\left( a,A\right) }, \] where \(B\left( \alpha,A\right)\) denotes a Beta function. Since \(p\left( \theta\mid a,A\right)\) is a density and integrates to 1, we have

\[ B\left( a,A\right) =\int_{0}^{1}\theta^{a-1}\left( 1-\theta\right)^{A-1}d\theta. \]

Bayes rule then tell us how to combine the likelihood and prior to obtain a posterior distribution, namely \(\theta \mid Y=y\). What do we believe about \(\theta\) given a sequence of. Our predictor rule is then \(P(Y_{t=1} =1 \mid Y=y ) = \mathrm{E}(\theta \mid y)\) it is straightforward to show that the posterior distribution is again a Beta distribution with \[ p\left(\theta\mid y\right) \sim B\left( a_{T},A_{T}\right) \; \mathrm{ and} \; a_{T}=a+\sum_{t=1}^{T}y_{t} , A_{T}=A+T-\sum_{t=1}^{T}y_{t} \]

There is a “conjugate” form of the posterior: it is also a Beta distribution and the hyper-parameters \(a_{T}\) and \(A_{T}\) depend on the data only via the sufficient statistics, \(T\) and \(\sum_{t=1}^{T}y_{t}\). The posterior mean and variance are

\[ \mathrm{E}\left[ \theta\mid y\right] =\frac{a_{T}}{a_{T}+A_{T}} \;\text{ and }\; \mathrm{Var}\left( \theta\mid y\right) =\frac{a_{T}A_{T}}{\left( a_{T}+A_{T}\right) ^{2}\left( a_{T}+A_{T}+1\right)}, \] respectively. This implies that for large samples, \(\mathrm{E}(\theta\mid y) \approx \bar{y} = \widehat{\theta}\), the MLE!

Now, if we assume a uniform prior specification, \(\theta \sim B(1,1) = U(0,1)\), then we have the following probability assessment. After \(T\) trials, suppose that we have only seen \(T\) successes, namely, \(( y_1 , \ldots , y_T ) = ( 1 , \ldots , 1 )\). Then you assess the probability of the next trial being a success as \[ p( Y_{T+1} =1 \mid y_1=1 , \ldots , y_T=1 ) = \frac{T+1}{T+2} \] This follows from the mean of the Beta posterior, \(\theta \mid y \sim \text{Beta}(1+T, 1)\), \(P(Y_{T+1} = 1 \mid y) = \mathrm{E}_{\theta \mid y}\left[P(Y_{T=1} \mid \theta) \right] = \mathrm{E}[\theta \mid y]\). For large \(T\) this is almost certain. Figure 1 shows the expected value of observing \(1\) after we observed \(T\) successful outcomes.

Code
par(mar=c(4,4,1,0), bty='n')
t = 1:120
plot(t,(t+1)/(t+2), type='l', lwd=3, col="blue")

Figure 1: Posterior Probability of a White Swan

Now consider a future set of \(n\) trials, where \(n\) is also large. The probability of never seeing a Black Swan is then given by \[ p( y_{T+1} =1 , \ldots , y_{T+n} = 1 \mid y_1=1 , \ldots , y_T=1 ) = \frac{ T+1 }{ T+n+1 } \] For a fixed \(T\), and large \(n\), we have \(\frac{ T+1 }{ T+n+1 } \rightarrow 0\). Hence, we will see a Black Swan event with large probability — we just don’t know when! The exchangeable Beta-Binomial model then implies that a Black Swan event will eventually appear. One shouldn’t be that surprised when it actually happens.

Structural time series models

A structural time series model is defined by two equations. The observation equation relates the observed data \(y_t\) to a vector of latent variables \(\alpha_t\) known as the “state.” \[ y_t = Z_t^T\alpha_t + \epsilon_t. \]

The transition equation describes how the latent state evolves through time. \[ \alpha_{t+1} = T_t \alpha_t + R_t \eta_t. \]

The error terms \(\epsilon_t\) and \(\eta_t\) are Gaussian and independent of everything else. The arrays \(Z_t\) , \(T_t\) and \(R_t\) are structural parameters. They may contain parameters in the statistical sense, but often they simply contain strategically placed 0’s and 1’s indicating which bits of \(\alpha_t\) are relevant for a particular computation. An example will hopefully make things clearer.

The simplest useful model is the “local level model,” in which the vector \(\alpha_t\) is just a scalar \(\mu_t\). The local level model is a random walk observed in noise. \[\begin{align*} y_t = &\mu_t + \epsilon_t\\ \mu_{t+1} = &\mu_t + \eta_t. \end{align*}\] Here \(\alpha_t=\mu_t\) , and \(Z_t\) , \(T_t\), and \(R_t\) all collapse to the scalar value 1. Similar to Bayesian hierarchical models for nested data, the local level model is a compromise between two extremes. The compromise is determined by variances of \(\epsilon_t \sim N(0,\sigma^2)\) and \(\eta_t \sim N(0,\tau^2)\). If \(\tau^2=0\) then \(\mu_t\) is a constant, so the data are IID Gaussian noise. In that case the best estimator of \(y_{t+1}\) is the mean of \(y_1,\ldots,y_t\). Conversely, if \(\sigma^2=0\) then the data follow a random walk, in which case the best estimator of \(y_{t+1}\) is \(y_t\). Notice that in one case the estimator depends on all past data (weighted equally) while in the other it depends only on the most recent data point, giving past data zero weight. If both variances are positive then the optimal estimator of \(y_{t+1}\) winds up being “exponential smoothing,” where past data are forgotten at an exponential rate determined by the ratio of the two variances. Also notice that while the state in this model is Markov (i.e. it only depends on the previous state), the dependence among the observed data extends to the beginning of the series.