Unit 5: Stochastic Processes
Vadim Sokolov
George Mason University
Spring 2025
The process then is the distribution over the space of functions from \(U\) to \(S\).
A one-dimensional Brownian Motion (also known as Wiener process) is a continuous time stochastic process \(B(t)_{t\ge 0}\) with the following properties
Formally brownian motion is a stochastic process \(B(t)\) is a family of real random variables indexed by the set of nonnegative real numbers \(t\).
Figure 1: Brownian Motion
Thus, for any times \(0 \leq t_1 < t_2 < ... < t_n\), the random variables \(B(t_2) - B(t_1)\), \(B(t_3) - B(t_2)\), …, \(B(t_n) - B(t_{n-1})\) are independent and the function \(t \mapsto B(t)\) is continuous almost surely.
The investor must obtain optimal filters for \((V_t,J_t,Z_t)\), and learn the posterior densities of the parameters \((\mu, \alpha_v, \beta_v, \sigma_v^2 , \lambda )\). These estimates will be conditional on the information available at each time.
\(k(x, x') = \mathbb{E}[(f(x) - m(x))(f(x') - m(x'))]\) describes the amount of dependence between the values of the function at two different points in the input space.
\(k(x,x) = \sigma^2\) and \(k(x,x') \rightarrow 0\) as \(\|x-x'\|_2 \rightarrow \infty\).
Generating a sequence 100 inputs (process indexes)
and then define the mean function and the covariance function
The covariance matrix is then defined as
Generate a sample from the GP using the mvrnorm
function from the MASS
package and plot a sample
Figure 2: Sample from a Gaussian Process
A collection of 100 points of function \(f(x)\) sampled from a Gaussian Process with zero mean and squared exponential kernel for the set of 100 indexes \(x =(0,0.1,0.2,\ldots,10)\).
Let’s generate a few more samples from the same GP and plot them together
Figure 3: Samples from a Gaussian Process
The mean of the conditional distribution is given by \[ \mu_{\mathrm{post}} = \mu_* + K_*^TK^{-1} (Y - \mu) \qquad(1)\] and the covariance is given by \[ \Sigma_{\mathrm{post}} = K_{**} - K_*^T K^{-1} K_*. \qquad(2)\]
Equation 1 and Equation 2 are convenient properties of a multivariate normal distribution.
diag(eps, n)
\(=\epsilon I\) adds a diagonal matrix with the small \(\epsilon\) on the diagonal.Why?
Now we generate a new set of inputs \(x_*\) and calculate the covariance matrices \(K_*\) and \(K_{**}\).
Notice, we did not add \(\epsilon I\) to \(K_*\) = KX
matrix, but to add it to \(K_{**}\) = KXX
to guarantee that the resulting posterior covariance matrix is non-singular (invert able).
Now, we can generate a sample from the posterior distribution over \(y_*\), given \(Y\)
plot_gp = function(mup, Sigmap, X, Y, XX, YY){
q1 = mup + qnorm(0.05, 0, sqrt(diag(Sigmap)))
q2 = mup + qnorm(0.95, 0, sqrt(diag(Sigmap)))
matplot(XX, t(YY), type="l", col="gray", lty=1, xlab="x", ylab="y")
points(X, Y, pch=20, cex=2)
lines(XX, sin(XX), col="blue")
lines(XX, mup, lwd=2)
lines(XX, q1, lwd=2, lty=2, col=2)
lines(XX, q2, lwd=2, lty=2, col=2)
}
YY = mvrnorm(100, mup, Sigmap)
plot_gp(mup, Sigmap, X, Y, XX, YY)
Likelihood: \[ p(Y \mid X, \sigma, l) = \frac{1}{(2\pi)^{n/2} |K|^{1/2}} \exp \left ( -\frac{1}{2} Y^T K^{-1} Y \right ) \] where \(K = K(X,X)\) is the covariance matrix. We assume mean is zero, to simplify the formulas. The log-likelihood is given by \[ \log p(Y \mid X, \sigma, l) = -\frac{1}{2} \log |K| - \frac{1}{2} Y^T K^{-1} Y - \frac{n}{2} \log 2\pi. \]
Let’s implement a function that calculates the log-likelihood of the data given the hyper-parameters \(\sigma\) and \(l\) and use optim
function to find the maximum of the log-likelihood function.
Make predictions about the output values at new inputs \(x_*\).
l = par[2]; sigma = par[1]
predplot = function(X, Y, XX, YY, l, sigma) {
K = outer(X[,1],X[,1], sqexpcov,l,sigma) + diag(eps, n)
KX = outer(X[,1], XX[,1],sqexpcov,l,sigma)
KXX = outer(XX[,1],XX[,1], sqexpcov,l,sigma) + diag(eps, q)
Si = solve(K)
mup = t(KX) %*% Si %*% Y # we assume mu is 0
Sigmap = KXX - t(KX) %*% Si %*% KX
YY = mvrnorm(100, mup, Sigmap)
plot_gp(mup, Sigmap, X, Y, XX, YY)
}
predplot(X, Y, XX, YY, l, sigma)
We can see that our uncertainty is much “tighter”
Some matrix calculus: \[ \frac{\partial Y^T K^{-1} Y}{\partial \theta} = Y^T \frac{\partial K^{-1}}{\partial \theta} Y. \] The derivative of the inverse matrix \[ \frac{\partial K^{-1}}{\partial \theta} = -K^{-1} \frac{\partial K}{\partial \theta} K^{-1}. \] and the log of the determinant of a matrix \[ \frac{\partial \log |K|}{\partial \theta} = \mathrm{tr} \left ( K^{-1} \frac{\partial K}{\partial \theta} \right ), \]
Calculate the derivative of the log-likelihood function with respect to \(\theta\) \[ \frac{\partial \log p(Y \mid X,\theta)}{\partial \theta} = -\frac{1}{2}\frac{\partial \log |K|}{\partial \theta} + \frac{1}{2} Y^T \frac{\partial K^{-1}}{\partial \theta} Y. \] Putting it all together, we get \[ \frac{\partial \log p(Y \mid X,\theta)}{\partial \theta} = -\frac{1}{2} \mathrm{tr} \left ( K^{-1} \frac{\partial K}{\partial \theta} \right ) + \frac{1}{2} Y^T K^{-1} \frac{\partial K}{\partial \theta} K^{-1} Y. \]
\[ K_{ij} = k(x_i, x_j) = \sigma^2 \exp \left ( -\frac{1}{2} \frac{(x_i - x_j)^2}{l^2} \right ). \] The derivative of the covariance matrix with respect to \(\sigma\) is given by \[ \frac{\partial K_{ij}}{\partial \sigma} = 2\sigma \exp \left ( -\frac{1}{2} \frac{(x_i - x_j)^2}{l^2} \right );~\frac{\partial K}{\partial \sigma} = \dfrac{2}{\sigma}K. \] The derivative of the covariance matrix with respect to \(l\) is given by \[ \frac{\partial K_{ij}}{\partial l} = \sigma^2 \exp \left ( -\frac{1}{2} \frac{(x_i - x_j)^2}{l^2} \right ) \frac{(x_i - x_j)^2}{l^3};~ \frac{\partial K}{\partial l} = \frac{(x_i - x_j)^2}{l^3} K. \]
Now we can implement a function that calculates the derivative of the log-likelihood function with respect to \(\sigma\) and \(l\).
# Derivative of the log-likelihood function with respect to sigma
dloglik_sigma = function(par, X, Y) {
sigma = par[1]; l = par[2]
K = outer(X[,1],X[,1], sqexpcov,l,sigma) + diag(eps, n)
Si = solve(K)
dK = 2*K/sigma
tr = sum(diag(Si %*% dK))
return(-(-0.5 * tr + 0.5 * t(Y) %*% Si %*% dK %*% Si %*% Y))
}
# Derivative of the log-likelihood function with respect to l
dloglik_l = function(par, X, Y) {
sigma = par[1]; l = par[2]
K = outer(X[,1],X[,1], sqexpcov ,l,sigma) + diag(eps, n)
Si = solve(K)
dK = outer(X[,1],X[,1], function(x, x1) (x - x1)^2)/l^3 * K
tr = sum(diag(Si %*% dK))
return(-(-0.5 * tr + 0.5 * t(Y) %*% Si %*% dK %*% Si %*% Y))
}
# Gradient function that returns a vector of derivatives
gnlg = function(par,X,Y) {
return(c(dloglik_sigma(par, X, Y), dloglik_l(par, X, Y)))
}
Now we can use the optim
function to find the maximum of the log-likelihood function and provide the derivative function we just implemented.
par1 = optim(c(1,1), fn=loglik, gr=gnlg ,X=X, Y=Y,method="BFGS")$par
l = par1[2]; sigma = par1[1]
print(par1)
[1] 1.48443 2.39151
The result is the same compared to when we called optim
without the derivative function. Even execution time is the same for our small problem. However, at larger scale, the derivative-based optimization algorithm will be much faster.
R
Package for Gaussian Processes[1] 2.36025
Extrapolation: Posterior distribution over \(y_*\), given \(Y\)
We can see that outside of the range of the observed data, the model with \(\sigma=1\) is more “confident” in its predictions.
We first estimate the length scale parameter \(l\) using the laGP
package.
Now we plot the data and the fit using the estimated length scale parameter \(l\).
library(ggplot2)
XX = matrix(seq(2.4, 55, length = 499), ncol=1)
p = predGP(gp, XX)
N = 499
q1 = qnorm(0.05, mean = p$mean, sd = sqrt(diag(p$Sigma)))
q2 = qnorm(0.95, mean = p$mean, sd = sqrt(diag(p$Sigma)))
q3 = qnorm(0.5, mean = p$mean, sd = sqrt(diag(p$Sigma)))
ggplot() + geom_point(aes(x=X,y=Y)) + geom_line(aes(x=XX,y=q3)) + geom_ribbon(aes(x=XX,ymin=q1, ymax=q2), alpha=0.2)
Motorcycle Accident Data. Black line is the mean of the posterior distribution over \(y_*\), given \(Y\). Blue lines are the 95% confidence interval.
Histogram of time values
The \(\sqrt{n}\) decay in variance of the posterior distribution is a property of the squared exponential kernel.
graph LR NS[Decision on Next Samples]--Collect-->S[System Under Study] S--Observe-->NS
Given a function \(f(x)\), that is not known analytically, it can represent, for example, output of a complex computer program. The goal is to optimize \[ x^* = \arg\min_x f(x). \]
The Bayesian approach to this problem is the following:
The expected improvement (EI) acquisition function \[ f^* = \min y \] Our current best. At a given point \(x\) and function value \(y = f(x)\), the expected improvement function is defined as \[ a(x) = \mathbb{E}\left[\max(0, f^* - y)\right], \] The function that we calculate expectation of \[ u(x) = \max(0, f^* - y) \] is the utility function. Thus, the acquisition function is the expected value of the utility function.
\[ \int y \phi(y,\mu,\sigma^2) dy =\frac{1}{2} \mu \text{erf}\left(\frac{y-\mu }{\sqrt{2} \sigma }\right)-\frac{\sigma e^{-\frac{(y-\mu )^2}{2 \sigma ^2}}}{\sqrt{2 \pi }}, \] where \(\Phi(y,\mu,\sigma^2)\) is the cumulative distribution function of the normal distribution. Thus, \[ \int_{-\infty}^{f^*} y \phi(y,\mu,\sigma^2) dy = \frac{1}{2} \mu (1+\text{erf}\left(\frac{f^*-\mu }{\sqrt{2} \sigma }\right))-\frac{\sigma e^{-\frac{(f^*-\mu )^2}{2 \sigma ^2}}}{\sqrt{2 \pi}} = \mu \Phi(f^*,\mu,\sigma^2) + \sigma^2 \phi(f^*,\mu,\sigma^2). \]
we can write the acquisition function as \[ a(x) = \dfrac{1}{2}\left(\sigma^2 \phi(f^*,\mu,\sigma^2) + (f^*-\mu)\Phi(f^*,\mu,\sigma^2)\right) \]
Let’s implement it
Taxi Simulator Visualization
Define a helper function to plot the GP emulator
plotgp = function(x,y,XX,p) {
q1 = qnorm(0.05, mean = p$mean, sd = sqrt(diag(p$Sigma)))
q2 = qnorm(0.95, mean = p$mean, sd = sqrt(diag(p$Sigma)))
q3 = qnorm(0.5, mean = p$mean, sd = sqrt(diag(p$Sigma)))
ggplot() + geom_point(aes(x=x,y=y)) + geom_line(aes(x=XX,y=q3)) + geom_ribbon(aes(x=XX,ymin=q1, ymax=q2), alpha=0.2)
}
there is potentially a better value at around 50 taxis
[1] 44
[1] 57
[1] 45
[1] 100
If we run nextsample
one more time, we get 47, close to our current best of 45. Further, the model is confident at this location. It means that we can stop the algorithm and “declare a victory”