Bayes AI

Unit 8: Bayesian Deep Learning

Vadim Sokolov
George Mason University
Spring 2025

Course Page, Slides

Bayes in Modern AI

Deep Learning

  • Rather than using shallow additive architectures common to most statistical models, deep learning uses layers of semi-affine input transformations to provide a predictive rule.
  • Applying these layers of transformations leads to a set of attributes (or, features) to which probabilistic statistical methods can be applied.
  • Deep learning is one of the widely used machine learning method for analysis of large scale and high-dimensional data sets.
  • There are several deep learning architectures exist - each has its own uses and purposes. Convolutional Neural Networks (CNN) deal with 2-dimensional input objects, i.e. images and were shown to outperform any other techniques. Recurrent Neural Networks (RNN) were shown the best performance on speech and text analysis tasks.

Neural Network

  • Let \(f_1 , \ldots , f_L\) be given univariate activation functions for each of the \(L\) layers.
  • Activation functions are nonlinear transformations of weighted data.
  • A semi-affine activation rule is then defined by \[ f_l^{W,b} = f_l \left ( \sum_{j=1}^{N_l} W_{lj} X_j + b_l \right ) = f_l ( W_l X_l + b_l )\,, \] which implicitly needs the specification of the number of hidden units \(N_l\). Our deep predictor, given the number of layers \(L\), then becomes the composite map

\[ \hat{Y}(X) = F(X) = \left ( f_l^{W_1,b_1} \circ \ldots \circ f_L^{W_L,b_L} \right ) ( X)\,. \]

Neural Network

  • The fact that DL forms a universal “basis” which we recognize in this formulation dates to Poincare and Hilbert is central.
  • From a practical perspective, given a large enough data set of “test cases”, we can empirically learn an optimal predictor.
  • Similar to a classic basis decomposition, the deep approach uses univariate activation functions to decompose a high dimensional \(X\).

Let \(Z^{(l)}\) denote the \(l\)th layer, and so \(X = Z^{(0)}\). The final output \(Y\) can be numeric or categorical. The explicit structure of a deep prediction rule is then \[ \begin{aligned} \hat{Y} (X) & = W^{(L)} Z^{(L)} + b^{(L)} \\ Z^{(1)} & = f^{(1)} \left ( W^{(0)} X + b^{(0)} \right ) \\ Z^{(2)} & = f^{(2)} \left ( W^{(1)} Z^{(1)} + b^{(1)} \right ) \\ \ldots & \\ Z^{(L)} & = f^{(L)} \left ( W^{(L-1)} Z^{(L-1)} + b^{(L-1)} \right )\,. \end{aligned} \]

Motivating Example

  • Apply feed-forward neural network with one hidden layer to a problem of binary classification
Code
numSamples = 200 # total number of observations
radius = 10 # radius of the outer circle
noise = 0.0001 # amount of noise to be added to the data
d = matrix(0,ncol = 3, nrow = numSamples); # matrix to store our generated data

# Generate positive points inside the circle.
for (i in 1:(numSamples/2) ) {
  r = runif(1,0, radius * 0.4);
  angle = runif(1,0, 2 * pi);
  x = r * sin(angle);
  y = r * cos(angle);
  noiseX = runif(1,-radius, radius) * noise;
  noiseY = runif(1,-radius, radius) * noise;
  d[i,] = c(0,x,y)
}

# Generate negative points outside the circle.
for (i in (numSamples/2+1):numSamples ) {
  r = runif(1,radius * 0.8, radius);
  angle = runif(1,0, 2 * pi);
  x = r * sin(angle);
  y = r * cos(angle);
  noiseX = runif(1,-radius, radius) * noise;
  noiseY = runif(1,-radius, radius) * noise;
  d[i,] = c(1,x,y)
}
colnames(d) = c("label", "x1", "x2")
# Plot the training dataset
plot(d[,2],d[,3], col=d[,1]+2, pch=16, xlab="x1", ylab="x2")
Code
# utils$save_pdf("~/papers/twocultures/fig/","nn-circle-2")
x1 = seq(-11,11,length.out = 100)

Circle Example

Let’s try to use a simple logistic regression model to separate the two classes.

Code
# Fit a logistic regression model
fit = glm(label~x1+x2, data=as.data.frame(d), family=binomial(link='logit'))
# Plot the training dataset
plot(d[,2],d[,3], col=d[,1]+2, pch=16, xlab="x1", ylab="x2")
th = fit$coefficients
# Plot the decision boundary
abline(-th[1]/th[3], -th[2]/th[3], col=2)

We can see that a logistic regression could not do it. It uses a single line to separate observations of two classes. We can see that the data is not linearly separable.

Circle Example

However, we can use multiple lines to separate the data.

Code
plot(x1~x2, data=d,col=d[,1]+2, pch=16)
# Plot lines that separate once class (red) from another (green)
lines(x1, -x1 - 6); text(-4,-3,1)
lines(x1, -x1 + 6); text(4,3,2)
lines(x1,  x1 - 6); text(4,-3,3)
lines(x1,  x1 + 6); text(-3,4,4)

Circle Example

Now, we do the same thing as in simple logistic regression and apply logistic function to each of those lines

Code
# Define sigmoid function
sigmoid  = function(z) exp(z)/(1+exp(z))

# Define hidden layer of our neural network
features = function(x1,x2) {
  z1 =  6 + x1 + x2; a1 = sigmoid(z1)
  z2 =  6 - x1 - x2; a2 = sigmoid(z2)
  z3 =  6 - x1 + x2; a3 = sigmoid(z3)
  z4 =  6 + x1 - x2; a4 = sigmoid(z4)
  return(c(a1,a2,a3,a4))
}

Using the matrix notaitons, we have \[ z = \sigma(Wx + b), ~ W = \begin{bmatrix} 1 & 1 \\ -1 & -1 \\ -1 & 1 \\ 1 & -1 \end{bmatrix}, ~ b = \begin{bmatrix} 6 \\ 6 \\ 6 \\ 6 \end{bmatrix}, ~ \sigma(z) = \frac{1}{1+e^{-z}} \]

First layer of our NN: \(R^2 \rightarrow R^4\) output \(z\) called a feature vector.

Circle Example

The feature vector \(z\) is then passed to the output layer, which applies simple logistic regression to the feature vector. \[ \hat{y} = \sigma(w^Tz + b), ~ w = \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix}, ~ b = -3.1, ~ \sigma(z) = \frac{1}{1+e^{-z}} \]

The output of the output layer is the probability of the positive class.

Code
# Calculate prediction (classification) using our neural network
predict_prob = function(x){
  x1 = x[1]; x2 = x[2]
  z = features(x1,x2)
  # print(z)
  mu = sum(z) - 3.1
  # print(mu)
  sigmoid(mu)
}

Circle Example

We can use our model to do the predictions now

Code
# Predict the probability of the positive class for a given point
predict_prob(c(0,0))
[1] 0.7089128
Code
predict_prob(c(0,10))
[1] 0.2565405

Circle Example

The model generates sensible predictions, let’s plot the decision boundary to see how well it separates the data.

Code
x1 = seq(-11,11,length.out = 100)
x2 = seq(-11,11,length.out = 100)
gr = as.matrix(expand.grid(x1,x2));
[1] 10000     2
Code
yhat = apply(gr,1,predict_prob)
[1] 10000
Code
image(x1,x2,matrix(yhat,ncol = 100), col = heat.colors(20,0.7))

Regression

How about a regression model. We will use a one-layer neural network to fit a quadratic function. We simulate noisy data from the following model \[ y = 0.5 + 0.3x^2 + \epsilon, ~ \epsilon \sim N(0,0.02^2) \] - 3 hidden units in the first hidden - 2 units in the second hidden layer - output layer is a single unit - Use the hyperbolic tangent (ReLU) activation function for all layers. The model is defined as follows

Code
relu = function(x) max(0,x)
nn = function(W,f=relu) {
    b0 = W[1]; w0=W[2:4];b1 = W[5]; w1 = W[6:8]
    z0 = apply(b0 + outer(x,w0,'*'),1:2,f)
    yhat = b1 + z0 %*% w1
    return(list(yhat = yhat[,1],z0=z0))
}

Regression

  • The output linear layer has a single output. Thus, the prediction yhat is generated as a linear model of the feature vector z0.
  • The model has 8 parameters
  • Let’s generate training data and fit the model
  • We will use the BFGS optimization algorithm to minimize the loss function (negative log-likelihood) of the model.
Code
set.seed(99) #gretzky
nl  = c(3,2)
params = c(0,rnorm(3),0,rnorm(3))
x = seq(-1,1,0.02)
y = 0.5 + 0.3*x^2 + rnorm(length(x),0,0.02)
loss = function(W) sum((nn(W)$yhat - y)^2)
res = optim(params, loss, method='BFGS')
res$par
[1] -0.2369376  1.3898195 -0.8211064  0.4558352  0.4974205  0.1733447  0.4632991
[8]  0.3879388

Regression

The solid black line is the neural network model, and the dashed lines are the basis functions. The model fits the data well.

Code
o = nn(res$par)
plot(x,y); lines(x,o$yhat, lwd=2)
lines(x,0.5+o$z0[,1],col=2, lwd=2, lty=2); lines(x,0.5+o$z0[,2],col=3, lwd=2, lty=2); lines(x,0.5+o$z0[,3],col=4, lwd=2, lty=2)

Figure 1: Noisy quadratic function approximated by a neural network with ReLu activation function.

Regression

Let’s try the \(\tanh\) function

Code
set.seed(8) #gretzky
params = c(0,rnorm(3),0,rnorm(3))
loss = function(W) mean((nn(W,f=tanh)$yhat - y)^2)
res = optim(params, loss, method='BFGS')
res$par
[1] -0.9804458 -0.2331603  0.8336547 -1.1397178  0.8431593 -0.6489841  0.5880229
[8]  0.5269772
Code
o = nn(res$par, f=tanh)
plot(x,y, ylim=c(0.4,0.95)); lines(x,o$yhat, lwd=2);
lines(x,0.5*o$z0[,1]+0.9, lwd=2, lty=2, col=2); lines(x,0.5*o$z0[,2]+0.9, lwd=2, lty=2, col=3); lines(x,0.5*o$z0[,3]+0.9, lwd=2, lty=2, col=4)

Noisy quadratic function approximated by a neural network with tanh activation function.

Notice that we did not have to explicitly specify that our model need to have a quadratic term, the model learned it from the data. This is the power of deep learning. The model is able to learn the structure of the data from the data itself.

Regression

We can apply the same approach to the interactions, say the true model for the data as follows \[ y = 0.5 + 0.1x_1 + 0.2x_2 + 0.5x_1x_2+ \epsilon, ~ \epsilon \sim N(0,0.02^2) \] We can use the same model as above, but with two input variables. The model will learn the interaction term from the data.

Code
set.seed(99) #ovi
x1 = seq(-1,1,0.01)
x2 = x1
y = 0.5 + 0.1*x1 + 0.2*x2 + 0.5*x1*x2 + rnorm(length(x1),0,0.02)
library("scatterplot3d")
s3d = scatterplot3d(x1,x2,y, pch=16)
x = cbind(x1,x2)
nn = function(W,f=relu) {
    b0 = W[1]; w0 = W[2:5]; b1 = W[6]; w1 = W[7:8]
    w0 = matrix(w0,nrow=2)
    z0 = apply(b0 + x%*%w0,1:2,f)
    yhat = b1 + z0 %*% w1
    return(list(yhat = yhat[,1],z0=z0))
}
W = c(0,rnorm(4),0,rnorm(2))
loss = function(W) sum((nn(W, f=tanh)$yhat - y)^2)
res = optim(W, fn=loss, method='BFGS')
res$par
[1]  0.7821996  0.5035370 -1.3927522  0.6322621 -0.9379012 -2.0564351 -2.8841640
[8]  6.7772229
Code
o = nn(res$par, f=tanh)
s3d$points3d(x1,x2,o$yhat, col=2, type='l', lwd=5)
s3d$points3d(x1,x2,o$z0[,1], col=3, type='l', lwd=5)
s3d$points3d(x1,x2,o$z0[,2], col=4, type='l', lwd=5)

Activation Functions

  • The last output layer of a neural network has sigmoid activation function for binary output variable (classification) and no activation function for continuous output variable regression.
  • The hidden layers can have different activation functions. The most common activation functions are the hyperbolic tangent function and the rectified linear unit (ReLU) function.

A typical approach is to use the same activation function for all hidden layers. The hyperbolic tangent function is defined as \[ \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} \] Notice, that the hyperbolic tangent function is a scaled version of the sigmoid function, with \(\tanh(0) = 0\). It is a smooth function which is differentiable everywhere. However,

Activation Functions

tanh

Hard tanh

softplus

ReLU

Leaky ReLU

sigmoid

Activation Functions

  • Typically \(\tanh\) is preferred to the sigmoid function because it is zero-centered.
  • The major drawback of sigmoid and \(\tanh\) functions is that they saturate when the input is very large or very small.
  • When we try to learn the weights of the network, the optimisation algorithms makes small steps in the space of the parameters and when the weights are large the small changes won’t effect the values of the layers’ outputs and optimisation will “stagnate”.
  • This means that the gradient of the function is very small, which makes learning slow. The ReLU function is defined as

Activation Functions

The ReLU function is defined as \[ \text{ReLU}(z) = \max(0,z) \] The ReLU function is a piecewise linear function which is computationally efficient and easy to optimize. The ReLU function is the most commonly used activation function in deep learning. The ReLU function is not differentiable at \(z=0\), but it is differentiable everywhere else. The derivative of the ReLU function is \[ \text{ReLU}'(z) = \begin{cases} 0 & \text{if } z < 0 \\ 1 & \text{if } z > 0 \end{cases} \]

Quantile Linear Regression

  • Quantile regression is a statistical method that extends traditional linear regression
  • Models the relationship between predictor variables and specific quantiles of the response variable, rather than just the conditional mean.
  • Provides a more comprehensive view of the relationship between variables across different parts of the distribution.
  • Linear regression \[ E(Y|X) = X\beta \]

vs quantile regression \[ Q_Y(\tau|X) = X\beta(\tau) \] where \(\tau\) is the quantile of interest (0 < \(\tau\) < 1), and \(\beta(\tau)\) are the regression coefficients for the \(\tau\)th quantile.

What is a Quantile?

The distribution function (CDF) \[ F(y) = P(Y \leq y) = \tau \] The \(\tau\)th quantile of \(Y\) is the value \(q\) such that \[ q = F^{-1}(\tau) = \inf \{y : F(y) \geq \tau\} \]

Quantile loss

  • Sum of absolute differences between the predicted and actual values.
  • It is used for regression problems with continuous variables. \[ \min_{\beta}\sum_{i=1}^n |y_i - f(x_i,\beta)| \] a.k.a. the quantile estimator with \(\tau=0.5\)

Quantile loss

Unconditional case: \[ \frac{\mathrm{d} \left | x \right | }{\mathrm{d} x} = \operatorname{sign} \left( x \right) \] where \(\operatorname{sign} \left( x \right)\) is the sign function. Hence, deriving the sum above yields \[ \sum_{i=1}^n \operatorname{sign}(y_i - \beta). \] This equals to zero only when the number of positive items equals the number of negative which happens when \(\beta\) is the median.

Quantile loss

A more rigorous and non-calculus proof is due to Schwertman (1990).

Let \(y_1,\ldots,y_n\) be the observed data and \(\hat{\beta}\) be the least absolute deviations estimator. Then we have \[ \sum_{i=1}^n |y_i - \hat{\beta}| \leq \sum_{i=1}^n |y_i - \beta| \] for any \(\beta\). Let \(y_{(1)},\ldots,y_{(n)}\) be the ordered data. Then we have \[ \sum_{i=1}^n |y_i - \hat{\beta}| \leq \sum_{i=1}^n |y_i - y_{(i)}| \] Let \(y_{(n/2)}\) be the median of the data. Then we have \[ \sum_{i=1}^n |y_i - \hat{\beta}| \leq \sum_{i=1}^n |y_i - y_{(n/2)}| \] which implies that \(\hat{\beta}\) is the median of the data.

Quantile loss

The generalization of the median estimator to the case of estimating value of quantile \(\tau\) is as follows \[ \min_{\beta}\sum_{i=1}^n \rho_{\tau}(y_i - \beta) \] where \(\rho_{\tau}(x) = x(\tau - \mathbb{I}(x < 0))\) is the quantile loss function. If we set \(\tau = 0.5\), the loss function becomes the absolute value function and we get the median estimator. The expected loss is \[ E \rho_{\tau}(y - \beta) = (\tau-1)\int_{-\infty}^{\beta} (y-\beta)dF(y) + \tau\int_{\beta}^{\infty} (y-\beta)dF(y) \] Differentiating the expected loss function with respect to \(\beta\) and setting it to zero gives the quantile estimator \[ \hat{\beta}_{\tau} = F^{-1}(\tau) \]

Bayesian Connection

Yu and Moyeed 2001

  • Turn loss function \[ \rho_{\tau}(x) = x(\tau - \mathbb{I}(x < 0)) \] into a likelihood function \[ f(x\mid \tau) \propto \exp\left(-\rho_{\tau}(x)\right) \] asymmetric Laplace distribution!

  • Can add location \(\mu\) and scale \(\sigma\) parameters to the likelihood function \[ f(x\mid \tau) = \frac{\tau(1 - \tau)}{\sigma} \exp\left(-\rho_\tau\left(\frac{x-\mu}{\sigma}\right)\right) \]

Estimation

Quantile regression estimates are obtained by minimizing the following objective function:

\[\min_{\beta} \sum_{i=1}^n \rho_\tau(y_i - x_i'\beta)\]

where \[\rho_\tau(u) = u(\tau - I(u < 0))\] is the tilted absolute value function.

Advantages and Applications

  1. Robustness to outliers
  2. Ability to capture heterogeneous effects across the distribution
  3. No assumptions about the distribution of the error terms

Quantile regression is particularly useful in:

  • Economics: Analyzing income disparities
  • Ecology: Studying species distributions
  • Healthcare: Developing growth charts
  • Finance: Assessing risk measures like Value at Risk (VaR)[4]

Example in R

Let’s demonstrate quantile regression using the mtcars dataset in R:

Code
# Load required libraries
library(quantreg)
library(ggplot2)

# Load data
data(mtcars)

# Fit quantile regression models for different quantiles
quantiles <- c(0.1, 0.25, 0.5, 0.75, 0.9)
models <- lapply(quantiles, function(q) rq(mpg ~ wt, data = mtcars, tau = q))

# Create a plot
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_abline(intercept = coef(lm(mpg ~ wt, data = mtcars))[1],
              slope = coef(lm(mpg ~ wt, data = mtcars))[2],
              color = "red", linetype = "dashed") +
  geom_abline(intercept = sapply(models, function(m) coef(m)[1]),
              slope = sapply(models, function(m) coef(m)[2]),
              color = c("blue", "green", "purple", "orange", "brown")) +
  labs(title = "Quantile Regression: MPG vs Weight",
       x = "Weight (1000 lbs)", y = "Miles per Gallon") +
  theme_minimal()
Code
# Print summary of median regression
summary(models[[3]])

Call: rq(formula = mpg ~ wt, tau = q, data = mtcars)

tau: [1] 0.5

Coefficients:
            coefficients lower bd upper bd
(Intercept) 34.23224     32.25029 39.74085
wt          -4.53947     -6.47553 -4.16390

This code fits quantile regression models for the 10th, 25th, 50th, 75th, and 90th percentiles of miles per gallon (mpg) based on car weight. The resulting plot shows how the relationship between weight and fuel efficiency varies across different quantiles of the mpg distribution.

Quantile Neural Network for Synthetic Data

Code
import numpy as np
import torch
import matplotlib.pyplot as plt
import scipy.stats 
# Sin
n = 10000
# x = np.linspace(-1,1, n)
np.random.seed(8)
x = np.random.uniform(-1,1,(n))
x = np.sort(x)
eps = np.random.normal(0,np.exp(1-x)/10)
mu = np.sin(np.pi*x)/(np.pi*x)
y  = mu + eps
def truef(tau):
    return torch.sin(torch.pi*x)/(torch.pi*x) + torch.sqrt(torch.exp(1-x)/10)*scipy.stats.norm.ppf(tau)

Convert Data to PyTorch Tensors

Code
x  = torch.as_tensor(x,dtype=torch.float32).view(-1,1)
y  = torch.as_tensor(y,dtype=torch.float32)
plt.scatter(x,y,s=1)    

Define the Model

Code
import torch.nn as nn
class QuantNet(nn.Module):
    def __init__(self, xsz=1):
        super(QuantNet, self).__init__()
        self.nh = 64
        hsz = 32
        hsz1 = 32
        self.fcx = nn.Linear(xsz, hsz)
        self.fctau = nn.Linear(self.nh, hsz)
        self.fcxtau = nn.Linear(hsz , hsz1)
        self.fcxtau1 = nn.Linear(hsz1 , hsz1)
        self.fc = nn.Linear(hsz1 , 2)
    def forward(self, x,tau):
        tau = torch.cos(torch.arange(start=0,end=self.nh)*torch.pi*tau)
        tau = torch.relu(self.fctau(tau)) # function phi from paper 
        x = torch.relu(self.fcx(x)) # function psi
        x = torch.relu(self.fcxtau(x*tau)) # first layer of function g
        x = torch.relu(self.fcxtau1(x)) # second layer of function g
        x = self.fc(x) # third layer of function g
        return x

Model Estimation (a.k.a. Training)

Code
def train(model, x,y,optimizer,epochs):
    lv = np.zeros(epochs)
    for t in range(epochs):
        tau = torch.rand(1).item()
        f = model(x,tau)
        e = y.view(-1,1)-f
        loss = 0.1*torch.mean(torch.square(e[:,0]))
        # loss = 0
        loss += torch.mean(torch.maximum(tau*e[:,1],(tau-1)*e[:,1]))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        lv[t] = loss.item()
        if t % 2000 == 0:
            print(f"Epoch {t}: Loss = {loss.item():>7f}")
    print(f"Epoch {t}: Loss = {loss.item():>7f}")
    return lv

Create Model

Code
torch.manual_seed(8) # ovi
<torch._C.Generator object at 0x10935f190>
Code
def init_weights(m):
    if isinstance(m, nn.Linear):
        torch.nn.init.xavier_uniform_(m.weight)

model = QuantNet()
model.apply(init_weights)
QuantNet(
  (fcx): Linear(in_features=1, out_features=32, bias=True)
  (fctau): Linear(in_features=64, out_features=32, bias=True)
  (fcxtau): Linear(in_features=32, out_features=32, bias=True)
  (fcxtau1): Linear(in_features=32, out_features=32, bias=True)
  (fc): Linear(in_features=32, out_features=2, bias=True)
)
Code
optimizer = torch.optim.RMSprop(model.parameters())
print(model)
QuantNet(
  (fcx): Linear(in_features=1, out_features=32, bias=True)
  (fctau): Linear(in_features=64, out_features=32, bias=True)
  (fcxtau): Linear(in_features=32, out_features=32, bias=True)
  (fcxtau1): Linear(in_features=32, out_features=32, bias=True)
  (fc): Linear(in_features=32, out_features=2, bias=True)
)

Run the Estimation (Training)

Code
lv = train(model, x,y,optimizer,1000)
Epoch 0: Loss = 0.382400
Epoch 999: Loss = 0.135881
Code
plt.plot(lv)

Make Predictions and Plot

Code
plt.scatter(x,y,s=1, label='Data');
plt.plot(x, model(x,0.05).detach().numpy()[:,1],'g-', linewidth=2, label='5% Percentile');
plt.plot(x, truef(0.05),'g--', linewidth=2, label='True 5% Percentile');
plt.plot(x, model(x,0.95).detach().numpy()[:,1],'r-', linewidth=2, label='95% Percentile');
plt.plot(x, truef(0.95),'r--', linewidth=2, label='True 95% Percentile');
plt.plot(x, model(x,0.5).detach().numpy()[:,1],'b-', linewidth=2, label='50% Percentile');
plt.plot(x, truef(0.5),'b--', linewidth=2, label='True 50% Percentile');
plt.legend(loc='lower right')

Predict at a specific location

Code
xn = 0.5
print(np.exp(1-xn)/10)
0.1648721270700128
Code
ns = 5000
tau = torch.rand(ns)
yhat = np.empty(ns)
for i in range(ns):
    yhati  = model(torch.as_tensor([xn],dtype=torch.float32),tau[i])
    yhat[i] = yhati[1].detach().numpy()
print(np.std(yhat))
0.059542930636527304
Code
plt.hist(yhat,bins=50);

Fixed Quantile Neural Network

Code
class FixedQuantNet(nn.Module):
    def __init__(self, xsz=1, tau = [0.5]):
        super(FixedQuantNet, self).__init__()
        self.nh = 64
        hsz = 64
        self.fcx = nn.Linear(xsz, hsz)
        self.tau = tau
        self.nq = len(self.tau)
        self.fc = nn.Linear(hsz , self.nq + 1)
    def forward(self, x):
        x = torch.relu(self.fcx(x))
        x = self.fc(x)
        return x
    def train(self,x,y,optimizer,epochs):
        lv = np.zeros(epochs)
        for t in range(epochs):
            f = self(x)
            e = y.view(-1,1)-f
            loss = 0.1*torch.mean(torch.square(e[:,0]))
            # loss = 0
            for i in range(self.nq):
                loss += torch.mean(torch.maximum(self.tau[i]*e[:,i+1],(self.tau[i]-1)*e[:,i+1]))
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            lv[t] = loss.item()
            if t % 2000 == 0:
                print(f"Epoch {t}: Loss = {loss.item():>7f}")
        print(f"Epoch {t}: Loss = {loss.item():>7f}")
        return lv

Train the Model

Code
modelf = FixedQuantNet(tau=[0.05,0.5,0.95])
modelf.apply(init_weights)
FixedQuantNet(
  (fcx): Linear(in_features=1, out_features=64, bias=True)
  (fc): Linear(in_features=64, out_features=4, bias=True)
)
Code
optimizer = torch.optim.RMSprop(modelf.parameters())
lv = modelf.train(x,y,optimizer,200)
Epoch 0: Loss = 1.791370
Epoch 199: Loss = 0.221048
Code
yhat = modelf(x).detach().numpy()
plt.plot(lv)

Plots

Code
# model.train(x,y,optimizer,1000)
plt.scatter(x,y,s=1, label='Data');
plt.plot(x, yhat[:,1],'g-', linewidth=2, label='5% Percentile'); plt.plot(x, truef(0.05),'g--', linewidth=2, label='True 5% Percentile');
plt.plot(x, yhat[:,3],'r-', linewidth=2, label='95% Percentile'); plt.plot(x, truef(0.95),'r--', linewidth=2, label='True 95% Percentile');
plt.plot(x, yhat[:,2],'b-', linewidth=2, label='50% Percentile'); plt.plot(x, truef(0.5),'b--', linewidth=2, label='True 50% Percentile');
plt.legend(loc='lower right')

Hierarchical Bayesian Model

We consider the following model: \[\begin{align*} \tau &\sim \mathrm{Gamma}(0.5, 0.5) \\ \lambda_d &\sim \mathrm{Gamma}(0.5, 0.5) \\ \beta_d &\sim \mathcal{N}(0, 20) \\ y_n &\sim \mathrm{Bernoulli}(\sigma((\tau \lambda \odot \beta)^T x_n))), \end{align*}\] - \(\tau\) is a scalar global coefficient scale - \(\lambda\) is a vector of local scales - \(\beta\) is the vector of unscaled coefficients,

Code
import tensorflow_probability.substrates.jax as tfp
import numpy as np
import jax.numpy as jnp
tfd = tfp.distributions
import jax
import matplotlib.pyplot as plt

Apply to Iris Data

Code
import pandas as pd
iris = pd.read_csv("data/iris.csv")
print(iris.shape)
(150, 5)
Code
print(iris.head())
   sepal.length  sepal.width  petal.length  petal.width variety
0           5.1          3.5           1.4          0.2  Setosa
1           4.9          3.0           1.4          0.2  Setosa
2           4.7          3.2           1.3          0.2  Setosa
3           4.6          3.1           1.5          0.2  Setosa
4           5.0          3.6           1.4          0.2  Setosa
Code
y = iris['variety']=="Setosa"
x = iris["sepal.length"].values
plt.scatter(x,y)
x = np.c_[np.ones(150),x]
print(x.shape)
(150, 2)

Log Density Function

For Hamiltonian MC, we only need to evaluate the joint log-density pointwise

Code
def joint_log_prob(x, y, tau, lamb, beta):
    lp = tfd.Gamma(0.5, 0.5).log_prob(tau)
    lp += tfd.Gamma(0.5, 0.5).log_prob(lamb).sum() 
    lp += tfd.Normal(0., 20).log_prob(beta).sum() 
    logits = x @ (tau * lamb * beta)
    lp += tfd.Bernoulli(logits).log_prob(y).sum() 
    return lp   

tau = np.random.rand(1)[0]
lamb = np.random.rand(2)
beta = np.random.rand(2)
joint_log_prob(x, y, tau, lamb, beta)
Array(-126.7892, dtype=float32)

Change of Variables

\[ z\triangleq T^{-1}(\theta),\qquad \pi(z) = \pi(\theta) \left| \frac{\partial T}{\partial z} (z) \right|, \]

Taking the logarithm of both sides, we get \[ \log \pi(z) = \log \pi(\theta) + \log \left| \frac{\partial T }{\partial z}(z) \right| \] Use \(T(z)=e^z\), and \(\log|\frac{\partial T}{\partial z}(z)| = z\)

Change of Variables in Code

Code
def unconstrained_joint_log_prob(x, y, theta):
    ndims = x.shape[-1]
    unc_tau, unc_lamb, beta = jnp.split(theta, [1, 1 + ndims])
    unc_tau = unc_tau.reshape([]) 
    # Make unc_tau a scalar 
    tau = jnp.exp(unc_tau)
    ldj = unc_tau
    lamb = jnp.exp(unc_lamb)
    ldj += unc_lamb.sum()
    return joint_log_prob(x, y, tau, lamb, beta) + ldj

Let’s check out function

Code
from functools import partial
target_log_prob = partial(unconstrained_joint_log_prob, x, y)
theta = np.r_[tau,lamb,beta]
target_log_prob(theta)
Array(-530.62317, dtype=float32)

Automatic Differentiation

Code
target_log_prob_and_grad = jax.value_and_grad(target_log_prob)
tlp, tlp_grad = target_log_prob_and_grad(theta)
print(tlp)
-530.62317
Code
tlp_grad
Array([ -511.1795  ,   -95.032646,  -416.80698 ,  -185.95103 ,
       -1782.1393  ], dtype=float32)

Hamiltonian Monte Carlo

Code
def leapfrog_step(target_log_prob_and_grad, step_size, i, leapfrog_state):
    z, m, tlp, tlp_grad = leapfrog_state
    m += 0.5 * step_size * tlp_grad
    z += step_size * m
    tlp, tlp_grad = target_log_prob_and_grad(z)
    m += 0.5 * step_size * tlp_grad
    return z, m, tlp, tlp_grad

def hmc_step(target_log_prob_and_grad, num_leapfrog_steps, step_size, z, seed):
    m_seed, mh_seed = jax.random.split(seed)
    tlp, tlp_grad = target_log_prob_and_grad(z)
    m = jax.random.normal(m_seed, z.shape)
    energy = 0.5 * jnp.square(m).sum() - tlp
    new_z, new_m, new_tlp, _ = jax.lax.fori_loop(
        0,
        num_leapfrog_steps,
        partial(leapfrog_step, target_log_prob_and_grad, step_size),
        (z, m, tlp, tlp_grad)) 
    new_energy = 0.5 * jnp.square(new_m).sum() - new_tlp
    log_accept_ratio = energy - new_energy
    is_accepted = jnp.log(jax.random.uniform(mh_seed, [])) < log_accept_ratio
    # select the proposed state if accepted
    z = jnp.where(is_accepted, new_z, z)
    hmc_output = {"z": z,
                  "is_accepted": is_accepted,
                  "log_accept_ratio": log_accept_ratio}
    # hmc_output["z"] has shape [num_dimensions]
    return z, hmc_output

def hmc(target_log_prob_and_grad, num_leapfrog_steps, step_size, num_steps, z,
        seed):
    # create a seed for each step
    seeds = jax.random.split(seed, num_steps)
    # this will repeatedly run hmc_step and accumulate the outputs
    _, hmc_output = jax.lax.scan(
        partial(hmc_step, target_log_prob_and_grad, num_leapfrog_steps, step_size),
        z, seeds)
    # hmc_output["z"] now has shape [num_steps, num_dimensions]
    return hmc_output

def scan(f, state, xs):
  output = []
  for x in xs:
    state, y = f(state, x)
    output.append(y)
  return state, jnp.stack(output)

HMC

Code
num_leapfrog_steps=30
step_size = 0.008
from jax import random
seed = random.PRNGKey(92)
num_samples=10000
hmc_output = hmc(target_log_prob_and_grad, num_leapfrog_steps, step_size,
    num_samples, theta, seed)

Inspect the results

Code
ndims = x.shape[-1]
thetap = hmc_output['z']
taup, lambp, betap = jnp.split(thetap, [1, 1 + ndims],axis=1)
skip = 2000
slope = betap[skip:,1]
intercept = betap[skip:,0]
plt.hist(slope,bins=50);
plt.hist(intercept,bins=50);

Inspect the results

Code
print(np.quantile(slope,[0.05,0.95,0.5]))
[-14.40216732  -1.37361898  -5.5752933 ]
Code
print(np.quantile(intercept,[0.05,0.95,0.5]))
[17.57117882 38.81280651 28.22325706]

Distributional Reinforcement Learning

Reading list:

Mixture Density Networks (MDNs)

Mixture model head: Outputs parameters for: - Means (\(\mu_1,...,\mu_M\)) - Variances (\(\sigma_1^2,...,\sigma_M^2\)) - Mixing coefficients (\(\pi_1,...,\pi_M\))

\[ p(y\mid x) = \sum_{m=1}^M \pi_m \mathcal{N}(y\mid \mu_m, \sigma_m^2) \]

Loss function: \[ L(\theta) = -\frac{1}{N}\sum_{i=1}^N \log p(y_i\mid x_i, \theta) \] where \(\theta\) are the parameters of the model.

mdn-nn.ipynb

Mean Field Variational Inference

  • Approximate Bayesian inference for deep learning models.
  • This approach offers a computationally efficient method to capture uncertainty in deep neural networks while maintaining scalability.

Given data \(\mathcal{D}\), the posterior distribution over parameters \(\boldsymbol{\theta}\) is given by:

\[ p(\boldsymbol{\theta}|\mathcal{D}) = \frac{p(\mathcal{D}|\boldsymbol{\theta})p(\boldsymbol{\theta})}{p(\mathcal{D})} \]

Mean Field Variational Inference

  • Approximates the complex posterior distribution with a simpler, tractable distribution by minimizing the Kullback-Leibler (KL) divergence between them.
  • This is equivalent to maximizing the Evidence Lower Bound (ELBO), defined as:

\[ \text{ELBO}(q) = \int q(z) \log p(x,z) - q(z) \log q(z) dz \]

where \(q(z)\) is the variational approximation to the true posterior \(p(z|x)\)

Assume that the variational family is fully factorized. For latent variables \(Z = \{Z_1, \ldots, Z_n\}\) and observed variables \(X = \{X_1, \ldots, X_n\}\), the mean field approximation takes the form:

\[ p(z|x) \approx \prod_{i=1}^{n} q_i(z_i) \]

latent variables is independent!

Coordinate Ascent Algorithm

\[ q_j(z_j) \propto \exp(E_{z_{i\neq j} \sim q_{i\neq j}}[\log p(z_j|x, z_{i\neq j})]) \]

When applying mean field variational inference to neural networks, we typically assume independence between all weight parameters:

\[ q(\boldsymbol{\theta}) = \prod_{i=1}^{|\boldsymbol{\theta}|} q_i(\theta_i) \]

Each weight distribution is commonly modeled as a Gaussian:

\[ q_i(\theta_i) = \mathcal{N}(\theta_i|\mu_i, \sigma_i^2) \]

The variational parameters \(\phi = \{\mu_i, \sigma_i\}\) are optimized to maximize the ELBO[2][4].

Practical Implementation

In practice, implementing mean field variational inference for BNNs involves:

  1. Specifying a prior distribution over weights (often Gaussian)
  2. Defining a likelihood function (typically Gaussian for regression tasks)
  3. Parameterizing the variational distribution as factorized Gaussians
  4. Optimizing the ELBO using stochastic gradient techniques

For example, in regression tasks, we might assume a Gaussian likelihood:

\[ p(\mathcal{D}|\boldsymbol{\theta}) = \prod_{n}\mathcal{N}(y_n|f(x_n,\boldsymbol{\theta}), \sigma^2_{\text{noise}}) \]

And independent Gaussian priors over parameters:

\[ p(\boldsymbol{\theta}) = \mathcal{N}(\boldsymbol{\theta}|\boldsymbol{0}, \sigma_p^2\mathbf{I}) \]

Where \(\sigma_p\) is a hyperparameter controlling the prior variance[2].

In-between uncertainty is an issue!

Stochastic-gradient Markov chain Monte Carlo (SGMCMC)

Use mini-batches of data and Langevin dynamics for efficient exploration!

At each iteration, SGLD updates the model parameters using: \[ \theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \nabla_\theta \log p(\theta_t | \text{data}) + \eta_t \] where: - \[\epsilon_t\] is the step size, - \[\nabla_\theta \log p(\theta_t | \text{data})\] is the gradient of the log-posterior (often estimated using a mini-batch of data, like in SGD), - \[\eta_t \sim \mathcal{N}(0, \epsilon_t)\] is Gaussian noise[1][2][3][5].

Other Approaches

Some Applications of Bayes Approaches in LLMs

  • Bayesian techniques are increasingly being used in the context of large language models (LLMs).
  • These developments highlight the growing synergy between Bayesian methods and large language models, offering improvements in model performance, uncertainty quantification, and interpretability.
  • Uncertainty Estimation: Bayesian Prompt Ensembles (BayesPE) have been proposed as a novel approach to obtain well-calibrated uncertainty estimates for black-box LLMs. This method uses a weighted ensemble of semantically equivalent prompts and applies Bayesian variational inference to estimate the weights.
  • Enhancing Bayesian Optimization: A new approach called LLAMBO integrates LLMs within Bayesian optimization frameworks. This method frames the optimization problem in natural language, allowing LLMs to propose and evaluate solutions based on historical data. LLAMBO has shown promise in improving surrogate modeling and candidate sampling, especially in early stages of search.

Some Applications of Bayes Approaches in LLMs

  • Probability Estimation: The BIRD framework incorporates abductive factors, LLM entailment, and learnable deductive Bayesian modeling to provide controllable and interpretable probability estimation for model decisions. This approach has demonstrated a 35% improvement over GPT-4 in aligning probability estimates with human judgments.
  • Natural Language Processing: Bayesian techniques have been applied to various NLP tasks, including word segmentation, syntax analysis, morphology, coreference resolution, and machine translation. These methods offer an elegant way to incorporate prior knowledge and manage uncertainty over parameters.
  • Deep Bayesian Learning: Researchers are exploring the integration of Bayesian principles with deep learning models for NLP applications. This includes the development of hierarchical Bayesian models, variational autoencoders, and (stochastic neural networks)[https://aclanthology.org/P19-4006/].
  • (Prompt Optimisation)[https://dl.acm.org/doi/10.1007/978-3-031-75623-8_28]

Bayesian Optimization

  • Bayesian optimization is a powerful tool for optimizing expensive-to-evaluate functions
  • RunAI was recently acquired by Nvidia for $700m
  • See bo-nn.ipynb example