Bayes AI

Unit 6: Markov Chain Monte Carlo

Vadim Sokolov
George Mason University
Spring 2025

Course Page, Slides

MCMC Simulation

Suppose that \(X \sim F_X ( x )\) and let \(Y = g (X)\).

How do we find \(F_Y ( y )\) and \(f_Y ( y )\) ?

  • von Neumann

Given a uniform \(U\), how do we find \(X= g(U)\)?

  • In the bivariate case \((X,Y) \rightarrow (U,V)\).

We need to find \(f_{(U,V)} ( u , v )\) from \(f_{X,Y}(x,y)\)

  • Applications: Simulation, MCMC and PF.

Transformations

The cdf identity gives \[ F_Y ( y) = \mathbb{P} ( Y \leq y ) = \mathbb{P} ( g( X) \leq y ) \]

  • Hence if the function \(g ( \cdot )\) is monotone we can invert to get

\[ F_Y ( y ) = \int_{ g( x) \leq y } f_X ( x ) dx \]

  • If \(g\) is increasing \(F_Y ( y ) = P( X \leq g^{-1} ( y ) ) = F_X ( g^{-1} ( y ) )\)

If \(g\) is decreasing \(F_Y ( y ) = P( X \geq g^{-1} ( y ) ) = 1 - F_X ( g^{-1} ( y ) )\)

Transformation Identity

  1. Theorem 1: Let \(X\) have pdf \(f_X ( x)\) and let \(Y=g(X)\). Then if \(g\) is a monotone function we have

\[ f_Y ( y) = f_X ( g^{-1} ( y ) ) \left | \frac{ d}{dy} g^{-1} ( y ) \right | \] There’s also a multivariate version of this that we’ll see later.

  • Suppose \(X\) is a continuous rv, what’s the pdf for \(Y = X^2\)?

  • Let \(X \sim N ( 0 ,1 )\), what’s the pdf for \(Y = X^2\)?

Probability Integral Transform

theorem Suppose that \(U \sim U[0,1]\), then for any continuous distribution function \(F\), the random variable \(X= F^{-1} (U)\) has distribution function \(F\).

  • Remember that for \(u \in [0,1]\), \(\mathbb{P} \left ( U \leq u \right ) = u\), so we have

\[ \mathbb{P} \left (X \leq x \right )= \mathbb{P} \left ( F^{-1} (U) \leq x \right )= \mathbb{P} \left ( U \leq F(x) \right )=F(x) \] Hence, \(X = F_X^{-1}(U)\).

Normal

Sometimes thare are short-cut formulas to generate random draws

Normal \(N(0,I_2)\): \(x_1,x_2\) uniform on \([0,1]\) then \[ \begin{aligned} y_1 = & \sqrt{-2\log x_1}\cos(2\pi x_2)\\ y_2 = & \sqrt{-2\log x_1}\sin(2\pi x_2) \end{aligned} \]

Simulation and Transformations

An important application is how to transform multiple random variables?

  • Suppose that we have random variables:

\[ ( X , Y ) \sim f_{ X , Y} ( x , y ) \] A transformation of interest given by: \[ U = g ( X , Y ) \; \; {\rm and} \; \; V = h ( X , Y ) \]

  • The problem is how to compute \(f_{ U , V } ( u , v )\) ? Jacobian

\[ J = \frac{ \partial ( x , y ) }{ \partial ( u , v ) } = \left | \begin{array}{cc} \frac{ \partial x }{ \partial u} & \frac{ \partial x }{ \partial v} \\ \frac{ \partial y }{ \partial u} & \frac{ \partial y }{ \partial v} \end{array} \right | \]

Bivariate Change of Variable

  • Theorem: (change of variable)

\[ f_{ U , V } ( u , v ) = f_{ X , Y} ( h_1 ( u , v ) , h_2 ( u , v ) ) \left | \frac{ \partial ( x , y ) }{ \partial ( u , v ) } \right | \] The last term is the Jacobian.

This can be calculated in two ways.

\[ \left | \frac{ \partial ( x , y ) }{ \partial ( u , v ) } \right | = 1 / \left | \frac{ \partial ( u , v ) }{ \partial ( x , y ) } \right | \]

  • So we don’t always need the inverse transformation \(( x , y ) = ( g^{-1} ( u , v ) , h^{-1} ( u , v ) )\)

Inequalities and Identities

  1. Markov

\[ \mathbb{P} \left ( g( X ) \geq c \right ) \leq \frac{ \mathbb{E} ( g(X) ) }{c } \; \; {\rm where} \; \; g( X) \geq 0 \]

  1. Chebyshev

\[ \mathbb{P} \left ( | X - \mu | \geq c \right ) \leq \frac{ Var(X) }{c^2 } \]

  1. Jensen

\[ \mathbb{E} \left ( \phi ( X ) \right ) \leq \phi \left ( \mathbb{E}( X ) \right ) \]

  1. Cauchy-Schwarz \[ corr (X,Y) \leq 1 \]

Chebyshev follows from Markov. Mike Steele and Cauchy-Schwarz.

Markov Inequality

Let \(f\) be non-decreasing \[ \begin{aligned} P ( Z > t ) &= P ( f(Z) \geq f(t) ) \\ & = E \left ( \mathbb{I} ( f( Z) \geq f(t ) ) \right ) \\ & \leq E \left ( \mathbb{I} ( f( Z) \geq f(t ) ) \frac{f(Z)}{f(t) } \right ) \\ & = E\left ( \frac{f(Z)}{f(t) } \right ) \end{aligned} \]

Concentration Inequalities

Law of Large Numbers \[ \lim_{ n \rightarrow \infty } \mathbb{P} \left ( | Z - E(Z) | > n \epsilon \right ) = 0 \; \; \forall \epsilon > 0 \]

Central Limt Theorem (CLT) \[ \lim_{ n \rightarrow \infty } \mathbb{P} \left ( n^{- 1/2} ( | Z - E(Z) | ) > \epsilon \right ) = \Phi ( x ) \]

Posterior Concentration

Hoeffding and Bernstein

Let \(Z= \sum_{i=1}^n X_i\).

Hoeffding \[ P ( Z > E(Z) + t ) \leq \exp \left ( - \frac{ t^2}{2n} \right ) \]

Bernstein \[ P ( Z > E(Z) + t ) \leq \exp \left ( - \frac{ t^2}{ 2 ( Var(Z) + t/3 ) } \right ) \] Large Deviations (Varadhan)

Special Distributions

See Common Distributions

  1. Bernoulli and Binomial

  2. Hypergeometric

  3. Poisson

  4. Negative Binomial

  5. Normal Distribution

  6. Gamma Distribution

  7. Beta Distribution

  8. Multinomial Distribution

  9. Bivariate Normal Distribution

  10. Wishart Distribution

\(\ldots\)

Example: Markov Dependence

  • We can always factor a joint distribution as

\[ p( X_n , X_{n-1} , \ldots , X_1 ) = p( X_n | X_{n-1} , \ldots , X_1 ) \ldots p( X_2 | X_1 ) p( X_1 ) \]

  • A process has the Markov Property if

\[ p( X_n | X_{n-1} , \ldots , X_1 ) = p( X_n | X_{n-1} ) \]

  • Only the current history matter when determining the probabilities.

A real world probability model: Hidden Markov Models

Are stock returns a random walk?

Hidden Markov Models (Baum-Welch, Viterbi)

  • Daily returns on the SP500 stock market index.

Build a hidden Markov model to predict the ups and downs.

  • Suppose that stock market returns on the next four days are \(X_1 , \ldots , X_4\).

  • Let’s empirical determine conditionals and marginals

SP500 Data

Marginal and Bivariate Distributions

  • Empirically, what do we get? Daily returns from \(1948-2007\).
\(x\) Down Up
\(P( X_i ) = x\) 0.474 0.526
  • Finding \(p( X_2 | X_1 )\) is twice as much computational effort: counting \(UU,UD,DU,DD\) transitions.
\(X_i\) Down Up
\(X_{i-1} = Down\) 0.519 0.481
\(X_{i-1} = Up\) 0.433 0.567

Conditioned on two days

  • Let’s do \(p( X_3 | X_2 , X_1 )\)
\(X_{i-2}\) \(X_{i-1}\) Down Up
Down Down 0.501 0.499
Down Up 0.412 0.588
Up Down 0.539 0.461
Up Up 0.449 0.551
  • We could do the distribution \(p( X_2 , X_3 | X_1 )\). This is a joint, marginal and conditional distribution all at the same time.

Joint because more than one variable \(( X_2 , X_3 )\), marginal because it ignores \(X_4\) and conditional because its given \(X_1\).

Joint Probabilities

  • Under Markov dependence \[ \begin{aligned} P( UUD ) & = p( X_1 = U) p( X_2 = U | X_1 = U) p( X_3 | X_2 = U , X_1 = U ) \\ & = ( 0.526 ) ( 0.567 ) ( 0.433) \end{aligned} \]

  • Under independence we would have got \[ \begin{aligned} P(UUD) & = P( X_1 = U) p( X_2 = U) p( X_3 = D ) \\ & = (.526)(.526)(.474) \\ & = 0.131 \end{aligned} \]

Markov Chain Monte Carlo

  • Think od shuffling problem
  • There are 52! possible permutations of this deck.
  • Naive approach: generate a vector with all possible permutations and draw an element of this vector with probability 1/52!.

\(52! \approx 10^{68}\)

slightly less then number of particles in the observed universe (\(10^{80}\)).

Bayes rule

\[ p(\theta \mid X,Y) = \dfrac{p(Y \mid \theta,X)p(\theta)}{\int p(Y \mid \theta,X)p(\theta)d\theta}. \]

  • MCMC algorithms generate a Markov chain \[ \left\{ \theta ^{\left( g\right)}\right\} _{g=1}^{G} \] whose stationary distribution is \(p\left( \theta \mid X,Y\right)\). Thus, the key to Bayesian inference is simulation rather than optimization.

\[ \begin{split} \widehat{E}\left( f\left( \theta\right) \mid X,Y\right)&=G^{-1}\sum_{g=1}^{G}f\left( \theta ^{\left( g\right) }\right)\\ & \approx \int f\left( \theta\right) p\left( \theta \mid X,Y\right)d\theta=E\left( f\left( \theta\right) \mid X,Y\right). \end{split} \]

MCMC for Discrete Random Variables

\[ P(X_{k+1} = i \mid X_k = j) = p_{ij}. \]

  • Transition matrix \(P = \{p_{ij}\}_{i,j=1}^n\) is column stochastic, i.e. \(p_{ij} \ge 0\) and \(\sum_j p_{ij} = 1\).
  • Let \(\pi_k \in R^n\) be the distribution of the random variable \(X_k\) at time \(k\), \(\pi_{ki} = P(X_k=i)\)
  • \(\pi_{k+1} = P\pi_k\).
  • The limiting distribution is \(\pi = P\pi\), i.e. \(\pi\) is an eigenvector of \(P\) with eigenvalue 1.
  • The limiting distribution is unique and it is the first eigenvector of \(P\). The limiting distribution is also called a stationary distribution.

MCMC for Discrete Random Variables

  • Power method \(\pi^{k+1} = P^k\pi^0\) to \(\pi\) (Google’s PageRank).
  • Perron-Frobenius theorem: if \(p_{ij}>0\) then \(\pi\) always exists and it is unique and \[ \Vert \pi^k - \pi\Vert \le c |\lambda_2|^k, \]
  • \(\lambda_2\) is the second largest eigenvalue of \(P\) (the first eigenvalue is always 1) - \(\pi\) as the average time of visiting vertex \(i\) under random walk.

The mixing time of the Markov chain is given by \[ T=\dfrac{1}{\log(1/\lambda_2)} \] It is roughly, number of steps over which deviation from equilibrium distribution decreases by factor \(e\).

Three-State Example

Code
graph LR
    1 --0.5--> 1
    1 --0.25--> 2
    1 --0.25--> 3
    2 --0.2--> 1
    2 --0.1--> 2
    2 --0.7--> 3
    3 --0.25--> 2
    3 --0.25--> 1
    3 --0.5--> 3

graph LR
    1 --0.5--> 1
    1 --0.25--> 2
    1 --0.25--> 3
    2 --0.2--> 1
    2 --0.1--> 2
    2 --0.7--> 3
    3 --0.25--> 2
    3 --0.25--> 1
    3 --0.5--> 3

 P <- rbind(c(0.50, 0.2, 0.25),
            c(0.25, 0.1, 0.25),
            c(0.25, 0.7, 0.50))

Check that it is column-stochastic

Code
 colSums(P)
[1] 1 1 1

Power Iterations

# Power iterations
iterate.P <- function(x, P, n) {
    res <- matrix(NA, n+1, length(x))
    res[1,] <- x
    for (i in 1:n)
        res[i+1,] <- x <- P %*% x
    res
}

Let’s start with a random vector \(x = (1, 0, 0)\) and iterate \(x \leftarrow P x\) for \(n=10\) steps.

n  = 10
y = iterate.P(c(1, 0, 0), P, n)

Power Iterations

We can compare the results to the eigenvector calculated using built-in (QR decomposition-based) method

Code
v <- eigen(P, FALSE)$vectors[,1]
v <- v/sum(v) # normalize eigenvector
knitr::kable(cbind(v, y[n+1,]), col.names=c("Eigenvector", "Power iterations"), digits=3)
Eigenvector Power iterations
0.32 0.32
0.22 0.22
0.46 0.46

Power iterations convergence to eigenvector. Each color corresponds to a component of the eigenvector (stationary distribution vector)

Power Iterations

Code
matplot(0:n, y, type="l", lty=1, xlab="Step", ylab="y", las=1)
abline(h=v, lty=2, col=1:3)

Power Iterations

We can also find the stationary distribution by doing a simple random walk on the graph.

run <- function(i, P, n) {
  res <- integer(n)
  for (t in seq_len(n))
  res[[t]] <- i <- sample(ncol(P), 1, pr=P[,i])
  res
}
samples <- run(1, P, 300)

Power Iterations

Now, we plot the fraction of time that we were in each state over time:

Code
cummean <- function(x) cumsum(x) / seq_along(x)
plot(cummean(samples == 1), type="l", ylim=c(0, 1), xlab="Step", ylab="y", las=1)
lines(cummean(samples == 2), col=2)
lines(cummean(samples == 3), col=3)
abline(h=v, lty=2, col=1:3)

English Alphabet

Code
library(jsonlite)
# http://norvig.com/mayzner.html
bg = read_json("data/bigrams.json", simplifyVector=T)
nbg = nrow(bg)
bgm = matrix(0,nrow=26,ncol=26)
rownames(bgm) = letters; colnames(bgm) = letters
for (i in 1:nbg) { # from j to i 
  idx = match(unlist(strsplit(bg[i,1], split="")), letters)
  bgm[idx[2],idx[1]] = as.numeric(bg[i,2])
}
# View(bgm)
bgm = bgm %*% diag(1/colSums(bgm))

image(bgm, axes=FALSE)
axis(3, at=seq(0,1, length=26), labels=letters)
axis(2, at=seq(1,0, length=26), labels=letters)
Code
ev = eigen(bgm)
v = ev$vectors[,1]
v <- as.numeric(v/sum(v))

English Alphabet

Code
barplot(v, names.arg = letters, ylim=c(0,0.13))
lf = read.csv("data/letterfreq.txt")
lf$freq = lf$freq/sum(lf$freq)
lf = lf[order(lf$letter),]
barplot(lf$freq, names.arg = letters, ylim=c(0,0.13))

barplot(bgm[,5], names.arg = letters)

From Bigrams

From Google

Transitions from letter e

Individual Letter Frequencies

How to Construct Transition Probabilities?

  • The reverse problem, is how to construct the transition probabilities so that the limiting distribution is \(\pi\).
  • For example, it is easy to see that any row-stochastic matrix which is symmetric has a uniform limiting distribution, i.e. \(\pi = e\), \(e_i = 1/n\), where \(n\) is the number of vertices.
  • \(e\) is an eigenvector of \(P^T\), since \(P^T\) is row-stochastic, thus sum of each row equals to 1, i.e. \(P^T e = e\).

General case balance condition: \[ \pi_i p_{ij} = \pi_j p_{ji}, ~ i,j=1,\ldots,n, \] from which follows that \[ \sum_i p_{ij}\pi_i = \pi_j \sum_i p_{ji} = \pi_j,~j=1,\ldots,n. \]

How to Construct Transition Probabilities?

Thus we need to find a Markov chain in which the transition probabilities satisfy the detailed balance condition (symmetry). Another way to write the condition is \[ \dfrac{p_{ij}}{p_{ji}} = \dfrac{\pi_j}{\pi_i}. \] Say we have some “teaser” transition probabilities \(p_{ij}^0\) and we assume those are symmetric. We want to find new probabilities \(p_{ij} = p_{ij}^0b_{ij}, ~ i\ne j\) and \(p_{ii} = 1 - \sum_{j:~j\ne i} p_{ij}\). We need to choose \(b_{ij}\) is such a way so that \[ \dfrac{p_{ij}}{p_{ji}} = \dfrac{p_{ij}^0b_{ij}}{p_{ji}^0b_{ji}} = \dfrac{b_{ij}}{b_{ji}} = \dfrac{\pi_j}{\pi_i} \] We can choose \[ b_{ij} = F\left(\dfrac{\pi_j}{\pi_i}\right) \]

For the detailed balance condition to be satisfied, we need to choose \(F: R_+ \rightarrow [0,1]\) such that \[ \dfrac{F(z)}{F(1/z)} = z. \]

How to Construct Transition Probabilities?

An example of such a function is \(F(z) = \min(z,1)\). This leads to the Metropolis algorithm

  1. When at state \(i\), draw from \(p^{0}_{ij}\) to generate next state \(j\)
  2. Calculate \(a_{ij} = \min(1,\pi_i/\pi_j)\) (acceptance probability)
  3. Move to \(j\) with probability \(a_{ij}\), stay at \(i\) otherwise

When move happens, we say step is accepted, otherwise it is rejected. In a more general case (Metropolis-Hastings), when \(p^0\) is not symmetric, we calculate \[ a_{ij} = \min\left(1,\dfrac{\pi_i p^0_{ji}}{\pi_j p^0_{ji}}\right). \]

Continious Case

In the continuous case we use variables \(S\) and \(T\) instead induces \(i\) and \(j\). The detailed balance condition becomes \[ \pi(S)p(T \mid S) = \pi(T)p(S \mid T). \] Then \[ \int p(T \mid S)\pi(S)dS = \int p(S \mid T)\pi(T)dS = \pi(T)\int p(S\mid T)dS = \pi(T). \]

The following conditions \[ \forall S, \forall T:\pi(T)\ne 0:p(T \mid S) >0 \] are sufficient to guarantee uniqueness of \(\pi(T)\).

Metropolis-Hastings Algorithm

Specifically, the MH algorithm repeats the following two steps \(G\) times: given \(\theta ^{\left( g\right) }\) \[ \begin{aligned} &\text{Step 1. Draw }\theta^{\prime} \text{ from a proposal distribution,} p(\theta^{\prime}|\theta ^{(g)}) \\ &\text{Step 2. Accept } \theta^{\prime} \text{ with probability } \alpha \left(\theta ^{(g)},\theta^{\prime}\right) \text{,} \end{aligned} \] where \[ \alpha \left( \theta ^{(g)},\theta^{\prime}\right) =\min \left( \frac{\pi(\theta^{\prime})}{\pi (\theta ^{(g)})}\frac{q(\theta^{(g)}|\theta^{\prime})}{q(\theta^{\prime}|\theta ^{(g)})},1\right) \text{.} \]

Metropolis-Hastings Algorithm

Implementation of the accept-reject step:

  • Draw a uniform random variable, \(U\sim U \left[ 0,1\right]\), and set \(\theta ^{\left( g+1\right) }=\theta^{\prime}\)
  • if \(U<\alpha \left( \theta ^{(g)},\theta^{\prime}\right)\), leaving \(\theta ^{\left( g\right) }\) unchanged (\(\theta^{\left( g+1\right) }=\theta^{(g)}\)).
  • It is important to note that the denominator in the acceptance probability cannot be zero, provided the algorithm is started from a \(\pi\) - positive point since \(q\) is always positive. The MH algorithmonly requires that \(\pi\) can be evaluated up to proportionality.
  • The output of the algorithm, \(\left\{ \theta ^{\left( g\right) }\right\}_{g=1}^{\infty }\), is clearly a Markov chain. The key theoretical property is that the Markov chain, under mild regularity, has \(\pi \left( \theta\right)\) as its limiting distribution. We discuss two important specialcases that depend on the choice of \(q\).

Independence MH

One special case draws a candidate independently of the previous state, \(q(\theta^{\prime}|\theta ^{(g)})=q(\theta^{\prime})\). In this independence MH algorithm, the acceptance criterion simplifies to \[ \alpha \left( \theta ^{(g)},\theta^{\prime}\right) =\min \left( \frac{\pi (\theta^{\prime})}{\pi (\theta ^{(g)})}\frac{q(\theta ^{(g)})}{q(\theta ^{\prime})},1\right). \] Even though \(\theta\) is drawn independently of the previous state, the sequence generated is not being independent, since \(\alpha\) depends on previous draws. The criterion implies a new draw is always accepted if target density ratio \(\pi (\theta^{\prime})/\pi (\theta ^{(g)})\), increases more than the proposal ratio, \(q(\theta ^{(g)})/q(\theta^{\prime})\). When this is not satisfied, a balanced coin is flipped to decide whether or not toaccept the proposal.

Random-walk Metropolis

Random-walk (RW) Metropolis is the polar opposite of the independence MH algorithm. It draws a candidate from the following RW model, \[ \theta^{\prime}=\theta ^{\left( g\right) }+\sigma \varepsilon _{g+1}, \] where \(\varepsilon _{t}\) is an independent, mean zero, and symmetric error term, typically taken to be a normal or \(t-\)distribution, and \(\sigma\) is a scaling factor. The algorithm must be tuned via the choice of \(\sigma\), the scaling factor. Symmetry implies that \[ q\left( \theta^{\prime}|\theta ^{\left( g\right) }\right) =q\left( \theta ^{\left( g\right) }|\theta^{\prime}\right), \] and \[ \alpha \left( \theta ^{(g)},\theta^{\prime}\right) =\min \left( \pi (\theta^{\prime})/\pi (\theta ^{(g)}),1\right). \]

Implementation

set.seed(7)
mh = function(target, proposal, n, x0) {
  x = x0; p = length(x0)
  samples = matrix(NA, nrow=n, ncol=p)
  accept = rep(0, n)
  for (i in 1:n) {
    x_new = proposal(x)
    a = min(1,target(x_new) / target(x))
    # print(c(x, x_new, a))
    if (runif(1) < a) {accept[i]=1; x = x_new}
    samples[i,] = x
    # print(accept[i])
  }
  list(samples = samples, accept = accept)
}

Apply

We apply MCMC to the weighted sum of two normal distributions. This sort of distribution is fairly straightforward to sample from, but let’s draw samples with MCMC.

Code
p <- 0.4
mu <- c(-1, 2)
sd <- c(.5, 2)
target <- function(x)
    p     * dnorm(x, mu[1], sd[1]) +
    (1-p) * dnorm(x, mu[2], sd[2])
curve(target(x), col="red", -4, 8, n=301, las=1)

Apply

Code
set.seed(7)
proposal = function(x) x + rnorm(1, 0, 0.25)
r = mh(target, proposal, 2000, x0=0)
hist(r$samples[r$accept==1,], freq=FALSE, breaks=35, col="lightblue", main="Metropolis-Hastings", xlab="x")
curve(target(x), add=TRUE, col="red", lwd=2)

Bivariate

Code
set.seed(92)
library(ggplot2)
# Metropolis-Hastings for Bivariate Normal distribution
# Target distribution
mu = c(0, 0)
sigma = matrix(c(1, 0.5, 0.5, 1), 2, 2)
target = function(x) exp(-0.5*t(x-mu) %*% solve(sigma) %*% (x-mu))
library(mixtools)
ellipse(mu, sigma, npoints = 1000, newplot = TRUE)
proposal = function(x) x + rnorm(2, 0, .5)
n = 5000 # Number of samples
x = c(0, 0)
samples = matrix(0, n, 2)
for (i in 1:n) {
  x_new = proposal(x)
  a = target(x_new) / target(x)
  if (i<50) {
  if (runif(1) > a) {arrows(x[1], x[2], x_new[1], x_new[2], col= 'red')} else  {arrows(x[1], x[2], x_new[1], x_new[2], col= 'green'); x = x_new}
  }
  if (runif(1) < a) x = x_new

  samples[i,] = x
}

Paths of random samples generated by Metropolis algorithm. Red are the rejected steps and green are accepted ones

Bivariate

Code
ggplot(mapping = aes(x=samples[,1], y=samples[,2])) + geom_point()+ geom_density_2d(size=1)

Gibbs Samples: Clifford-Hammersley

  • Clifford-Hammersley (CH) theorem: high-dimensional joint distribution, \(p\left( \theta\mid X,Y\right)\), is completely characterized by a larger number of lower dimensional conditional distributions. Given this characterization, MCMC methods iteratively sample from these conditional distributions using standard sampling methods.
  • Example: bivariate posterior distribution \(p\left( \theta_1,\theta_{2} \mid X,Y \right)\), e.g. \(\theta_1\) as traditional static parameters and \(\theta_2\) as latent variables.

CH: \(p\left( \theta_1,\theta_2\right)\) is determined by \(p\left( \theta_{1} \mid \theta_{2}\right)\) and \(p\left( \theta_{2} \mid \theta_{1}\right)\), given \(p\left( \theta_{1},\theta_2\right)\), \(p\left( \theta_{1}\right)\) and \(p\left( \theta_{2}\right)\) have positive mass for all points

Besag formula

For any pairs \((\theta_{1},\theta_{2})\) and \((\theta_1^{\prime},\theta_2^{\prime})\), \[ \frac{p(\theta_1,\theta_2)}{p(\theta_1^{\prime},\theta_2^{\prime})}=\frac{p(\theta_1 \mid \theta_2^{\prime})p(\theta_2 \mid \theta_1)}{p\left( \theta_1^{\prime} \mid \theta_2^{\prime}\right) p\left( \theta_2^{\prime} \mid \theta_2\right) }. \] The proof uses the fact that \(p\left( \theta_1,\theta_2\right)=p\left( \theta_2 \mid \theta_1\right) p\left( \theta_1\right)\) (applied to both \((\theta_1,\theta_2)\) and \((\theta_1^{\prime},\theta_2^{\prime})\)) and the Bayes rule: \[ p\left( \theta_1\right) =\frac{p\left( \theta_1 \mid \theta_2^{\prime }\right) p\left( \theta_2^{\prime}\right) }{p\left( \theta_2^{\prime} \mid \theta_1\right) }\text{.} \]

Clifford-Hammersley: The general version

  • Partitioning a vector as \(\theta =\left( \theta_{1},\theta_{2},\theta_{3},\ldots ,\theta_{K}\right)\), then the general CH theorem states that \[ p\left( \theta _{i} \mid \theta _{-i}\right) := p\left( \theta _{i} \mid \theta_1,\theta_1,\ldots,\theta_{i-1},\theta _{i+1},...,\theta _{K}\right) , \] for \(i=1,...,K\), completely characterize \(p\left( \theta_1,\ldots,\theta_{K}\right)\).

Latent Variable Models

  • Fixed parameters \(\theta\), and latent variables, \(x\). In this case, then CH \[ p\left( \theta,x \mid y\right) \] is completely characterized by \(p\left( \theta \mid x,y\right)\) and \(p\left( x \mid \theta ,y\right)\).
  • The distribution \(p\left( \theta \mid x,y\right)\) is the posterior distribution of the parameters, conditional on the observed data and the latent variables. Similarly, \(p\left( x \mid \theta,y\right)\) is the smoothing distribution of the latent variables.

Gibbs Sampler

  • The Gibbs sampler simulates multi-dimensional posterior distributions by iteratively sampling from the lower-dimensional conditional posteriors.
  • Unlike the previous MH algorithms, the Gibbs sampler updates the chain one component at a time, instead of updating the entire vector.
  • This requires either that the conditional posteriors distributions is discrete, or a recognizable distribution (e.g. Normal) for which standard sampling algorithms apply, or that resampling methods, such as accept-reject, can be used.

Gibbs Sampler

In the case of \(p\left( \theta_1,\theta_2\right)\), given current draws,\(\left( \theta_1^{\left( g\right) },\theta_2^{\left( g\right)}\right)\), the Gibbs sampler consists of \[ \begin{aligned} \text{1. Draw }\theta_1^{\left( g+1\right) } &\sim &p\left( \theta _1|\theta_2^{\left( g\right) }\right) \\ \text{2. Draw }\theta_2^{\left( g+1\right) } &\sim &p\left( \theta_2|\theta_1^{\left( g+1\right) }\right) , \end{aligned} \] repeating \(G\) times.

Gibbs Sampler

  • Transition kernel from state \(\theta\) to state \(\theta ^{\prime}\) is \[ p\left( \theta ,\theta^{\prime}\right) =p\left( \theta_{1}^{\prime}|\theta_2\right) p\left( \theta_2^{\prime}|\theta_{1}^{\prime}\right) \]
  • The limiting (statiobnary) distribution \(\pi\), satisfies \[ \pi \left( \theta^{\prime}\right) =\int p\left( \theta ,\theta^{\prime}\right) d\theta. \]

Gibbs Sampler

It is easy to verify that the stationary distribution of the Markov chain generated by the Gibbs sampler is the posterior distribution, \(\pi \left(\theta \right) =p\left( \theta_1,\theta_2\right)\): \[ \begin{aligned} \int p\left( \theta ,\theta^{\prime}\right) d\theta &=&p\left( \theta_2^{\prime}|\theta_1^{\prime }\right) \int_{\theta_2}\int_{\theta_1}p\left( \theta_1^{\prime}|\theta_2\right) p\left( \theta_1,\theta_2\right) d\theta_{1}d\theta_2 \\ &=&p\left( \theta_2^{\prime}|\theta_1^{\prime}\right) \int_{\theta_2}p\left( \theta_1^{\prime}|\theta_2\right) p\left( \theta_{2}\right) d\theta_2 \\&=&p\left( \theta_2^{\prime}|\theta_1^{\prime}\right) p\left( \theta_{1}^{\prime}\right) =p\left( \theta_1^{\prime},\theta_2^{\prime}\right) =\pi \left( \theta^{\prime}\right) \text{.} \end{aligned} \]

Gibbs Sampler

The ergodic theorem holds: for a sufficiently integrable function \(l\) and for all starting points \(\theta ,\) \[ \underset{G\rightarrow \infty }{\lim }\frac{1}{G}\sum_{g=1}^{G}f\left(\theta ^{\left( g\right) }\right) =\int f\left( \theta \right) \pi \left(\theta \right) d\theta =E\left[ f\left( \theta \right) \right] \] - Run for an initial length, often called the burn-in - Then a secondary sample of size \(G\) is created for Monte Carlo inference

Gibbs sample for Normal-Gamma

Code
# summary statistics of sample
n <- 30
ybar <- 15
s2 <- 3

# sample from the joint posterior (mu, tau | data)
mu <- tau <- rep(NA, 11000)
T <- 1000    # burnin
tau[1] <- 1  # initialisation
for(i in 2:11000)
{   
    mu[i] <- rnorm(n = 1, mean = ybar, sd = sqrt(1 / (n * tau[i - 1])))    
    tau[i] <- rgamma(n = 1, shape = n / 2, scale = 2 / ((n - 1) * s2 + n * (mu[i] - ybar)^2))
}
mu <- mu[-(1:T)]   # remove burnin
tau <- tau[-(1:T)] # remove burnin
hist(mu)

Code
hist(tau)

Hybrid chains

  • Given a partition of the vector \(\theta\) via CH, a hybrid MCMC algorithm updates the chain one subset at a time, either by direct draws (Gibbs steps) or via MH. - Thus, a hybrid algorithm combines the features of the MH algorithm and the Gibbs sampler, providing significant flexibility in designing MCMC algorithms for different models.

  • \(p\left( \theta_2|\theta_1\right)\) is recognizable and can be directly sampled.

  • \(p\left( \theta_1|\theta_2\right)\) can only be evaluated and not directly sampled.

  • MH accept/reject based on \[ \alpha \left( \theta_1^{(g)},\theta_1^{\prime}\right) =\min \left( \frac{p\left( \theta_1^{\prime}|\theta_2^{\left( g\right) }\right) }{p\left( \theta_1^{\left ( g \right ) }|\theta_2^{\left( g\right) }\right) }\frac{q\left( \theta_1^{\left( g\right) }|\theta_1^{\prime},\theta_2^{\left( g\right) }\right) }{q\left( \theta_1^{\prime}|\theta_1^{\left( g\right) },\theta _2^{\left( g\right) }\right) },1\right) . \]

Hybrid chains

The general hybrid algorithm is as follows. Given \(\theta_1^{\left(g\right) }\) and \(\theta_2^{\left( g\right) }\), for \(g=1,\ldots, G\), \[ \begin{aligned} 1.\text{ Draw }\theta_1^{\left( g+1\right) } &\sim &MH\left[ q\left( \theta_1|\theta_1^{\left( g\right) },\theta_2^{\left( g\right) }\right) \right] \\ 2.\text{ Draw }\theta_2^{\left( g+1\right) } &\sim &p\left( \theta_2|\theta_1^{\left( g+1\right) }\right) . \end{aligned} \] In higher dimensional cases, a hybrid algorithm consists of any combination of Gibbs and Metropolis steps. Hybrid algorithms significantly increase the applicability of MCMC methods, as the only requirement is that the model generates posterior conditionals that can either be sampled or evaluated.

Hamiltonian Monte-Carlo

  • Add a momentum to the Metropolis Sampling.
  • Gradient of the un-normalized posterior to better navigate the surface defined by the posterior.
  • Momentum variable \(r\) for each model variable \(\theta\). \[ p(\theta,r) \propto \exp\left(L(\theta) - 0.5 r^Tr\right), \]
  • \(L\) is the logarithm of the joint density of the variables of interest (negative potential energy)
  • \(0.5 r^Tr\) is the kinetic energy of the particle
  • \(p(\theta,r)\) is the total negative energy

Hamiltonian Monte-Carlo

We can simulate the evolution over time of the Hamiltonian dynamics of this system via the “leapfrog” integrator, which proceeds according to the updates \[ \begin{aligned} r^{+/2} & = r + (\epsilon/2) \nabla_{\theta}L(\theta) \\ \theta^{+} & = \theta + (\epsilon/2) r^{+/2}\\ r^{+} & = r^{+/2} + (\epsilon/2)\nabla_{\theta}L(\theta^{+}). \end{aligned} \]

Hierarchical Bayesian Model

We consider the following model: \[\begin{align*} \tau &\sim \mathrm{Gamma}(0.5, 0.5) \\ \lambda_d &\sim \mathrm{Gamma}(0.5, 0.5) \\ \beta_d &\sim \mathcal{N}(0, 20) \\ y_n &\sim \mathrm{Bernoulli}(\sigma((\tau \lambda \odot \beta)^T x_n))), \end{align*}\]

  • \(\tau\) is a scalar global coefficient scale
  • \(\lambda\) is a vector of local scales
  • \(\beta\) is the vector of unscaled coefficients,
Code
import tensorflow_probability.substrates.jax as tfp
import numpy as np
import jax.numpy as jnp
tfd = tfp.distributions
import jax
import matplotlib.pyplot as plt

Apply to Iris Data

Code
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np  
iris = pd.read_csv("data/iris.csv")
print(iris.shape)
(150, 5)
Code
print(iris.head())
   sepal.length  sepal.width  petal.length  petal.width variety
0           5.1          3.5           1.4          0.2  Setosa
1           4.9          3.0           1.4          0.2  Setosa
2           4.7          3.2           1.3          0.2  Setosa
3           4.6          3.1           1.5          0.2  Setosa
4           5.0          3.6           1.4          0.2  Setosa
Code
y = iris['variety']=="Setosa"
x = iris["sepal.length"].values
plt.scatter(x,y)
x = np.c_[np.ones(150),x]
print(x.shape)
(150, 2)

Log Density Function

For Hamiltonian MC, we only need to evaluate the joint log-density pointwise

Code
def joint_log_prob(x, y, tau, lamb, beta):
    lp = tfd.Gamma(0.5, 0.5).log_prob(tau)
    lp += tfd.Gamma(0.5, 0.5).log_prob(lamb).sum() 
    lp += tfd.Normal(0., 20).log_prob(beta).sum() 
    logits = x @ (tau * lamb * beta)
    lp += tfd.Bernoulli(logits).log_prob(y).sum() 
    return lp   

tau = np.random.rand(1)[0]
lamb = np.random.rand(2)
beta = np.random.rand(2)
joint_log_prob(x, y, tau, lamb, beta)
Array(-134.03787, dtype=float32)

Change of Variables

\[ z\triangleq T^{-1}(\theta),\qquad \pi(z) = \pi(\theta) \left| \frac{\partial T}{\partial z} (z) \right|, \]

Taking the logarithm of both sides, we get \[ \log \pi(z) = \log \pi(\theta) + \log \left| \frac{\partial T }{\partial z}(z) \right| \] Use \(T(z)=e^z\), and \(\log|\frac{\partial T}{\partial z}(z)| = z\)

Change of Variables in Code

def unconstrained_joint_log_prob(x, y, theta):
    ndims = x.shape[-1]
    unc_tau, unc_lamb, beta = jnp.split(theta, [1, 1 + ndims])
    unc_tau = unc_tau.reshape([]) 
    # Make unc_tau a scalar 
    tau = jnp.exp(unc_tau)
    ldj = unc_tau
    lamb = jnp.exp(unc_lamb)
    ldj += unc_lamb.sum()
    return joint_log_prob(x, y, tau, lamb, beta) + ldj

Let’s check out function

from functools import partial
target_log_prob = partial(unconstrained_joint_log_prob, x, y)
theta = np.r_[tau,lamb,beta]
target_log_prob(theta)
Array(-904.6409, dtype=float32)

Automatic Differentiation

Code
target_log_prob_and_grad = jax.value_and_grad(target_log_prob)
tlp, tlp_grad = target_log_prob_and_grad(theta)
print(tlp)
-904.6409
Code
tlp_grad
Array([ -892.1147 ,  -100.52104,  -792.00354,  -233.65552, -1235.4655 ],      dtype=float32)

Hamiltonian Monte Carlo

Code
def leapfrog_step(target_log_prob_and_grad, step_size, i, leapfrog_state):
    z, m, tlp, tlp_grad = leapfrog_state
    m += 0.5 * step_size * tlp_grad
    z += step_size * m
    tlp, tlp_grad = target_log_prob_and_grad(z)
    m += 0.5 * step_size * tlp_grad
    return z, m, tlp, tlp_grad

def hmc_step(target_log_prob_and_grad, num_leapfrog_steps, step_size, z, seed):
    m_seed, mh_seed = jax.random.split(seed)
    tlp, tlp_grad = target_log_prob_and_grad(z)
    m = jax.random.normal(m_seed, z.shape)
    energy = 0.5 * jnp.square(m).sum() - tlp
    new_z, new_m, new_tlp, _ = jax.lax.fori_loop(
        0,
        num_leapfrog_steps,
        partial(leapfrog_step, target_log_prob_and_grad, step_size),
        (z, m, tlp, tlp_grad)) 
    new_energy = 0.5 * jnp.square(new_m).sum() - new_tlp
    log_accept_ratio = energy - new_energy
    is_accepted = jnp.log(jax.random.uniform(mh_seed, [])) < log_accept_ratio
    # select the proposed state if accepted
    z = jnp.where(is_accepted, new_z, z)
    hmc_output = {"z": z,
                  "is_accepted": is_accepted,
                  "log_accept_ratio": log_accept_ratio}
    # hmc_output["z"] has shape [num_dimensions]
    return z, hmc_output

def hmc(target_log_prob_and_grad, num_leapfrog_steps, step_size, num_steps, z,
        seed):
    # create a seed for each step
    seeds = jax.random.split(seed, num_steps)
    # this will repeatedly run hmc_step and accumulate the outputs
    _, hmc_output = jax.lax.scan(
        partial(hmc_step, target_log_prob_and_grad, num_leapfrog_steps, step_size),
        z, seeds)
    # hmc_output["z"] now has shape [num_steps, num_dimensions]
    return hmc_output

def scan(f, state, xs):
  output = []
  for x in xs:
    state, y = f(state, x)
    output.append(y)
  return state, jnp.stack(output)

HMC

num_leapfrog_steps=30
step_size = 0.008
from jax import random
seed = random.PRNGKey(92)
num_samples=10000
hmc_output = hmc(target_log_prob_and_grad, num_leapfrog_steps, step_size,
    num_samples, theta, seed)

Inspect the results

Code
ndims = x.shape[-1]
thetap = hmc_output['z']
taup, lambp, betap = jnp.split(thetap, [1, 1 + ndims],axis=1)
skip = 2000
slope = betap[skip:,1]
intercept = betap[skip:,0]
plt.hist(slope,bins=50);
plt.hist(intercept,bins=50);

Inspect the results

Code
print(np.quantile(slope,[0.05,0.95,0.5]))
[-14.34220815  -1.41647384  -5.76567411]
Code
print(np.quantile(intercept,[0.05,0.95,0.5]))
[17.46308355 39.23070431 28.39726067]
Code
print(np.mean(intercept),np.mean(slope))
28.044336 -6.913012
Code
y = iris$Species=="setosa"
glm(y~iris$Sepal.Length, family=binomial)

Call:  glm(formula = y ~ iris$Sepal.Length, family = binomial)

Coefficients:
      (Intercept)  iris$Sepal.Length  
            27.83              -5.18  

Degrees of Freedom: 149 Total (i.e. Null);  148 Residual
Null Deviance:      191 
Residual Deviance: 72   AIC: 76

Hierarchical Model

We will use the Normal-Normal model with kown variance and unknowm mean \[ \begin{aligned} \mu \sim & \mathrm{Normal}(0,1) \\ x\mid \mu \sim & \mathrm{Normal}(\mu,1) \end{aligned} \]

Let’s sample

Code
n =20
data= sort(rnorm(n,0))
# hist(data)
samples = 15000
proposal_width = 0.5
mu_current = 0
sigma = 1
mu0 = 0
sigma0 = 1
posterior = c(mu_current)
for (i in 1:samples) {
  mu_proposal = rnorm(1,mu_current, proposal_width)
  likelihood_current =  prod(dnorm(data,mu_current,  sigma))
  likelihood_proposal = prod(dnorm(data,mu_proposal, sigma))
  prior_current =  dnorm(mu_current,mu0, sigma0)
  prior_proposal = dnorm(mu_proposal,mu0, sigma0)

  p_current = likelihood_current * prior_current
  p_proposal = likelihood_proposal * prior_proposal
  # Accept proposal?
  p_accept = p_proposal / p_current
  accept = (runif(1) < p_accept)
  if (accept) {
      mu_current = mu_proposal
      posterior = c(posterior,mu_current)
  }
}
posterior = posterior[-(1:500)]
plot(posterior, pch=16)

Code
mu_post = (mu0 / sigma0^2 + sum(data) / sigma^2) / (1. / sigma0^2 + n / sigma^2)
sigma_post = sqrt((1. / sigma0^2 + n / sigma^2)^(-1))
sprintf("Analytical mean: %.2f. MCMC Mean: %.2f", mu_post, mean(posterior))
[1] "Analytical mean: 0.03. MCMC Mean: 0.03"
Code
sprintf("Analytical sd: %.2f. MCMC sd: %.2f", sigma_post, sd(posterior))
[1] "Analytical sd: 0.22. MCMC sd: 0.23"
Code
x = seq(-1,1,by=0.01)
dpost = dnorm(x,mu_post,sigma_post)
hist(posterior, freq = F, breaks=30, ylim=c(0,2), main="Posterior Mean")
lines(density(posterior), ylim=c(0,2), main="Posterior Mean")
lines(x,dpost, type='l', lwd=3, col='red')

Let’s compare to HMC

Code
library(brms)
d = data.frame(y=data, sigma0=rep(1,20))
fit <- brm(data = d, 
            family = gaussian,
            y | se(sigma0) ~ 1,
    prior = c(prior(normal(0, 1), class = Intercept)),
    iter = 1000, refresh = 0, chains = 4)
posterior = as_draws_array(fit, variable = "b_Intercept")
hist(posterior, freq = F, breaks=30, ylim=c(0,2), main="Posterior Mean")
lines(density(posterior), ylim=c(0,2), main="Posterior Mean")
lines(x,dpost, type='l', lwd=3, col='red')

Problems of Metropolis-Hastings algorithm

MH is very simple and quite general. At the same time

  • too large step size \(\sigma\) leads to a large fraction of rejected samples, while too small \(\sigma\) makes very small steps, thus it takes long time to ‘explore the distribution’ (check this in demonstration!)
  • in high-dimensional spaces (very important use-case), MH explores the space very inefficiently because of it’s random-walk behavior. Guessing good direction in 1000 dimensions is incomparably harder than doing this for 2 dimensions!
  • MH can’t travel long distances (significantly larger than \(\sigma\)) between isolated local minimums

Sampling high-dimensional distributions with MH becomes very inefficient in practice. A more efficient scheme is called Hamiltonian Monte Carlo (HMC).