# Chapter 6 Optimisation

We studied different ways to build parametric predictive models $$y = f_{\theta}(\theta) + \epsilon$$. Our predictor$$f_{\theta}(\theta)$$ is parametrized by a vector $$\theta \in \mathbb{R}^p$$. We also described three approaches to estimate parameters of a parametric model $$y = f_{\theta}(\theta) + \epsilon$$, given a training data set $$D = (\theta,y) = \{(\theta_i,y_i)\}_{i=1}^n$$. Maximum likelihood, when$$\epsilon$$ is Normal (regression problem), requires solving an optimization problem $\hat \theta = \arg\min_{\theta} \sum_{i=1}^{n}\left(y_i - f_{\theta}(x_i)\right)^2.$ The MAP-based inference involves solving optimization problem with penalized negative log-likelihood as an objective $\hat \theta = \arg\min_{\theta} \sum_{i=1}^{n}\left(y_i - f_{\theta}(x_i)\right)^2 + \lambda\sum_{j=1}^{p}\phi(\theta_j).$ here $$\phi(\theta)$$ is the negative log-likelihood of the prior on model parameters $$\theta$$, and $$\lambda$$ is a global parameter that gauges the amount of regularization.

The full Bayesian inference computes posterior distribution over model parameters using Bayes rule $p(\hat \theta \mid X,y) = \dfrac{p(y\mid X, \theta)p(\theta)}{p(y\mid X)}.$

Unless we have a conjugate prior an analytical formula for the posterior distribution is unavailable, instead we apply numerical methods, such asMarkov Chain Monte Carlo to sample from the posterior distribution.

In this chapter we will discuss optimization methods for performing MLE and MAP estimation as well as Markov Chain Monte Carlo for the full Bayesian inference calculations.

## 6.1 Preliminaries

Let us start with a general optimization problem which arises when maximum a posteriori (MAP) or maximum likelihood (MLE) estimation is used. To perform the inference, both approaches involve solving an unconstrained optimization problem of the form $\mathop{\mathrm{minimize \quad}}_{\theta} l(\theta, X,y) + \lambda\phi(\theta).$

Given training data $$(X,y)$$, the loss function $$l: \mathbb{R}^p \rightarrow R$$ is a function of$$\theta$$ and gives a mis-fit between model and training data. Parameter $$\lambda$$ scales the regularization penalty function and is used to control bias-variance trade-off.

Table below shoes examples of loss and penalty functions

predictor $$l(\theta,y,\theta)$$ $$l(\theta)$$ $$\phi(\theta)$$
least-squares $$(\theta^Tx - y)^2$$ $$\theta^Tx$$ $$0$$
ridge regression $$(\theta^Tx - y)^2$$ $$\theta^Tx$$ $$\Vert\theta\Vert _2^2$$
lasso $$(\theta^Tx - y)^2$$ $$\theta^Tx$$ $$\Vert\theta\Vert _1$$
logistic classifier $$\log (1+ \exp(-y\theta^Tx))$$ $$\mathrm{sign}~\theta^Tx$$ $$0$$
SVM $$(1-y\theta^Tx)_+$$ $$\mathrm{sign}~\theta^Tx$$ $$\Vert\theta\Vert _2^2$$

We can further mix and match the loss functions and the penalty functions. For example $$\ell_1$$ penalty can be added to the logistic classifier loss function to obtain a sparse model. To simplify our formulas, for now we will consider just the loss function and will develop a few optimization algorithms for finding optimal values. A necessary condition for a point $$\hat \theta$$ to be optimal is $\nabla_{\theta} l(\hat \theta)= 0,$

where $$\nabla$$ standS for a gradient of the function, which is a vector of partial derivatives $\nabla_{\theta} l = \left(\dfrac{\partial f}{\partial \theta_1},\ldots,\dfrac{\partial f}{\partial \theta_p}\right).$

The partial derivative with respect to a parameter measures the change in the value of the function relative to the change in the input variable

$l(\theta + \delta_i) \approx l(\theta) + \delta \dfrac{\partial f}{\partial \theta_i}.$ Here $$\delta_i = (0,\ldots,0,\delta,0,\ldots,0)$$ is a small perturbation to the $$i$$th component of the vector $$\theta$$.

Thus, solving the unconstrained minimization problem is the same as finding all of the points where derivative is zero, such a point is called critical. There are rare occasions when a critical point can be calculated analytically. For example for a quadratic loss function $l(\theta,x,y) = \sum_{i=1}^{n}(\theta^T\theta_i - y_i)^2% = (\theta\theta - y)^T(\theta\theta - y) = \theta^TX^TX\theta - 2y^TX\theta + y^Ty.$ The derivative is given by $\dfrac{\partial l}{\partial \theta_j} = \sum_{i=1}^{n}2(\theta_j\theta_i - y_i)\theta_{ij} = 2\sum_{i=1}^{n}\theta_j\theta_ix{ij} - 2\sum_{i=1}^{n}y_ix_{ij}.$ when we write it in matrix form, to yield $\nabla_{\theta} l = 2X^TX\theta - 2X^Ty.$ Setting this equal to zero, gives the solution $$\theta = (X^TX)^{-1} X^Ty$$. In many situations, it is impossible to derive an analytical solution to an optimization problem and an iterative algorithms must be used.Optimization algorithms used in model fitting can only find one of the critical points. However, many useful learning models have a convex loss function. A convex loss function has only one critical point. Thus,algorithms considered in this section can solve this class of problems.

Many important learning problems lead to convex loss functions. Consider an exponential family of distributions. $p(x | \theta) = \dfrac{l(x)}{g(\theta)}\exp(\theta^Tu(x))$ for any$$u(x)$$, $$l(x)>0$$ and $$g(\theta) = \int l(x)\exp(\theta^Tu(x)) d\theta$$,$$p(x | \theta)$$ will be a probability density function. Sufficient statistics If a density function can be represented as a product $$p(x|\theta) = l(x)h(\theta)g(\theta,c(x))$$, than $$c(x)$$ is called a sufficient statistic of this distribution. Thus, for exponential family the function $$u(x)$$ will be the sufficient statistic. Sufficient statistic contains all of the information for estimating $$\theta$$. Thus, once we calculated sufficient statistics, we do not need sample data anymore for estimating $$\theta$$. It has two immediate practical implications. First, when we need to estimate parameter $$\theta$$ based on the data distributed among multiple server,we do not need to copy data to a single location. Rather, we can calculate sufficient statistics for each part of the data and use those to estimate $$\theta$$. Second, we can sequentially analyze the data by updating sufficient statistics each time new chunk of samples arrives and then we do not need to store those samples. Note that $\dfrac{\partial\log g(\theta)}{\partial\theta_i} = \int \dfrac{\partial}{\partial\theta_i} \log(l(x)\exp(\theta^Tu(x))) = \operatorname{E}\left(u(x)\right)$

Can you derive sufficient statistics for Gamma distribution? If we take second derivative $\dfrac{\partial\log g(\theta)}{\partial\theta_i\partial\theta_j} = \operatorname{Cov}\left(u(\theta_i),u(\theta_j)\right)$ Thus $$g(\theta)$$ is a convex function! This guarantees concavity of the log-likelihood with respect to $$\theta$$. Indeed, $\log p(x|\theta) = \sum_{i}\log l(\theta_i) - n\log g(\theta) + \sum_i \log \theta u(\theta_i)$ First term is a constant (does not depend on $$\theta$$), second term is concave (negative of convex function) and the third term is linear, thus the resulting function is concave. Thus, numerical solver that searches for local maxima, will find the global solution.

## 6.2 Newton’s Method

Newton’s method solves the problem of finding zeros of a non-linear function. Thus, we can apply it to solve the problem of$$\nabla l(\theta) = 0$$. The method can be derived from the following formula $\nabla l(\theta + \delta) \approx \nabla l(\theta) + H(\theta)\delta.$ Here $$H(\theta) = \nabla^2 l(\theta)$$ is called the Hessian matrix,which is the matrix of second derivatives of the function $$l$$, with$$i,j$$th element of this matrix being $H_{ij}(\theta)= \dfrac{\partial^2 l}{\partial \theta_i\partial \theta_j}$

Now, we want to find such $$\delta$$ so that$$\nabla l(\theta) + H(\theta)\delta = 0$$, we do it by multiplying by$$H^{-1}$$ $\delta = -H(\theta)^{-1}\nabla l(\theta).$ Thus, the the solution is approximated by $\hat \theta \approx \theta -H(\theta)^{-1}\nabla l(\theta)$ If we doit iteratively, by looking for approximate solution at every step, we get the Newton’s method $\theta^+ = \theta -H(\theta)^{-1}\nabla l(\theta),$ where $$\theta^+$$stands for the next iterate. Pure Newton’s method might not converge. In practice, we instead use damped Newton’s method (i.e., Newton’s method), which repeats $\theta^+ = \theta-t(\nabla^2 l(\theta))^{-1} \nabla l(\theta)$ Note, that the pure method uses $$t = 1$$sizes here typically are chosen by backtracking search, with parameters $$0 < \alpha \leq 1/2, 0 <\beta < 1$$. At each iteration, we start with $$t = 1$$ and while $l(x+tv) > l(x) + \alpha t \nabla l(x)^T v$ we shrink $$t = \beta t$$,else we perform the Newton update. Note that here$$v = -(\nabla^2 f (x))^{-1} \nabla f (x)$$, so$$\nabla f (x)^T v = -\lambda^2(x)$$

### 6.2.1 Convergence analysis

There’s sometimes confusion about terminology for convergence. Here’s what optimizers generally say, when talking about how fast a positive sequence tk of scalars is decreasing to zero.

• Sub-linear: $$t_k \rightarrow 0$$, but $$t_{k+1}/t_k \rightarrow 1$$. Example: $$1/k$$ rate, where $$t_k \le Q/k$$, for some constant $$Q$$.
• Linear: $$t_{k+1}/t_k \le r$$ for some $$r \in (0, 1)$$. Thus typically $$t_k \le Ct_k$$. Also called “geometric” or “exponential” (oversell).
• Super-linear: $$t_{k+1}/t_k \rightarrow 0$$. That’s fast!
• Quadratic: $$t_{k+1} \le Ct_k^2$$. Really fast, typical of Newton’s method. The number of correct significant digits doubles at each iteration. There’s no point being faster than this!

Assume that $$l$$ convex, twice differentiable, having domain $$\mathbb{R}^n$$, and additionally - $$\nabla l$$ is Lipschitz with parameter $$L$$
$\Vert\nabla l(\theta) - \nabla l(y)\Vert _2 \leq L\Vert\theta-y\Vert _2 \text{ for any } \theta,y$

• $$l$$ is strongly convex with parameter $$m$$
$\nabla^2l(\theta) \ge mI, ~~~\mbox{for all}~~\theta$

• $$\nabla^2 l$$ is Lipschitz with parameter $$M$$ Newton’s method with backtracking line search satisfies the following two-stage convergence bounds $l(\theta^{(k)}) - l^* \leq \begin{cases} (l(\theta^{(0)})-l^*)- \gamma k & \text{if } k \leq k_0\\ \frac{2m^3}{M^2} \left(\frac{1}{2}\right)^{2^{k-k_0+1}} & \text{if } k > k_0. \end{cases}$ Here $$\gamma = \alpha \beta^2 \eta^2 m/L^2$$, $$\eta = \mathop{\mathrm{minimize \quad}}_{}\{1, 3(1- 2\alpha)\}m^2 /M$$, and $$k_0$$ is the number of steps until $$\Vert \nabla l(\theta^{(k_0 +1)} )\Vert _2 < \eta$$. In more detail, convergence analysis reveals such that convergence follows two stages

• Damped phase: $$\Vert \nabla l(\theta^{(k)} )\Vert _2 \geq \eta$$, and
$l(\theta^{(k+1)}) - l(\theta^{(k)}) \leq -\gamma$

• Pure phase: $$\Vert\nabla l(\theta^{(k)} )\Vert _2 < \eta$$, backtracking selects $$t = 1$$, and
$\frac{M}{2m^2} \Vert\nabla l(\theta^{(k+1)})\Vert _2 \leq \left(\frac{M}{2m^2} \Vert\nabla l(\theta^{(k)})\Vert _2\right)^2$

Note that once we enter pure phase, we won’t leave when $$\eta \leq m^2 /M$$ Unraveling this result, what does it say? To reach$$l(\theta^{(k)} ) - \hat l \leq \epsilon$$, we need at most $\frac{l(\theta^{(0)}) - \hat l}{\gamma} + \log_2\log_2 (\epsilon_0/\epsilon)$ iterations, where $$\epsilon_0 = 2m^3 /M^2$$ This is called quadratic convergence. The above result is a local convergence rate, i.e., we are only guaranteed quadratic convergence after some number of steps $$k_0$$,where $$k_0 \leq \frac{l(\theta^{(0)})-\hat l}{\gamma}$$. Somewhat bothersome may be the fact that the above bound depends on $$L, m, M$$,and yet the algorithm itself does not. A scale-free analysis is possible for self-concordant functions: on $$\mathbb{R}$$,a convex function $$l$$ is called self-concordant if

$|l'''(\theta)| \leq 2l''(\theta)^{3/2} \text{ for all } \theta$

and on $$\mathbb{R}^n$$ is called self-concordant if its projection onto every line segment is so. E.g., $$l(\theta) = \log(\theta)$$ is self-concordant (Nesterov and Nemirovskii):Newton’s method with backtracking line search requires at most $C(\alpha, \beta)(l(\theta^{(0)}) - \hat l) + \log\log (1/\epsilon)$ iterations to reach $$l(\theta^{(k)} ) - \hat l \leq \epsilon$$, that only depends on $$\alpha, \beta$$,

## 6.3 Stochastic Gradient Descent

There is a simpler procedure, called gradient descent, to finds minimabut does not require calculating the Hessian: $\theta^+ = \theta - t \nabla l (\theta)$ which converges, as we willshow, to the solution for sufficiently small $$\alpha$$. Gradient descent method is derived from the Newton’s method by replacing$$H(\theta)$$ by $$1/t I$$, where $$I$$ is the identity matrix, the resulting iterations are $\theta^+ = \theta -t\nabla l(\theta)$ At each iteration, gradient descent minimizes $l (y) \approx l (\theta) + \nabla l (\theta)^T (y - \theta) + \frac{1}{2t}\Vert y - \theta\Vert^2_2$ Here $$l (\theta) + \nabla l (\theta)^T (y - \theta)$$ is linear approximation to $$l$$ and $$\frac{1}{2t}\Vert y - \theta\Vert^2_2$$ is the proximity term to $$x$$, with weight $$1/(2t)$$. Choose next point$$y = \theta^+$$ to minimize quadratic approximation: $\theta^+ = x - t \nabla l(\theta).$

Figure 6.2 shows the geometric interpretation of Gradient Descent Method. Blue point is the current iterate $$\theta$$ and red is the next iterate $$\theta^+$$

$\theta^+ = \arg\min_y l(\theta) + \nabla l(\theta)^T (y - \theta) + \frac{1}{2t} \Vert y-\theta\Vert^2_2$

Parameter $$t$$ controls the step size (a.k.a. learning rate) and need to be specified by the modeler. If the step size too large, the algorithm will not converge. If we choose step size which is too small, it will take many steps for the algorithm to converge to the solution. Figure 6.3 shows effect of the step size on the convergencefor $$l(\theta) = (10x^2_1 + x^2_2 )/2$$

Clearly there’s a tradeoff-convergence analysis later will give us abetter idea.

## 6.4 Automatic Differentiation (AD)

In general, there are three different ways to calculate those derivatives.

• numerical differentiation, when a gradient is approximated by a finite difference $$f'(x) = (f(x+h)-f(x))/h$$

• the numerical differentiation is not backward stable

• Symbolic differentiation uses a tree form representation of a function and applies chain rule

Similar to symbolic differentiations AD recursively applies the chain rule It calculates the exact value of derivative and thus avoids the problem of numerical instability.

AD provides the value of derivative evaluated at a specific point rather than an analytical representation of the derivative.

AD does not require analytical specification. AD can differentiate complex functions which involve IF statements and loops, and AD can be implemented using either forward or backward mode.

    def sigmoid(x,b,w):
v1 = w*x;
v2 = v1 + b
v3 = 1/(1+exp(-v2))

### 6.4.1 Computational Graph

DL model high dimensional data using complex models. We represent both the model and the algorithm using a computational graph. In this graph, each node is not only passively stores the parameters but also has behavior, which when combined specifies the learning algorithm. The vertices of the graph contain parts of the code and store the state.

### 6.4.2 Back-Propagation

Back-propagation is the algorithm for calculating the derivative $$\nabla \mathcal{L}$$ which is required for the optimization iterations. We can think of a derivative of a function as a speed with with our function changes. Function will have large derivatives at points where it is steep (changes fast) and low derivatives at valleys. Derivative can be approximated by a finite difference $f'(x) \approx \dfrac{f(x+h)-f(x)}{h}$ for some small $$h$$. It is simply a ration of change in output $$f$$ to change in input $$x$$. For example, the derivative of a linear function $$f(a) = ka$$ is constant, $$f(x+h) - f(x) = kx + kh - kx = kh$$, thus $$f'(x) = k$$ a constant. Now, for quadratic function we have $$f(x+h) - f(x) = (x+h)^2 - x^2 = 2xh - h^2$$, thus $$f'(x)\approx 2x-h$$ when $$h$$ is very small we ignore $$-h$$ tern to get $$f'(x) = 2x$$.

A specific type of automated differentiation algorithm applied.

Take c = a+b; d = b+1; e = c*d

#### 6.4.2.2 Differentiation via Backward Calculations (Backprop)

$\frac{\partial}{\partial a}(a+b) = \frac{\partial a}{\partial a} + \frac{\partial b}{\partial a} \; {\rm and} \; \frac{\partial}{\partial u}uv = u\frac{\partial v}{\partial u} + v\frac{\partial u}{\partial u}$

#### 6.4.2.3 Backprop

    # set some inputs
x = -2; y = 5; z = -4

# perform the forward pass
q = x + y # q becomes 3
f = q * z # f becomes -12

# perform the backward pass (backpropagation)
# in reverse order:
# first backprop through f = q * z
dfdz = q # df/dz = q, so gradient on z becomes 3
dfdq = z # df/dq = z, so gradient on q becomes -4
# now backprop through q = x + y
dfdx = 1.0 * dfdq # dq/dx = 1.
#  And the multiplication here is the chain rule!
dfdy = 1.0 * dfdq # dq/dy = 1

#### 6.4.2.5 Backprop for Regression

Suppose that we have a shallow two layer neural net (a.k.a regression)

The loss function is simply $$L^2$$, namely $l(W_1,W_2) = (W_2\sigma(W_1x)-y)^2$ Need to know how to calculate $$\sigma'$$ and $$\frac{\partial a}{\partial W},\ \frac{\partial a}{\partial x}$$, where $$a = Wx$$

To automate process of calculating derivatives of complex functions, we represent our function as a computational graph. Consider the following expression $$\mathcal{L}(a,b,c) = 3(bc+a)$$ its graph form is shown in @ref(fig:tree_form_simple).

Graph form allows us to combine derivatives of simple function to calculate a derivative of a more complex expression. Say, we want to calculate the gradient $$\nabla\mathcal{L}(a,b,c)$$ at point $$(a,b,c) = (5,3,2)$$. First we evaluate the graph by moving from the bottom to the top, which is called the forward pass

1. $$u= bc = 6$$

2. $$v = a+u = 11$$

3. $$\mathcal{L} = 3v = 33$$

To calculate the derivative we move in the backward direction

1. $$\dfrac{\partial \mathcal{L}(v)}{\partial b} = 3\dfrac{\partial v}{\partial b}$$

2. $$\dfrac{\partial v}{\partial b} = \dfrac{\partial u}{\partial b}$$

3. $$\dfrac{\partial u}{\partial b}= c = 2$$.

When we put all those three expressions together, we get $\dfrac{\partial \mathcal{L}(v)}{\partial b} = 3\times2 = 6.$ In other words, when $$a$$ changes by one unit, $$\mathcal{L}$$ changes by six units. Let’s check it by replacing $$b = 3 \rightarrow 4$$, then $$\mathcal{L}(5,4,2) = 3(8+5) = 39$$, it changed from 33 to 39 (6 units)!

We calculated this derivative using chain rule, which allows to differentiate composite functions $\left(f(g(x))\right)' = f'(g(x))g'(x).$

Now consider an actual function we are interested when training a deep learning model, a quadratic loss function with just two sample, mathematically it is $\mathcal{L}_w(x) = (w^Tx_1 - y_1)^2 + (w^Tx_2 - y_2)^2$

The forward path

1. $$z_1 = (w^Tx_1 - y_1)$$, $$z_2 = (w^Tx_2 - y_3)$$

2. $$a_1 = z_1^2$$, $$a_2 = z_2^2$$

3. $$\mathcal{L}(a) = a_1 + a_2$$

And the backward pass

1. $$\dfrac{\partial \mathcal{L}(a)}{\partial w_i } = \dfrac{\partial a_1}{\partial w_i } + \dfrac{a_2}{\partial w_i }$$

2. $$\dfrac{\partial a_j}{\partial w_i } = 2z_j \dfrac{\partial z_j}{\partial w_i }$$

3. $$\dfrac{\partial z_j}{\partial w_i } = x_{ji}$$

Put it together to get $\dfrac{\partial \mathcal{L}(x) }{\partial w_i} = 2(w^Tx_1-y_1)x_{i1} + 2(w^Tx_2-y_2)x_{i2}.$

To calculate the derivatives we traversed the computational graph from bottom to the top, thus the name of the algorithm - back propagation or backprop for short.

Now, let’s try to calculate derivative of loss function for logistic classification with just one sample $$(x,y)$$. Recall that it is a sum of the following terms $\mathcal{L}_w(x) = -\left[y \log\left(\sigma(w^Tx+b)\right)+ (1-y)\left(1-\log\left(\sigma(w^Tx+b)\right)\right)\right].$

To keep things simple, consider just the first term $$Q_w(x) = y \log\left(\sigma(w^Tx+b)\right)$$.

The forward pass

1. $$z = w^Tx + b$$

2. $$a = \sigma(z ) = 1/(1+e^{-z})$$

3. $$Q(a,y) = y\log a$$

and the backward pass

1. $$\dfrac{\partial Q(a,y) }{\partial w_i} = y\dfrac{1}{a}\dfrac{\partial a}{\partial w_i}$$

2. $$\dfrac{\partial a}{\partial w_i} = \sigma(z)(1-\sigma(z))\dfrac{\partial z}{\partial w_i}$$

3. $$\dfrac{\partial z}{\partial w_i} = x_i$$

Foe the second step we used the following identity: $\dfrac{\partial \sigma(z)}{\partial z} = \frac{e^{-z}}{\left(e^{-z}+1\right)^2} = \sigma(z)(1-\sigma(z)).$

Now, putting it all together

$\dfrac{\partial Q}{\partial w_i} = y(1-a)x_i.$ We can similarly calculate the derivative for the second term $$(1-y)\left(1-\log\left(\sigma(w^Tx+b)\right)\right)$$ and leave it for the reader as an exercise.

When we have multiple samples $$\{x_i,y_i\}_{i=1}^n$$, our loss function is a sum $\mathcal{L}_w(x) = -\sum_{i=1}^{n}\mathcal{L}_w(x_i), ~~ \dfrac{\partial \mathcal{L}_w(x)}{w_j} = \sum_{i=1}^{n}\dfrac{\partial \mathcal{L}_w(x_i)}{w_j}.$ We simply calculate each term of the derivate expression $$\dfrac{\partial \mathcal{L}_w(x_i)}{w_j}$$ using the formulas we derived above.

### 6.4.3 Backprop for Classification

The loss function is now given by $l(F(x,\theta),y=k) = -\log\frac{\exp(f_k(x,\theta))}{\sum_{i=1}^{K}\exp(f_i(x,\theta))}$ Useful identities $$g(z) = e^z/(1+e^z) = 1/(1+e^{-z})$$ $g'(z) = g(z)(1-g(z))$ $\left(\log g(z)\right)' = 1-g(z) = -g(-z)$ Check your gradients using finite difference!

#### 6.4.3.1 Logistic Likelihood

\begin{aligned} L(\theta | x) &= Pr(Y | X; \theta) \\ &= \prod_i Pr(y_i | x_i; \theta) \\ &= \prod_i h_\theta(x_i)^{y_i}(1 - h_\theta(x_i))^{(1-y_i)}\end{aligned} Taking logs gives cross-entropy loss $J(\theta) = - \sum_i \left(y_i \log( h_\theta(x_i )) + (1 - y_i) \log( 1 - h_\theta(x_i) ) \right).$ $\frac{\partial J(\theta)}{\partial \theta_j} = \sum_i x^{(i)}_j (h_\theta(x^{(i)}) - y^{(i)}).$ Written in its vector form, the entire gradient can be expressed as: $\nabla_\theta J(\theta) = \sum_i x^{(i)} (h_\theta(x^{(i)}) - y^{(i)})$ This is essentially the same as the gradient for linear regression except that now $$h_\theta(x)= \sigma(\theta^Tx)$$.

#### 6.4.3.2 Multinomial Logistic regression

$p(y = k\mid x) = h_{\theta k}(x) = \frac{e^{\theta_k^T x}}{\sum_j e^{-\theta_j^Tx}}$ So $$h$$ is a vector now \begin{aligned} h_\theta(x) = \begin{bmatrix} P(y = 1 | x; \theta) \\ P(y = 2 | x; \theta) \\ \vdots \\ P(y = K | x; \theta) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{K}{\exp(\theta^{(j)\top} x) }} \begin{bmatrix} \exp(\theta^T_1 x ) \\ \exp(\theta^T_2) \\ \vdots \\ \exp(\theta^T_K) \\ \end{bmatrix}\end{aligned}

Using one-hot encoding, i.e. $$y=1$$ becomes $$T = (1,0,,\ldots,0)$$, we have the likelihood $p(T|\theta) = \prod_{i=1}^n\prod_{k=1}^K p(y = k | \theta_i)^{t_{ik}} = \prod_{i=1}^n\prod_{k=1}^K y_{ik}^{t_{ik}}$ Where $$T$$ is $$n \times K$$ matrix of target variables with elements $$t_{ik}$$. The negative log-likelihood is then $-\log p(T|\theta) = -\sum_{i=1}^n\sum_{k=1}^K t_{ik}\log y_{ik}$

Architecture Search

### 6.4.4 Convergence analysis

Lipschitz continuity of $$\nabla l(\theta)$$ with constant $$L$$ implies $l(y) \leq l(\theta) + \nabla l(\theta)^T (y - \theta) + \frac{L}{2} \Vert y-\theta\Vert^2_2 \text{, for all }\theta, y.$

Consider function $$g(t) = l(y + t(\theta - y))$$, then \begin{aligned}g'(t) = &\nabla l(y + t(\theta-y))^T(\theta-y)\\g'(0) = &\nabla l(y)^T(\theta-y)\end{aligned} Since $\int_{0}^{1}g'(t)dt = g(1) - g(0)$ We have \begin{aligned}\left|f(\theta) - f(y) - \nabla l(\theta)^T(x-y)\right| = &\left|\int_{0}^{1}g'(t)dt - g'(0)\right| \leq \int_{0}^{1}\left|g'(t) - g'(0)\right|dt \leq\\ &\int_{1}^{0} \left\Vert \nabla l(y+t(y-\theta)) - \nabla l(y) \right\Vert _2\left\Vert \theta-y \right\Vert _2dt\leq\\& \left\Vert \theta-y \right\Vert _2\int_{0}^{1}Lt\left\Vert \theta-y \right\Vert _2 = \\&\dfrac{L}{2}\left\Vert \theta-y \right\Vert _2^2\end{aligned}

Assume that $$l$$ convex and differentiable, with domain $$\mathbb{R}^n$$, andadditionally $$\nabla f$$ is Lipschitz continuous with constant $L > 0$ $\Vert\nabla l(\theta) - \nabla l(y)\Vert _2 \leq L\Vert\theta-y\Vert _2 \text{ for any } \theta,y$

Gradient descent with fixed step size $$t \leq 1/L$$ satisfies $l\left(\theta^{(k)}\right) - \hat l \leq \frac{\Vert\theta^{(0)}-\hat \theta\Vert^2_2}{2tk}$

We say gradient descent has convergence rate $$O(1/k)$$. To get$$l(\theta^{(k)} ) - \hat l \leq \epsilon$$, we need $$O(1/\epsilon)$$iterations Lipschitz continuity of $$\nabla l(\theta)$$ with constant $$L$$ implies $l(y) \leq l(\theta) + \nabla l(\theta)^T (y - \theta) + \frac{L}{2} \Vert y-\theta\Vert^2_2 \text{, for all }\theta, y.$ This quadratic approximation provides an upper bound for the originalfunction. Thus, for SGD iterations will decrease the value of the lossfunction as long as $$t < L/2$$. By setting$$y = \theta^+ = \theta - t\nabla l(\theta)$$, we get \begin{aligned} l(\theta^+ ) &\leq l(\theta) + \nabla l(\theta)^T (\theta^+ - \theta) + \frac{L}{2} \Vert\theta^+-\theta\Vert^2_2 = \\ & l(\theta) + \nabla l(\theta)^T(\theta - t\nabla l(\theta) - \theta) + \dfrac{L}{2}\Vert\theta - t\nabla l(\theta) - \theta\Vert _2^2 = \\ & l(\theta) - t\nabla l(\theta)^T\nabla l(\theta) + \dfrac{L}{2}\Vert t\nabla l(\theta)\Vert _2^2=\\ & l(\theta) - t \Vert\nabla l(\theta)\Vert _2^2 + \dfrac{Lt^2}{2}\Vert\nabla l(x)\Vert _2^2=\\ & l(\theta) - t\left(1-\dfrac{Lt}{2}\right)\Vert\nabla l(\theta)\Vert _2^2=\\ & l(\theta) - \dfrac{t}{2}\Vert\nabla l(\theta)\Vert _2^2 \hat l + \nabla l(\theta)^T (\theta - \hat \theta ) - \frac{t}{2} \Vert\nabla l(\theta)\Vert _2^2=\\ & \hat l + \frac{1}{2t} (\Vert\theta-\hat \theta\Vert^2_2 - \Vert\theta^+-\hat \theta\Vert^2_2)\end{aligned} Thus, for $$t<1/L$$ the method does descent, e.g.$$l(\theta^+) < l(\theta)$$, unless $$\nabla l(\theta) = 0$$ Since $$l$$ is convex, we have $l(\theta) \leq l(\hat \theta) + \nabla l(\theta)^T(\theta - \hat \theta)$

Finally, \begin{aligned}l(\theta^+) \leq &l(\theta) - \dfrac{t}{2}\Vert\nabla l(\theta)\Vert _2^2 \leq \\&l(\hat \theta) + \nabla l(\theta)^T(\theta - \hat \theta) - \dfrac{t}{2}\Vert\nabla l(\theta)\Vert _2^2 \leq \\&l(\hat \theta) + \dfrac{1}{2t}(\left\Vert \theta - \hat \theta \right\Vert _2^2 - \left\Vert \theta^+ - \theta^* \right\Vert _2^2)\end{aligned} Applying it to step $$i$$, we get $l\left(\theta^{(i+1)}\right) - l(\hat \theta) \leq \dfrac{1}{2t}\left(\left\Vert \theta^{(i)} - \hat \theta \right\Vert _2^2 - \left\Vert \theta^{(i+1)} - \hat \theta \right\Vert _2^2\right)$

Summing over iterations: \begin{aligned} \sum_{i=1}^k (l(\theta^{(i)})-\hat l) & \leq \frac{1}{2t} (\Vert\theta^{(0)}-\hat \theta\Vert^2_2 - \Vert\theta^{(k)}-\hat \theta\Vert^2_2)\\ & \leq \frac{1}{2t} \Vert\theta^{(0)}-\hat \theta\Vert _2^2 \end{aligned} Since $$l(\theta^{()} )$$ is nonincreasing, $l(\theta^{(k)}) - \hat l \leq \frac{1}{k} \sum_{i=1}^k (l(\theta^{(i)} ) - \hat l) \leq \frac{\Vert\theta^{(0)}-\hat \theta\Vert^2_2}{2tk}$

Let’s consider a quadratic problem in $$R^2$$ $l(\theta) = (1/2)(\theta_1^2 + \gamma \theta_2^2),~~~~\gamma>0$ withexact line search, starting at $$x^{(0)} = (\gamma, 1)$$: $\theta_1^{(k)} = \gamma \left(\dfrac{\gamma-1}{\gamma+1}\right)^k,~~~~~~~~\theta_2^{(k)} = \left(-\dfrac{\gamma-1}{\gamma+1}\right)^k$ The convergence rate will be very slow for $$x_1$$ component, when$$\gamma \gg 1$$ and in $$x_2$$ direction, when $$\gamma \ll 1$$. Figure 6.4 shows the trajectory of the gradient descentmethod for for $$\gamma = 10$$.

Now, let’s consider a non-quadratic loss function $l(\theta) = e^{\theta_1 + 3\theta_2 - 0.1} + e^{\theta_1 - 3\theta_2 - 0.1} + e^{-\theta_1 - 0.1}$ The shape of the loss function is defined by the training data set whichwe can scale and offset anyway we want. Data scaling helps to improvethe conditioning. It is typical to scale the input variables to be ofmean zero and variance one. In case of linear regression it will lead toa loss function to have circle-shaped level lines. When the inputdimension varies by orders of magnitude, it is better to take the $\log(1 + \theta).$

In gradient descent the difference between two successive iterates isproportional to the value fo the gradient $\theta^+ - \theta \propto \dfrac{\partial f}{\partial \theta}$ Ifthe average value of the $$\theta$$’s is large (say, 100), then theupdates will be very large and correlated, which makes learning bad andslow. Keeping things zero-mean and with small variance simply makeseverything work much better. Now, we assume that we only have access to a noisy value of gradientevaluated at $$\theta$$ and we assume that noise is zero-mean $g(\theta) = \nabla l(\theta) + \epsilon,~~~\operatorname{E}\left(\epsilon\right) = 0$ A firstapproach is to get multiple noisy estimates of $$\nabla l(\theta)$$ andthen average those $\nabla l(\theta) \approx 1/N\sum_{i=1}^{N}g_i(\theta).$

suggested an alternative approach that requires only a single evaluation of the noisy gradientand updates the current iterate as: $\theta_{i+1} = \theta_i - t_i g(\theta_i),$ where $$\gamma_n = 0$$ is asequence of positive numbers that converges to zero but converges slowlyso that $$\sum \gamma_n = \infty$$. Stochastic gradient descent (SGD) replaces the gradient by its noisy approximation. Consider sum of functions $\mathop{\mathrm{minimize}}_{\theta} \frac{1}{n} \sum_{i=1}^n l_i(\theta)$

Stochastic Gradient descent applied to this problem would repeat $\theta_{i+1} = \theta_i - t_i g(\theta_i)$ where $g(\theta_i) \approx \nabla l(\theta_i) = \dfrac{1}{|E_i|}\sum_{j \in E_i}^{}\nabla l_j(\theta_i).$ where $$E_i \subset\{1,\ldots,T \}$$ and $$|E_i|$$ is the number of elementsin $$E_i$$. When $$|E_i| >1$$ the algorithm is called batch SGD and simplySGD otherwise. Typically, the subset $$E$$ is chosen by going cyclicallyand picking consecutive elements of $$\{1,\ldots,T \}$$,$$E_{i+1} = [E_i \mod T]+1$$. The direction $$g(\theta_i)$$ providing anunbiased estimator of $$\nabla l(\theta)$$. Specifically, this leads to $\operatorname{E}\left(g(\theta_i)\right)= \nabla l(\theta).$

Given labels $$y_i \in \{0, 1\}$$, features$$x_i \in \mathbb{R}^p,\ i = 1, \ldots, n$$. Consider logistic regression withridge regularization: $\mathop{\mathrm{minimize}}_{\theta \in \mathbb{R}^p} \frac{1}{n} \sum_{i=1}^n \left( -y_i \theta^T x_i + \log(1 + e^{\theta^T x_i}) \right) + \frac{\lambda}{2} \Vert\theta\Vert _2^2$ Write the criterion as $l(\theta) = \frac{1}{n} \sum_{i=1}^n l_i(\theta),\ l_i(\theta) = -y_i\theta^T x_i + \log (1+e^{\theta^T x_i}) + \frac{\lambda}{2} \Vert\theta\Vert _2^2$ The gradient computation$$\nabla l(\theta) = \sum_{i=1}^n ( y_i - p_i (\theta)) \theta_i + \lambda \theta$$is doable when $$n$$ is moderate, but not when $$n$$ is huge. Note that:

1. One batch update costs $$O(np)$$
2. One stochastic update costs $$O(p)$$
3. One mini-batch update costs $$O(bp)$$

We applied SGD with batch sizes of 1, 10 and 20 to a problem with $$n = 10,000$$, and $$p = 20$$. We used fixed step sizes and observed that diminishing step sizes give roughly similar results. Figure 6.5 shows the convergence of SGD for different batch sized applied to as imulated data set. Method with batch size 1 requires more iterations to converge, as expected. However, on the $$b\times p$$ scale, SGD with smaller batch size requires less computations to achieve same level of performance. The amount of commuting required for each step of SGD is linear in batch size $$O(b)$$. However, the uncertainty in gradient’s approximation is reduced by a factor of $$O(\sqrt{b})$$. Thus, there is diminishing returns to putting more samples in each mini-batch.

In practice, the choice of batch size depends on computing architecture,specifically on number of computing cores and number of data samples that can be processed by each core at a time. Parallelization capabilities available in hardware architecture can be utilized to perform mini-batch SGD step in the same time as it would take to process a single data points. There are a few other considerations need to betaken into account while choosing the batch size. As discussed in large batch sizes lead to degradation input-of-sample performance of a model. Empirically, it was shown that large batch sizes lead SGD to converge to sharp minimizers of the training function. Sharp minimizers lead to a model that does not generalize well. On the other hand, smaller batch sizes consistently converge to flat minimas. Thus, the noise in gradient approximations lead to effect of regularization.

Let’s analyze the convergence rates. Recall that, under suitable steps izes, when $$l$$ is convex and has a Lipschitz gradient, full gradient(FG) descent satisfies $l(\theta^{(k)} ) - \hat l = O(1/k)$ What about stochastic gradient (SG) descent? Under diminishing step sizes, when $$l$$is convex (plus other conditions) $\operatorname{E}\left(l(\theta^{(k)} )\right) - \hat l = O(1/ \sqrt{k}).$ Finally, what about mini-batch stochastic gradient? Again, under diminishing step sizes, for$$l$$ convex (plus other conditions) $\operatorname{E}\left(l(\theta^{(k)} )\right) - \hat l = O(1/ \sqrt{bk} + 1/k).$ But each iteration here $$b$$ times more expensive and (for small $$b$$), in terms of flops, Recall that, under suitable step sizes, when $$l$$ is strongly convex with a Lipschitz gradient, gradient descent satisfies $l(\theta^{(k)} ) - \hat l = O(\rho^k )$ where $$\rho < 1$$. But, under diminishing step sizes, when $$l$$ is strongly convex (plus other conditions), stochastic gradient descent gives $\operatorname{E}\left(l(\theta^{(k)} )\right) - \hat l = O(1/k)$ So stochastic methods do not enjoy the linear convergence rate of gradient descent under strong convexity. For a while, this was believed to be inevitable, as Nemirovski and others had established matching lower bounds, but the seapplied to stochastic minimization of criterions,$$l(\theta) = \int l(\theta, \xi) d\xi$$. Can we do better for finite sums? First, we perform convergence analysis under strong convexity, Strong convexity of $$l$$ means $$l(\theta) - \frac{2}{2} \Vert\theta\Vert^2_2$$ is convex for some $$m > 0$$. If $$l$$ is twice differentiable, then this implies $\nabla^2 l(\theta) \succeq mI \text{ for any }x.$

Sharper lower bound than that from usual convexity: $l(y) \geq l(\theta) + \nabla l(\theta)^T (y -\theta) + \frac{m}{2} \Vert y-\theta\Vert^2_2 \text{ all } x,y$ Under Lipschitz assumption as before, and also strong convexity: Gradient descent with fixed step size $$t \leq 2/(m + L)$$ or with backtracking line search search satisfies $l(\theta^{(k)} ) - \hat l \leq c^k \frac{L}{2} \Vert\theta^{(0)}-\hat \theta\Vert^2_2$ where $$0 < c < 1$$. Thus, rate with strong convexity is $$O(c^k )$$, exponentially fast! To get $$l(\theta^{(k)} ) - \hat l \leq \epsilon$$, need$$O(\log(1/\epsilon))$$ iterations. This is called linear convergence,because looks linear on a semi-log plot:

Constant $$c$$ depends adversely on condition number $$L/m$$ (higher condition number $$\Rightarrow$$ slower rate) A look at the conditions for a simple problem,$$l(\theta ) = \frac{1}{2} \Vert y - X \theta\Vert _2^2$$ Lipschitz continuity of$$\nabla f$$:

• This means $$\nabla^2 l(\theta) \preceq LI$$
• As $$\nabla^2 l(\theta)= X^T X$$, we have $$L = \sigma^2_{\max}(\theta)$$

Strong convexity of $$l$$ :

• This means $$\nabla^2 l(\theta) \succeq mI$$
• As $$\nabla^2 l(\theta ) = X^T X$$, we have $$m =\sigma^2_{\min}(\theta)$$
• If $$X$$ is wide–i.e., $$X$$ is $$n \times p$$ with $$p > n$$ –then $$\sigma_{\min} (\theta) = 0$$, and $$l$$ can’t be strongly convex
• Even if $$\sigma_{\min} (\theta) > 0$$, can have a very large condition number $$L/m = \sigma_{\max}^2 (\theta)/ \sigma_{\min}^2 (\theta)$$.

A function $$l$$ having Lipschitz gradient and being strongly convex satisfies: $mI \preceq \nabla^2 l(\theta)\preceq LI \text{ for all } x \in \mathbb{R}^n ,$ for constants $$L > m > 0$$of $$l$$ being sandwiched between two quadraticsseem like a strong condition to hold globally (for all $$x \in \mathbb{R}^n$$).But a careful look at the proofs shows that we only need Lipschitz gradients/strong convexity over the sublevel set $S = \{x : l(\theta) \leq l(\theta^{(0)} )\}$ This is less restrictiveSome GD: practicalities include Stopping rule: stop when$$\Vert \nabla l(\theta)\Vert _2$$ is small

• Recall $$\nabla l(\theta^* ) = 0$$ at solution $$\hat \theta$$
• If $$l$$ is strongly convex with parameter $$m$$, then

$\Vert\nabla l(\theta)\Vert _2 \leq \sqrt{2m \epsilon} \Rightarrow l(\theta) - \hat l \leq \epsilon$

Pros and cons of gradient descent:

• Pro: simple idea, and each iteration is cheap
• Pro: very fast for well-conditioned, strongly convex problems
• Con: often slow, because interesting problems aren’t strongly convex or well-conditioned
• Con: can’t handle non differentiable functions

### 6.4.5 Polyak’s Momentum

Recall $$O(1/\epsilon)$$ rate for gradient descent over problem class ofconvex, differentiable functions with Lipschitz continuous gradients.The updates $$x^{(k)}$$ in $x^{(0)} + \text{span}\{\nabla l(\theta^{(0)} ), \nabla l(\theta^{(1)} ), \ldots, \nabla l(\theta^{(k-1)})\}$ Polyak’s momentum introduces an additional term$$\mu\left(\theta^{(k)} - \theta^{(k-1)}\right)$$. If we imagine thattrajectory of the gradient descent iterations is a rolling ball, thancurrent speed should be proportional to the previous one. The momentumupdate is then $\theta^{(k+1)} = \theta^{(k)} - t\nabla l\left(\theta^{(k)}\right) + \mu\left(\theta^{(k)} - \theta^{(k-1)}\right).$ Here $$0\le \mu <1$$. Polyak’s momentum algorithm (or heavy-ball method) was shown to notconverge for some simple function. For example, showed that heavy-ball method does not converge for a one-dimensional function with the following gradient $\nabla l(\theta) = \begin{cases} 25\theta & \theta < 1\\\theta + 24 & 1\le \theta< 2\\25\theta -24 & \ge 2 \end{cases}$ When using an initial condition in the interval$$3.07 \le \theta_0 \le 3.46$$, the Heavy-ball method produces a limitcycle with oscillations that never damp out. The first 50 iterates for$$\theta_0= 3.3$$ are shown in Figure 6.7(left), and a plot of $$l(\theta)$$ with the limitcycle overlaid is shown in Figure 6.7(right).

Nesterov showed that this problem can be alleviated and introduced the look-ahead strategy \begin{aligned}y = & \theta^{(k)} + \mu\left(\theta^{(k)} - \theta^{(k-1)}\right)\\\theta^{(k+1)} = &y- t\nabla l\left(y\right)\end{aligned}

Polyak’s method evaluates the gradient before adding momentum, while Nesterov’s algorithm evaluates the gradient after applying momentum. For any $$k \leq (n-1)/2$$ and any starting point $$\theta^{(0)}$$, there isa function $$l$$ in the problem class such that any first-order method satisfies $l\left(\theta^{(k)} \right) - \hat l \geq \frac{3L\Vert\theta^{(0)}- \hat \theta \Vert _2^2}{32(k + 1)^2}$

Can attain rate $$O(1/k^2 )$$, or $$O(1/ \sqrt{\epsilon})$$? Say you are a retail company and you sell a product. Say you have amodel $p(y \mid x, \theta), \text{ where } y \in \{\mbox{buy}, \mbox{not buy}\},$ here $$x$$ is the characteristics of the buyer (age, gender, previous purchasing behavior) and product characteristics, including price.$$\theta$$ is the model parameters (neural net, logistic regression). Stochastic gradient descent is the mechanism toupdate $$\theta$$ every time user visits a product page. You adjust price according to user’s characteristics! SGD also can be implemented in parallel, and particularly within MapReduce environment: Gradient descent $\theta^+ = \theta - t \nabla l(\theta)$

MAP

• Machine 1: Use $$1,...,n/4$$ samples to calculate $$\theta^+_1$$
• Machine 2: Use $$n/4+1,...,n/2$$ samples to calculate $$\theta^+_2$$
• Machine 3: Use $$n/2+1,...,3n/4$$ samples to calculate $$\theta^+_3$$
• Machine 4: Use $$3n/4+1,...,n$$ samples to calculate $$\theta^+_4$$

REDUCE - Average the results. For NN speedup is at least #nodes/ $$\log$$(#rows)(http://arxiv.org/abs/1209.4129). Can be used on multi-core machines.

### 6.4.6 SGD vs Newton

At a high-level:

• Memory: each iteration of Newton’s method requires $$O(n^2 )$$ storage ($$n \times n$$ Hessian); each gradient iteration requires $$O(n)$$ storage ($$n$$-dimensional gradient)
• Computation: each Newton iteration requires $$O(n^3 )$$ flops (solving a dense $$n \times n$$ linear system); each gradient iteration requires $$O(n)$$ flops (scaling/adding n-dimensional vectors)
• Backtracking: backtracking line search has roughly the same cost, both use $$O(n)$$ flops per inner backtracking step
• Conditioning: Newton’s method is not affected by a problem’s conditioning, but gradient descent can seriously degrade
• Fragility: Newton’s method may be empirically more sensitive to bugs/numerical errors, gradient descent is more robust

Back to logistic regression example: now x-axis is parametrized in terms of time

Each gradient descent step is $$O(p)$$, but each Newton step is $$O(p^3)$$.

### 6.4.7 Proximal Gradient Methods

Let $$\theta$$ denote a $$p$$-dimensional parameter of interest, $$y$$ an$$n$$-vector of outcomes, and $$X$$ a fixed $$n \times p$$ matrix whose rows are the design points (or “features”) $$x_i^T$$. All vectors are column vectors by default. In this section, we develop a set of proximal gradient techniques to solve the optimization problem that arises in MAP inference.

$$$\mathop{\mathrm{minimize}}_{\theta \in \mathbb{R}^p} l(\theta) + \phi(\theta) \tag{6.1}$$$

where $$l(\theta)$$ is a loss function depending implicitly on $$X$$ and $$y$$, and $$\phi(\theta)$$ is a penalty function or regularization term. Importantly, we consider the case where$$\phi(\theta)$$ may have points in its domain where it fails to be differentiable. Thus we focus on methods that are capable of handling non-smooth objectives. From the Bayesian perspective, $$l(\theta)$$ and $$\phi(\theta)$$ correspond to the negative logarithms of the likelihood and prior distribution in the hierarchical model $p(y \mid \theta) \propto \exp\{-l(\theta)\} \; , \quad p(\theta) \propto \exp\{ -\phi(\theta) \} \, .$ Note that $$p(\theta)$$ is not necessarily a proper prior without further constraints on $$\phi(\theta)$$. Even if it isn’t, however, the posterior distribution $$p(\theta \mid y) \propto p(y \mid \theta) p(\theta)$$ may still be proper. A function $$g(\theta)$$ is said to majorize another function $$l(\theta)$$at $$\theta_0$$ if $$g(\theta_0) = l(\theta_0)$$ and$$g(\theta) \geq l(\theta)$$ for all $$\theta \neq \theta_0$$. If the same relation holds with the inequality sign flipped, $$g(\theta)$$ is instead said to be a minorizing function for $$l(\theta)$$. The convex conjugate of $$\phi(\theta)$$ is the function$$\psi(z) = \sup_{\theta} \{ z^T \theta - \phi(\theta) \}$$. As $$\psi$$ is the point wise supremum of a family of affine (and therefore convex)functions, it is convex even when $$\phi(\theta)$$ is not. But if$$\phi(\theta)$$ is convex (and closed and proper), then the following dual relationship holds between $$\phi$$ and its conjugate: \begin{aligned} \phi(\theta) &=& \sup_{\lambda} \{ \lambda^T \theta - \psi(\lambda) \} \\ \psi(\lambda) &=& \sup_{\theta} \{ \lambda^T x - \phi(\theta) \} \, .\end{aligned} If $$\phi(\theta)$$ is differentiable, the maximizing value of $$\lambda$$in the first equation is $$\hat{\lambda}(\theta) = \nabla \phi(\theta)$$.

### 6.4.8 Proximal operators and Moreau envelopes

Let $$l(\theta)$$ be a proper convex function. The Moreau envelope $$E_{\gamma} l(\theta)$$ and proximal mapping $$\mathop{\mathrm{prox}} _{\gamma} l(\theta)$$for parameter $$\gamma > 0$$ are defined as \begin{aligned}E_{\gamma l} (\theta) &=& \inf_{z } \left\{l(z) + \frac{1}{2\gamma} \left\Vert z - \theta \right\Vert _2^2 \right\} \leq l(\theta) \\ \mathop{\mathrm{prox}} _{\gamma l} (\theta) &=& \arg \min_{z } \left\{ l(z)+ \frac{1}{2\gamma} \left\Vert z - \theta \right\Vert _2^2 \right\} \, .\end{aligned} Intuitively, the Moreau envelope is a regularized version of $$l$$. Itapproximates $$l$$ from below, and has the same set of minimizing values as $$l$$ . The proximal mapping returns the value that solves the minimization problem defined by the Moreau envelope. It balances two goals: minimizing $$l$$, and staying near$$\theta$$. The proximal operator generalizes the notion of Euclidean projection onto a convex set: if $$l(\theta) = \iota_C(\theta)$$, then$$\mathop{\mathrm{prox}} _l(\theta) = \arg \min_{z \in C} \Vert \theta-z \Vert _2^2$$. Let $$l(\theta) = |x|$$. Then its Moreau envelope is just the Huber function that is used in robust statistics $E_{\gamma l} (\theta) = \inf_{z } \left\{|z|+ \frac{1}{2\gamma} \left\Vert z - \theta \right\Vert _2^2 \right\} = \begin{cases}\dfrac{1}{2\gamma}\theta^2, & |\theta|\le \gamma\\|\theta| - \dfrac{\gamma}{2}, & |\theta| > \gamma\end{cases}$

##### 6.4.8.0.1 Connection with gradient descent.

One of the simplest proximal algorithms is the proximal-gradient method which provides an important starting point for the more advancedtechniques we describe in subsequent sections. Suppose as in Equation (6.1) that the objective function is$$F(\theta) = l(\theta) + \phi(\theta)$$, where $$l(\theta)$$ isdifferentiable but $$\phi(\theta)$$ is not. An archetypal case is that ofa generalized linear model with a non-differentiable penalty designed toencourage sparsity. The proximal gradient method is well suited for suchproblems. It has only two basic steps which are iterated untilconvergence.

1. Gradient step: Define an intermediate point $$v^t$$ by taking a gradient step with respect to the differentiable term $$l(\theta)$$:
$v^t = \theta^t - \gamma \nabla l(\theta^{t}) \, .$

2. Proximal operator step: Evaluate the proximal operator of the non-differentiable term $$\phi(\theta)$$ at the intermediate point $$v^t$$:
$$$\theta^{t+1} = \mathop{\mathrm{prox}} _{\gamma \phi} (v^t) = \mathop{\mathrm{prox}} _{\gamma \phi}\{ \theta^t - \gamma \nabla l(\theta^{t}) \} \, (\#eq:prox_quad_alg)$$$

This can be motivated in at least two ways.

##### 6.4.8.0.2 As an MM algorithm.

Suppose that $$l(\theta)$$ has a Lipschitz-continuous gradient withmodulus $$L$$. This allows us to construct a majorizing function: whenever$$\gamma \in (0, 1/L]$$, we have the majorization $l(\theta) + \phi(\theta) \leq l(\theta_0) + (\theta - \theta_0)^T \nabla l(\theta_0) + \frac{1}{2\gamma} \left\Vert \theta - \theta_0 \right\Vert _2^2 + \phi(\theta) \, ,$ with equality at $$\theta=\theta_0$$. Simple algebra shows that theoptimum value of the right-hand side is $\hat{\theta} = \arg\min_{\theta} \left\{ \phi(\theta) + \frac{1}{2\gamma} \left\Vert \theta - u \right\Vert _2^2 \right\} \, , \quad \text{ where } \quad u = \theta_0 - \gamma \nabla l(\theta_0) \, .$ This is nothing but the proximal operator of $$\phi$$, evaluated at anintermediate gradient-descent step for $$l(\theta)$$. The fact that we may write this method as an MM algorithm leads to thefollowing basic convergence result. Suppose that 1. $$l(\theta)$$ is convex with domain $$\Re^n$$. 2. $$\nabla l(\theta)$$ is Lipschitz continuous with modulus $$L$$, i.e.
$\left\Vert \nabla l(\theta) - \nabla l(z) \right\Vert _2 \leq L \left\Vert \theta-z \right\Vert _2 \quad \forall \theta,z \, .$ 3. $$\phi$$ is closed and convex, ensuring that $$\mathop{\mathrm{prox}} _{\gamma \phi}$$ makes sense. 4. the optimal value is finite and obtained at $$\hat \theta$$.

If these conditions are met, than the proximal gradient method convergesat rate $$1/t$$ with fixed step size $$\gamma = 1/L$$ .

##### 6.4.8.0.3 As the fixed point of a “forward-backward” operator.

The proximal gradient method can also be interpreted as a means forfinding the fixed point of a “forward-backward” operator derived fromthe standard optimality conditions from subdifferential calculus. Thishas connections (not pursued here) with the forward-backward method forsolving partial differentiable equations. A necessary and sufficientcondition that $$\hat \theta$$ minimizes $$l(\theta)$$ is that $$$0 \in \partial \left\{ l(\hat \theta) + \phi(\hat \theta)\right\} = \nabla l(\hat \theta) + \partial \phi(\hat \theta) \, , \tag{6.2}$$$ the sum ofa point and a set. We will use this fact to characterize $$\hat \theta$$as the fixed point of the following operator:

$$$\hat \theta = \mathop{\mathrm{prox}} _{\gamma \phi}\{ \hat \theta - \gamma \nabla l(\hat \theta) \} \, . \tag{6.3}$$$

To see this, let $$I$$ be the identity operator. Observe that the optimality condition (6.2) \begin{aligned} 0 &\in \gamma \nabla l(\hat \theta) - \hat \theta + \hat \theta + \gamma \partial \phi(\hat \theta) \\ \hat \theta - \gamma \nabla l(\hat \theta) &\in \hat \theta + \gamma \partial \phi(\hat \theta) \\ (I-\gamma \nabla l)\hat \theta &\in (I + \gamma \partial \phi) \hat \theta \\ \hat \theta &= (I + \gamma \partial \phi)^{-1} (I-\gamma \nabla l) \hat \theta \\ &= \mathop{\mathrm{prox}} _{\gamma \phi} (\hat \theta-\gamma \nabla l(\hat \theta)) \, , \end{aligned} the composition of two operators. The final lineappeals to the fact (see below) that the proximal operator is theresolvent of the subdifferential operator:$$\mathop{\mathrm{prox}} _{\gamma \phi}(x) = (I + \gamma \partial \phi)^{-1} (x)$$. Thus tofind the solution, we repeatedly apply the operator having $$x^\star$$ asa fixed point:

$x^{t+1} = \mathop{\mathrm{prox}} _{\gamma^{t} \phi}\{ x^{t} - \gamma^{t} \nabla l(x^{t}) \} \, .$ This is precisely the proximal gradientmethod. We now show that the proximal operator is the resolvent of thesubdifferential operator. By definition, if$$z \in (I + \gamma \partial l)^{-1} x$$, then \begin{aligned} x &\in (I + \gamma \partial l)z \\ x &\in z + \gamma \partial l(z) \\ 0 &\in \frac{1}{\gamma}(z-x) + \partial l(x) \\ 0 &\in \partial_z \left\{ \frac{1}{2\gamma}\left\Vert z-x \right\Vert _2^2 + l(x) \right\} \, . \end{aligned} But for $$0$$ to be in the subdifferential (with respectto $$z$$) of the function on the right-hand side it is necessary andsufficient for $$z$$ to satisfy $z = \arg\min_u \left\{ \frac{1}{2\gamma}\left\Vert u-x \right\Vert _2^2 + l(u) \right\} = \mathop{\mathrm{prox}} _{\gamma l}(x) \, .$ Therefore $$z = \mathop{\mathrm{prox}} _{\gamma l}(x)$$ if and only if$$z \in (I + \gamma \partial l)^{-1} x$$. It is interesting that$$(I + \gamma \partial l)^{-1}$$ is single-valued and therefore afunction, even though $$\partial l$$ is set-valued. The proximal framework also applies to some non-convex regularisationpenalties, e.g. $$L^q$$-norm for $$0 \leq q \leq 1$$.
##### 6.4.8.0.4 Proof of Nesterov’s GD Convergence Rates

We need to solve $\mathop{\mathrm{minimize}}_{\theta} f(\theta,)$ where $f(\theta) = l(\theta) + \phi(\theta)$ Gradient descent iterations are $\theta^+ = \mathop{\mathrm{prox}} _{t\phi}\left(\theta - t \nabla l (\theta)\right)$ with $\mathop{\mathrm{prox}} _{t\phi}(x) = \arg\min_{z} \dfrac{1}{2t}\left\Vert x-z \right\Vert _2^2 + \phi(z)$

Accelerated GD method is \begin{aligned}y = & \theta^{(k)} + \dfrac{k-2}{k+1}\left(\theta^{(k)} - \theta^{(k-1)}\right)\\\theta^{(k+1)} = & \mathop{\mathrm{prox}} _{t\phi}\left(y- t\nabla l\left(y\right)\right)\end{aligned}

We can rewrite it as \begin{aligned}y = & \theta^{(k)} +(1-a_k)\theta^{(k)} + a_k u^{(k)}\\\theta^{(k+1)} = & \mathop{\mathrm{prox}} _{t\phi}\left(y- t\nabla l\left(y\right)\right)\\u^{(k)} = & \theta^{(k-1)} + \dfrac{1}{a_k}\left(\theta^{(k)} - \theta^{(k-1)}\right)\end{aligned} with $$a_k = 2/(k+1)$$. This is equivalent to the formulation of accelerated generalizedgradient method. Accelerated generalized gradient method with fixed step size $$t le 1/L$$satisfies $l(\theta^{(k)}) - l(\hat \theta) \le \dfrac{2\left\Vert \theta^{(0)} - \hat \theta \right\Vert _2^2}{t(k+1)^2}$

$$l$$ being Lipschitz with constant $$L>0$$ and $$t \le 1/L$$ implies $l(\theta^+) \le l(y) + \nabla l(y)^T (\theta^+ - y) + \dfrac{1}{2t}\left\Vert \theta^+ - y \right\Vert _2^2$ We use the fact that $\phi(v)\le \phi(z)+\dfrac{1}{t}(v-w)^T(z-v), \mbox{ for all } z,w, \mbox{ and }v= \mathop{\mathrm{prox}} _{th}(w)$ which follows from the definition fo the $$\mathop{\mathrm{prox}}$$ operator which finds$$v$$ that solves $\mathop{\mathrm{minimize}}\dfrac{1}{2t}\left\Vert w-v \right\Vert _2^2 + \phi(v)$ meaningthat $$0 = \dfrac{1}{t}(v-w) + \nabla \phi(v)$$, thus$$-\dfrac{1}{t}(v-w) = \nabla \phi(v)$$. Using this bound, we get $\phi(\theta^+) \le \phi(\theta) + \dfrac{1}{t}(\theta^+ - y)^T(z-\theta^+) + \nabla l(y)^T(z-\theta^+)$ Adding these together and using convexity of l $f(\theta^+) \le f(z) + \dfrac{1}{t}(\theta^+ - y)^T(z-\theta^+) + \dfrac{1}{2t}\left\Vert \theta^+ - y \right\Vert _2^2$ Using this bound at $$z = \theta$$ and $$\theta = \hat \theta$$, we get \begin{aligned}f(\theta^+) - &f(\hat \theta) - (1-a)\left(f(\theta) - f(\hat \theta)\right) \le\\& \dfrac{1}{t}\left(\theta^+-y\right)^T\left(a\hat \theta + (1-a)\theta - \theta^+\right) + \dfrac{1}{2t}\left\Vert \theta^+ - y \right\Vert _2^2=\\& \dfrac{a^2}{2t}\left(\left\Vert u - \hat \theta \right\Vert _2^2 - \left\Vert u^+ - \hat \theta \right\Vert _2^2\right)\end{aligned} we replaced $$y$$ in terms of the intermidiate steps \begin{aligned}au & = y - (1-a)\theta\\au^+ & = \theta^+ - (1-a)\theta\end{aligned} Thus, at iteration $$k$$, we get \begin{aligned}\dfrac{t}{a_k^2}&\left(f\left(\theta^{(k)}\right) - f\left(\hat \theta\right)\right) + \dfrac{1}{2}\left\Vert u^{(k)} - \hat \theta \right\Vert _2^2\leq\\& \dfrac{(1-a_k)t}{a_k^2}\left(f\left(\theta^{(k)}\right) - f\left(\hat \theta\right)\right)+ \dfrac{1}{2}\left\Vert u^{(k-1)} - \hat \theta \right\Vert _2^2\end{aligned} Using the fact that $$(1-a_i)/a_i^2 \leq 1/a_{i-1}^2$$ and iterating thisinequality, we get \begin{aligned}\dfrac{t}{a_k^2}&\left(f\left(\theta^{(k)}\right) - f\left(\hat \theta\right)\right) + \dfrac{1}{2}\left\Vert u^{(k)} - \hat \theta \right\Vert _2^2\leq\\& \dfrac{(1-a_1)t}{a_1^2}\left(f\left(\theta^{(0)}\right) - f\left(\hat \theta\right)\right)+ \dfrac{1}{2}\left\Vert u^{(0)} - \hat \theta \right\Vert _2^2 = \\& \dfrac{1}{2}\left\Vert x^{(0)} - \hat \theta \right\Vert _2^2\end{aligned} Thus, $\left(f\left(\theta^{(k)}\right) - f\left(\hat \theta\right)\right) \le \dfrac{a_k^2}{2t}\left\Vert x^{(0)} - \hat \theta \right\Vert _2^2 = \dfrac{2}{t(k+1)^2}\left\Vert x^{(0)} - \hat \theta \right\Vert _2^2$ ### 6.4.9 Examples: Proximal Operators Statistical optimization algorithms (e.g. MLE, EM, ECME, etc.) can bewritten very compactly in terms of proximal operators of log-likelihoodsand penalty functions. We now compile a list of useful examples forlater use. ##### 6.4.9.0.1 Quadratic functions. Quadratic log-likelihood are very common in statistics. $l(\theta) = \frac{1}{2} x^T P x + q^T x + r$ They correspond toconditionally Gaussian sampling models and arise in least-squares andweighted-least squares problems, along with ridge regression and in EMalgorithms based on mixtures of normals to name a few models. For example, if $$(y \mid \theta) \sim N(Ax, \Omega^{-1})$$, then$$l(\theta) = (y-A\theta)^T \Omega (y-A\theta) /2$$, or $P=A^T \Omega A \;, \quad q= - A^T \Omega y \;, \quad r=y^T \Omega y/2$ in the general form given above. If $$l(\theta)$$ takes this form, itsproximal operator (with parameter $$1/\gamma)$$ is easily computed as,assuming the relevant inverse exists. $\mathop{\mathrm{prox}} _{l/\gamma}(\theta) = (P+\gamma A^T A)^{-1} (\gamma A^T x - q).$ ##### 6.4.9.0.2$$\ell^1$$ norm. Let $$\phi(\theta) = \tau \Vert x \Vert _1$$. In this case the proximaloperator is clearly separable in the components of $$\theta$$, and theproblem that must be solved for each component is $\underset{z \in R}{\text{minimize}} \; \; \left\{ \tau |z| + \frac{\gamma}{2} (z-\theta)^2 \right\}.$ The problem is clearly equivalent to We will solve this without appealing to subdifferential calculus byintroducing a slack variable to handle the nondifferentiable $$|z|$$ term. First consider the case $$\theta \geq 0$$. In this case, the best choiceof $$z$$ is clearly nonnegative and we may optimize over $$z \geq 0$$.Likewise, if $$\theta < 0$$, the best choice of z is nonpositive, and wemay optimize over $$z \leq 0$$. In either case, with a little bit ofalgebra we reach an equivalent problem that may be written as where theoptimal value of $$z$$ provides the solution to the original problem. Thisis differentiable in $$y$$ and therefore easily solved for both $$y$$ and$$z$$: $$$\hat{z} = \mathop{\mathrm{prox}} _{\tau|x|/\gamma} (\theta) = \mathop{\mathrm{sgn}}(\theta)(|x| - \tau/\gamma)_+ = S_{\tau/\gamma}(\theta) \, , \tag{6.4}$$$ the soft-thresholding operator with parameter $$\tau/\gamma$$. ##### 6.4.9.0.3 Other penalties. Many nondifferentiable penalty functions have proximal operators thatare easily computed using the same trick of introducing a slack variableto handle the nondifferentiable terms. For example, let$$\phi_a(\theta) = \tau \log(1 + |x|/a)$$, the component-wise penaltyfunction corresponding to the double-Pareto prior of. By an argument similar to the above, we reachthe conclusion that $\mathop{\mathrm{prox}} _{\phi/\gamma}(\theta) = \frac{\mathop{\mathrm{sgn}}(\theta)}{2} \left\{ |x| - a + \sqrt{ (a- |x|)^2 + 4 r(\theta) } \right\} \; , \quad r(\theta) = (a|x| - \tau/\gamma)_+ \, .$ Suppose as before that $$l(\theta) = l(\theta) + \phi(\theta)$$, where$$l(\theta)$$ is differentiable but $$\phi(\theta)$$ is onlysubdifferentiable. The proximal gradient method is intended for justthis occasion. The basic sequence of steps to be iterated is $x^{k+1} = \mathop{\mathrm{prox}} _{\gamma^k \phi}\{ x^k - \gamma^{k} \nabla l(\theta^{k}) \} \, .$ This can be motivated in (at least) two ways. ##### 6.4.9.0.4 As an Majorize-Minimize algorithm (EM is a special case). Suppose that $$l(\theta)$$ has a Lipschitz-continuous gradient withmodulus $$L$$. This allows us to construct a majorizing function by starting from XXX and simply adding $$\phi(\theta)$$ to bothsides: whenever $$\gamma \in (0, 1/L]$$, we have the majorization $l(\theta) + \phi(\theta) \leq l(\theta_0) + (\theta - \theta_0)^T \nabla l(\theta_0) + \frac{1}{2\gamma} \left\Vert \theta - \theta_0 \right\Vert _2^2 + \phi(\theta) \, ,$ with equality at $$\theta=\theta_0$$. Simple algebra shows that theoptimum value of the right-hand side is $\hat{\theta} = \arg \min_x \left\{ \phi(\theta) + \frac{1}{2\gamma} \left\Vert \theta - u \right\Vert _2^2 \right\} \, , \quad \mbox{where} \quad u = \theta_0 - \gamma \nabla l(\theta_0) \, .$ This is nothing but the proximal operator of $$\phi$$, evaluated at anintermediate gradient-descent step for $$l(\theta)$$. The fact that we may write this method as an MM algorithm leads to thefollowing basic convergence result. Suppose that 1. $$l(\theta)$$ is convex with domain $$\mathbb{R}^n$$. 2. $$\nabla l(\theta)$$ is Lipschitz continuous with modulus $$L$$, i.e. $\left\Vert \nabla l(\theta) - \nabla l(y) \right\Vert _2 \leq L \left\Vert \theta-y \right\Vert _2 \quad \forall x,y \, .$ 3. $$\phi$$ is closed and convex, ensuring that $$\mathop{\mathrm{prox}} _{\gamma \phi}$$ makes sense. 4. the optimal value is finite and obtained at $$\hat \theta$$. Given these conditions, the proximal gradient algorithm converges atrate $$1/k$$ with fixed step size $$\gamma^{k} = 1/L$$. ##### As the fixed point of a “forward-backward” operator. The proximal gradient method can also be interpreted as a means forfinding the fixed point of a “forward-backward” operator derived fromthe standard optimality conditions from subdifferential calculus. Thishas connections (not pursued here) with the forward-backward method forsolving partial differentiable equations. Let $$\partial$$ be thesubdifferential operator. A necessary and sufficient condition that$$\hat \theta$$ minimizes $$l(\theta)$$ is that the sum of a point and a set $$$0 \in \partial \left\{ l(\theta) + \phi(\theta)\right\} = \nabla l(\theta) + \partial \phi(\theta. \tag{6.5}$$$ This fact is used to characterize $$\hat \theta$$ as the fixed point ofthe following operator: $$$\hat \theta = \mathop{\mathrm{prox}} _{\gamma \phi}\{ \hat \theta - \lambda \nabla l(\hat \theta) \} \, . \tag{6.6}$$$ Observe that the optimality condition (6.5) is equivalent to the composition of two operators. \begin{aligned}0 &\in \gamma \nabla l(\hat \theta) - \hat \theta + \hat \theta + \partial \phi(\hat \theta) \\\hat \theta - \gamma \nabla l(\hat \theta) &\in \hat \theta + \gamma \partial \phi(\hat \theta) \I-\gamma \nabla f)\hat \theta &\in (I + \gamma \partial \phi) \hat \theta \\\hat \theta &= (I + \gamma \partial \phi)^{-1} (I-\gamma \nabla f) \hat \theta \\&= \mathop{\mathrm{prox}} _{\gamma \phi} (I-\gamma \nabla f) \hat \theta \, ,\end{aligned} here \(I be the identity operator. The final lineappeals to the fact (see below) that the proximal operator is theresolvent of the subdifferential operator:$$\mathop{\mathrm{prox}} _{\gamma \phi}(\theta) = (I + \gamma \partial \phi)^{-1} (\theta)$$.Thus to find the solution, we repeatedly apply the operator having$$\theta^\star$$ as a fixed point: $\theta^{k+1} = \mathop{\mathrm{prox}} _{\gamma^k \phi}\{ \theta^k - \gamma^{k} \nabla l(\theta^{k}) \} \, .$ This is precisely the proximal gradient method. We now show that the proximal operator is the resolvent of thesubdifferential operator. By definition, if $$z \in (I + \gamma \partial f)^{-1} \theta$$, then \begin{aligned}\theta &\in (I + \gamma \partial f)z \\\theta &\in z + \gamma \partial l(z) \\0 &\in \frac{1}{\gamma}(z-\theta) + \partial l(\theta) \\0 &\in \partial_z \left\{ \frac{1}{2\gamma}\left\Vert z-\theta \right\Vert _2^2 + l(\theta) \right\} \, . \end{aligned} But for $$0$$ to be in the subdifferential (with respectto $$z$$) of the function on the right-hand side is a necessary andsufficient condition for $$z$$ to satisfy $z = \arg \min_u \left\{ \frac{1}{2\gamma}\left\Vert u-\theta \right\Vert _2^2 + l(u) \right\} = \mathop{\mathrm{prox}} _{\gamma f}(\theta) \, .$ Therefore $$z = \mathop{\mathrm{prox}} _{\gamma f}(\theta)$$ if and only if$$z \in (I + \gamma \partial f)^{-1} \theta$$. It is interesting that$$(I + \gamma \partial f)^{-1}$$ is single-valued and therefore afunction, even though $$\partial f$$ is set-valued. ### 6.4.10 Iterative shrinkage thresholding As an example, consider the proximal gradient method applied to our general problem given by Equation (6.1) with a quadratic log-likelihood, as in aweighted least-squares problem $l(\theta) = \frac{1}{2} (y - A\theta)^T \Omega (y - A\theta) \, ,$ This gives $$\nabla l(\theta) = A^T \Omega A \theta - A^T \Omega y$$, andthe proximal gradient method becomes $\theta^{k+1} = \mathop{\mathrm{prox}} _{\gamma^k \phi}\{ \theta^k - \gamma^{k} A^T \Omega ( A \theta^k - y ) \} \, .$ This algorithm has been widely studied under the name of iterative shrinkage thresholding (IST). Its primary computational costs at eachiteration are: (1) multiplying the current iterate $$\theta^k$$ by $$A$$,and (2) multiplying the residual $$A \theta_{k} - y$$ by $$A^T \Omega$$.Typically the proximal operator for $$\phi$$ will be simple to compute, asin the case of a quadratic or $$\ell^1$$ penalty, and will contribute anegligible amount to the overall complexity of the algorithm. A useful approach is redundant parameterizations induced by variablesplitting. To see how this works, we re-write our original problem givenby \begin{aligned}& \underset{\theta}{\text{minimize}}& & l(\theta) + \phi(\theta),\end{aligned} as an equivalent constrained optimization problem of the form \begin{align} & \underset{\theta}{\text{minimize}} & & l(z) + \phi(\theta) \\ & \text{subject to} & & x-z=0 \, , \tag{6.7} \end{align} which we refer to as the “primal” problem. We haveintroduced $$z$$ as a redundant parameter (or “slack variable”), andencoded a consensus requirement in the form of the affine constraint $$\theta-z=0$$. Other redundant parameterizations are certainly possible. For example,consider the case of an exponential-family model for outcome $$y$$ withcumulant-generating function $$\psi(z)$$ and with natural parameter $$z$$: $p(y) = p_0(y) \exp \{ yz - \psi(z)\} \, .$ In a generalized linearmodel, the natural parameter for outcome $$y_i$$ is a linear regression oncovariates, $$z_i = a_i^T \theta$$. In this case $$l(\theta)$$ may bewritten as $l(\theta) = \sum_{i=1}^N l_i(\theta) \; , \quad \mbox{where} \quad l_i(\theta) = \psi(a_i^T \theta) - y_i (a_i^T \theta) \} \, ,$ up to an additive constant not depending on $$\theta$$. Now introduceslack variables $$z_i = a_i^T \theta$$. This leads to the equivalentprimal problem \begin{aligned}& \underset{\theta, z}{\text{minimize}}& & \sum_{i=1}^N \{\psi(z_i) - y_i z_i \} + \phi(\theta) \\& \text{subject to}& & A\theta - z = 0 \, .\end{aligned} The advantage of such a variable-splitting approach is that now the fitand penalty terms are de-coupled in the objective function of the primalproblem. A standard tactic for exploiting this fact is to write down andsolve the dual problem corresponding to the original (primal) constainedproblem. This is sometimes referred to as dualization. Many well-known references exist on this topic. For this reason we focus onproblem formulation and algorithms for solving \tag{6.7}, avoiding standard material onduality or optimality conditions. ### 6.4.11 Dual ascent We first start with the simple problem \begin{aligned}& \underset{\theta}{\text{minimize}}& & l(\theta) \\& \text{subject to}& & A\theta = z \, .\end{aligned} The Lagrangian for this problem is $l(\theta, \lambda) = l(\theta) + \lambda^T(A\theta - z) = l(\theta) + (A^T \lambda)^T \theta - \lambda^T z \, .$ Let $$f^{\star}$$ be the convex conjugate of $$l$$. The dual function is $g(\lambda) = \inf_\theta l(\theta,\lambda) = - f^{\star} (-A^T \lambda) - z^T \lambda \, ,$ and the dual problem is \begin{aligned}& \underset{\lambda}{\text{maximize}}& & g(\lambda) \, .\end{aligned} Let $$p^\star$$ and $$d^\star$$ be the optimal values of the primal and dualproblems, respectively. Assuming that strong duality holds,[^1] theoptimal values of the primal and dual problems are the same. Moreover,we may recover a primal-optimal point $$\theta^\star$$ from a dual-optimalpoint $$\lambda^\star$$ via $\hat \theta = \arg \min_\theta l(\theta, \lambda^\star) \quad \iff \quad 0 \in \partial_\theta l(\theta, \lambda^\star) \, .$ The idea of dual ascent is to solve the dual problem using gradientascent. Note that $\nabla g(\lambda) = \nabla_\lambda L(\hat{\theta}_\lambda, \lambda) \; , \quad \mbox{where} \quad \hat{\theta}_\lambda = \arg \min_\theta l(\theta, \lambda) \, .$ The second term is simply the residual for the constraint:$$\nabla_\lambda l(\theta, \lambda) = A\theta - z$$. Therefore dual ascentinvolves iterating two steps with appropriate step size $$\alpha_k$$ isgiven by \begin{aligned}\theta^{k+1} &= \arg \min_\theta l(\theta, \lambda^k) \\\lambda^{k+1} &= \lambda^k + \alpha_k (A \theta^{k+1} - z).\end{aligned} ### 6.4.12 Augmented Lagrangian Take the same problem as before, \begin{aligned}& \underset{\theta}{\text{minimize}}& & l(\theta) \\& \text{subject to}& & A\theta = z \, ,\end{aligned} with a Lagrangian $l(\theta, \lambda) = l(\theta) + \lambda^T(A\theta - z) \, .$ Theaugmented-Lagrangian approach (or method of multipliers) seeks tostabilize the intermediate steps by adding a ridge-like term to theLagrangian: $L_{c}(\theta, \lambda) = l(\theta) + \lambda^T(A\theta - z) = l(\theta) + \lambda^T(A\theta - z) + \frac{c}{2} \Vert A\theta - z \Vert _2^2 \, .$ One way of viewing this is as the standard Lagrangian for the equivalentproblem \begin{aligned}& \underset{\theta}{\text{minimize}}& & l(\theta) + \frac{c}{2} \left\Vert A\theta-z \right\Vert _2^2 \\& \text{subject to}& & A\theta = z \, ,\end{aligned} For any primal-feasible $$\theta$$, the new objectiveremains unchanged, and thus has the same minimum as the originalproblem. The dual function is is differentiable and strongly convexunder mild conditions and we need $g_{c}(\lambda) = \inf_\theta L_c(\theta, \lambda) \, ,$ We can nowuse dual ascent for the modified problem, iterating \begin{aligned}\theta^{k+1} &= \arg \min_\theta \left\{ l(\theta) + \lambda^T(A\theta^k - z) + \frac{c}{2} \Vert A\theta - z \Vert _2^2 \right\} \\\lambda^{k+1} &= \lambda^k + c (A \theta^{k+1} - z) \, .\end{aligned} Therefore, the dual-variable update doesn’t changecompared to standard dual ascent. But the $$\theta$$ update has aregularization term added to it, whose magnitude depends upon the tuningparameter $$c$$. Notice that the step size $$c$$ is used in the dual-updatestep. ##### 6.4.12.0.1 Scaled form. Now re-scale the dual variable as $$u = c^{-1} \lambda$$, and rewrite theaugmented Lagrangian as \begin{aligned}L_{c}(\theta, u) &= l(\theta) + c u^T (A\theta - z) + \frac{c}{2} \Vert A\theta - z \Vert _2^2 \\&= l(\theta) + \frac{c}{2} \Vert r + u \Vert _2^2 - \frac{c}{2} \Vert u \Vert _2^2\end{aligned} where $$r = A\theta - z$$. This leads to the followingdual-update formulas: \begin{aligned}\theta^{k+1} &= \arg \min_\theta \left\{ l(\theta) + \frac{c}{2} \Vert A\theta - z + u^k \Vert _2^2 \right\} \\u^{k+1} &= u^k + (A \theta^{k+1} - z) \, .\end{aligned} Notice that the re-scaled dual variable is the runningsum of the residuals $$r^k = A\theta^k - z$$ from the primal constraint.This is handy because the formulas are often shorter when working withre-scaled dual variables. ### 6.4.13 Alternating Directions Method of Multipliers (ADMM) One of the most powerful ideas in optimization is the alternatingdirections. We combine the ideas of variable splitting with theaugmented Lagrangian in order to arrive at a method called ADMM(alternating-direction method of multipliers) for solving problem \tag{6.7}. The scaled-form augmented Lagrangianfor this problem is $L_c(\theta, z, \lambda) = l(z) + \phi(\theta) + \frac{c}{2} \left\Vert \theta - z + u \right\Vert _2^2 + \frac{c}{2} \left\Vert u \right\Vert _2^2 \, .$ ADMM is just like dual ascent for this problem, except that we optimizethe Lagrangian in $$\theta$$ and $$z$$ individually, rather than jointly, ineach pass (hence “alternating direction”): \begin{aligned}z^{k+1} &= \arg \min_z \left\{ l(z) + \frac{c}{2} \Vert \theta^k - z + u^k \Vert _2^2 \right\} \\\theta^{k+1} &= \arg \min_\theta \left\{ \phi(\theta) + \frac{c}{2} \Vert \theta - z^{k+1} + u^k \Vert _2^2 \right\}\\u^{k+1} &= u^k + \theta^{k+1} - z^{k+1} \, .\end{aligned} Notice that the first two steps are the proximaloperators of $$l(\theta)$$ and $$\phi(\theta)$$, respectively. One way of interpreting ADMM is as a variant on the proximal gradientmethod for the dual problem corresponding to \tag{6.7}. As a result, Nesterov-type acceleration methods may also be applied, with extra care to regularityconditions (in particular, strong convexity of $$l(\theta)$$). ### 6.4.14 Bregman divergence and exponential families The augmented Lagrangian method for solving $$\ell^1$$ problems is called“Bregman iteration” in the compressed-sensing literature. Here the goalis to solve the “exact recovery” problem via basis pursuit: \begin{aligned} & \underset{\theta}{\text{minimize}} & & \Vert x \Vert _1 \\ & \text{subject to} & & Ax = z \, , \end{aligned} where $$z$$ is measured, $$\theta$$ is the unknownsignal, and $$A$$ is a known “short and fat” matrix (meaning morecoordinates of $$\theta$$ than there are observations). The scaled-form augmented Lagrangian corresponding to this problem is \begin{aligned} L_{c}(\theta, u) &= \Vert x \Vert _1 + \frac{c}{2} \Vert Ax - f + u \Vert _2^2 - \frac{c}{2} \Vert u \Vert _2^2 \, , \end{aligned} with steps \begin{aligned} x^{k+1} &= \arg \min_x \left\{ \Vert x \Vert _1 + \frac{c}{2} \Vert Ax - z_k \Vert _2^2 \right\} \\ z^{k+1} &= f + z^k - A x^{k+1} \, , \end{aligned} where we have redefined $$z^{k} = z - u^k$$ compared tothe usual form of the dual update. Thus each intermediate step ofBregman iteration is like a lasso regression problem. (In the compressedsensing literature, this algorithm is motivated a different way, byappealing to Bregman divergences. But it’s the same algorithm.) Let $$d(\theta)$$ be a strictly convex differentiable function withLegendre dual $$b$$. The Bregman divergence from $$\theta$$ to $$y$$ inducedby $$d$$ is $D_{d}(\theta, y) = d(\theta) - d(y) - d'(y) (\theta-y) \geq 0 \, .$ This is the vertical distance between $$d(y)$$ and the extrapolated“guess” for $$d(y)$$ based on the tangent line at $$\theta$$. In themultivariate case everything carries through with gradients/planesreplacing derivatives/lines. There is a unique Bregman divergence associated with every exponentialfamily. It corresponds precisely to the relationship between the naturalparameterization and the mean-value parameterization. Suppose that $p(y; \theta) = p_0(y) \exp \{ y \theta - b(\theta) \} \, .$ Theexpected value of $$y$$, as a function of $$\theta$$, is given in terms ofthe cumulant-generating function as $$\mu(\theta) = b'(\theta)$$. This issometimes referred to as Tweedie’s formula , and has come up repeatedly in a variety of different contexts.[^2] Asimple proof follows from differentiating the identity$$\int_R p(y; \theta) d y = 1$$ under the integral. Let $$d(\mu)$$ be the convex conjugate of $$b(\theta)$$. Then \begin{aligned}d(\mu) = \sup_\theta \{\theta \mu - b(\theta) \}, &~~b(\theta) = \sup_\mu \{\theta \mu - d(\mu) \}\end{aligned} By the envelope formula, the maximizing value of $$\mu$$in the second equation (as a function of $$\theta)$$ satisfies$$\mu(\theta) = b'(\theta)$$. This lets us recognize the dual variable$$\mu$$ as the mean-value parameterization; that is, the natural andmean-value parameterizations form a Legendre pair. Re-writing the log likelihood in terms of the mean parameter, andappealing to the mapping $$\theta = d'(\mu)$$ along with its inverse$$\mu = b'(\theta)$$, leads to \begin{aligned}\theta y - b(\theta) &= \theta y - \theta \mu + d(\mu) \\&= d(\mu) + (y-\mu) \theta \\&= d(\mu) + (y-\mu) d'(\mu) - d(y) + d(y) \\&= -D_d(y,\mu) + d(y) \, .\end{aligned} Remember that we are thinking of this as a distributionfor $$y$$ given $$\theta$$ (or $$\mu$$) and thus there are no Jacobeans toworry about—just a bijective mapping from one parameterization toanother. This gives us two different ways of writing an exponential-family model:(1) in terms of the natural parameter $$\theta$$ and thecumulant-generating function $$b$$, $$p(y; \theta, b)$$; or (2) in terms ofthe mean-value parameter $$\mu$$ and the Bregman divergence induced by theLegendre dual $$d = b^{\star}$$, $$p(y ; \mu, d)$$. ##### 6.4.14.0.1 Splitting on the mean-value parameter. Another use of variable splitting is when we write the model in terms ofits mean-value parameterization: $p(y_i ; \mu_i(\theta) ) \propto \exp\{-D_{d}[y_i, \mu_i(\theta)]\} \, ,$ where $$d = b^{\star}$$ and $$\mu_i(\theta) = E(y_i ; \theta)$$ is theexpected value, given the parameter. Assuming we are still using the canonical link, we may now write themodel in terms of a penalized Bregman divergence with split variables: \begin{aligned}& \underset{x, z}{\text{minimize}}& & \sum_{i=1}^N D_{d}(y_i, z_i) + \phi(\theta) \\& \text{subject to}& & \mu(a_i^T \theta) - z_i = 0 \, .\end{aligned} where $$\phi(\theta)$$ is the penalty function. In a Poisson model $$y_i \sim \mbox{Pois}(\mu_i)$$,$$\mu_i = \exp(\theta_i)$$ for natural parameter $$\theta_i = a_i^T x$$. Thecumulant generating function is $$b(\theta) = \exp(\theta)$$, and thus$$d(\mu) = \mu \log \mu - \mu$$. Thus after simplification we have$$D_d(y, \mu) = \mu - y \log \mu + (\mu - y)$$. The optimization problemcan be split as \begin{aligned} & \underset{x, z}{\text{minimize}} & & \sum_{i=1}^N (z_i - y_i \log z_i) + \phi(\theta) \\ & \text{subject to} & & a_i^T x = \log z_i \, . \end{aligned} Expectation Maximization (EM)—————————– Typically we try to train a model given the data $$\theta,Y$$. Adiscriminative model $$p(y|x,\theta)$$ or generative model$$p(\theta,y,\theta)$$. In an unsupervised case, we want to learn the$$p(\theta,y,\theta)$$ model while only having $$\theta$$ observations. Wewill look for $\theta_{\mathrm{MAP}} = \arg\max_{\theta}p(\theta|\theta).$ We willstart with a simpler case, when there in no prior over parameters$$\theta$$ and we try to find $\theta_{\mathrm{ML}} = \arg\max_{\theta}p(\theta|\theta).$ EMalgorithm should be used when knowing $$Y$$ will make solving our problemeasier. We start by \begin{aligned}\log p(\theta|\theta) = &\int q(Y) \log p(\theta|\theta,Y)dY = \int q(Y) \log \dfrac{p(\theta,Y|\theta)}{p(Y|X,\theta)}dt = \int q(Y) \log \dfrac{p(\theta,Y|\theta)}{p(Y|X,\theta)}\dfrac{q(Y)}{q(Y)}dt= \\& \underbrace{\int q(Y) \log \dfrac{p(\theta,Y|\theta)}{q(Y)}dY}_{\text{variational lower bound}} + \int q(Y)\log \dfrac{p(Y|X,\theta)}{q(Y)}dY = \\&L(q,\theta) + \mathrm{KL}(q(Y)\Vert p(Y|X,\theta))\end{aligned} The name variational lower bound comes from the fact that KL-divergenceis always positive. The KL-divergence is a measure of similarity betweentwo distributions. It has the following defitiion $\mathrm{KL}(p(\theta)\Vert q(\theta)) = \int p(\theta) \log \dfrac{p(\theta)}{q(\theta)}dx$ and has the following properties 1. $$\mathrm{KL}(p(\theta)\Vert q(\theta)) \ge 0$$ and equals to $$0$$, if and only if $$p(\theta) = q(\theta)$$. 2. $$\mathrm{KL}(p(\theta)\Vert q(\theta)) \ne \mathrm{KL}(q(\theta)\Vert p(\theta))$$ The first property follows from Jensen’s inequality, which states that $\int q(\theta)l(\theta)dx \le f\left(\int q(\theta)xdx\right), ~\text{when}~q(\theta)\ge 0, ~~\int q(\theta)dx = 1,~~l(\theta) \text{ is concave}$ Sometimes it is written as $E_{x\sim q(\theta)}l(\theta) \le l(E_{x\sim q(\theta)}\theta)$ \begin{aligned}-\mathrm{KL}(p(\theta)\Vert q(\theta)) = -\int p(\theta) \log \dfrac{p(\theta)}{q(\theta)}dx = \int p(\theta) \log \dfrac{q(\theta)}{p(\theta)}dx \le \log \int p(\theta) \dfrac{q(\theta)}{p(\theta)}dx = 0.\end{aligned} Since $$\log$$ is strictly concave, we have equality to $$0$$ only when$$q(\theta) = p(\theta)$$. Using the first property of KL-divergence, we have $\log p(\theta|\theta) \ge L(q,\theta),~~\text{for any } q(Y)$ Thebound on the left depends on distribution $$q$$, thus the variational inthe name. Now, the EM algorithm does the coordinate descent iterations andalternates optimization with respect to $$q(Y)$$ and $$\theta$$. We rely onthe fact that when we try to optimize$$l(\theta) \ge h(\theta,\xi(\theta))$$ which can be bounded by anotherfunction $$h$$ and for any $$\theta$$, there exists $$\xi(\theta)$$, such that$$l(\theta) = h(\theta,\xi(\theta))$$, then we can replace theoptimization problem $$l(\theta) \rightarrow \max_x$$ with iterativealgorithm that solves two optimization problems at each iteration \begin{aligned}\theta^+ = &\arg\max_x h(\theta,\xi)\\\xi^+ = & \arg\max_{\xi} h(\theta^+,\xi) = \xi(\theta^+).\end{aligned} The EM algorithm than has two step 1. E-step $$L(q,\theta) \rightarrow \max_{q(Y)}$$ 2. M-step $$L(q,\theta) \rightarrow \max_{\theta}$$ The E-step is a variational optimization problem. However, we can get anexact solution by solving an equivalent problem $\mathrm{KL}(q(T)\Vert p(Y|X,\theta)) \rightarrow \min_{q(Y)}$ We knowthat KL is always positive and only equals to 0, when distributions areequal, thus $q^*(Y) = \arg\min_{q(Y)}\mathrm{KL}(q(T)\Vert p(Y \mid X,\theta)) = p(Y \mid X,\theta).$ We have an analytical solution! Now for the M-step $L(q,\theta) = \int q(Y)\log p(\theta,Y \mid \theta)dY + c = E_{Y\sim q(Y)}\log p(\theta,Y \mid \theta)+c$ where $$c = \int q(Y)\log q(Y)dY$$. Now, when $$p(\theta,Y \mid \theta)$$ isfrom exponential family (which is one of the assumptions of the EMalgorithm) than $$\log p(\theta,Y \mid \theta)$$ is concave and so is$$E_{Y\sim q(Y)}\log p(\theta,Y \mid \theta)$$! See R example. A simplified approach uses space of delta functions to find $$q$$ on theE-step, i.e. $$q(Y) \in \Delta = \{q(Y) \mid q(Y) = \delta(Y - Y_0)\}$$.Then $\mathrm{KL}(q(T)\Vert p(Y \mid X,\theta)) \rightarrow \min_{q(Y) \in \Delta}$ is solved by taking$$Y_0 = Y_{\mathrm{MAP}} = \arg\max_{Y} p(Y \mid X,\theta) = \arg\max_{Y} p(Y,X \mid \theta)$$.In this case integral in the M-step collapses to a value of the functionin a single point and the M-step is to solve $\log p(\theta,Y \mid \theta) \rightarrow \max_{\theta}.$ Theresulting algorithm is called hard EM. Note that in the case when weconstrained a class of possible functions $$q$$ to the set $$\Delta$$ of$$\delta$$ functions, the KL-divergence will never be equal to zero at theE-step. Another simplification of EM algorithm uses$$q(Y) \in F = \{q(y) \mid q(Y) = \prod_{i=1}^{m}q_i(y_i)\}$$ and iscalled variational EM. We do not assume any specific functional form forany of $$q_i$$. Further, $$q_i$$ can be multivariate, and each index $$i$$corresponds to a subset of observations. The resulting algorithms is $E\log p(\theta,Y \mid \theta)+c = \log q_i(y_i).$ We we calculateeach factor independently while taking expectation with respect to therest of the factors, we do it iteratively for each of the factors. Block-conjugacy is when $$p(\theta \mid Y,\theta)$$ and $$p(Y)$$ areconjugate for any $$y_i$$ while other $$y_j,~j\ne i$$ are fixed. If $$q$$ isblock-conjugate with $$p(\theta \mid Y,\theta)$$, we can find a analyticalsolution to $\mathrm{KL}\left(\prod_{i=1}^{m} q_i(y_i) \Vert p(Y \mid X,\theta) \right) \rightarrow \min_{q}$ Another simplification relies on generating samples from$$\theta,T \mid \theta$$ which can be done via MCMC. It can be used in thecase when there is no block-conjugacy. The last variant of EM algorithms relies on finding approximate maximumon the M-step. We note that $p(\theta \mid \theta) = \prod_{i=1}^{n}p(\theta_i \mid \theta) = \prod_{i=1}^{n}\int p(\theta_i \mid y_i,\theta)p(y_i)dy_i.$ Then we replace E-step with its stochastic counterpart $q(y_i) = \dfrac{p(\theta_i,y_i \mid \theta)}{\int p(\theta_i,y_i \mid \theta dy_i}.$ The stochastic M-step is then $\theta^+ = \theta + \alpha E_{Y\sim q(Y)}\nabla\log p(\theta_i,y_i \mid \theta)$ We continue such and EM process for $$i=1,\ldots,n$$. Let’s consider the PCA example. The PCA models $p(\theta,Y \mid \theta) = \prod_{i=1}^{n}p(\theta_i \mid y_i,\theta)p(y_i) = \prod_{i=1}^{n}N(\theta_i \mid Wy_i,\sigma^2I)N(y_i \mid 0,I),~\theta_i\in \mathbb{R}^D,~y_i\in \mathbb{R}^d, D\gg d.$ The PCA algorithm searches for $$W$$ and $$\sigma$$, given $$\theta$$. Thereis a linear algebra algorithms that solves this problem (SVD) and it hascomplexity of $$O(nD^2)$$. We can also formulate the problem as theproblem of maximizing the partial likelihood $\int p(\theta,Y \mid W,\sigma)p(Y)dY = p(\theta \mid W,\sigma) \rightarrow \max_{W,\sigma}.$ However, we can formulate the problem of finding $$W$$ as an EM algorithmwith E-Step being $p(Y \mid X,W,\sigma) = \dfrac{\prod p(\theta_i \mid y_i,W,\sigma)p(W,\sigma)}{\int \prod p(\theta_i \mid y_i,W,\sigma)p(W,\sigma) dY} = \prod_{i=1}^{n}N(y_i \mid \mu_i,\sigma_i^2).$ Can be done analytically due to conjugacy! The M-step is to find $(W,\sigma) = \arg\max_{W,\sigma} E_Y \log p(\theta,Y \mid W,\sigma).$ It can be done analytically! The complexity of EM algorithm iteration is$$O(nDd)$$. For small $$d$$ we can find solution much faster when comparedto the SVD algorithm. Another advantage of EM is that it can work withmissing data. On the E-step we will need to find distribution over $$Y$$and missing components of $$\theta$$. Further, we can derive a non-linearextension by introducing another discreet latent variableZ $p(\theta,Y,X \mid W,\sigma) = \prod_{i=1}^{n}p(\theta_i \mid y_i,z_iW,\sigma)p(y_i)p(z_i),~z_i\in \{1,\ldots,K\}.$ Here$$\theta_i \mid y_i,z_iW,\sigma \sim N(\theta_i \mid W_{z_i}y_i,\sigma^2)$$,$$y_i \sim N(0,I)$$, and $$p(z_i=k) = \pi_k$$. Now we need to fund $$W_k$$.This leads to mixture PCA. Further extension would allow to search forthe different dimensionality. We call this extension a Bayes PCA. Westart with $$W \in \mathbb{R}^{D\times D}$$ and introduce the prior distributionover the columns of $W $p(W) = \prod_{i=1}^{D}N(W_i \mid 0, \alpha_i^{-1}I)$ We can find$ $\alpha = \arg\max_{\alpha} p(\theta \mid \alpha) = \int p(W)\int p(Y)p(\theta \mid Y,W,\alpha)dYdW$ . Consider the mixture Gaussian example. \begin{aligned}p(\theta,Z,\mu,\Lambda,\pi) &= \prod_{i=1}^{n}\left(p(\theta_i \mid z_i,\mu,\Lambda)p(z_i \mid \pi)\prod_{k=1}^{K}p(\mu_k,\Lambda_k)p(\pi)\right) = \\&\prod_{i=1}^{n}N(\theta_i \mid \mu_{z_i},\Lambda_{z_i})\pi_{z_i}\prod_{k=1}^{K}GW(\mu_k,\Lambda_k)\mathrm{Dir}(\pi)\end{aligned}

We can calculate variational approximation $p(\theta \mid Z,\mu,\Lambda,\pi) \approx q(Z)q(\mu,\Lambda,\pi).$ Right hand side can be analytically calculated due to conjugacy. Fordetails, see books by Bishop and by Murphy. Consider another example, namely LDA (latent Dirichlet allocations).First proposed for topic modeling, this method was found to beapplicable in much wider set of applications. Say we have $$D$$ documents,consider the following probabilistic model $p(W,Z,\theta \mid \phi,\lambda) = \prod_{d=1}^{D}p(\theta_d \mid \lambda)\prod_{n=1}^{N_d}p(w_{d_n} \mid z_{d_n})p(z_{d_n} \mid \theta_d ) = \prod_{d=1}^{D}\mathrm{Dir}(\theta_d \mid \lambda)\prod_{n=1}^{N_d} \theta_{d_n,z_{d_n}\phi_{z_{d_n},w_{d_n}}}.$ Here $$W$$ is the vector of words, $$N_d$$ is the number of words indocument $$d$$, $$\theta_d \in \mathbb{R}^{\Omega}$$ ($$\Omega$$ is the number ofwords) is a topic profile for text $$d$$, $$N_d$$ is the number ofwords in document $$d$$, $$z_{d_n} \in \{1,\ldots,T\}$$ index for thetopics, $$w_{d_n}$$ is the index of a word, and$$\phi \in \mathbb{R}^{T\times \Omega}$$. We assume each word is associated with aspecific topic $$z_d$$. Then the EM algorithms is $p(Z,\theta \mid W,\phi,\lambda) \approx q(Z)q(\theta)$ $(\phi,\lambda) = \arg\max_{\phi,\lambda} E_{\theta}\log p(W,Z,\theta \mid \phi,\lambda).$

One of the applications are in collaborative filtering methods which arewidely applied to develop recommendations systems. By treating documentsas users, and products as words in documents. Topics corresponds tobuyers preferences.

## 6.6 Markov Chain Monte Carlo (MCMC)

The Bayesian model building requires to compute the conditionaldistribution of unobserved variables given observed data. For supervised learning problem, we need to learn posterior distribution over modelparameters, given the training data$$D = (X,Y) = \{x_i,y_i\}_{i=1}^n$$.

The training can be performed via the Bayes rule $p(\theta \mid D) = \dfrac{p(Y \mid \theta,X)p(\theta)}{\int p(Y \mid \theta,X)p(\theta)d\theta}.$ Characterizing this distribution, however, is often difficult. Beyond alimited set of conjugate prior models, $$p(\theta \mid D)$$ is complicated and high-dimensional, implying that standard sampling methods either donot apply or are prohibitively expensive in terms of computing time. Markov Chain Monte Carlo (MCMC) methods provide a simulation based method for sampling from these high-dimensional distributions. The sesamples can be used for estimation, and prediction.

We start with thes huffling problem that demonstrates the usefulness of the sampling techniques. Imagine we have a deck of cards. There are 52! possible permutations ofthe cards in the deck. If we want to generate a random draw from thespace of all possible permutations (we want to shuffle) we can generate a vector with all possible permutations and draw an element of this vector with probability 1/52!. However, $$52! \approx 10^{68}$$. It is slightly less then number of particles in the observed universe ($$10^{80}$$). We simply cannot generate such an array. A plausible solution is to generate a random shuffle using a random walk in the space of possible shuffles. To do that, we take the top card and randomly place it in the deck. After doing it for 200-300 times we will generate a random draw from space of possible shuffles! This is how Markov Chain Monte Carlo works. You generate a Markov Chain (shuffle top card at each step) to generate a Monte Carlo sample from a complex distribution (distribution over all possible shuffles).

Given the unknown variables $$\theta, x$$, MCMC algorithms generate a Markov chain,$$\left\{ \theta ^{\left( g\right)},x^{\left( g\right) }\right\} _{g=1}^{G}$$, whose stationary distribution is $$p\left( \theta ,x \mid y\right)$$. Thus, the key to Bayesian inference issimulation rather than optimization.

The simulations are used to estimate integrals via Monte Carlo thatnaturally arise in Bayesian inference. Common examples include posterior moments of parameters, $$E\left[ \theta \mid y\right]$$, or state variables, $$E\left[ x \mid y\right]$$, or even expected utility. Monte Carlo estimatesare given by $\widehat{E}\left( f\left( \theta ,x\right) \mid y\right)=G^{-1}\sum_{g=1}^{G}f\left( \theta ^{\left( g\right) },x^{\left( g\right)}\right) \approx \int f\left( \theta ,x\right) p\left( \theta ,x \mid y\right)d\theta dx=E\left( f\left( \theta ,x\right) \mid y\right) \text{.}$

Monte Carlo methods were first applies in late forties (MANIACpaper) to simulate physical processes of particle motion and binary collision. Further they were used as computational techniques to computemulti dimensional integrals. Recently they are widely applied in data science applications. Markov chain Monte Carlo techniques became rather popular in modern applied mathematics and are frequently used for approximating integrals of complicated functions (Diaconis 2009).

### 6.6.1 Metropolis-Hastings Algorithm

The Metropolis-Hastings (MH) algorithm provides a general approach for sampling from a given target density, $$\pi \left( \theta \right) .$$ MH uses an accept-reject approach, drawing a candidate from a distribution $$q$$ over $$\theta$$ that is accepted or rejected based on an acceptance probability. Unlike traditional accept-reject algorithms, which repeatedly sample until acceptance, the MH algorithm samples only once at each iteration. If the candidate is rejected, the algorithm keeps the current value. In this original form of the algorithm, the entire vector $$\theta$$ is update at once. Below, modifications are discussed that update $$\theta$$ in blocks, using the intuition from CH.

To design the accept-reject approach, we consider a discrete distribution with finite number of states. Then a sequence of draws from this distribution can be seen as a random walk on a graph with vertices being the possible values of the random variable. The same approach is used on optimisation togenerate points that are uniformly distributed within a set. We start by looking at the problem of generating random walks on graphs.

Given an oriented graph $$G(V,E)$$, where $$V$$ is the set of vertices andand $$E$$ is the set of edges and $$p_{ij}$$ are the probabilities of transition from vertex $$j$$ to vertex $$i$$ (we reversed the order to avoid matrix transpositions in the formulas). The transition matrix $$P = \{p_{ij}\}_{i,j=1}^n$$ is column stochastic, i.e. $$p_{ij} \ge 0$$ and $$\sum_i p_{ij} = 1$$. The problem is to find a limiting probability distribution (a.k.a. stationary distribution) $$\hat \theta = P\hat \theta$$ , $$\theta\in S$$, and $$S$$ is a standard simplex. We can use power method $$\theta^{k+1} = P^k\theta^0$$ which converges to $$\hat \theta$$ (this is also a basic idea behind Google’s PageRank). Perron-Frobenius theorem says that if $$p_{ij}>0$$ then $$\hat \theta$$ always exists and it is unique and $$\Vert \theta^k - \hat \theta\Vert \le c |\lambda_2|^k$$, where $$\lambda_2$$ is the second eigenvalue of $$P$$. The first eigenvalue is always $$1$$. We can also think of $$\theta_i^*$$ as the average time of visiting vertex $$i$$ under random walk. Let’s consider a simple example

The transition matrix and the stationary distribution are $P = \left(\begin{array}{ccccccc} 0 & 0 & 1/3 & 0 & 0 & 0 & 0 \\ 1/2 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1/2 & 1 & 0 & 1/2 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1/3 & 1/2 &0 &0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 1/3 & 0 & 0 & 1 & 0 \end{array}\right),~~~ \hat \theta = \left(\begin{array}{c}0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 1/2 \\ 1/2\end{array} \right)$

The reverse problem, is how can we achieve a uniform limiting distribution over the graph $$G$$, i.e. $$\hat \theta = e$$, $$e_i = 1/n$$, where $$n$$ is the number of vertices. We can easily see that $$e$$ is an eigenvector of $$P^T$$, since $$P^T$$ is row-stochastic, thus sum of each row equals to 1, i.e. $$P^T e = e$$. and for power iterations $$y^{k+1} = P^Ty^k$$ we have $$y^k \rightarrow e$$. If our probability transition matrix is symmetric$$P^T = P$$, then our limiting distribution is uniform.

Our goal is to build a Markov chain (random walk) so that the limiting distribution is $$\pi$$ which is known apriori. Before we need to consider a what is called a detailed balance condition: $\pi_i p_{ij} = \pi_j p_{ji}, ~ i,j=1,\ldots,n,$

from which follows that $\sum_i p_{ij}\pi_i = \pi_j \sum_i p_{ji} = \pi_j,~j=1,\ldots,n.$ Thus we need to find a Markov chain in which the transition probabilities satisfy the detailed balance condition (symmetry). Another way to write the condition is $\dfrac{p_{ij}}{p_{ji}} = \dfrac{\pi_j}{\pi_i}.$

Say we have some teaser" transition probabilities $$p_{ij}^0$$ and we assume those are symmetric. We want to find new probabilities $$p_{ij} = p_{ij}^0b_{ij}, ~ i\ne j$$ and $$p_{ii} = 1 - \sum_{j:~j\ne i} p_{ij}$$. We need to choose $$b_{ij}$$ issuch a way so that $\dfrac{p_{ij}}{p_{ji}} = \dfrac{p_{ij}^0b_{ij}}{p_{ji}^0b_{ji}} = \dfrac{b_{ij}}{b_{ji}} = \dfrac{\pi_j}{\pi_i}$ We can choose $b_{ij} = F\left(\dfrac{\pi_j}{\pi_i}\right)$

For the detailed balance condition to be satisfied, we need to choose $$F: R_+ \rightarrow [0,1]$$ such that $\dfrac{F(z)}{F(1/z)} = z.$

An example of such a function is $$F(z) = \min(z,1)$$. This leads to the Metropolis algorithm

1. When at state $$i$$, draw from $$p^{0}_{ij}$$ to generate next state $$j$$
2. Calculate $$a_{ij} = \min(1,\pi_i/\pi_j)$$ (acceptance probability)
3. Move to $$j$$ with probability $$a_{ij}$$, stay at $$i$$ otherwise

When move happens, we say step is accepted, otherwise it is rejected. In a more general case (Metropolis-Hesting), when $$p^0$$ is not symmetric, we calculate $a_{ij} = \min\left(1,\dfrac{\pi_i p^0_{ji}}{\pi_j p^0_{ji}}\right).$

In the continuous case we use variables $$S$$ and $$T$$ instead induces $$i$$ and $$j$$. The detailed balance condition becomes $\pi(S)p(T \mid S) = \pi(T)p(S \mid T).$

Then $\int p(T \mid S)\pi(S)dS = \int p(S \mid T)\pi(T)dS = \pi(T)\int p(S\mid T)dS = \pi(T).$

The following conditions $\forall S, \forall T:\pi(T)\ne 0:p(T \mid S) >0$

are sufficient to guarantee uniqueness of $$\pi(T)$$.

Specifically, the MH algorithm repeats the following two steps $$G$$ times: given $$\theta ^{\left( g\right) }$$ \begin{aligned} &\text{}&\text{Step 1. Draw }\theta ^{\prime }\text{ from a proposal distribution,}(\theta ^{\prime }|\theta ^{(g)}) \\ &&\text{Step 2. Accept }\theta ^{\prime }\text{ with probability }\alpha \left( \theta ^{(g)},\theta ^{\prime }\right) \text{,} \end{aligned} where $\alpha \left( \theta ^{(g)},\theta ^{\prime }\right) =\min \left( \frac{\pi(\theta ^{\prime })}{\pi (\theta ^{(g)})}\frac{q(\theta ^{(g)}|\theta ^{\prime })}{q(\theta ^{\prime }|\theta ^{(g)})},1\right) \text{.}$

To implement the accept-reject step, draw a uniform random variable, $$U\sim U \left[ 0,1\right]$$, and set $$\theta ^{\left( g+1\right) }=\theta ^{\prime }$$ if $$U<\alpha \left( \theta ^{(g)},\theta ^{\prime }\right)$$, leaving $$\theta ^{\left( g\right) }$$ unchanged ($$\theta^{\left( g+1\right) }=\theta^{(g)}$$). It is important to note that the denominator in the acceptance probability cannot be zero, provided the algorithm is started from a $$\pi$$ - positive point since $$q$$ is always positive. The MH algorithmonly requires that $$\pi$$ can be evaluated up to proportionality.

The output of the algorithm, $$\left\{ \theta ^{\left( g\right) }\right\}_{g=1}^{\infty }$$, is clearly a Markov chain. The key theoretical property is that the Markov chain, under mild regularity, has $$\pi \left( \theta\right)$$ as its limiting distribution. We discuss two important specialcases that depend on the choice of $$q$$.

### 6.6.2 Independence MH

One special case draws a candidate independently of the previous state, $$q(\theta ^{\prime }|\theta ^{(g)})=q(\theta ^{\prime })$$. In this independence MH algorithm, the acceptance criterion simplifies to $\alpha \left( \theta ^{(g)},\theta ^{\prime }\right) =\min \left( \frac{\pi (\theta ^{\prime })}{\pi (\theta ^{(g)})}\frac{q(\theta ^{(g)})}{q(\theta ^{\prime })},1\right).$

Even though $$Y$$ is drawn independently of the previous state, the sequence generated is not being independent, since $$\alpha$$ depends on previous draws. The criterion implies a new draw is always accepted if target density ratio $$\pi (\theta ^{\prime })/\pi (\theta ^{(g)})$$, increases more than theproposal ratio, $$q(\theta ^{(g)})/q(\theta ^{\prime })$$. When this is not satisfied, a balanced coin is flipped to decide whether or not toaccept the proposal.

When using independence MH, it is common to pick the proposal density to closely match certain properties of the target distribution. One common criterion is to ensure that tails of the proposal density are thicker than the tails of the target density. By blanketing’’ the target density, it is less likely that the Markov chain will get trapped in alow probability region of the state space.

### 6.6.3 Random-walk Metropolis

Random-walk (RW) Metropolis is the polar opposite of the independence MH algorithm. It draws a candidate from the following RW model, $\theta ^{\prime }=\theta ^{\left( g\right) }+\sigma \varepsilon _{g+1},$ where $$\varepsilon _{t}$$ is an independent, mean zero, and symmetricerror term, typically taken to be a normal or $$t-$$distribution, and $$\sigma$$ is a scaling factor. The algorithm must be tuned via the choiceof $$\sigma$$, the scaling factor. Symmetry implies that $q\left( \theta ^{\prime }|\theta ^{\left( g\right) }\right) =q\left( \theta ^{\left( g\right) }|\theta ^{\prime }\right),$

and $\alpha \left( \theta ^{(g)},\theta ^{\prime }\right) =\min \left( \pi (\theta ^{\prime })/\pi (\theta ^{(g)}),1\right).$

The RW algorithm, unlike the independence algorithm, learns about thedensity $$\pi \left( \theta \right)$$ via small symmetric steps, it randomly walks" around the support of $$\pi$$. If a candidate draw has a higher target density value than the current draw, $$\pi (\theta ^{\prime })>\pi (\theta ^{(g)})$$, the draw is always accepted. If $$\pi (\theta ^{\prime })<\pi (\theta ^{(g)})$$, then an unbalanced coin is flipped.

### 6.6.4 Gibbs Samples

The first step is the Clifford-Hammersley (CH) theorem, which states that a high-dimensionaljoint distribution, $$p\left( \theta ,x \mid y\right)$$, is completely characterized by a larger number of lower dimensional conditional distributions. Given this characterization, MCMC methods iteratively sample from these conditional distributions using standard sampling methods.

To develop the foundations of MCMC in the simplest setting, we considersampling from a bivariate posterior distribution$$p\left( \theta _{1},\theta_{2} \mid y\right)$$, and suppress the dependence on the data for parsimony. For intuition, it is useful to think of $$\theta _{1}$$ as traditional static parameters and $$\theta _{2}$$ as latent variables.

The Clifford-Hammersley theorem (CH) proves that the joint distribution,$$p\left( \theta _{1},\theta _{2}\right)$$, is completely determined by the conditional distributions, $p( {1} {2})$and $$p\left( \theta _{2} \mid \theta _{1}\right)$$, under a positivity condition. The positivity condition requires that $$p\left( \theta _{1},\theta _{2}\right) ,$$ $$p\left( \theta _{1}\right)$$ and $$p\left( \theta _{2}\right)$$ have positive mass for all points. These results are useful in practice because in most cases,$$p\left( \theta _{1},\theta _{2}\right)$$ is only known up to proportionality and cannot be directly sampled. CH implies that the same information can be extracted from the lower-dimensional conditional distributions, breaking curse of dimensionality" by transforming a higher dimensional problem of sampling from$p( {1},{2})$, into easier problems of sampling from $$p\left( \theta _{1} \mid \theta _{2}\right)$$ and $$p\left( \theta _{2} \mid \theta _{1}\right)$$.

The CH theorem is based on the Besag formula: for any pairs $$(\theta_{1},\theta _{2})$$ and $({1}{{}},{2}{{}})$, $\frac{p\left( \theta _{1},\theta _{2}\right) }{p\left( \theta _{1}^{\prime },\theta _{2}^{^{\prime }}\right) }=\frac{p(\theta _{1} \mid \theta _{2}^{^{\prime }})p(\theta _{2} \mid \theta _{1})}{p\left( \theta _{1}^{^{\prime }} \mid \theta _{2}^{^{\prime }}\right) p\left( \theta _{2}^{^{\prime }} \mid \theta _{1}\right) }.$

The proof uses the fact that $$p\left( \theta _{1},\theta _{2}\right)=p\left( \theta _{2} \mid \theta _{1}\right) p\left( \theta _{1}\right)$$ (applied to both $$(\theta _{1},\theta _{2})$$ and $$(\theta _{1}^{^{\prime}},\theta _{2}^{^{\prime }})$$) and Bayes rule: $p\left( \theta _{1}\right) =\frac{p\left( \theta _{1} \mid \theta _{2}^{\prime }\right) p\left( \theta _{2}^{\prime }\right) }{p\left( \theta _{2}^{\prime } \mid \theta _{1}\right) }\text{.}$

The general version of CH follows by analogy. Partitioning a vector as $$\theta =\left( \theta_{1},\theta_{2},\theta_{3},\ldots ,\theta_{K}\right)$$, then the general CH theorem states that $p\left( \theta _{i} \mid \theta _{-i}\right) := p\left( \theta _{i} \mid \theta _{1},\theta _{1,...,}\theta _{i-1},\theta _{i+1},...,\theta _{K}\right) ,$ for $$i=1,...,K$$, completely characterize $$p\left( \theta _{1},\ldots,\theta_{K}\right)$$.

An important case arises frequently in models with latent variables. Here, the posterior is defined over vectors of static fixed parameters, $$\theta$$, and latent variables, $$x$$. In this case, CH implies that$$p\left( \theta,x \mid y\right)$$ is completely characterized by $$p\left( \theta \mid x,y\right)$$ and $$p\left( x \mid \theta ,y\right)$$. The distribution $$p\left( \theta \mid x,y\right)$$ is the posterior distribution of the parameters, conditional on the observed data and the latent variables. Similarly, $$p\left( x \mid \theta,y\right)$$ is the smoothing distribution of the latent variables.

The Gibbs sampler simulates multi-dimensional posterior distributions by iteratively sampling from the lower-dimensional conditional posteriors. Unlike the previous MH algorithms, the Gibbs sampler updates the chain one component at a time, instead of updating the entire vector. This requires either that the conditional posteriors distributions are discrete, are a recognizable distribution (e.g. Normal) for which standard sampling algorithms apply, or that resampling methods, such asaccept-reject, can be used.

In the case of $$p\left( \theta _{1},\theta _{2}\right)$$, given current draws,$$\left( \theta _{1}^{\left( g\right) },\theta _{2}^{\left( g\right)}\right)$$, the Gibbs sampler consists of \begin{aligned} \text{1. Draw }\theta _{1}^{\left( g+1\right) } &\sim &p\left( \theta _{1}|\theta _{2}^{\left( g\right) }\right) \\ \text{2. Draw }\theta _{2}^{\left( g+1\right) } &\sim &p\left( \theta _{2}|\theta _{1}^{\left( g+1\right) }\right) , \end{aligned} repeating $$G$$ times. The draws generated by the Gibbs sampler form a Markov chain, as the distribution of $$\theta ^{\left( g+1\right) }$$ conditional on $$\theta ^{\left( g\right) }$$ is independent of past draws. Higher-dimensional cases follow by analogy. The Gibbs transition kernel from state $$\theta$$ to state $$\theta ^{\prime}$$ is $$p\left( \theta ,\theta ^{\prime }\right) =p\left( \theta_{1}^{\prime }|\theta _{2}\right) p\left( \theta _{2}^{\prime }|\theta_{1}^{\prime }\right),$$ and by definition, $$\int p\left( x,y\right) dy=1$$. The densities $$p\left( \theta _{1}|\theta _{2}\right)$$ and $$p\left(\theta _{2}|\theta _{1}\right)$$ will typically have either discrete or continuous support, and in nearly all cases the chain can reach any point or set in the state space in one step. To establish convergence, we first identify the limiting distribution. A stationary probability distribution, $$\pi$$, satisfies the integral equation $\pi \left( \theta ^{\prime }\right) =\int p\left( \theta ,\theta^{\prime }\right) d\theta.$

If the chain converges, then $$\pi$$ is also called the limiting distribution. It is easy to verify that the stationary distribution of the Markov chain generated by the Gibbs sampler is the posterior distribution, $$\pi \left(\theta \right) =p\left( \theta _{1},\theta _{2}\right)$$: \begin{aligned} \int p\left( \theta ,\theta ^{\prime }\right) d\theta &=&p\left( \theta _{2}^{\prime }|\theta _{1}^{\prime }\right) \int_{\theta _{2}}\int_{\theta _{1}}p\left( \theta _{1}^{\prime}|\theta _{2}\right) p\left( \theta _{1},\theta _{2}\right) d\theta_{1}d\theta _{2} \\ &=&p\left( \theta _{2}^{\prime }|\theta _{1}^{\prime }\right) \int_{\theta _{2}}p\left( \theta _{1}^{\prime }|\theta _{2}\right) p\left( \theta_{2}\right) d\theta _{2} \\&=&p\left( \theta _{2}^{\prime }|\theta _{1}^{\prime }\right) p\left( \theta_{1}^{\prime }\right) =p\left( \theta _{1}^{\prime },\theta _{2}^{\prime}\right) =\pi \left( \theta ^{\prime }\right) \text{.} \end{aligned}

To establish convergence to the limiting distribution, the chain must satisfy certain regularity conditions on how it traverses the statespace. Starting from an initial $$\pi$$-positive point, the Markov chainin Gibbs samplers can typically reach any set in the state space in onestep, implying that states communicate and the chain is irreducible.This does not imply that a chain starting from a given point, willreturn to that point or visit nearby states frequently. Well-behavedchains are not only irreducible, but stable, in the sense that they makemany return visits to states. Chains that visit states or setsfrequently are recurrent. Under very mild conditions, the Gibbs samplergenerates an irreducible and recurrent chain. In most cases, a measuretheoretical condition called Harris recurrence is also satisfied, whichimplies that the chains converge for any starting values. In this case, the ergodic theorem holds: for a sufficiently integrablefunction $$l$$ and for all starting points $$\theta ,$$ $\underset{G\rightarrow \infty }{\lim }\frac{1}{G}\sum_{g=1}^{G}f\left(\theta ^{\left( g\right) }\right) =\int f\left( \theta \right) \pi \left(\theta \right) d\theta =E\left[ f\left( \theta \right) \right]$

almost surely. Notice the two subtle modes of convergence: there is theconvergence of the Markov chain to its stationary distribution, andMonte Carlo convergence, which is the convergence of the partial sums tothe integral. In practice, a chain is typically run for an initial length, oftencalled the burn-in, to remove any dependence on the initial conditions.Once the chain has converged, then a secondary sample of size $$G$$ iscreated for Monte Carlo inference

### 6.6.5 Hybrid chains

Given a partition of the vector $$\theta$$ via CH, a hybrid MCMC algorithm updates the chain one subset at a time, either by direct draws (Gibbs steps) or via MH. Thus, a hybrid algorithm combinesthe features of the MH algorithm and the Gibbs sampler, providing significant flexibility in designing MCMC algorithms for different models.

To see the mechanics, consider the two-dimensional example. First, assume that the distribution $$p\left( \theta _{2}|\theta _{1}\right)$$ is recognizable and can be directly sampled. Second, suppose that $$p\left( \theta _{1}|\theta _{2}\right)$$ can only be evaluated and not directly sampled. Thus we use a Metropolis step to update $$\theta _{1}$$ given $$\theta_2$$. For the MH step, the candidate is drawn from $q( _{1}{}|{1}^{( g) },{2}{( g) })$, which indicates that the step can depend on the past draw for $$\theta _{1}$$. We denote the Metropolis stepas $$MH \left[ q\left( \theta _{1}|\theta _{1}^{\left( g\right) },\theta_{2}^{\left( g\right) }\right) \right]$$, which implies that we draw$$\theta_{1}^{\left( g+1\right) }\sim q\left( \theta _{1}^{\prime }|\theta_{1}^{\left( g\right) },\theta _{2}^{\left( g\right) }\right)$$ and accept/reject based on $\alpha \left( \theta _{1}^{(g)},\theta _{1}^{\prime }\right) =\min \left( \frac{p\left( \theta _{1}^{\prime }|\theta _{2}^{\left( g\right) }\right) }{% p\left( \theta _{1}^{\left ( g \right ) }|\theta _{2}^{\left( g\right) }\right) }\frac{q\left( \theta _{1}^{\left( g\right) }|\theta _{1}^{\prime },\theta _{2}^{\left( g\right) }\right) }{q\left( \theta _{1}^{\prime }|\theta _{1}^{\left( g\right) },\theta _{2}^{\left( g\right) }\right) },1\right) .$

The general hybrid algorithm is as follows. Given $$\theta _{1}^{\left(g\right) }$$ and $$\theta _{2}^{\left( g\right) }$$, for g=1,..., G, \begin{aligned} 1.\text{ Draw }\theta _{1}^{\left( g+1\right) } &\sim &MH\left[ q\left( \theta _{1}|\theta _{1}^{\left( g\right) },\theta _{2}^{\left( g\right) }\right) \right] \\ 2.\text{ Draw }\theta _{2}^{\left( g+1\right) } &\sim &p\left( \theta _{2}|\theta _{1}^{\left( g+1\right) }\right) . \end{aligned}

In higher dimensional cases, a hybrid algorithm consists of any combinationof Gibbs and Metropolis steps. Hybrid algorithms significantly increasethe applicability of MCMC methods, as the only requirement is that themodel generates posterior conditionals that can either be sampled orevaluated.

### 6.6.6 Missing values in normal random sample

A key assumption: Missing at random. Two methods are available: 1. The EM algorithm: Dempster, Laird and Rubin (1977, JRSSB) 2. Markov chain Monte Carlo (MCMC) method Some references 1. The EM Alorithm and Extensions by G. J. McLachlan and T. Krishnan (2008), John Wiley. 2. The EM Algorithm and Related Statistical Models by M. Watanabe and K. Yamaguchi (2003), CRC Press. 3. Bayesian Data Analysis, 2nd Ed, by Gelman, Carlin, Stern and Rubin (2003), CRC Press. 4. Bayes and Empirical Bayes Methods for Data Analysis, 2nd Ed., by B.P. Carlin and T.A. Louis (2001), CRC Press. EM algorithm: Iterate between Expectation step and Maximizationstep. - E-step: For each data point with missing values, use the conditional distribution
$\theta_1 \mid \theta_2 = \theta_2 \sim N_k(\mu_1+\Sigma_{12}\Sigma_{22}^{-1}(\theta_2-\mu),\Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}),$ where $$k = \mathrm{dim}(\theta_1)$$ and the partition is based on $$\theta = (\theta_1,\theta_2)$$. Here $$\theta_1$$ denotes the missing components and $$\theta_2$$ the observed component. - M-step: Perform MLE estimation based on the complete data. It is helpful to make use of sufficient statistics. MCMC method: Also makes use of the same conditional distribution.However, instead of using the expectation, one draws a random samplefrom the conditional distribution to fill the missing values. Basic Result used in MCMC method: Suppose that$$\theta_1,\ldots, \theta_n$$ form a random sample from $$N(\mu,\Sigma)$$,where $$\Sigma$$ is known. Suppose that the prior distribution of $$\mu$$ is$$N(\mu_0,\Sigma_0)$$. Then the posterior distribution of $$\mu$$ is alsomultivariate normal with mean $$\mu_*$$ and covariance matrix $$\Sigma_*$$,where $\Sigma_*^{-1} = \Sigma_0^{-1} + n\Sigma,~~\mu_* = \Sigma_*(\Sigma_0^{-1}\mu_0 + n \Sigma^{-1}\bar{\theta}),$ where $$\bar{\theta} = \sum_{i=1}^{n}\theta_i/n$$. Bayesian Deep Learning———————- We have considered Bayesian predictive models in previously anddescribed several approaches to perform an inference, namely MAP(maximum posteriori) and MCMC (Markov Chain Monte Carlo). However, fullBayesian inference for non conjugate models that rely on MCMC is hardlyscalable to the dimensions that arise when using deep learning models.DL models simply have much larger number of parameters compared to whatMCMC can handle in a computationally efficient manner. The recent successful approaches to develop efficient Bayesian inferencealgorithms for deep learning networks are based on thereparameterization techniques for calculating Monte Carlo gradientswhile performing variational inference. Given the data $$D = (X,Y)$$, thevariation inference relies on approximating the posterior$$p(\theta \mid D)$$ with a variation distribution $$q(\theta \mid \phi)$$,where $$\theta = (W,b)$$. Then $$q$$ is found by minimizing the based on theKullback-Leibler divergence between the approximate distribution and theposterior, namely $\text{KL}(q \mid\mid p) = \int q(\theta \mid \phi)\log \dfrac{q(\theta \mid \phi)}{p(\theta\mid D)}d\theta.$ Kullback-Leibler divergence is a functional defined for probabilitydensity functions. Some important properties of KL divergence include 1. $$\text{KL}(q \mid\mid p) \ge 0$$, and equals to zero only when $$p = q$$ 2. $$\text{KL}(q \mid\mid p) \ne \text{KL}(p \mid\mid q)$$ The second property is why it is called a divergence and not a distance.It is easy to proof the first property \begin{aligned}- \text{KL}(p \mid\mid q) = & -\int p(x) \log \dfrac{p(x)}{q(x)}dx = \int p(x) \log \dfrac{q(x)}{p(x)}dx \\\le & \log \int p(x) \dfrac{q(x)}{p(x)}dx = 0\end{aligned}

We want to find parameters $$\phi$$ that minimize the divergence. Weobserve, that the log total evidence can be calculated as $$$\log p(D) = \mathcal{L}(\phi) + \text{KL}(q \mid\mid p) \tag{6.8}$$$ where $\mathcal{L}(\phi) = \int q(\theta \mid\phi)\log \dfrac{p(Y\mid X,\theta)p(\theta)}{q(\theta \mid \phi)}d\theta$ The sum does not depend on $$\phi$$, thus minimizing$$\text{KL}(q \mid\mid p)$$ is the same that maximizing $$\mathcal{L}(q)$$. Also,since $$\text{KL}(q \mid\mid p) \ge 0$$, which follows from Jensen’sinequality, we have $$\log p(D) \ge \mathcal{L}(\phi)$$. Thus, the evidencelower bound name. The resulting maximization problem$$\mathcal{L}(\phi) \rightarrow \max_{\phi}$$ is solved using stochasticgradient descent.

### 6.6.7 Stochastic Variational Inference

Stochastic Variational Inference (SVI) Another way to define the lowerbound is $$$\mathcal{L}(\phi) = \operatorname{E}_{\theta \sim q(\theta \mid \phi)}\left(\log p(Y,\theta\mid X) - \log q(\theta \mid \phi)\right) \tag{6.9}$$$ Intuitively, the first term rewards variational distributions that placehigh mass on configurations of the latent variables that also explainthe observations; the second term rewards variational distributions thatare entropic, i.e., that maximize uncertainty by spreading their mass onmany configurations. Practitioners derive variational algorithms to maximize the ELBO overthe variational parameters by expanding the expectation in (6.9) andthen computing gradients to use in an optimization procedure. Closedform coordinate-ascent updates are available for conditionally conjugateexponential family models, where thedistribution of each latent variable given its Markov blanket falls inthe same family as the prior, for a small set of variational families. The gradient of the lower bound is not an expectation and thus cannot becalculated using Monte Carlo methods. The idea is to represent thegradient as an expectation of some random variable, so that Monte Carlotechniques can be used to calculate it. There are two standard methodsto do it. First, the log-derivative trick, uses the following identity $\nabla_x \log f(x) = \nabla_x f(x)/f(x)$ Thus, $\nabla_x f(x) = f(x) \nabla_x \log f(x)$ Now, let’s apply thisidentity to a gradient \begin{aligned}\nabla_{\phi} \operatorname{E}_{\theta \sim q(\theta \mid \phi)}\left(f(\theta)\right) = & \int \nabla_{\phi}q(\theta\mid \phi)f(\theta)d\theta\\= & \int \dfrac{q(\theta\mid \phi)}{q(\theta\mid \phi)}\nabla_{\phi}q(\theta\mid \phi)f(\theta)d\theta\\= & \int q(\theta\mid \phi)\nabla_{\phi}\log q(\theta\mid \phi) f(\theta)d\theta\\ = & \operatorname{E}_{\phi \sim q(\theta\mid \phi)}\left(f(\theta)\nabla_{\phi}\log q(\theta\mid \phi)\right)\end{aligned} By setting$$f(\theta) = \log p(Y,\theta\mid X) - \log q(\theta \mid \phi)$$, we get $\nabla_{\phi}\mathcal{L}= \operatorname{E}_{\phi \sim q(\theta\mid \phi)}\left((\log p(Y,\theta\mid X) - \log q(\theta \mid \phi))\nabla_{\phi}\log q(\theta\mid \phi)\right).$ Now we can use Monte-Carlo method to approximate the derivative $\nabla_{\phi}\mathcal{L}\approx \dfrac{1}{N}\sum_{s=1}^{N}(\log p(Y,\theta^{(s)}\mid X) - \log q(\theta^{(s)} \mid \phi))\nabla_{\phi}\log q(\theta^{(s)}\mid \phi),~~~~\theta^{(s)} \sim q(\theta\mid \phi)$ Thus, if we select $$q(\theta \mid\phi)$$ so that it is easy to computeits derivative and generate samples from it, the gradient can beefficiently calculated using Monte Carlo technique. Control variableneed to be of mean zero. The problem is that this approximate of the gradient is too noisy. Thus,for an optimization algorithm to converge, the high variance gradientswould require small learning rates. A modified estimator that involves control variables can be used. Theidea is to find an expectation of a different function with the samemean but lower variance $\operatorname{E}_{q}\left(f\right) = \operatorname{E}_{q}\left(\hat f\right),~~~\operatorname{Var}\left(\hat f\right) < \operatorname{Var}\left(f\right).$ One approachis to define $$\hat f$$ to be $\hat f (\theta) = f(\theta) - a(h(\theta) - \operatorname{E}\left(h(\theta)\right)).$ Thevariance of $$\hat f$$ can be written as $\operatorname{Var}\left(\hat f\right) = \operatorname{Var}\left(f\right) + a^2 \operatorname{Var}\left(h\right) - 2a\operatorname{Cov}\left(f,h\right).$ Minima of thisfunction is achieved at the point, were derivative is 0 and is equalsto $a^* = \operatorname{Cov}\left(f,h\right)/\operatorname{Var}\left(h\right).$ Which can be estimated empirically. Yet, another way to calculate the gradient of the expectation is to use reparametrization trick. It simply replaces a random variable by adeterministic transformation of a simpler random variable. Say we wantto sample a random variable $$\theta \sim p(\theta)$$ instead, we generatesample by sampling $$\epsilon \sim p(\epsilon)$$ and then calculating$$g(\epsilon \mid \theta)$$. The most popular approach of that type usesinverse CDF (cumulative distribution function) as function $$g$$ and uniform as $$p(\epsilon)$$. Box-muller transform is another example that relies on polar coordinates and allows to generate sample from abivariate normal by setting $x_1 = \sqrt{-2\log \epsilon_1 }\cos(2\pi \epsilon_2),~~ x_2 = \sqrt{-2\log \epsilon_1 }\sin(2\pi \epsilon_2),~~~\epsilon_{1,2}\sim U[0,1]$

Table 6.1: One-liner representations of some common distributions.
Distribution Parameters ($$\theta)$$ $$\epsilon$$ $$g(\epsilon, \theta)$$
Exponential rate $$\lambda$$ $$U(0,1)$$ $$-\frac{\log (1-\epsilon)}{\lambda }$$
Cauchy location $$x_0$$, scale $$\gamma$$ $$U(0,1)$$ $$x_0+\gamma \tan \left(\pi \left(\epsilon-\frac{1}{2}\right)\right)$$
Laplace mean $$\mu$$, scale $$\beta$$ $$U(0,1)$$ $$\beta \log (2 \epsilon )+\mu$$
Gaussian mean $$\mu=0$$, scale $$\sigma=1$$ $$\epsilon_{1,2}\sim U(0,1)$$ $$\sqrt{-2\ln(\epsilon_1)}\cos(2\pi \epsilon_2)$$
Gaussian mean $$\mu$$, scale $$\sigma$$ $$N(0,1)$$ $$\mu + \sigma\epsilon$$
Multivariate Gaussian mean $$\mu$$, variance $$\Sigma = RR^T$$ $$\epsilon_i \sim N(0,1)$$ $$\mu+R\epsilon$$
Log-Normal mean $$\mu$$, scale $$\sigma$$ $$N(\mu,\sigma^2)$$ $$\exp(\epsilon)$$
Inverse Gamma shape $$\alpha$$, scale $$\beta$$ $$\Gamma(\alpha,\beta)$$ $$1/\epsilon$$

The representations shown in Table 6.1 are used by random number generation librariesavailable in most of the languages. Now, let’s see how those reparametrization can be used to solve astochastic optimization problem that arises in variational inference. Wewant to find a gradient of an expectation $\nabla_{\phi} \operatorname{E}_{\theta \sim q(\theta \mid \phi)}\left(f(\theta)\right) = \int \nabla_{\phi}q(\theta\mid \phi)f(\theta)d\theta.$ The goal is to find a re-parametrization of our random variable $$\theta$$so that $\nabla_{\phi} \operatorname{E}_{\theta \sim q(\theta \mid \phi)}\left(f(\theta)\right) = \operatorname{E}_{\epsilon \sim p(\epsilon)}\left(\nabla_{\phi}f(g(\epsilon,\phi))\right)$ Re-expressing the distribution over $$\theta$$, we get \begin{aligned}\nabla_{\phi} \operatorname{E}_{\theta \sim q(\theta \mid \phi)}\left(f(\theta)\right) = & \nabla_{\phi} \int q(\theta\mid \phi)f(\theta)d\theta\\& = \nabla_{\phi} \int p(\epsilon)f(g(\epsilon,\phi))d\epsilon\\& = \int p(\epsilon)\nabla_{\phi}f(g(\epsilon,\phi))d\epsilon\\& = \operatorname{E}_{\epsilon \sim p(\epsilon)}\left(\nabla_{\phi} f(g(\epsilon,\phi))\right)\end{aligned} Now our distribution over $$\epsilon$$ does not depend on $$\phi$$ and,thus, we can interchange gradient and expectation operators. Here weneed to assume that there exists such a re-parametrization $$g$$ and wecan $$p(\epsilon)$$ is easy to sample from. Now, we can apply Monte Carlointegration to calculate the derivative $\operatorname{E}_{p(\epsilon)}\left(\nabla_\theta f(g(\epsilon, \theta))\right) = \frac{1}{S}\sum_{s=1}^{S}\nabla_\theta f(g(\epsilon^{(s)}, \theta)), \quad \epsilon^{(s)}\sim p(\epsilon)$

We can apply reparametrization trick by representing $$\theta$$ as a valueof a deterministic function, $$\theta = g(\epsilon,x,\phi)$$, where$$\epsilon \sim r(\epsilon)$$ does not depend on $$\phi$$. The derivative isgiven by \begin{aligned}\nabla_{\phi} E_q\log p(Y\mid X, \theta) &= \int r(\epsilon)\nabla_{\phi}\log p(Y\mid X, g(\epsilon,x,\phi))d \epsilon\\& = E_{\epsilon} [\nabla_{g}\log p(Y\mid X, g(\epsilon,x,\phi))\nabla_{\phi}g(\epsilon,x,\phi)].\end{aligned} The reparametrization is trivial when$$q(\theta \mid\phi) = N(\theta \mid \mu(D,\phi), \Sigma(D,\phi))$$, and$$\theta = \mu(D,\phi) + \epsilon\Sigma(D,\phi),~\epsilon\sim N(0,I)$$. propose using $$\Sigma(D,\phi) = I$$ and representing$$\mu(D,\phi)$$ and $$\epsilon$$ as outputs of a neural network (multi-layerperceptron), the resulting approach was called variational auto-encoder.A generalized reparametrization has been proposed by and combines both log-derivative andreparametrization techniques by assuming that $$\epsilon$$ can depend on$$\phi$$. In operations research literature, the re-parametrization is called aperturbation analysis and the resulting estimator is called pathwise derivative (PD) (Fu 2015).

### 6.6.8 Auto-Encoding Variational Bayes

Variational Auto Encoder (VAE) is a generative model that allows togenerate draws from a joint high dimensional distribution. It uses theideas from an auto-encoder model. An auto-encoder is a deep learningmodel which trains the architecture to approximate $$X$$ by itself (i.e.,$$X=Y$$) via a bottleneck structure. It has two modules, the encoder $$g$$and decoder $$h$$ This means we select a model $$F^{W,b} (X) = h(g(X))$$which aims to concentrate the information required to recreate $$X$$. Putdifferently, an auto-encoder creates a more cost effectiverepresentation of $$X$$. For example, under an $$L^2$$-loss function, we wish to find ${\rm arg \; min}_{W,B} \; \Vert F^{W,b} (X) - X \Vert^2_2$ subject toa regulariziation penalty on the weights and offsets. In an auto-encoder, for a training data set $$\{ X_1 , X_2 , \ldots \}$$,we set the target values as $$Y_i = X_i$$. A static auto-encoder with twolinear layers, akin to a traditional factor model, can be written as adeep learner as \begin{aligned}z^{(2)} & = W^{(1)} X + b^{(1)},\\a^{(2)} & = f_2 ( z^{(2)} ),\\z^{(3)} & = W^{(2)} a^{(2)} + b^{(2)},\\Y = F^{W,b}(X) & = a^{(3)} = f_3 ( z^{(3)} ),\end{aligned} where$$a^{(2)} , a^{(3)}$$ are activation levels. It is common to set$$a^{(1)} = X$$. The goal is to learn the weight matrices$$W^{(1)} , W^{(2)}$$. If $$X \in \mathbb{R}^{N}$$, then$$W^{(1)} \in \mathbb{R}^{M,N}$$ and $$W^{(1)} \in \mathbb{R}^{N,M}$$,where $$M \ll N$$ provides the auto-encoding at a lower dimensional level. If $$W_2$$ is estimated from the structure of the training data matrix,then we have a traditional factor model, and the $$W_1$$ matrix providesthe factor loadings. (We note that PCA in particular falls into thiscategory, as we have seen in Equation @ref(eq:PCA_eq). If $$W_2$$ is estimated based on the pair $$\hat{X}=\{Y,X\}=X$$ (which meansestimation of $$W_2$$ based on the structure of the training data matrixwith the specific auto-encoder objective), then we have a sliced inverseregression model. If $$W_1$$ and $$W_2$$ are simultaneously estimated basedon the training data $$X$$, then we have a two layer deep learning model. A dynamic one layer auto-encoder for a financial time series $$(Y_t)$$can, for example, be written as a coupled system of the form $Y_t = W_x X _t + W_y Y_{t-1} \; \; {\rm and} \; \; \left ( \begin{array}{c}X_t\\Y_{t-1}\end{array}\right ) = W Y_t\,.$ We then need to learn the weight matrices $$W_x$$and $$W_y$$. Here, the state equation encodes and the matrix $$W$$ decodesthe $$Y_t$$ vector into its history $$Y_{t-1}$$ and the current state $$X_t$$. The auto-encoder demonstrates nicely that in deep learning we do nothave to model the variance-covariance matrix explicitly, as our model isalready directly in predictive form. (Given an estimated non-linearcombination of deep learners, there is an implicit variance-covariancematrix, but that is not the driver of the method.) Idea of VAE replaces the deterministic lower dimensional representation$$z$$ of $$X$$ with a random variable, which we will denote $$z$$ as well,then we can calculate distribution over the input by marginalizing outthe latent variable $p(x) = \int P(x\mid Z)p(z)dz.$ In VAE a normallikelihood is assumed, e.g.  $x\mid z \sim N(h_{\phi}(z),\sigma I).$ Function $$h$$ is called the encoder. Now, we need to choose $$p(z)$$. If we choose it arbitrarily, then for most of th possible values of $$z$$, $$p(x\mid z)$$ will be zero (the distributions will not overlap). The keyidea is to make $$p(z)$$ to be data-driven and depend on observed samples.Thus, we want values of $$z$$ that are likely to produce values of $$x$$ bethe decoder. Thus, we will use a new distribution $z\mid x \sim q(z\mid x).$

This distribution is called the encoder and is also assumed to be normal, $z\mid x \sim N(\mu(x),\Sigma(x))$

Thus, we have different distribution over $$z$$ for different values of $$x$$. As previously, we want to construct $$q(z\mid x)$$ to be as close as possible to $$p(z)$$ (regularization), further, we want reconstruction to be as good as possible

$$$l(x,\theta,\phi) = -\operatorname{E}_{z\sim q_{\theta}(z\mid x)}\left(\log p_{\phi}(x\mid z)\right) + \text{KL}\left(q_{\theta}(z\mid x) || p(z)\right) \tag{6.10}$$$

The loss function is simply a lower bound for the total probability$$p(x)$$. Now, we choose $$q(z\mid x)$$ to be normal with mean and variance beingoutputs of the deterministic encoder part of the VAE model $\mu,\Sigma = g_{\theta}(X), ~~~~ z \sim N(\mu,\Sigma), ~~X = h_{\phi}(z)$

The advantages of this choice are computational, as they make it clearhow to compute the right hand side. The lastterm—$$\mathrm{KL}\left[q(z|X)\|p(z)\right]$$—is now a KL-divergencebetween two multivariate Gaussian distributions, which can be computedin closed form as: $\begin{array}{c}\mathrm{KL}[N(\mu_0,\Sigma_0) \| N(\mu_1,\Sigma_1)] = \hspace{20em}\\\hspace{5em}\frac{ 1 }{ 2 } \left( \mathrm{tr} \left( \Sigma_1^{-1} \Sigma_0 \right) + \left( \mu_1 - \mu_0\right)^\top \Sigma_1^{-1} ( \mu_1 - \mu_0 ) - k + \log \left( \frac{ \det \Sigma_1 }{ \det \Sigma_0 } \right) \right)\end{array}$ where $$k$$ is the dimensionality of the distribution. Inour case, this simplifies to: $\begin{array}{c}\mathrm{KL}[N(\mu(X),\Sigma(X)) \| N(0,I)] = \hspace{20em}\\\hspace{6em}\frac{ 1 }{ 2 } \left( \mathrm{tr} \left( \Sigma(X) \right) + \left( \mu(X)\right)^\top ( \mu(X) ) - k - \log\det\left( \Sigma(X) \right) \right).\end{array}$

The first term on the right hand side of Equation (6.10) is a bit more tricky. We could use sampling to estimate $$E_{z\sim q}\left[\log p(x\mid z) \right]$$, but getting a good estimate would require passing many samples of $$z$$ through decoder,which would be expensive. Hence, as is standard in stochastic gradient descent, we take one sample of $$z$$ and treat $$p(x|z$$ for that $$z$$ as anapproximation of $$E_{z\sim q}\left[\log p(x|z) \right]$$.

After all, we are already doing stochastic gradient descent overdifferent values of $$x$$. At test time, when we want to generate new samples, we simply inputvalues of $$z\sim\mathcal{N}(0,I)$$ into the decoder. That is, we removethe "encoder,’’ including the multiplication and addition operationsthat would change the distribution of $$z$$. Further, since the lossfunction is simply the lower-bound of $$p(x)$$, we can use it as a proxyfor $$p(x)$$ and evaluate a probability of a given observation.

We apply AE to MNIST data set

### 6.6.9 Generative Models

Ty Roberts created program Verbasizer which David Bowie used in in thestudio to generate song ideas, titles and lyrics. Verbasizer was anattempt to mimic the actual process David Bowie used to write his songs.Bowie would handwrite words from the newspaper, cut them up, throw theminto a hat and then arrange the fragments on pieces of paper. He’d thenreorder those words and remove those that would not fit the lyrics.Verbasizer did exactly that. It would take word inputs, and arrangethose in columns. Some columns can be for specific types of word, forexample nouns, this helped building grammatically correct sentences.Further, each columns was assigned a weight. Then the program wouldbuild sentences by randomly picking a word from each column. TheVerbasizer was used in the creation of "Outside" which employedadditional creative techniques for bypassing one’s usual methods ofartmaking that fall into patterns that are otherwise difficult to avoid.Figure 6.9 shows a version of Verbolizer which wasre-created recently for an excibition dedicated to David Bowe.

With the return of David Bowie, everybody’s getting in on the actionincluding London’s esteemed Victoria and Albert Museum which ispreparing to debut “David Bowie Is.” The exhibition “will feature morethan 300 objects that include handwritten lyrics, original costumes,fashion, photography, film, music videos, set designs and Bowie’s owninstruments.” Though many details of the exhibition remain secret, TyRoberts created a contemporary version of the Verbasizer (see abovescreenshot) for a section of the show possibly titled "David in theStudio." Roberts now works with the Gracenote R&D groups’ media andtechnology lab and they recreated the software for an iMac with a moreautomated process than the original Verbasizer. Since Bowie drew onnewspaper content as one of his sources, the Verbasizer 2.0 draws innews feeds and automatically generates lyrics. It’s another step removedfrom the cut-up by hand but it caused Roberts to recall Brian Eno’sinterest in generative processes for creating music that were a bitahead of their time back in the 90’s. Back then Roberts actually workedmore with Brian Eno and that work included developing software toautogenerate visuals though not music because the computers availablefor their use were not up to the task. Of course now Brian Eno isdeveloping Generative Music apps such as Scape for iOS. A PossibleTakeaway for Tech Startups There’s a lot one could take from thesetales, including the fact that pre-Web ideas offer a lot of potentialfor those who are aware of the history of art as well as technology.Many of these ideas are unavailable to the typical Silicon Valley techstartup full of engineers and coders. Though the current mantra is thateveryone should learn to code with the counter that programmers shouldlearn about design and art, interdisciplinary teams who can collaboratewith mutual respect and understanding now have the potential to make newmoves inaccessible to teams dominated by designers, programmers orartists. Bonus: Peep the combined merch game of David Bowie and theVictoria and Albert Museum. Trash talk all you want but Bowie is stilltaking you to school.

### 6.6.10 Shallow Factor Models

Almost all shallow data reduction techniques can be viewed as consistingof a low dimensional auxiliary variable $$Z$$ and a prediction rulespecified by a composition of functions \begin{aligned}\hat{Y} &= f_1^{W_1,b_1} (f_2( W_2X +b_2)\big)\\&= f_1^{W_1,b_1}(Z),\,\text{ where Z:=f_2(W_2X+b_2)\,. }\end{aligned} The problem of high dimensional data reduction in general is to find the$$Z$$-variable and to estimate the layer functions $$(f_1, f_2)$$ correctly.In the layers, we want to uncover the low-dimensional $$Z$$-structure in away that does not disregard information about predicting the output $$Y$$. Principal component analysis (PCA), reduced rank regression (RRR),linear discriminant analysis (LDA), project pursuit regression (PPR),and logistic regression are all shallow learners. For example, PCA reduces $$X$$ to $$f_2(X)$$ using a singular value decomposition of the form

$$$Z = f_2(X) = W^T X + b, (\#eq:PCA_eq)$$$

where the columns of theweight matrix $$W$$ form an orthogonal basis for directions of greatestvariance (which is in effect an eigenvector problem). Similarly, for thecase of $$X=(x_1,\ldots,x_p)$$, PPR reduces $$X$$ to $$f_2(X)$$ by setting $Z=f_2(X) = \sum^{N_1}_{i=1}g_i(w_{i1}x_1 + \ldots + w_{ip}x_p)\,.$ Asstated before, these types of dimension reduction is independent of $$y$$and can easily discard information that is valuable for predicting thedesired output. Sliced inverse regression (SIR) $X$ overcomes thisdrawback somewhat by estimating the layer function $$f_2$$ using data onboth, $$Y$$ and $$X$$, but still operates independently of $$f_1$$. If $$l(\theta)$$ and $$\phi(\theta)$$ are both convex, then Slater’s condition—that there exists a point in the relative interior of the domain that is strictly feasible—basically always ensures that strong duality holds in statistical problems. Similar expressions have arisen in studies of admissibility , dynamic models , outlier detection , scale-mixture priors , and sparsity .

### References

Armagan, Artin, David B Dunson, and Jaeyong Lee. 2013. “Generalized Double Pareto Shrinkage.” Statistica Sinica 23 (1): 119.
Beck, Amir, and Marc Teboulle. 2009. “Gradient-Based Algorithms with Applications to Signal Recovery.” Convex Optimization in Signal Processing and Communications, 42–88.
Bertsekas, Dimitri P. 1999. “Nonlinear Programming. Athena Scientific Belmont.” Massachusets, USA.
Brown, Lawrence D. 1971. “Admissible Estimators, Recurrent Diffusions, and Insoluble Boundary Value Problems.” The Annals of Mathematical Statistics 42 (3): 855–903.
Carlin, Bradley P, and Nicholas G Polson. 1991. “Inference for Nonconjugate Bayesian Models Using the Gibbs Sampler.” Canadian Journal of Statistics 19 (4): 399–405.
Carvalho, Carlos M, Nicholas G Polson, and James G Scott. 2010. “The Horseshoe Estimator for Sparse Signals.” Biometrika 97 (2): 465–80.
Efron, Bradley. 2011. “Tweedie’s Formula and Selection Bias.” Journal of the American Statistical Association 106 (496): 1602–14.
Fu, Michael C. 2015. “Stochastic Gradient Estimation.” In Handbook of Simulation Optimization, 105–47. Springer.
Ghahramani, Zoubin, and Matthew J Beal. 2001. “Propagation Algorithms for Variational Bayesian Learning.” In Advances in Neural Information Processing Systems, 507–13.
Griva, Igor, Stephen G Nash, and Ariela Sofer. 2009. Linear and Nonlinear Optimization. Vol. 108. Siam.
Keskar, Nitish Shirish, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.” arXiv Preprint arXiv:1609.04836.
Kingma, Diederik P, and Max Welling. 2013. “Auto-Encoding Variational Bayes.” arXiv Preprint arXiv:1312.6114.
Lessard, Laurent, Benjamin Recht, and Andrew Packard. 2016. “Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints.” SIAM Journal on Optimization 26 (1): 57–95.
Masreliez, C. 1975. “Approximate Non-Gaussian Filtering with Linear State and Observation Relations.” IEEE Transactions on Automatic Control 20 (1): 107–10.
Robbins, H. 1955. “An Empirical Bayes Approach to Statistics. In Proceedings of the Third Berkeley Symposium of Mathematical Statistics and Probability.” University of California Press, Berkeley.
Robbins, Herbert, and Sutton Monro. 1985. “A Stochastic Approximation Method.” In Herbert Robbins Selected Papers, 102–9. Springer.
Rockafellar, R Tyrrell, and Roger J-B Wets. 2009. Variational Analysis. Vol. 317. Springer Science & Business Media.
Ruiz, Francisco R, Michalis Titsias RC AUEB, and David Blei. 2016. “The Generalized Reparameterization Gradient.” In Advances in Neural Information Processing Systems, 460–68.
West, Mike. 1984. “Outlier Models and Prior Distributions in Bayesian Linear Regression.” Journal of the Royal Statistical Society. Series B (Methodological), 431–39.