Regularization is a technique to incorporate some prior knowledge about parameters of the model into the estimation process. Consider an example when regularization allows us to solve a hard problem of filtering noisy traffic data.
Example 17.7 (Traffic) Consider traffic flow speed measured by an in-ground sensor installed on interstate I-55 near Chicago. Speed measurements are noisy and prone to have outliers. Figure 17.9 shows speed measured data, averaged over five minute intervals on one of the weekdays.
Speed measurements are noisy and prone to have outliers. There are two sources of noise. The first is the measurement noise, caused by inhalant nature of the sensor’s hardware. The second source is due to sampling error, vehicles observed on a specific lane where senor installed might not represent well traffic in other lanes.
Trend filtering, which is a variation of a well-know Hodrick-Prescott filter. In this case, the trend estimate is the minimizer of the weighted sum objective function \[
(1/2) \sum_{t=1}^{n}(y_t - x_t)^2 + \lambda \sum_{t=1}^{n-1}|x_{t-1} - 2x_t + x_{t+1}|,
\]
Trend Filtering for Traffic Speed Data
Ridge Regression
Gauss invented the concept of least squares and developed algorithms to solve the the optimization problem \[
\underset{\beta}{\mathrm{minimize}}\quad ||y- X\beta||_2^2
\] where \(\beta = (\beta_1 , \ldots , \beta_p )\), we can use linear algebra algorithms, the solution given by \[
\hat{\beta} = ( X^T X )^{-1} X^T y
\] This can be numerically unstable when \(X^T X\) is ill-conditioned, and happens when \(p\) is large. Ridge regression addresses this problem by adding an extra term to the \(X^TX\) matrix \[
\hat{\beta}_{\text{ridge}} = ( X^T X + \lambda I )^{-1} X^T y.
\] The corresponding optimization problem is \[
\underset{\beta}{\mathrm{minimize}}\quad ||y- X\beta||_2^2 + \lambda||\beta||_2^2.
\] An alternative formulation is We can think of the constrain is of a budget on the size of \(\beta\).
The we choose \(\lambda\) over a regularisation path. The penalty in ridge regression forces coefficients \(\beta\) to be close to 0. Penalty is large for large values and very small for small ones. Tuning parameter \(\lambda\) controls trade-off between how well model fits the data and how small \(\beta\)s are. Different values of \(\lambda\) will lead to different models. We select \(\lambda\) using cross validation.
Example 17.4 (Shrinkage) Consider a simulated data with \(n=50\), \(p=30\), and \(\sigma^2=1\). The true model is linear with \(10\) large coefficients between \(0.5\) and \(1\).
Our approximators \(\hat f_{\beta}\) is a linear regression. We can empirically calculate the bias by calculating the empirical squared loss \(1/n||y -\hat y||_2^2\) and variance can be empirically calculated as \(1/n\sum (\bar{\hat{y}} - \hat y_i)\)
Bias squared \(\mathrm{Bias}(\hat{y})^2=0.006\) and variance \(\Var{\hat{y}} =0.627\). Thus, the prediction error = \(1 + 0.006 + 0.627 = 1.633\)
We’ll do better by shrinking the coefficients to reduce the variance. Let’s estimate, how big a gain will we get with Ridge?
Now we see the accuracy of the model as a function of \(\lambda\)
Ridge Regression At best: Bias squared \(=0.077\) and variance \(=0.402\).
Prediction error = \(1 + 0.077 + 0.403 = 1.48\)
The additional term \(\lambda||\beta||_2^2\) in the optimization problem is called the regularization term. There are several ways to regularize an optimization problem. All of those techniques were developed in the middle of last century and were applied to solve problems of fitting physics models into observed data, those frequently arise in physics and engineering applications. Here are a few examples of such regularization techniques.
Ivanov regularization \[
\underset{x \in \mathbb{R^n}}{\mathrm{minimize}}\quad ||y - X\beta||_2^2~~~~ \mbox{s.t.}~~||\beta||_l \le k
\]
Morozov regularization \[
\underset{x \in \mathbb{R^n}}{\mathrm{minimize}}\quad ||\beta||_l~~~~ \mbox{s.t.}~~ ||y - X\beta||_2^2 \le \tau
\] Here \(\tau\) reflects the so called noise level, i.e. an estimate of the error which is made during the measurement of \(b\).
Tikhonov regularization \[
\underset{\beta\in \mathbb{R^n}}{\mathrm{minimize}}\quad ||y - X\beta||_2^2 + \lambda||\beta||_l
\] - Tikhonov regularization with \(l=1\) is lasso - Tikhonov regularization with \(l=2\) is ridge regression - lasso + ridge = elastic net
\(\ell_1\) Regularization (LASSO)
The Least Absolute Shrinkage and Selection Operator (LASSO) uses \(\ell_1\) norm penalty and in case of linear regression leads to the following optimization problem \[
\underset{\beta}{\mathrm{minimize}}\quad ||y- X\beta||_2^2 + \lambda||\beta||_1
\]
In one dimensional case solves the following optimization problem \[
\underset{\beta}{\mathrm{minimize}}\quad \frac{1}{2} (y-\beta)^2 + \lambda | \beta |
\] The solution is given by the soft-thresholding operator defined by \[
\hat{\beta} = \mathrm{soft} (y; \lambda) = ( y - \lambda ~\mathrm{sgn}(y) )_+.
\] Here sgn is the sign function and \(( x )_+ = \max (x,0)\). To demonstrate how this solution is derived, we can define a slack variable \(z = | \beta |\) and solve the joint constrained optimisation problem which is differentiable.
Graphically, the soft-thresholding operator is
LASSO has a nice feature that it forces some of the \(\hat{\beta}\)’s to zero. It is an automatic variable selection! Finding optimal solution is computationally fast, it is a convex optimisation problem, though, it is non-smooth. As in ridge regression, we still have to pick \(\lambda\) via cross-validation. Visually the process can be represented using regularization path graph, as in the following example Example: We model prostate cancer using LASSO
MSE and Regularization path for Prostate Cancer data using LASSO
Now with ridge regression
MSE and Regularization path for Prostate Cancer data using Ridge
Example 17.5 (Horse race prediction using logistic regression) We use the run.csv
data from Kaggle (https://www.kaggle.com/gdaley/hkracing). Thhis dataset contains the condition of horse races in Hong Kong, including race course, distance, track condition and dividends paid. We want to use individual variables to predict the chance of winning of a horse. For the simplicity of computation, we only consider horses with id \(\leq 500\), and train the model with \(\ell_1\)-regularized logistic regression.
And we include lengths_behind
, horse_age
, horse_country
, horse_type
, horse_rating
, horse_gear
, declared_weight
, actual_weight
, draw
, win_odds
, place_odds
as predicting variables in our model.
Since most of the variables, such as country
, gear
, type
, are categorical, after spanning them into binary indictors, we have more than 800 columns in the design matrix.
We try two logistic regression model. The first one includes win_odds given by the gambling company. The second one does not include the win_odds and we use win_odds to test the power of our model. We tune both models with a 10-fold cross-validation to find the best penalty parameter \(\lambda\).
In this model, we fit the logistic regression with full dataset. The best \(\lambda\) we find is \(5.699782e-06\).
Logistic regression for full data
In this model, we randomly partition the dataset into training(70%) and testing(30%) parts. We fit the logistic regression with training dataset. The best \(\lambda\) we find is \(4.792637e-06\).
Logistic regression for test data
The out-of-sample mean squared error for win_odds
is 0.0668.
Elastic Net combines Ridge and Lasso and chooses coefficients \(\beta_1,\ldots,\beta_p\) for the input variables by minimizing the sum-of-squared residuals plus a penalty of the form \[
\lambda||\beta||_1 + \alpha||\beta||_2^2.
\]
Bayesian Regularization
From Bayesian perspective regularization is nothing but incorporation of prior information into the model. Remember, that a Bayesian model is specified by likelihood and prior distributions. Bayesian regularization methods include the Bayesian bridge, horseshoe regularization, Bayesian lasso, Bayesian elastic net, spike-and-slab lasso, and global-local shrinkage priors. Bayesian \(l_0\) regularization is an attractive solution for high dimensional variable selection as it directly penalizes the number of predictors. The caveat is the need to search over all possible model combinations, as a full solution requires enumeration over all possible models which is NP-hard.
In Bayesian approach, regularization requires the specification of a loss, denoted by \(l\left(\beta\right)\) and a penalty function, denoted by \(\phi_{\lambda}(\beta)\), where \(\lambda\) is a global regularization parameter. From a Bayesian perspective, \(l\left(\beta\right)\) and \(\phi_{\lambda}(\beta)\) correspond to the negative logarithms of the likelihood and prior distribution, respectively. Regularization leads to an maximum a posteriori (MAP) optimization problem of the form \[
\underset{\beta \in R^p}{\mathrm{minimize}\quad}
l\left(\beta\right) + \phi_{\lambda}(\beta) \; .
\] Taking a probabilistic approach leads to a Bayesian hierarchical model \[
p(y \mid \beta) \propto \exp\{-l(\beta)\} \; , \quad p(\beta) \propto \exp\{ -\phi_{\lambda}(\beta) \} \ .
\] The solution to the minimization problem estimated by regularization corresponds to the posterior mode, \(\hat{\beta} = \mathrm{ arg \; max}_\beta \; p( \beta|y)\), where \(p(\beta|y)\) denotes the posterior distribution. Consider a normal mean problem with \[
\label{eqn:linreg}
y = \theta+ e \ , \ \ \text{where } e \sim N(0, \sigma^2),~-\infty \le \theta \le \infty \ .
\] What prior \(p(\theta)\) should we place on \(\theta\) to be able to separate the “signal” \(\theta\) from “noise” \(e\), when we know that there is a good chance that \(\theta\) is sparse (i.e. equal to zero). In the multivariate case we have \(y_i = \theta_i + e_i\) and sparseness is measured by the number of zeros in \(\theta = (\theta_1\ldots,\theta_p)\). The Bayesan Lasso assumes double exponential (a.k.a Laplace) prior distribution where \[
p(\theta_i \mid b) = 0.5b\exp(-|\theta|/b).
\] We use Bayes rule to calculate the posterior as a product of Normal likelihood and Laplace prior \[
\log p(\theta \mid y, b) \propto ||y-\theta||_2^2 + \dfrac{2\sigma^2}{b}||\theta||_1.
\] For fixed \(\sigma^2\) and \(b>0\) the posterior mode is equivalent to the Lasso estimate with \(\lambda = 2\sigma^2/b\). Large variance \(b\) of the prior is equivalent to the small penalty weight \(\lambda\) in the Lasso objective function.
The Laplace distribution can be represented as scale mixture of Normal distribution(Andrews and Mallows 1974) \[
\begin{aligned}
\theta_i \mid \sigma^2,\tau \sim &N(0,\tau^2\sigma^2)\\
\tau^2 \mid \alpha \sim &\exp (\alpha^2/2)\\
\sigma^2 \sim & \pi(\sigma^2).\end{aligned}
\] We can show equivalence by integrating out \(\tau\) \[
p(\theta_i\mid \sigma^2,\alpha) = \int_{0}^{\infty} \dfrac{1}{\sqrt{2\pi \tau}}\exp\left(-\dfrac{\theta_i^2}{2\sigma^2\tau}\right)\dfrac{\alpha^2}{2}\exp\left(-\dfrac{\alpha^2\tau}{2}\right)d\tau = \dfrac{\alpha}{2\sigma}\exp(-\alpha/\sigma|\theta_i|).
\] Thus it is a Laplace distribution with location 0 and scale \(\alpha/\sigma\). Representation of Laplace prior is a scale Normal mixture allows us to apply an efficient numerical algorithm for computing samples from the posterior distribution. This algorithms is called a Gibbs sample and it iteratively samples from \(\theta \mid a,y\) and \(b\mid \theta,y\) to estimate joint distribution over \((\hat \theta, \hat b)\). Thus, we so not need to apply cross-validation to find optimal value of \(b\), the Bayesian algorithm does it “automatically”. We will discuss Gibbs algorithm later in the book.
When prior is Normal \(\theta_i \sim N(0,\sigma_{\theta}^2)\), the posterior mode is equivalent to the ridge estimate. The relation between variance of the prior and the penalty weight in ridge regression is inverse proportional \(\lambda\propto 1/\sigma_{\theta}^2\).
Spike-and-Slab Prior
Our Bayesian formulation of allows to specify a wide range of range of regularized formulations for a regression problem. In this section we consider a Bayesian model for variable selection. Consider a linear regression problem \[
y = \beta_1x_1+\ldots+\beta_px_p + e \ , \ \ \text{where } e \sim N(0, \sigma^2),~-\infty \le \beta_i \le \infty \ .
\] We would like to solve the problem of variable selections, i.e. identify which input variables \(x_i\) to be used in our model. The gold standard for Bayesian variable selection are spike-and-slab priors, or Bernoulli-Gaussian mixtures. Whilst spike-and-slab priors provide full model uncertainty quantification, they can be hard to scale to very high dimensional problems and can have poor sparsity properties. On the other hand, techniques like proximal algorithms can solve non-convex optimization problems which are fast and scalable, although they generally don’t provide a full assessment of model uncertainty.
To perform a model selection, we would like to specify a prior distribution \(p\left(\beta\right)\), which imposes a sparsity assumption on \(\beta\), where only a small portion of all \(\beta_i\)’s are non-zero. In other words, \(\|\beta\|_0 = k \ll p\), where \(\|\beta\|_0 \defeq \#\{i : \beta_i\neq0\}\), the cardinality of the support of \(\beta\), also known as the \(l_0\) (pseudo)norm of \(\beta\). A multivariate Gaussian prior (\(l_2\) norm) leads to poor sparsity properties in this situation. Sparsity-inducing prior distributions for \(\beta\) can be constructed to impose sparsity include the double exponential (lasso).
Under spike-and-slab, each \(\beta_i\) exchangeably follows a mixture prior consisting of \(\delta_0\), a point mass at \(0\), and a Gaussian distribution centered at zero. Hence we write,
\[
\label{eqn:ss}
\beta_i | \theta, \sigma_\beta^2 \sim (1-\theta)\delta_0 + \theta N\left(0, \sigma_\beta^2\right) \ .
\] Here \(\theta\in \left(0, 1\right)\) controls the overall sparsity in \(\beta\) and \(\sigma_\beta^2\) accommodates non-zero signals. This family is termed as the Bernoulli-Gaussian mixture model in the signal processing community.
A useful re-parameterization, the parameters \(\beta\) is given by two independent random variable vectors \(\gamma = \left(\gamma_1, \ldots, \gamma_p\right)'\) and \(\alpha = \left(\alpha_1, \ldots, \alpha_p\right)'\) such that \(\beta_i = \gamma_i\alpha_i\), with probabilistic structure \[
\label{eq:bg}
\begin{array}{rcl}
\gamma_i\mid\theta & \sim & \text{Bernoulli}(\theta) \ ;
\\
\alpha_i \mid \sigma_\beta^2 &\sim & N\left(0, \sigma_\beta^2\right) \ .
\\
\end{array}
\] Since \(\gamma_i\) and \(\alpha_i\) are independent, the joint prior density becomes \[
p\left(\gamma_i, \alpha_i \mid \theta, \sigma_\beta^2\right) =
\theta^{\gamma_i}\left(1-\theta\right)^{1-\gamma_i}\frac{1}{\sqrt{2\pi}\sigma_\beta}\exp\left\{-\frac{\alpha_i^2}{2\sigma_\beta^2}\right\}
\ , \ \ \ \text{for } 1\leq i\leq p \ .
\] The indicator \(\gamma_i\in \{0, 1\}\) can be viewed as a dummy variable to indicate whether \(\beta_i\) is included in the model.
Let \(S = \{i: \gamma_i = 1\} \subseteq \{1, \ldots, p\}\) be the “active set" of \(\gamma\), and \(\|\gamma\|_0 = \sum\limits_{i = 1}^p\gamma_i\) be its cardinality. The joint prior on the vector \(\{\gamma, \alpha\}\) then factorizes as \[
\begin{array}{rcl}
p\left(\gamma, \alpha \mid \theta, \sigma_\beta^2\right) & = & \prod\limits_{i = 1}^p p\left(\alpha_i, \gamma_i \mid \theta, \sigma_\beta^2\right) \\
& = &
\theta^{\|\gamma\|_0}
\left(1-\theta\right)^{p - \|\gamma\|_0}
\left(2\pi\sigma_\beta^2\right)^{-\frac p2}\exp\left\{-\frac1{2\sigma_\beta^2}\sum\limits_{i = 1}^p\alpha_i^2\right\} \ .
\end{array}
\]
Let \(X_\gamma \defeq \left[X_i\right]_{i \in S}\) be the set of “active explanatory variables" and \(\alpha_\gamma \defeq \left(\alpha_i\right)'_{i \in S}\) be their corresponding coefficients. We can write \(X\beta = X_\gamma \alpha_\gamma\). The likelihood can be expressed in terms of \(\gamma\), \(\alpha\) as \[
p\left(y \mid \gamma, \alpha, \theta, \sigma_e^2\right)
=
\left(2\pi\sigma_e^2\right)^{-\frac n2}
\exp\left\{
-\frac1{2\sigma_e^2}\left\|y - X_\gamma \alpha_\gamma\right\|_2^2
\right\} \ .
\]
Under this re-parameterization by \(\left\{\gamma, \alpha\right\}\), the posterior is given by
\[
\begin{array}{rcl}
p\left(\gamma, \alpha \mid \theta, \sigma_\beta^2, \sigma_e^2, y\right) & \propto &
p\left(\gamma, \alpha \mid \theta, \sigma_\beta^2\right)
p\left(y \mid \gamma, \alpha, \theta, \sigma_e^2\right)\\
& \propto &
\exp\left\{-\frac1{2\sigma_e^2}\left\|y - X_\gamma \alpha_\gamma\right\|_2^2
-\frac1{2\sigma_\beta^2}\left\|\alpha\right\|_2^2
-\log\left(\frac{1-\theta}{\theta}\right)
\left\|\gamma\right\|_0
\right\} \ .
\end{array}
\] Our goal then is to find the regularized maximum a posterior (MAP) estimator \[
\arg\max\limits_{\gamma, \alpha}p\left(\gamma, \alpha \mid \theta, \sigma_\beta^2, \sigma_e^2, y \right) \ .
\] By construction, the \(\gamma\) \(\in\left\{0, 1\right\}^p\) will directly perform variable selection. Spike-and-slab priors, on the other hand, will sample the full posterior and calculate the posterior probability of variable inclusion. Finding the MAP estimator is equivalent to minimizing over \(\left\{\gamma, \alpha\right\}\) the regularized least squares objective function
\[
\label{obj:map}
\min\limits_{\gamma, \alpha}\left\|y - X_\gamma \alpha_\gamma\right\|_2^2
+ \frac{\sigma_e^2}{\sigma_\beta^2}\left\|\alpha\right\|_2^2
+ 2\sigma_e^2\log\left(\frac{1-\theta}{\theta}\right)
\left\|\gamma\right\|_0 \ .
\] This objective possesses several interesting properties:
The first term is essentially the least squares loss function.
The second term looks like a ridge regression penalty and has connection with the signal-to-noise ratio (SNR) \(\sigma_\beta^2/\sigma_e^2\). Smaller SNR will be more likely to shrink the estimates towards \(0\). If \(\sigma_\beta^2 \gg \sigma_e^2\), the prior uncertainty on the size of non-zero coefficients is much larger than the noise level, that is, the SNR is sufficiently large, this term can be ignored. This is a common assumption in spike-and-slab framework in that people usually want \(\sigma_\beta \to \infty\) or to be “sufficiently large" in order to avoid imposing harsh shrinkage to non-zero signals.
If we further assume that \(\theta < \frac12\), meaning that the coefficients are known to be sparse a priori, then \(\log\left(\left(1-\theta\right) / \theta\right) > 0\), and the third term can be seen as an \(l_0\) regularization.
Therefore, our Bayesian objective inference is connected to \(l_0\)-regularized least squares, which we summarize in the following proposition.
(Spike-and-slab MAP & \(l_0\) regularization)
For some \(\lambda > 0\), assuming \(\theta < \frac12\), \(\sigma_\beta^2 \gg \sigma_e^2\), the Bayesian MAP estimate defined by ([obj:map]) is equivalent to the \(l_0\) regularized least squares objective, for some \(\lambda > 0\), \[
\label{obj:l0}
\min\limits_{\beta}
\frac12\left\|y - X\beta\right\|_2^2
+ \lambda
\left\|\beta\right\|_0 \ .
\]
First, assuming that \[
\theta < \frac12, \ \ \ \sigma_\beta^2 \gg \sigma_e^2, \ \ \ \frac{\sigma_e^2}{\sigma_\beta^2}\left\|\alpha\right\|_2^2 \to 0 \ ,
\] gives us an objective function of the form \[
\min\limits_{\gamma, \alpha}
\label{obj:vs}
\frac12 \left\|y - X_\gamma \alpha_\gamma\right\|_2^2
+ \lambda
\left\|\gamma\right\|_0, \ \ \ \ \text{where } \lambda \defeq \sigma_e^2\log\left(\left(1-\theta\right) / \theta\right) > 0 \ .
\]
Equation ([obj:vs]) can be seen as a variable selection version of equation ([obj:l0]). The interesting fact is that ([obj:l0]) and ([obj:vs]) are equivalent. To show this, we need only to check that the optimal solution to ([obj:l0]) corresponds to a feasible solution to ([obj:vs]) and vice versa. This is explained as follows.
On the one hand, assuming \(\hat\beta\) is an optimal solution to ([obj:l0]), then we can correspondingly define \(\hat\gamma_i \defeq I\left\{\hat\beta_i \neq 0\right\}\), \(\hat\alpha_i \defeq \hat\beta_i\), such that \(\left\{\hat\gamma, \hat\alpha\right\}\) is feasible to ([obj:vs]) and gives the same objective value as \(\hat\beta\) gives ([obj:l0]).
On the other hand, assuming \(\left\{\hat\gamma, \hat\alpha\right\}\) is optimal to ([obj:vs]), implies that we must have all of the elements in \(\hat\alpha_\gamma\) should be non-zero, otherwise a new \(\tilde\gamma_i \defeq I\left\{\hat\alpha_i \neq 0\right\}\) will give a lower objective value of ([obj:vs]). As a result, if we define \(\hat\beta_i \defeq \hat\gamma_i\hat\alpha_i\), \(\hat\beta\) will be feasible to ([obj:l0]) and gives the same objective value as \(\left\{\hat\gamma, \hat\alpha\right\}\) gives ([obj:vs]).
Horseshoe Prior
The sparse normal means problem is concerned with inference for the parameter vector \(\theta = ( \theta_1 , \ldots , \theta_p )\) where we observe data \(y_i = \theta_i + \epsilon_i\) where the level of sparsity might be unknown. From both a theoretical and empirical viewpoint, regularized estimators have won the day. This still leaves open the question of how does specify a penalty, denoted by \(\pi_{HS}\), (a.k.a. log-prior, \(- \log p_{HS}\))? Lasso simply uses an \(L^1\)-norm, \(\sum_{i=1}^K | \theta_i |\), as opposed to the horseshoe prior which (essentially) uses the penalty \[
\pi_{HS} ( \theta_i | \tau ) = - \log p_{HS} ( \theta_i | \tau ) = - \log \log \left ( 1 + \frac{2 \tau^2}{\theta_i^2} \right ) .
\] The motivation for the horseshoe penalty arises from the analysis of the prior mass and influence on the posterior in both the tail and behaviour at the origin. The latter is the key determinate of the sparsity properties of the estimator.
From a historical perspective, James-Stein (a.k.a \(L^2\)-regularisation)(Stein 1964) is only a global shrinkage rule–in the sense that there are no local parameters to learn about sparsity. A simple sparsity example shows the issue with \(L^2\)-regularisation. Consider the sparse \(r\)-spike shows the problem with focusing solely on rules with the same shrinkage weight (albeit benefiting from pooling of information).
Let the true parameter value be \(\theta_p = \left ( \sqrt{d/p} , \ldots , \sqrt{d/p} , 0 , \ldots , 0 \right )\). James-Stein is equivalent to the model \[
y_i = \theta_i + \epsilon_i \; \mathrm{ and} \; \theta_i \sim \mathcal{N} \left ( 0 , \tau^2 \right )
\] This dominates the plain MLE but loses admissibility! This is due to the fact that a “plug-in” estimate of global shrinkage \(\hat{\tau}\) is used. Tiao and Tan’s original “closed-form” analysis is particularly relevant here. They point out that the mode of \(p(\tau^2|y)\) is zero exactly when the shrinkage weight turns negative (their condition 6.6). From a risk perspective \(E \Vert \hat{\theta}^{JS} - \theta \Vert \leq p , \forall \theta\) showing the inadmissibility of the MLE. At origin the risk is \(2\), but! \[
\frac{p \Vert \theta \Vert^2}{p + \Vert \theta \Vert^2} \leq R \left ( \hat{\theta}^{JS} , \theta_p \right ) \leq
2 + \frac{p \Vert \theta \Vert^2}{ d + \Vert \theta \Vert^2}.
\] This implies that \(R \left ( \hat{\theta}^{JS} , \theta_p \right ) \geq (p/2)\). Hence, simple thresholding rule beats James-Stein this with a risk given by \(\sqrt{\log p }\). This simple example, shows that the choice of penalty should not be taken for granted as different estimators will have different risk profiles.
Bayesian Inference
Consider a linear regression without bias term \[
f(x) = x^Tw + \epsilon,~~~\epsilon \sim N(0,\sigma_e).
\] We put a zero mean Gaussian prior on the model parameters \[
\beta \sim N(0,\Sigma).
\] Bayesian inference is to calculate posterior given the data \[
p(w\mid y,X) = \dfrac{p(y\mid X,w)p(w)}{p(y\mid X)}.
\] Product of two Gaussian density functions lead to another Gaussian \[
\begin{aligned}
p(w\mid y,X) & \propto \exp\left(-\dfrac{1}{2\sigma_e^2}(y-\beta^TX)^T(y-\beta^TX)\right)\exp\left(-\dfrac{1}{2}\beta^T\Sigma^{-1}\beta\right)\\
& \propto \exp\left(-\dfrac{1}{2}(\beta - \bar\beta)^T\left(\dfrac{1}{\sigma_e^2XX^T + \Sigma^{-1}}\right)(\beta-\bar\beta)\right)
\end{aligned}
\]
Thus, the posterior is \[
\beta\mid X,y \sim N(\bar\beta,A^{-1}),
\] where \(A = \left(\sigma_e^{-2}XX^T + \Sigma\right)\), and \(\bar\beta = \sigma_e^{-2}A^{-1}Xy\).
Posterior
Consider a model with \(p = 1\) \[
y = \beta_0 + \beta_1 x + \epsilon, ~~ \beta_i \sim N(0,1),~~~\sigma_e = 1
\] Let’s plot a sample from the prior set of functions
Now, say we observed two points \((1,1)\) and \((2,2)\), we can calculate the posterior \(\beta \mid X,y \sim N(0.833, 0.166)\)
Why our posterior mean is not 1?
Bayesian Model Selection
When analyzing data, we deal with three types of quantities
- \(X\) = observed variables
- \(Y\) = hidden variable
- \(\theta\) = parameters of the model that describes the relation between \(X\) and \(Y\).
A probabilistic models of interest are the joint probability distribution \(p(D,\theta)\) (called a generative model) and \(P(Y,\theta \mid X)\) (discriminative model). Discriminative models are easier to build and are more frequently used in practice. Generative model requires modeling a distribution over the set of observed variables, which makes our model more complicated. Text analysis provides an illustrative example. The task of identifying a topic of an article can be solved using discriminative distribution. The problem of generating a new article requires generative model.
While performing data analysis using learning algorithms, we perform two tasks, namely training and inference which are summarized in the table below
Training |
\(D = (X,Y) = \{x_i,y_i\}_{i=1}^n\) |
\(\theta\) |
\(p(\theta \mid D)\) |
Prediction |
\(x_{\text{new}}\) |
\(y_{\text{new}}\) |
\(p(y_{\text{new}} \mid x_{\text{new}}, D)\) |
The training can be performed via the Bayes rule \[
p(\theta \mid D) = \dfrac{p(Y \mid \theta,X)p(\theta)}{\int p(Y \mid \theta,X)p(\theta)d\theta}.
\] Now to perform the second step (prediction), we calculate \[
p(y_{\text{new}} \mid x_{\text{new}}, D) = \int p(y_{\text{new}} \mid x_{\text{new}},\theta)p(\theta \mid D)d\theta
\] Thus, full Bayesian inference requires calculating two integrals, which might be difficult. We mentioned earlier that MAP allows us to avoid those calculations by approximating the posterior with \[
p(\theta \mid D) \approx \delta(\theta_{\text{MAP}}),~~\theta_{\text{MAP}} \in \argmax_{\theta}p(\theta \mid D)
\] To calculate \(\theta_{\text{MAP}}\), we do not need to know the normalizing constant for calculating posterior, since the solution of optimization problem does not depend on this constant. Further, the second integral for inference becomes degenerate and get approximated by \[
p(y_{\text{new}} \mid x_{\text{new}}, D) = \int p(y_{\text{new}} \mid x_{\text{new}},\theta)p(\theta \mid D)d\theta \approx p(y_{\text{new}} \mid x_{\text{new}},\theta_{\text{MAP}}).
\]
Now we consider a case, when we have several candidate density functions for performing the prediction \[
p_1(Y,\theta \mid X), ~~p_2(Y,\theta \mid X),\ldots
\] How do we choose the better model? We can choose the model with highest evidence value (due to David MacKay) \[
j = \argmax_j p_j(Y \mid X) = \argmax_j \int p_j(Y \mid X,\theta)p(\theta)d\theta.
\] Note, formally instead of \(p(\theta)\) we need to write \(p(\theta \mid X)\), however since \(\theta\) does not depend on \(X\) we omit it.
Can you think of how the prior \(p(\theta)\), posterior \(p(\theta \mid D)\) and the evidence \(p(Y \mid X)\) distributions will look like? Which model is the best? Which model will have the highest \(\theta_{\text{MAP}}\)?
Example 17.6 (Racial discrimination) Say we want to analyze racial discrimination by the US courts. We have three variables:
- Murderer: \(m \in {0,1}\) (black/white)
- Victim: \(v \in \{0,1\}\) (black/white)
- Verdict: \(d \in \{0,1\}\) (prison/death penalty)
Say we have the data
0 |
0 |
0 |
132 |
0 |
0 |
1 |
19 |
0 |
1 |
0 |
9 |
0 |
1 |
1 |
0 |
1 |
0 |
0 |
52 |
1 |
0 |
1 |
11 |
1 |
1 |
0 |
97 |
1 |
1 |
1 |
6 |
We would like to establish a causal relations between the race and verdict variables. For this, we consider several models
\(p(d \mid m,v) = p(d) = \theta\)
\(p(d \mid m,v) = p(d \mid v)\); \(p(d \mid v=0) = \alpha, ~p(d \mid v=1)=\beta\)
\(p(d \mid v,m) = p(d \mid m)\); \(p(d \mid m=1) = \gamma,~p(d \mid m=1) = \delta\)
\(p(d|v,m)\) cannot be reduced, and
\(v=0\) |
\(\tau\) |
\(\chi\) |
\(v=1\) |
\(\nu\) |
\(\zeta\) |
We calculate which model describes data the best, we calculate the evidences. We need to describe the discriminative model \[
p(Y ,\theta \mid X) = p(Y \mid X,\theta)p(\theta \mid X)
\] Here \(X\) is the number of cases, and \(Y\) is the number of death penalties. We use uninformative prior \(\theta \sim U[0,1]\). To specify the likelihood, we use Binomial distribution \[
Y \mid X,\theta \sim B(X,\theta),~~B(Y \mid X,\theta) = C_Y^Xp^Y(1-\theta)^{X-Y}
\] We assume \(p(\theta)\sim Uniform\). Now lets calculate the evidence \[
p(Y, \theta \mid X) = \int p(Y \mid X,\theta)p(\theta)d\theta
\] for each of the four models
- \(p(Y \mid X) = \int B(19 \mid 151,\theta)B(0 \mid 9,\theta)B(11 \mid 63,\theta)B(6 \mid 103,\theta)d\theta\) \(\propto \int_0^{1} \theta^{36}(1-\theta)^{290}d\theta = B(37,291) = 2.8\times 10^{-51}\)
- \(p(Y \mid X) = \int\int B(19 \mid 151,\alpha)B(0 \mid 9,\beta)B(11 \mid 63,\alpha)B(6 \mid 103,\beta)d\alpha d\beta \propto 4.7\times 10^{-51}\)
- \(p(d \mid v,m) = p(d \mid m)=\int\int B(19 \mid 151,\gamma)B(0 \mid 9,\gamma)B(11 \mid 63,\delta)B(6 \mid 103,\delta)d\gamma d\delta \propto 0.27\times10^{-51}\)
- \(p(d \mid v,m) = \int\int\int\int B(19 \mid 151,\tau)B(0 \mid 9,\nu)B(11 \mid 63,\chi)B(6 \mid 103,\zeta)d\tau d\nu d\chi d\zeta \propto 0.18\times10^{-51}\)
The last model is too complex, it can explain any relations in the data and this, has the lowest evidence score! However, if we are to use ML estimates, the fourth model will have the highest likelihood. Bayesian approach allows to avoid over-fitting! You can also see that this data set contains the Simpson’s paradox. Check it! A related problem is Bertrand’s gold box problem.