Unit 9: Bayesian Double Descent and Model Selection: Modern Approach to Bias-Variance Tradeoff
Vadim Sokolov
George Mason University
Spring 2025
We have two alternatives \(H_1\) and \(H_2\) and plausibility of one in reference to the second one given the data is calculated as \[ \dfrac{P(H_1\mid D)}{P(H_1\mid D)} = \dfrac{P(D\mid H_1)}{P(D\mid H_2)}\dfrac{P(H_1)}{P(H_2)} \]
A simple example: \(D = (-1,3,7,11)\),
Assuming \(n_0,n \in \{-50,-49,\ldots,50\}\), we have \[ P(D\mid H_a) = \dfrac{1}{101^2} = 0.0001. \]
In a more general case, the the evidence (a.k.a marginal likelihood) is calculated as follows \[ P(D\mid H_i) = \int P(D\mid \theta, H_i)P(\theta\mid H_i)d\theta. \]
Which can be approximated by the Laplace approximation. In one-dimensional case, we have \[ P(D\mid H_i) \approx P(D\mid \hat{\theta}, H_i)P(\hat \theta\mid H_i)\sigma_{\theta\mid D}, \] where \(\hat{\theta}\) is the maximum aposteriori estimate of the parameters and \(\sigma_{\theta\mid D} = \sqrt{\mathrm{Var}(\theta\mid D)}\).
More generally, in high-dimensional case, we have \[ P(D\mid H_i) \approx P(D\mid \hat{\theta}, H_i)P(\hat \theta\mid H_i)\sqrt{\dfrac{(2\pi)}{\det(\mathbf{H}(\hat\theta))}}, \] here \(\mathbf{H} = -\nabla^2\log P(\theta\mid D,H_i)\) is the Hessian of the log-likelihood function. As the amount of data collected increases, this Gaussian approximation is expected to become increasingly accurate
The posterior is \[ P(\theta\mid D, H_i) = \dfrac{P(D\mid \theta, H_i)P(\theta\mid H_i)}{P(D\mid H_i)}. \]
\(P(D\mid H_i)\) is called the evidence or marginal likelihood.
Laplace approximation: find \(\hat{\theta}\) that maximizes \(P(D\mid \theta, H_i)\) (MAP) and approximate the posterior as a Gaussian centered at \(\hat{\theta}\) with covariance matrix given by the inverse of the Hessian of the log-likelihood function. \[ \Sigma^{-1} = A = -\nabla^2\log P(\hat \theta\mid D, H_i). \] then we Taylor expansion the log-likelihood function around \(\hat{\theta}\) and get \[ P(D\mid \theta, H_i) \approx P(\hat{\theta}\mid D, H_i)\exp\left(-\dfrac{1}{2}(\theta - \hat{\theta})^TA(\theta - \hat{\theta})\right). \]
Model posterior is \[ P(H_i\mid D) = \dfrac{P(D\mid H_i)P(H_i)}{P(D)}. \]
The evidence \(P(D\mid H_i)\) plays role of the likelihood function in the model selection.
The total data probability is \[ P(D) = \sum_{i=1}^k P(D\mid H_i)P(H_i). \] is the same for all the models, so we can ignore it in the model selection.
Assuming \(P(H_i) = 1/k\), we simply choose the model with the higher evidence.
The evidence is normalizing constant in parameter posterior and is model selection criteria \[ P(D\mid H_i) = \int P(D\mid \theta, H_i)P(\theta\mid H_i)d\theta. \] Laplace approximation for a one-dimensional case \[ P(D\mid H_i) \approx P(D\mid \hat{\theta}, H_i)P(\hat \theta\mid H_i)\sigma_{\theta\mid D}, \]
Likelihood \(\times\) Occam factor \(P(\hat \theta\mid H_i)\sigma_{\theta\mid D}\)
\(\sigma_{\theta\mid D}\) is the posterior uncertainty
For a uniform prior \(P(\hat \theta\mid H_i) \propto 1/\sigma_{\theta}\) \[ \text{Occam Factor} = \dfrac{\sigma_{\theta\mid D}}{\sigma_{\theta}}. \]
Ratio of posterior and prior volumes!
Models a trade-off between minimizing model complexity (\(\sigma_{\theta}\)) and minimizing the data misfit (\(\sigma_{\theta\mid D}\)).
Then the posterior odds are \[ \dfrac{P(H_1\mid D)}{P(H_2\mid D)} = \dfrac{P(D\mid H_1)P(H_1)}{P(D\mid H_2)P(H_2)} \approx 1000/1. \]
highly suspicious coincidences that the two box heights match exactly and the two colours match exactly
Definition: \[ p(D\mid M) = \int p(D\mid \theta, M)p(\theta\mid M)d\theta \approx n^{-k/2}p(D\mid \hat \theta) \] where \(k\) is the number of parameters in the model and \(n\) is the number of data points.
BIC uses Laplace approximation to the evidence, when likelihood is evaluated at the MAP estimate \(\hat \theta\) and the prior is uniform (\(n^{-k/2}\)).
\[ p(M\mid D) = \dfrac{p(D\mid M)p(M)}{p(D)} \] Assuming \(P(M)\) is uniform, we get
\[ p(M\mid D) \approx BIC(M) \]
\[ \log p(D\mid M) \approx -\dfrac{k}{2}\log n + \log p(D\mid \hat \theta, M) \]
AIC: \[ -\dfrac{k}{2}n \] No \(\log\)! It not not appropriate (consistent) fro large \(n\) (Woodraf 1982).
Model should be simple, but not simpler.