Bayes AI

Unit 3: Bayesian Inference with Conjugate Pairs: Single Parameter Models

Vadim Sokolov
George Mason University
Spring 2025

EPL Odds

English Premier League: EPL

Calculate Odds for the possible scores in a match? \[ 0-0, \; 1-0, \; 0-1, \; 1-1, \; 2-0, \ldots \] Let \(X=\) Goals scored by Arsenal

\(Y=\) Goals scored by Liverpool

What’s the odds of a team winning? \(\; \; \; P \left ( X> Y \right )\) Odds of a draw? \(\; \; \; P \left ( X = Y \right )\)

z1 = rpois(100,0.6) z2 = rpois(100,1.4) sum(z1<z2)/100 # Team 2 wins sum(z1=z2)/100 # Draw

Chelsea EPL 2017

Let’s take a historical set of data on scores Then estimate \(\lambda\) with the sample mean of the home and away scores

home team results visit team
Chelsea 2 1 West Ham
Chelsea 5 1 Sunderland
Watford 1 2 Chelsea
Chelsea 3 0 Burnley
\(\dots\)

EPL Chelsea

Chelsea against

Chelsea for

Our Poisson model fits the empirical data!!

EPL: Attack and Defence Strength

Each team gets an “attack” strength and “defence” weakness rating Adjust home and away average goal estimates

EPL: Hull vs ManU

Poisson Distribution

ManU Average away goals \(=1.47\). Prediction: \(1.47 \times 1.46 \times 1.37 = 2.95\)

Attack strength times Hull’s defense weakness times average

Hull Average home goals \(=1.47\). Prediction: \(1.47 \times 0.85 \times 0.52 = 0.65\). Simulation

Team Ex pected Goals 0 1 2 3 4 5
Man U 2.95 7 22 26 12 11 13
Hull City 0.65 49 41 10 0 0 0

EPL Predictions

A model is only as good as its predictions

  • In our simulation Man U wins 88 games out of 100, we should bet when odds ratio is below 88 to 100.

  • Most likely outcome is 0-3 (12 games out of 100)

  • The actual outcome was 0-1 (they played on August 27, 2016)

  • In out simulation 0-1 was the fourth most probable outcome (9 games out of 100)

Hierarchical Distributions

Bayesian Methods

Modern Statistical/Machine Learning

  • Bayes Rule and Probabilistic Learning
  • Computationally challenging: MCMC and Particle Filtering
  • Many applications in Finance:

Asset pricing and corporate finance problems.

Lindley, D.V. Making Decisions

Bernardo, J. and A.F.M. Smith Bayesian Theory

Bayesian Books

  • Hierarchical Models and MCMC

  • Bayesian Nonparametrics

Machine Learning

  • Dynamic State Space Models \(\ldots\)

Nate Silver: 538 and NYT

Silver (2012): The Signal and The Noise

  • Presidential Elections

  • Bayes dominant methodology

  • Predicting College Basketball/Oscars \(\ldots\)

Things to Know

Explosion of Models and Algorithms starting in 1950s

  • Bayesian Regularisation and Sparsity

  • Hierarchical Models and Shrinkage

  • Hidden Markov Models

  • Nonlinear Non-Gaussian State Space Models

Algorithms

  • Monte Carlo Method (von Neumann and Ulam, 1940s)

  • Metropolis-Hastings (Metropolis, 1950s)

  • Gibbs Sampling (Geman and Geman, Gelfand and Smith, 1980s)

  • Sequential Particle Filtering

Probabilistic Reasoning

  • Bayesian Probability (Ramsey, 1926, de Finetti, 1931)
    1. Beta-Binomial Learning: Black Swans
    2. Elections: Nate Silver
    3. Baseball: Kenny Lofton and Derek Jeter
  • Monte Carlo (von Neumann and Ulam, Metropolis, 1940s)
  • Shrinkage Estimation (Lindley and Smith, Efron and Morris, 1970s)

Bayesian Inference

Key Idea: Explicit use of probability for summarizing uncertainty.

  1. A probability distribution for data given parameters \[ f(y| \theta ) \; \; \text{Likelihood} \]

  2. A probability distribution for unknown parameters \[ p(\theta) \; \; \text{Prior} \]

  3. Inference for unknowns conditional on observed data

Inverse probability (Bayes Theorem);

Formal decision making (Loss, Utility)

Posterior Inference

Bayes theorem to derive posterior distributions \[ \begin{aligned} p( \theta | y ) & = \frac{p(y| \theta)p( \theta)}{p(y)} \\ p(y) & = \int p(y| \theta)p( \theta)d \theta \end{aligned} \] Allows you to make probability statements

  • They can be very different from p-values!

Hypothesis testing and Sequential problems

  • Markov chain Monte Carlo (MCMC) and Filtering (PF)

Conjugate Priors

  • Definition: Let \({\cal F}\) denote the class of distributions \(f ( y | \theta )\).

A class \(\Pi\) of prior distributions is conjugate for \({\cal F}\) if the posterior distribution is in the class \(\Pi\) for all \(f \in {\cal F} , \pi \in \Pi , y \in {\cal Y}\).

  • Example: Binomial/Beta:

Suppose that \(Y_1 , \ldots , Y_n \sim Ber ( p )\).

Let \(p \sim Beta ( \alpha , \beta )\) where \(( \alpha , \beta )\) are known hyper-parameters.

The beta-family is very flexible

Prior mean \(E ( p ) = \frac{\alpha}{ \alpha + \beta }\).

Bayes Learning: Beta-Binomial

How do I update my beliefs about a coin toss?

Likelihood for Bernoulli \[ p\left( y|\theta\right) =\prod_{t=1}^{T}p\left( y_{t}|\theta\right) =\theta^{\sum_{t=1}^{T}y_{t}}\left( 1-\theta\right) ^{T-\sum_{t=1}^{T}y_{t}}. \] Initial prior distribution \(\theta\sim\mathcal{B}\left( a,A\right)\) given by \[ p\left( \theta|a,A\right) =\frac{\theta^{a-1}\left( 1-\theta\right) ^{A-1}}{B\left( a,A\right) } \]

Bayes Learning: Beta-Binomial

Updated posterior distribution is also Beta \[ p\left( \theta|y\right) \sim\mathcal{B}\left( a_{T},A_{T}\right) \; {\rm and} \; a_{T}=a+\sum_{t=1}^{T}y_{t} , A_{T}=A+T-\sum_{t=1}^{T}y_{t} \] The posterior mean and variance are \[ E\left[ \theta|y\right] =\frac{a_{T}}{a_{T}+A_{T}}\text{ and }var\left[ \theta|y\right] =\frac{a_{T}A_{T}}{\left( a_{T}+A_{T}\right) ^{2}\left( a_{T}+A_{T}+1\right) } \]

Binomial-Beta

\(p ( p | \bar{y} )\) is the posterior distribution for \(p\)

\(\bar{y}\) is a sufficient statistic.

  • Bayes theorem gives \[ \begin{aligned} p ( p | y ) & \propto f ( y | p ) p ( p | \alpha , \beta )\\ & \propto p^{\sum y_i} (1 - p )^{n - \sum y_i } \cdot p^{\alpha - 1} ( 1 - p )^{\beta - 1} \\ & \propto p^{ \alpha + \sum y_i - 1 } ( 1 - p )^{ n - \sum y_i + \beta - 1} \\ & \sim Beta ( \alpha + \sum y_i , \beta + n - \sum y_i ) \end{aligned} \]

  • The posterior mean is a shrinkage estimator

Combination of sample mean \(\bar{y}\) and prior mean \(E( p )\)

\[ E(p|y) = \frac{\alpha + \sum_{i=1}^n y_i}{\alpha + \beta + n} = \frac{n}{n+ \alpha +\beta} \bar{y} + \frac{\alpha + \beta}{\alpha + \beta+n} \frac{\alpha}{\alpha+\beta} \]

Black Swans

Taleb, The Black Swan: the Impact of the Highly Improbable

Suppose you’re only see a sequence of White Swans, having never seen a Black Swan.

What’s the Probability of Black Swan event sometime in the future?

Suppose that after \(T\) trials you have only seen successes \(( y_1 , \ldots , y_T ) = ( 1 , \ldots , 1 )\). The next trial being a success has \[ p( y_{T+1} =1 | y_1 , \ldots , y_T ) = \frac{T+1}{T+2} \] For large \(T\) is almost certain. Here \(a=A=1\).

Black Swans

Principle of Induction (Hume)

The probability of never seeing a Black Swan is given by \[ p( y_{T+1} =1 , \ldots , y_{T+n} = 1 | y_1 , \ldots , y_T ) = \frac{ T+1 }{ T+n+1 } \rightarrow 0 \]

Black Swan will eventually happen – don’t be surprised when it actually happens.

Bayesian Learning: Poisson-Gamma

Poisson/Gamma: Suppose that \(Y_1 , \ldots , Y_n \mid \lambda \sim Poi ( \lambda )\).

Let \(\lambda \sim Gamma ( \alpha , \beta )\)

\(( \alpha , \beta )\) are known hyper-parameters.

  • The posterior distribution is

\[ \begin{aligned} p ( \lambda | y ) & \propto \exp ( - n \lambda ) \lambda^{ \sum y_i } \lambda^{ \alpha - 1 } \exp ( - \beta \lambda ) \\ & \sim Gamma ( \alpha + \sum y_i , n + \beta ) \end{aligned} \]

Example: Clinical Trials

Novick and Grizzle: Bayesian Analysis of Clinical Trials

Four treatments for duodenal ulcers.

Doctors assess the state of the patient.

Sequential data

(\(\alpha\)-spending function, can only look at prespecified times).

Treat Excellent Fair Death
A 76 1 7 7
B 89 1 0 1
C 86 1 3 1
D 88 9 3

Conclusion: Cannot reject at the 5% level

Conjugate binomial/beta model+sensitivity analysis.

Binomial-Beta

Let \(p_i\) be the death rate proportion under treatment \(i\).

  • To compare treatment \(A\) to \(B\) directly compute \(P ( p_1 > p_2 | D )\).

  • Prior \(beta ( \alpha , \beta )\) with prior mean \(E ( p ) = \frac{\alpha}{\alpha + \beta }\).

Posterior \(beta ( \alpha + \sum x_i , \beta + n - \sum x_i )\)

  • For \(A\), \(beta ( 1 , 1 ) \rightarrow beta ( 8 , 94 )\)

For \(B\), \(beta ( 1 , 1 ) \rightarrow beta ( 2 , 100 )\)

  • Inference: \(P ( p_1 > p_2 | D ) \approx 0.98\)

Sensitivity Analysis

Important to do a sensitivity analysis.

Treat Excellent Fair Death
A 76 1 7 7
B 89 1 0 1
C 86 1 3 1
D 88 9 3

Poisson-Gamma, prior \(\Gamma ( m , z)\) and \(\lambda_i\) be the expected death rate.

Compute \(P \left ( \frac{ \lambda_1 }{ \lambda_2 } > c | D \right )\)

Prob ( 0 , 0 ) ( 100, 2) ( 200 , 5)
\(P \left ( \frac{ \lambda_1 }{ \lambda_2 } > 1.3 | D \right )\) 0.95 0.88 0.79
\(P \left ( \frac{ \lambda_1 }{ \lambda_2 } > 1.6 | D \right )\) 0.91 0.80 0.64

Bayesian Learning: Normal-Normal

Using Bayes rule we get \[ p( \mu | y ) \propto p( y| \mu ) p( \mu ) \]

  • Posterior is given by

\[ p( \mu | y ) \propto \exp \left ( - \frac{1}{2 \sigma^2} \sum_{i=1}^n ( y_i - \mu )^2 - \frac{1}{2 \tau^2} ( \mu - \mu_0 )^2 \right ) \] Hence \(\mu | y \sim N \left ( \hat{\mu}_B , V_{\mu} \right )\) where

\[ \hat{\mu}_B = \frac{ n / \sigma^2 }{ n / \sigma^2 + 1 / \tau^2 } \bar{y} + \frac{ 1 / \tau^2 }{ n / \sigma^2 + 1 / \tau^2 }\mu_0 \; \; {\rm and} \; \; V_{\mu}^{-1} = \frac{n}{ \sigma^2 } + \frac{1}{\tau^2} \] A shrinkage estimator.

SAT Scores

SAT (\(200-800\)): 8 high schools and estimate effects.

School Estimated \(y_j\) St. Error \(\sigma_j\) Average Treatment \(\theta_i\)
A 28 15 ?
B 8 10 ?
C -3 16 ?
D 7 11 ?
E -1 9 ?
F 1 11 ?
G 18 10 ?
H 12 18 ?
  • \(\theta_j\) average effects of coaching programs

  • \(y_j\) estimated treatment effects, for school \(j\), standard error \(\sigma_j\).

Estimates

Two programs appear to work (improvements of \(18\) and \(28\))

  • Large standard errors. Overlapping Confidence Intervals?

  • Classical hypothesis test fails to reject the hypothesis that the \(\theta_j\)’s are equal.

  • Pooled estimate has standard error of \(4.2\) with

\[ \hat{\theta} = \frac{ \sum_j ( y_j / \sigma_j^2 ) }{ \sum_j ( 1 / \sigma_j^2 ) } = 7.9 \]

  • Neither separate or pooled seems sensible.

Bayesian shrinkage!

Hierarchical Model

Hierarchical Model (\(\sigma_j^2\) known) is given by \[ \bar{y}_j | \theta_j \sim N ( \theta_j , \sigma_j^2 ) \] Unequal variances–differential shrinkage.

  • Prior Distribution: \(\theta_j \sim N ( \mu , \tau^2 )\) for \(1 \leq j \leq 8\).

Traditional random effects model.

Exchangeable prior for the treatment effects.

As \(\tau \rightarrow 0\) (complete pooling) and as \(\tau \rightarrow \infty\) (separate estimates).

  • Hyper-prior Distribution: \(p( \mu , \tau^2 ) \propto 1 / \tau\).

The posterior \(p( \mu , \tau^2 | y )\) can be used to “estimate” \(( \mu , \tau^2 )\).

Posterior

Joint Posterior Distribution \(y = ( y_1 , \ldots , y_J )\) \[ p( \theta , \mu , \tau | y ) \propto p( y| \theta ) p( \theta | \mu , \tau )p( \mu , \tau ) \] \[ \propto p( \mu , \tau^2) \prod_{i=1}^8 N ( \theta_j | \mu , \tau^2 ) \prod_{j=1}^8 N ( y_j | \theta_j ) \] \[ \propto \tau^{-9} \exp \left ( - \frac{1}{2} \sum_j \frac{1}{\tau^2} ( \theta_j - \mu )^2 - \frac{1}{2} \sum_j \frac{1}{\sigma_j^2} ( y_j - \theta_j )^2 \right ) \] MCMC!

Posterior Inference

Report posterior quantiles

School 2.5% 25% 50% 75% 97.5%
A - 2 6 10 16 3 2
B - 5 4 8 12 2 0
C -1 2 3 7 11 2 2
D - 6 4 8 12 2 1
E -1 0 2 6 10 1 9
F - 9 2 6 10 1 9
G - 1 6 10 15 2 7
H - 7 4 8 13 2 3
\(\mu\) -2 5 8 11 18
\(\tau\) 0.3 2.3 5.1 8.8 21

Schools \(A\) and \(G\) are similar!

Bayesian Shrinkage

Bayesian shrinkage provides a way of modeling complex datasets.

  1. Baseball batting averages: Stein’s Paradox

  2. Batter-pitcher match-up: Kenny Lofton and Derek Jeter

  3. Bayes Elections

  4. Toxoplasmosis

  5. Bayes MoneyBall

  6. Bayes Portfolio Selection

Example: Baseball

Batter-pitcher match-up?

Prior information on overall ability of a player.

Small sample size, pitcher variation.

  • Let \(p_i\) denote Jeter’s ability. Observed number of hits \(y_i\)

\[ (y_i | p_i ) \sim Bin ( T_i , p_i ) \; \; {\rm with} \; \; p_i \sim Be ( \alpha , \beta ) \] where \(T_i\) is the number of at-bats against pitcher \(i\). A priori \(E( p_i ) = \alpha / (\alpha+\beta ) = \bar{p}_i\).

  • The extra heterogeneity leads to a prior variance \(Var (p_i ) = \bar{p}_i (1 - \bar{p}_i ) \phi\) where \(\phi = ( \alpha + \beta + 1 )^{-1}\).

Sports Data: Baseball

Kenny Lofton hitting versus individual pitchers.

Pitcher At -bats Hi ts Ob sAvg
J.C. Romero 9 6 .667
S. Lewis 5 3 . 600
B. Tomko 20 1 1 . 550
T. Hoffman 6 3 .500
K. Tapani 45 22 .489
A. Cook 9 4 . 444
J. Abbott 34 14 .412
A.J. Burnett 15 6 .400
K. Rogers 43 17 .395
A. Harang 6 2 .333
K. Appier 49 15 .306
R. Clemens 62 14 .226
C. Zambrano 9 2 .222
N. Ryan 10 2 . 200
E. Hanson 41 7 .171
E. Milton 19 1 .056
M. Prior 7 0 . 000
Total 76 30 228 3 .2 99

Baseball

Kenny Lofton

Kenny Lofton (career \(.299\) average, and current \(.308\) average for \(2006\) season) was facing the pitcher Milton (current record \(1\) for \(19\))

.

  • Is putting in a weaker player really a better bet?

  • Over-reaction to bad luck?

\(\mathbb{P}\left ( \leq 1 \; {\rm hit \; in \; } 19 \; {\rm attempts} | p = 0.3 \right ) = 0.01\)

An unlikely \(1\)-in-\(100\) event.

Baseball

Kenny Lofton

Bayes solution: shrinkage. Borrow strength across pitchers

Bayes estimate: use the posterior mean

Lofton’s batting estimates that vary from \(.265\) to \(.340\).

The lowest being against Milton.

\(.265 < .275\)

Conclusion: resting Lofton against Milton was justified!!

Bayes Batter-pitcher match-up

Here’s our model again ...

  • Small sample sizes and pitcher variation.

  • Let \(p_i\) denote Lofton’s ability. Observed number of hits \(y_i\)

\[ (y_i | p_i ) \sim Bin ( T_i , p_i ) \; \; {\rm with} \; \; p_i \sim Be ( \alpha , \beta ) \] where \(T_i\) is the number of at-bats against pitcher \(i\).

Estimate \(( \alpha , \beta )\)

Example: Derek Jeter

Derek Jeter 2006 season versus individual pitchers.

Pitcher At-bats Hits ObsAvg EstAvg 95% Int
R. Mendoza 6 5 .833 .322 (.282, .394)
H. Nomo 20 12 .600 .326 (.289, .407)
A.J.Burnett 5 3 .600 .320 (.275, .381)
E. Milton 28 14 .500 .324 (.291, .397)
D. Cone 8 4 .500 .320 (.218, .381)
R. Lopez 45 21 .467 .326 (.291, .401)
K. Escobar 39 16 .410 .322 (.281, .386)
J. Wettland 5 2 .400 .318 (.275, .375)
T. Wakefield 81 26 .321 .318 (.279, .364)
P. Martinez 83 21 .253 .312 (.254, .347)
K. Benson 8 2 .250 .317 (.264, .368)
T. Hudson 24 6 .250 .315 (.260, .362)
J. Smoltz 5 1 .200 .314 (.253, .355)
F. Garcia 25 5 .200 .314 (.253, .355)
B. Radke 41 8 .195 .311 (.247, .347)
D. Kolb 5 0 .000 .316 (.258, .363)
J. Julio 13 0 .000 .312 (.243, .350 )
Total 6530 2061 .316

Total 6530 2061 .316

Bayes Estimates

Stern stimates \(\hat{\phi} = ( \alpha + \beta + 1 )^{-1} = 0.002\) for Jeter

Doesn’t vary much across the population of pitchers.

The extremes are shrunk the most also matchups with the smallest sample sizes.

Jeter had a season \(.308\) average.

Bayes estimates vary from \(.311\) to \(.327\)–he’s very consistent.

If all players had a similar record then a constant batting average would make sense.

Bayes Elections: Nate Silver

Multinomial-Dirichlet

Predicting the Electoral Vote (EV)

  • Multinomial-Dirichlet: \((\hat{p} | p) \sim Multi (p), ( p | \alpha ) \sim Dir (\alpha)\)

\[ p_{Obama} = ( p_{1}, \ldots ,p_{51} | \hat{p}) \sim Dir \left ( \alpha + \hat{p} \right ) \]

  • Flat uninformative prior \(\alpha\equiv 1\).

http://www.electoral-vote.com/evp2012/Pres/prespolls.csv

Bayes Elections: Nate Silver

Simulation

Calculate probabilities via simulation: rdirichlet \[ p \left ( p_{j,O} | {\rm data} \right ) \;\; {\rm and} \; \; p \left ( EV >270 | {\rm data} \right ) \]

The election vote prediction is given by the sum \[ EV =\sum_{j=1}^{51} EV(j) \mathbb{E} \left ( p_{j} | {\rm data} \right ) \] where \(EV(j)\) are for individual states

Polling Data: electoral-vote.com

Electoral Vote (EV), Polling Data: Mitt and Obama percentages

State M.pct O.pct EV
Alabama 58 36 9
Alaska 55 37 3
Arizona 50 46 10
Arkansas 51 44 6
California 33 55 55
Colorado 45 52 9
Connecticut 31 56 7
Delaware 38 56 3
D.C. 13 82 3
Florida 46 50 27
Georgia 52 47 15
Hawaii 32 63 4
Idaho 68 26 4
Illinois 35 59 21
Indiana 48 48 11
Iowa 37 54 7
Kansas 63 31 6
Kentucky 51 42 8
Louisiana 50 43 9
Maine 35 56 4
Maryland 39 54 10
Massachusetts 34 53 12
Michigan 37 53 17
Minnesota 42 53 10
Mississippi 46 33 6

Polling Data:

Election 2008 Prediction. Obama \(370\)

Election 2012 Prediction. Obama \(332\).

Chicago Bears 2014-2015 Season

Bayes Learning: Update our beliefs in light of new information

  • In the 2014-2015 season.

The Bears suffered back-to-back \(50\)-points defeats.

Partiots-Bears \(51-23\)

Packers-Bears \(55-14\)

  • Their next game was at home against the Minnesota Vikings.

Current line against the Vikings was \(-3.5\) points.

Slightly over a field goal

What’s the Bayes approach to learning the line?

Hierarchical Model

Hierarchical model for the current average win/lose this year \[ \begin{aligned} \bar{y} | \theta & \sim N \left ( \theta , \frac{\sigma^2}{n} \right ) \sim N \left ( \theta , \frac{18.34^2}{9} \right )\\ \theta & \sim N( 0 , \tau^2 ) \end{aligned} \] Here \(n =9\) games so far. With \(s = 18.34\) points

Pre-season prior mean \(\mu_0 = 0\), standard deviation \(\tau = 4\).

Record so-far. Data \(\bar{y} = -9.22\).

Chicago Bears

Bayes Shrinkage estimator \[ \mathbb{E} \left ( \theta | \bar{y} , \tau \right ) = \frac{ \tau^2 }{ \tau^2 + \frac{\sigma^2}{n} } \bar{y} \]

The Shrinkage factor is \(0.3\)!!

That’s quite a bit of shrinkage. Why?

  • Our updated estimator is

\[ \mathbb{E} \left ( \theta | \bar{y} , \tau \right ) = - 2.75 > -.3.5 \] where current line is \(-3.5\).

  • Based on our hierarchical model this is an over-reaction.

One point change on the line is about \(3\)% on a probability scale.

Alternatively, calculate a market-based \(\tau\) given line \(=-3.5\).

Chicago Bears

Last two defeats were \(50\) points scored by opponent (2014-15)

Code
bears=c(-3,8,8,-21,-7,14,-13,-28,-41)
mean(bears)
[1] -9.222222
Code
sd(bears)
[1] 18.34242
Code
tau=4
sig2=sd(bears)*sd(bears)/9
tau^2/(sig2+tau^2)
[1] 0.2997225
Code
pnorm(-2.76/18)
[1] 0.4390677

Home advantage is worth \(3\) points. Vikings an average record.

Result: Bears 21, Vikings 13

Stein’s Paradox

Stein paradox: possible to make a uniform improvement on the MLE in terms of MSE.

  • Mistrust of the statistical interpretation of Stein’s result.

In particular, the loss function.

  • Difficulties in adapting the procedure to special cases

  • Long familiarity with good properties for the MLE

Any gains from a “complicated” procedure could not be worth the extra trouble (Tukey, savings not more than 10 % in practice)

For \(k\ge 3\), we have the remarkable inequality \[ MSE(\hat \theta_{JS},\theta) < MSE(\bar y,\theta) \; \forall \theta \] Bias-variance explanation! Inadmissability of the classical stats.

Baseball Batting Averages

Data: 18 major-league players after 45 at bats (1970 season)

Player \(\bar{y}_i\) \(E ( p_i | D )\) average season
Clemente 0.400 0.290 0.346
Robinson 0.378 0.286 0.298
Howard 0.356 0.281 0.276
Johnstone 0.333 0.277 0.222
Berry 0.311 0.273 0.273
Spencer 0.311 0.273 0.270
Kessinger 0.311 0.268 0.263
Alvarado 0.267 0.264 0.210
Santo 0.244 0.259 0.269
Swoboda 0.244 0.259 0.230
Unser 0.222 0.254 0.264
Williams 0.222 0.254 0.256
Scott 0.222 0.254 0.303
Petrocelli 0.222 0.254 0.264
Rodriguez 0.222 0.254 0.226
Campanens 0.200 0.259 0.285
Munson 0.178 0.244 0.316
Alvis 0.156 0.239 0.200

Baseball Data

First Shrinkage Estimator: Efron and Morris

Baseball Shrinkage

Shrinkage

Let \(\theta_i\) denote the end of season average

  • Lindley: shrink to the overall grand mean

\[ c = 1 - \frac{ ( k - 3 ) \sigma^2 }{ \sum ( \bar{y}_i - \bar{y} )^2 } \] where \(\bar{y}\) is the overall grand mean and

\[ \hat{\theta} = c \bar{y}_i + ( 1 - c ) \bar{y} \]

  • Baseball data: \(c = 0.212\) and \(\bar{y} = 0.265\).

Compute \(\sum ( \hat{\theta}_i - \bar{y}^{obs}_i )^2\) and see which is lower: \[ MLE = 0.077 \; \; STEIN = 0.022 \] That’s a factor of \(3.5\) times better!

Batting Averages

Baseball Paradoxes

Shrinkage on Clemente too severe: \(z_{Cl} = 0.265 + 0.212 ( 0.400 - 0.265) = 0.294\).

The \(0.212\) seems a little severe

  • Limited translation rules, maximum shrinkage eg. 80%

  • Not enough shrinkage eg O’Connor ( \(y = 1 , n = 2\)). \(z_{O'C} = 0.265 + 0.212 ( 0.5 - 0.265 ) = 0.421\).

Still better than Ted Williams \(0.406\) in 1941.

  • Foreign car sales (\(k = 19\)) will further improve MSE performance! It will change the shrinkage factors.

  • Clearly an improvement over the Stein estimator is

\[ \hat{\theta}_{S+} = \max \left ( \left ( 1 - \frac{k-2}{ \sum \bar{Y}_i^2 } \right ) , 0 \right ) \bar{Y}_i \]

Baseball Prior

Include extra prior knowledge

Empirical distribution of all major league players \[ \theta_i \sim N ( 0.270 , 0.015 ) \] The \(0.270\) provides another origin to shrink to and the prior variance \(0.015\) would give a different shrinkage factor.

To fully understand maybe we should build a probabilistic model and see what the posterior mean is as our estimator for the unknown parameters.

Shrinkage: Unequal Variances

Model \(Y_i | \theta_i \sim N ( \theta_i , D_i )\) where \(\theta_i \sim N ( \theta_0 , A ) \sim N ( 0.270 , 0.015 )\).

  • The \(D_i\) can be different – unequal variances

  • Bayes posterior means are given by

\[ E ( \theta_i | Y ) = ( 1 - B_i ) Y_i \; \; {\rm where} \; \; B_i = \frac{ D_i }{ D_i + A } \] where \(\hat{A}\) is estimated from the data, see Efron and Morris (1975).

  • Different shrinkage factors as different variances \(D_i\).

\(D_i \propto n_i^{-1}\) and so smaller sample sizes are shrunk more.

Makes sense.

Example: Toxoplasmosis Data

Disease of Blood that is endemic in tropical regions.

Data: \(5000\) people in El Salvador (varying sample sizes) from \(36\) cities.

  • Estimate “true” prevalences \(\theta_i\) for \(1 \leq i \leq 36\)

  • Allocation of Resources: should we spend funds on the city with the highest observed occurrence of the disease? Same shrinkage factors?

  • Shrinkage Procedure (Efron and Morris, p315) \[ z_i = c_i y_i \] where \(y_i\) are the observed relative rates (normalized so \(\bar{y} = 0\) The smaller sample sizes will get shrunk more.

The most gentle are in the range \(0.6 \rightarrow 0.9\) but some are \(0.1 \rightarrow 0.3\).

Bayes Portfolio Selection

de Finetti and Markowitz: Mean-variance portfolio shrinkage: \(\frac{1}{\gamma} \Sigma^{-1} \mu\)

Different shrinkage factors for different history lengths.

Portfolio Allocation in the SP500 index

Entry/exit; splits; spin-offs etc. For example, 73 replacements to the SP500 index in period 1/1/94 to 12/31/96.

Advantage: \(E ( \alpha | D_t ) = 0.39\), that is 39 bps per month which on an annual basis is \(\alpha = 468\) bps.

The posterior mean for \(\beta\) is \(p ( \beta | D_t ) = 0.745\)

\(\bar{x}_{M} = 12.25 \%\) and \(\bar{x}_{PT} = 14.05 \%\).

SP Composition

Date Symbol 6/96 12/89 12/79 12/69
General Electric GE 2.800 2.485 1.640 1.569
Coca Cola KO 2.342 1.126 0.606 1.051
Exxon XON 2.142 2.672 3.439 2.957
ATT T 2.030 2.090 5.197 5.948
Philip Morris MO 1.678 1.649 0.637 *****
Royal Dutch RD 1.636 1.774 1.191 *****
Merck MRK 1.615 1.308 0.773 0.906
Microsoft MSFT 1.436 ***** ***** *****
Johnson/Johnson JNJ 1.320 0.845 0.689 *****
Intel INTC 1.262 ***** ***** *****
Procter and Gamble PG 1.228 1.040 0.871 0.993
Walmart WMT 1.208 1.084 ***** *****
IBM IBM 1.181 2.327 5.341 9.231
Hewlett Packard HWP 1.105 0.477 0.497 *****
Pepsi PEP 1.061 0.719 ***** *****

SP Composition

Date Symbol 6/96 12/89 12/79 12/69
Pfizer PFE 0.918 0.491 0.408 0.486
Dupont DD 0.910 1.229 0.837 1.101
AIG AIG 0.910 0.723 ***** *****
Mobil MOB 0.906 1.093 1.659 1.040
Bristol Myers Squibb BMY 0.878 1.247 ***** 0.484
GTE GTE 0.849 0.975 0.593 0.705
General Motors GM 0.848 1.086 2.079 4.399
Disney DIS 0.839 0.644 ***** *****
Citicorp CCI 0.831 0.400 0.418 *****
BellSouth BLS 0.822 1.190 ***** *****
Motorola MOT 0.804 ***** ***** *****
Ford F 0.798 0.883 0.485 0.640
Chervon CHV 0.794 0.990 1.370 0.966
Amoco AN 0.733 1.198 1.673 0.758
Eli Lilly LLY 0.720 0.814 ***** *****

SP Composition

Date Symbol 6/96 12/89 12/79 12/69
Abbott Labs ABT 0.690 0.654 ***** *****
AmerHome Products AHP 0.686 0.716 0.606 0.793
FedNatlMortgage FNM 0.686 ***** ***** *****
McDonald’s MCD 0.686 0.545 ***** *****
Ameritech AIT 0.639 0.782 ***** *****
Cisco Systems CSCO 0.633 ***** ***** *****
CMB CMB 0.621 ***** ***** *****
SBC SBC 0.612 0.819 ***** *****
Boeing BA 0.598 0.584 0.462 *****
MMM MMM 0.581 0.762 0.838 1.331
BankAmerica BAC 0.560 ***** 0.577 *****
Bell Atlantic BEL 0.556 0.946 ***** *****
Gillette G 0.535 ***** ***** *****
Kodak EK 0.524 0.570 1.106 *****
Chrysler C 0.507 ***** ***** 0.367
Home Depot HD 0.497 ***** ***** *****
Colgate COL 0.489 0.499 ***** *****
Wells Fargo WFC 0.478 ***** ***** *****
Nations Bank NB 0.453 ***** ***** *****
Amer Express AXP 0.450 0.621 ***** *****

Keynes versus Buffett: CAPM

keynes = 15.08 + 1.83 market

buffett = 18.06 + 0.486 market

Year Keynes Market
1928 -3.4 7.9
1929 0.8 6.6
1930 -32.4 -20.3
1931 -24.6 -25.0
1932 44.8 -5.8
1933 35.1 21.5
1934 33.1 -0.7
1935 44.3 5.3
1936 56.0 10.2
1937 8.5 -0.5
1938 -40.1 -16.1
1939 12.9 -7.2
1940 -15.6 -12.9
1941 33.5 12.5
1942 -0.9 0.8
1943 53.9 15.6
1944 14.5 5.4
1945 14.6 0.8

King’s College Cambridge

Keynes vs Cash

Brief List of Conjugate Models

Likelihood Prior Posterior
Binomial Beta Beta
Negative Binomial Beta
Poisson Gamma Gamma
Geometric Beta Beta
Exponential Gamma Gamma
Normal (mean unknown) Normal Normal
Normal (variance unknown) Inverse Gamma Inverse Gamma
Normal (mean and variance unknown) Normal/Gamma Normal/Gamma
Multinomial Dirichlet Dirichlet