[1] -9.222222
[1] 18.34242
[1] 0.2997225
[1] 0.4390677
Unit 3: Bayesian Inference with Conjugate Pairs: Single Parameter Models
Calculate Odds for the possible scores in a match? \[ 0-0, \; 1-0, \; 0-1, \; 1-1, \; 2-0, \ldots \] Let \(X=\) Goals scored by Arsenal
\(Y=\) Goals scored by Liverpool
What’s the odds of a team winning? \(\; \; \; P \left ( X> Y \right )\) Odds of a draw? \(\; \; \; P \left ( X = Y \right )\)
z1 = rpois(100,0.6) z2 = rpois(100,1.4) sum(z1<z2)/100 # Team 2 wins sum(z1=z2)/100 # Draw
Let’s take a historical set of data on scores Then estimate \(\lambda\) with the sample mean of the home and away scores
home team | results | visit team | |
---|---|---|---|
Chelsea | 2 | 1 | West Ham |
Chelsea | 5 | 1 | Sunderland |
Watford | 1 | 2 | Chelsea |
Chelsea | 3 | 0 | Burnley |
\(\dots\) |
Our Poisson model fits the empirical data!!
Each team gets an “attack” strength and “defence” weakness rating Adjust home and away average goal estimates
ManU Average away goals \(=1.47\). Prediction: \(1.47 \times 1.46 \times 1.37 = 2.95\)
Attack strength times Hull’s defense weakness times average
Hull Average home goals \(=1.47\). Prediction: \(1.47 \times 0.85 \times 0.52 = 0.65\). Simulation
Team Ex | pected Goals 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|
Man U | 2.95 7 | 22 | 26 | 12 | 11 | 13 | |
Hull City | 0.65 | 49 | 41 | 10 | 0 | 0 | 0 |
A model is only as good as its predictions
In our simulation Man U wins 88 games out of 100, we should bet when odds ratio is below 88 to 100.
Most likely outcome is 0-3 (12 games out of 100)
The actual outcome was 0-1 (they played on August 27, 2016)
In out simulation 0-1 was the fourth most probable outcome (9 games out of 100)
Modern Statistical/Machine Learning
Asset pricing and corporate finance problems.
Lindley, D.V. Making Decisions
Bernardo, J. and A.F.M. Smith Bayesian Theory
Hierarchical Models and MCMC
Bayesian Nonparametrics
Machine Learning
McGrayne (2012): The Theory that would not Die
History of Bayes-Laplace
Code breaking
Bayes search: Air France \(\ldots\)
Silver (2012): The Signal and The Noise
Presidential Elections
Bayes dominant methodology
Predicting College Basketball/Oscars \(\ldots\)
Explosion of Models and Algorithms starting in 1950s
Bayesian Regularisation and Sparsity
Hierarchical Models and Shrinkage
Hidden Markov Models
Nonlinear Non-Gaussian State Space Models
Algorithms
Monte Carlo Method (von Neumann and Ulam, 1940s)
Metropolis-Hastings (Metropolis, 1950s)
Gibbs Sampling (Geman and Geman, Gelfand and Smith, 1980s)
Sequential Particle Filtering
Key Idea: Explicit use of probability for summarizing uncertainty.
A probability distribution for data given parameters \[ f(y| \theta ) \; \; \text{Likelihood} \]
A probability distribution for unknown parameters \[ p(\theta) \; \; \text{Prior} \]
Inference for unknowns conditional on observed data
Inverse probability (Bayes Theorem);
Formal decision making (Loss, Utility)
Bayes theorem to derive posterior distributions \[ \begin{aligned} p( \theta | y ) & = \frac{p(y| \theta)p( \theta)}{p(y)} \\ p(y) & = \int p(y| \theta)p( \theta)d \theta \end{aligned} \] Allows you to make probability statements
Hypothesis testing and Sequential problems
A class \(\Pi\) of prior distributions is conjugate for \({\cal F}\) if the posterior distribution is in the class \(\Pi\) for all \(f \in {\cal F} , \pi \in \Pi , y \in {\cal Y}\).
Suppose that \(Y_1 , \ldots , Y_n \sim Ber ( p )\).
Let \(p \sim Beta ( \alpha , \beta )\) where \(( \alpha , \beta )\) are known hyper-parameters.
The beta-family is very flexible
Prior mean \(E ( p ) = \frac{\alpha}{ \alpha + \beta }\).
How do I update my beliefs about a coin toss?
Likelihood for Bernoulli \[ p\left( y|\theta\right) =\prod_{t=1}^{T}p\left( y_{t}|\theta\right) =\theta^{\sum_{t=1}^{T}y_{t}}\left( 1-\theta\right) ^{T-\sum_{t=1}^{T}y_{t}}. \] Initial prior distribution \(\theta\sim\mathcal{B}\left( a,A\right)\) given by \[ p\left( \theta|a,A\right) =\frac{\theta^{a-1}\left( 1-\theta\right) ^{A-1}}{B\left( a,A\right) } \]
Updated posterior distribution is also Beta \[ p\left( \theta|y\right) \sim\mathcal{B}\left( a_{T},A_{T}\right) \; {\rm and} \; a_{T}=a+\sum_{t=1}^{T}y_{t} , A_{T}=A+T-\sum_{t=1}^{T}y_{t} \] The posterior mean and variance are \[ E\left[ \theta|y\right] =\frac{a_{T}}{a_{T}+A_{T}}\text{ and }var\left[ \theta|y\right] =\frac{a_{T}A_{T}}{\left( a_{T}+A_{T}\right) ^{2}\left( a_{T}+A_{T}+1\right) } \]
\(p ( p | \bar{y} )\) is the posterior distribution for \(p\)
\(\bar{y}\) is a sufficient statistic.
Bayes theorem gives \[ \begin{aligned} p ( p | y ) & \propto f ( y | p ) p ( p | \alpha , \beta )\\ & \propto p^{\sum y_i} (1 - p )^{n - \sum y_i } \cdot p^{\alpha - 1} ( 1 - p )^{\beta - 1} \\ & \propto p^{ \alpha + \sum y_i - 1 } ( 1 - p )^{ n - \sum y_i + \beta - 1} \\ & \sim Beta ( \alpha + \sum y_i , \beta + n - \sum y_i ) \end{aligned} \]
The posterior mean is a shrinkage estimator
Combination of sample mean \(\bar{y}\) and prior mean \(E( p )\)
\[ E(p|y) = \frac{\alpha + \sum_{i=1}^n y_i}{\alpha + \beta + n} = \frac{n}{n+ \alpha +\beta} \bar{y} + \frac{\alpha + \beta}{\alpha + \beta+n} \frac{\alpha}{\alpha+\beta} \]
Taleb, The Black Swan: the Impact of the Highly Improbable
Suppose you’re only see a sequence of White Swans, having never seen a Black Swan.
What’s the Probability of Black Swan event sometime in the future?
Suppose that after \(T\) trials you have only seen successes \(( y_1 , \ldots , y_T ) = ( 1 , \ldots , 1 )\). The next trial being a success has \[ p( y_{T+1} =1 | y_1 , \ldots , y_T ) = \frac{T+1}{T+2} \] For large \(T\) is almost certain. Here \(a=A=1\).
Principle of Induction (Hume)
The probability of never seeing a Black Swan is given by \[ p( y_{T+1} =1 , \ldots , y_{T+n} = 1 | y_1 , \ldots , y_T ) = \frac{ T+1 }{ T+n+1 } \rightarrow 0 \]
Black Swan will eventually happen – don’t be surprised when it actually happens.
Poisson/Gamma: Suppose that \(Y_1 , \ldots , Y_n \mid \lambda \sim Poi ( \lambda )\).
Let \(\lambda \sim Gamma ( \alpha , \beta )\)
\(( \alpha , \beta )\) are known hyper-parameters.
\[ \begin{aligned} p ( \lambda | y ) & \propto \exp ( - n \lambda ) \lambda^{ \sum y_i } \lambda^{ \alpha - 1 } \exp ( - \beta \lambda ) \\ & \sim Gamma ( \alpha + \sum y_i , n + \beta ) \end{aligned} \]
Novick and Grizzle: Bayesian Analysis of Clinical Trials
Four treatments for duodenal ulcers.
Doctors assess the state of the patient.
Sequential data
(\(\alpha\)-spending function, can only look at prespecified times).
Treat | Excellent | Fair | Death |
---|---|---|---|
A | 76 1 | 7 | 7 |
B | 89 1 | 0 | 1 |
C | 86 1 | 3 | 1 |
D | 88 9 | 3 |
Conclusion: Cannot reject at the 5% level
Conjugate binomial/beta model+sensitivity analysis.
Let \(p_i\) be the death rate proportion under treatment \(i\).
To compare treatment \(A\) to \(B\) directly compute \(P ( p_1 > p_2 | D )\).
Prior \(beta ( \alpha , \beta )\) with prior mean \(E ( p ) = \frac{\alpha}{\alpha + \beta }\).
Posterior \(beta ( \alpha + \sum x_i , \beta + n - \sum x_i )\)
For \(B\), \(beta ( 1 , 1 ) \rightarrow beta ( 2 , 100 )\)
Important to do a sensitivity analysis.
Treat | Excellent | Fair | Death |
---|---|---|---|
A | 76 1 | 7 | 7 |
B | 89 1 | 0 | 1 |
C | 86 1 | 3 | 1 |
D | 88 9 | 3 |
Poisson-Gamma, prior \(\Gamma ( m , z)\) and \(\lambda_i\) be the expected death rate.
Compute \(P \left ( \frac{ \lambda_1 }{ \lambda_2 } > c | D \right )\)
Prob ( 0 , 0 ) ( | 100, 2) ( | 200 , 5) | |
---|---|---|---|
\(P \left ( \frac{ \lambda_1 }{ \lambda_2 } > 1.3 | D \right )\) | 0.95 | 0.88 | 0.79 |
\(P \left ( \frac{ \lambda_1 }{ \lambda_2 } > 1.6 | D \right )\) | 0.91 | 0.80 | 0.64 |
Using Bayes rule we get \[ p( \mu | y ) \propto p( y| \mu ) p( \mu ) \]
\[ p( \mu | y ) \propto \exp \left ( - \frac{1}{2 \sigma^2} \sum_{i=1}^n ( y_i - \mu )^2 - \frac{1}{2 \tau^2} ( \mu - \mu_0 )^2 \right ) \] Hence \(\mu | y \sim N \left ( \hat{\mu}_B , V_{\mu} \right )\) where
\[ \hat{\mu}_B = \frac{ n / \sigma^2 }{ n / \sigma^2 + 1 / \tau^2 } \bar{y} + \frac{ 1 / \tau^2 }{ n / \sigma^2 + 1 / \tau^2 }\mu_0 \; \; {\rm and} \; \; V_{\mu}^{-1} = \frac{n}{ \sigma^2 } + \frac{1}{\tau^2} \] A shrinkage estimator.
SAT (\(200-800\)): 8 high schools and estimate effects.
School | Estimated \(y_j\) | St. Error \(\sigma_j\) | Average Treatment \(\theta_i\) |
---|---|---|---|
A | 28 | 15 | ? |
B | 8 | 10 | ? |
C | -3 | 16 | ? |
D | 7 | 11 | ? |
E | -1 | 9 | ? |
F | 1 | 11 | ? |
G | 18 | 10 | ? |
H | 12 | 18 | ? |
\(\theta_j\) average effects of coaching programs
\(y_j\) estimated treatment effects, for school \(j\), standard error \(\sigma_j\).
Two programs appear to work (improvements of \(18\) and \(28\))
Large standard errors. Overlapping Confidence Intervals?
Classical hypothesis test fails to reject the hypothesis that the \(\theta_j\)’s are equal.
Pooled estimate has standard error of \(4.2\) with
\[ \hat{\theta} = \frac{ \sum_j ( y_j / \sigma_j^2 ) }{ \sum_j ( 1 / \sigma_j^2 ) } = 7.9 \]
Bayesian shrinkage!
Hierarchical Model (\(\sigma_j^2\) known) is given by \[ \bar{y}_j | \theta_j \sim N ( \theta_j , \sigma_j^2 ) \] Unequal variances–differential shrinkage.
Traditional random effects model.
Exchangeable prior for the treatment effects.
As \(\tau \rightarrow 0\) (complete pooling) and as \(\tau \rightarrow \infty\) (separate estimates).
The posterior \(p( \mu , \tau^2 | y )\) can be used to “estimate” \(( \mu , \tau^2 )\).
Joint Posterior Distribution \(y = ( y_1 , \ldots , y_J )\) \[ p( \theta , \mu , \tau | y ) \propto p( y| \theta ) p( \theta | \mu , \tau )p( \mu , \tau ) \] \[ \propto p( \mu , \tau^2) \prod_{i=1}^8 N ( \theta_j | \mu , \tau^2 ) \prod_{j=1}^8 N ( y_j | \theta_j ) \] \[ \propto \tau^{-9} \exp \left ( - \frac{1}{2} \sum_j \frac{1}{\tau^2} ( \theta_j - \mu )^2 - \frac{1}{2} \sum_j \frac{1}{\sigma_j^2} ( y_j - \theta_j )^2 \right ) \] MCMC!
Report posterior quantiles
School | 2.5% | 25% | 50% | 75% | 97.5% |
---|---|---|---|---|---|
A - | 2 6 | 10 | 16 | 3 | 2 |
B - | 5 4 | 8 | 12 | 2 | 0 |
C -1 | 2 3 | 7 | 11 | 2 | 2 |
D - | 6 4 | 8 | 12 | 2 | 1 |
E -1 | 0 2 | 6 | 10 | 1 | 9 |
F - | 9 2 | 6 | 10 | 1 | 9 |
G - | 1 6 | 10 | 15 | 2 | 7 |
H - | 7 4 | 8 | 13 | 2 | 3 |
\(\mu\) | -2 | 5 | 8 | 11 | 18 |
\(\tau\) | 0.3 | 2.3 | 5.1 | 8.8 | 21 |
Schools \(A\) and \(G\) are similar!
Bayesian shrinkage provides a way of modeling complex datasets.
Baseball batting averages: Stein’s Paradox
Batter-pitcher match-up: Kenny Lofton and Derek Jeter
Bayes Elections
Toxoplasmosis
Bayes MoneyBall
Bayes Portfolio Selection
Batter-pitcher match-up?
Prior information on overall ability of a player.
Small sample size, pitcher variation.
\[ (y_i | p_i ) \sim Bin ( T_i , p_i ) \; \; {\rm with} \; \; p_i \sim Be ( \alpha , \beta ) \] where \(T_i\) is the number of at-bats against pitcher \(i\). A priori \(E( p_i ) = \alpha / (\alpha+\beta ) = \bar{p}_i\).
Kenny Lofton hitting versus individual pitchers.
Pitcher At | -bats Hi | ts Ob | sAvg | ||
---|---|---|---|---|---|
J.C. Romero | 9 | 6 | .667 | ||
S. Lewis | 5 3 | . | 600 | ||
B. Tomko | 20 1 | 1 . | 550 | ||
T. Hoffman | 6 | 3 | .500 | ||
K. Tapani | 45 | 22 | .489 | ||
A. Cook | 9 4 | . | 444 | ||
J. Abbott | 34 | 14 | .412 | ||
A.J. Burnett | 15 | 6 | .400 | ||
K. Rogers | 43 | 17 | .395 | ||
A. Harang | 6 | 2 | .333 | ||
K. Appier | 49 | 15 | .306 | ||
R. Clemens | 62 | 14 | .226 | ||
C. Zambrano | 9 | 2 | .222 | ||
N. Ryan | 10 2 | . | 200 | ||
E. Hanson | 41 | 7 | .171 | ||
E. Milton | 19 | 1 | .056 | ||
M. Prior | 7 0 | . | 000 | ||
Total 76 | 30 228 | 3 .2 | 99 |
Kenny Lofton (career \(.299\) average, and current \(.308\) average for \(2006\) season) was facing the pitcher Milton (current record \(1\) for \(19\))
.
Is putting in a weaker player really a better bet?
Over-reaction to bad luck?
\(\mathbb{P}\left ( \leq 1 \; {\rm hit \; in \; } 19 \; {\rm attempts} | p = 0.3 \right ) = 0.01\)
An unlikely \(1\)-in-\(100\) event.
Bayes solution: shrinkage. Borrow strength across pitchers
Bayes estimate: use the posterior mean
Lofton’s batting estimates that vary from \(.265\) to \(.340\).
The lowest being against Milton.
\(.265 < .275\)
Conclusion: resting Lofton against Milton was justified!!
Here’s our model again ...
Small sample sizes and pitcher variation.
Let \(p_i\) denote Lofton’s ability. Observed number of hits \(y_i\)
\[ (y_i | p_i ) \sim Bin ( T_i , p_i ) \; \; {\rm with} \; \; p_i \sim Be ( \alpha , \beta ) \] where \(T_i\) is the number of at-bats against pitcher \(i\).
Estimate \(( \alpha , \beta )\)
Derek Jeter 2006 season versus individual pitchers.
Pitcher | At-bats | Hits | ObsAvg | EstAvg | 95% Int |
---|---|---|---|---|---|
R. Mendoza | 6 | 5 | .833 | .322 | (.282, .394) |
H. Nomo | 20 | 12 | .600 | .326 | (.289, .407) |
A.J.Burnett | 5 | 3 | .600 | .320 | (.275, .381) |
E. Milton | 28 | 14 | .500 | .324 | (.291, .397) |
D. Cone | 8 | 4 | .500 | .320 | (.218, .381) |
R. Lopez | 45 | 21 | .467 | .326 | (.291, .401) |
K. Escobar | 39 | 16 | .410 | .322 | (.281, .386) |
J. Wettland | 5 | 2 | .400 | .318 | (.275, .375) |
T. Wakefield | 81 | 26 | .321 | .318 | (.279, .364) |
P. Martinez | 83 | 21 | .253 | .312 | (.254, .347) |
K. Benson | 8 | 2 | .250 | .317 | (.264, .368) |
T. Hudson | 24 | 6 | .250 | .315 | (.260, .362) |
J. Smoltz | 5 | 1 | .200 | .314 | (.253, .355) |
F. Garcia | 25 | 5 | .200 | .314 | (.253, .355) |
B. Radke | 41 | 8 | .195 | .311 | (.247, .347) |
D. Kolb | 5 | 0 | .000 | .316 | (.258, .363) |
J. Julio | 13 | 0 | .000 | .312 | (.243, .350 ) |
Total | 6530 | 2061 | .316 |
Total 6530 2061 .316
Stern stimates \(\hat{\phi} = ( \alpha + \beta + 1 )^{-1} = 0.002\) for Jeter
Doesn’t vary much across the population of pitchers.
The extremes are shrunk the most also matchups with the smallest sample sizes.
Jeter had a season \(.308\) average.
Bayes estimates vary from \(.311\) to \(.327\)–he’s very consistent.
If all players had a similar record then a constant batting average would make sense.
Predicting the Electoral Vote (EV)
\[ p_{Obama} = ( p_{1}, \ldots ,p_{51} | \hat{p}) \sim Dir \left ( \alpha + \hat{p} \right ) \]
http://www.electoral-vote.com/evp2012/Pres/prespolls.csv
Calculate probabilities via simulation: rdirichlet
\[
p \left ( p_{j,O} | {\rm data} \right ) \;\; {\rm and} \; \;
p \left ( EV >270 | {\rm data} \right )
\]
The election vote prediction is given by the sum \[ EV =\sum_{j=1}^{51} EV(j) \mathbb{E} \left ( p_{j} | {\rm data} \right ) \] where \(EV(j)\) are for individual states
electoral-vote.com
Electoral Vote (EV), Polling Data: Mitt and Obama percentages
State | M.pct | O.pct | EV |
---|---|---|---|
Alabama | 58 | 36 | 9 |
Alaska | 55 | 37 | 3 |
Arizona | 50 | 46 | 10 |
Arkansas | 51 | 44 | 6 |
California | 33 | 55 | 55 |
Colorado | 45 | 52 | 9 |
Connecticut | 31 | 56 | 7 |
Delaware | 38 | 56 | 3 |
D.C. | 13 | 82 | 3 |
Florida | 46 | 50 | 27 |
Georgia | 52 | 47 | 15 |
Hawaii | 32 | 63 | 4 |
Idaho | 68 | 26 | 4 |
Illinois | 35 | 59 | 21 |
Indiana | 48 | 48 | 11 |
Iowa | 37 | 54 | 7 |
Kansas | 63 | 31 | 6 |
Kentucky | 51 | 42 | 8 |
Louisiana | 50 | 43 | 9 |
Maine | 35 | 56 | 4 |
Maryland | 39 | 54 | 10 |
Massachusetts | 34 | 53 | 12 |
Michigan | 37 | 53 | 17 |
Minnesota | 42 | 53 | 10 |
Mississippi | 46 | 33 | 6 |
Bayes Learning: Update our beliefs in light of new information
The Bears suffered back-to-back \(50\)-points defeats.
Partiots-Bears \(51-23\)
Packers-Bears \(55-14\)
Current line against the Vikings was \(-3.5\) points.
Slightly over a field goal
What’s the Bayes approach to learning the line?
Hierarchical model for the current average win/lose this year \[ \begin{aligned} \bar{y} | \theta & \sim N \left ( \theta , \frac{\sigma^2}{n} \right ) \sim N \left ( \theta , \frac{18.34^2}{9} \right )\\ \theta & \sim N( 0 , \tau^2 ) \end{aligned} \] Here \(n =9\) games so far. With \(s = 18.34\) points
Pre-season prior mean \(\mu_0 = 0\), standard deviation \(\tau = 4\).
Record so-far. Data \(\bar{y} = -9.22\).
Bayes Shrinkage estimator \[ \mathbb{E} \left ( \theta | \bar{y} , \tau \right ) = \frac{ \tau^2 }{ \tau^2 + \frac{\sigma^2}{n} } \bar{y} \]
The Shrinkage factor is \(0.3\)!!
That’s quite a bit of shrinkage. Why?
\[ \mathbb{E} \left ( \theta | \bar{y} , \tau \right ) = - 2.75 > -.3.5 \] where current line is \(-3.5\).
One point change on the line is about \(3\)% on a probability scale.
Alternatively, calculate a market-based \(\tau\) given line \(=-3.5\).
Last two defeats were \(50\) points scored by opponent (2014-15)
[1] -9.222222
[1] 18.34242
[1] 0.2997225
[1] 0.4390677
Home advantage is worth \(3\) points. Vikings an average record.
Result: Bears 21, Vikings 13
Stein paradox: possible to make a uniform improvement on the MLE in terms of MSE.
In particular, the loss function.
Difficulties in adapting the procedure to special cases
Long familiarity with good properties for the MLE
Any gains from a “complicated” procedure could not be worth the extra trouble (Tukey, savings not more than 10 % in practice)
For \(k\ge 3\), we have the remarkable inequality \[ MSE(\hat \theta_{JS},\theta) < MSE(\bar y,\theta) \; \forall \theta \] Bias-variance explanation! Inadmissability of the classical stats.
Data: 18 major-league players after 45 at bats (1970 season)
Player | \(\bar{y}_i\) | \(E ( p_i | D )\) | average season |
---|---|---|---|
Clemente | 0.400 | 0.290 | 0.346 |
Robinson | 0.378 | 0.286 | 0.298 |
Howard | 0.356 | 0.281 | 0.276 |
Johnstone | 0.333 | 0.277 | 0.222 |
Berry | 0.311 | 0.273 | 0.273 |
Spencer | 0.311 | 0.273 | 0.270 |
Kessinger | 0.311 | 0.268 | 0.263 |
Alvarado | 0.267 | 0.264 | 0.210 |
Santo | 0.244 | 0.259 | 0.269 |
Swoboda | 0.244 | 0.259 | 0.230 |
Unser | 0.222 | 0.254 | 0.264 |
Williams | 0.222 | 0.254 | 0.256 |
Scott | 0.222 | 0.254 | 0.303 |
Petrocelli | 0.222 | 0.254 | 0.264 |
Rodriguez | 0.222 | 0.254 | 0.226 |
Campanens | 0.200 | 0.259 | 0.285 |
Munson | 0.178 | 0.244 | 0.316 |
Alvis | 0.156 | 0.239 | 0.200 |
Let \(\theta_i\) denote the end of season average
\[ c = 1 - \frac{ ( k - 3 ) \sigma^2 }{ \sum ( \bar{y}_i - \bar{y} )^2 } \] where \(\bar{y}\) is the overall grand mean and
\[ \hat{\theta} = c \bar{y}_i + ( 1 - c ) \bar{y} \]
Compute \(\sum ( \hat{\theta}_i - \bar{y}^{obs}_i )^2\) and see which is lower: \[ MLE = 0.077 \; \; STEIN = 0.022 \] That’s a factor of \(3.5\) times better!
Shrinkage on Clemente too severe: \(z_{Cl} = 0.265 + 0.212 ( 0.400 - 0.265) = 0.294\).
The \(0.212\) seems a little severe
Limited translation rules, maximum shrinkage eg. 80%
Not enough shrinkage eg O’Connor ( \(y = 1 , n = 2\)). \(z_{O'C} = 0.265 + 0.212 ( 0.5 - 0.265 ) = 0.421\).
Still better than Ted Williams \(0.406\) in 1941.
Foreign car sales (\(k = 19\)) will further improve MSE performance! It will change the shrinkage factors.
Clearly an improvement over the Stein estimator is
\[ \hat{\theta}_{S+} = \max \left ( \left ( 1 - \frac{k-2}{ \sum \bar{Y}_i^2 } \right ) , 0 \right ) \bar{Y}_i \]
Include extra prior knowledge
Empirical distribution of all major league players \[ \theta_i \sim N ( 0.270 , 0.015 ) \] The \(0.270\) provides another origin to shrink to and the prior variance \(0.015\) would give a different shrinkage factor.
To fully understand maybe we should build a probabilistic model and see what the posterior mean is as our estimator for the unknown parameters.
Model \(Y_i | \theta_i \sim N ( \theta_i , D_i )\) where \(\theta_i \sim N ( \theta_0 , A ) \sim N ( 0.270 , 0.015 )\).
The \(D_i\) can be different – unequal variances
Bayes posterior means are given by
\[ E ( \theta_i | Y ) = ( 1 - B_i ) Y_i \; \; {\rm where} \; \; B_i = \frac{ D_i }{ D_i + A } \] where \(\hat{A}\) is estimated from the data, see Efron and Morris (1975).
\(D_i \propto n_i^{-1}\) and so smaller sample sizes are shrunk more.
Makes sense.
Disease of Blood that is endemic in tropical regions.
Data: \(5000\) people in El Salvador (varying sample sizes) from \(36\) cities.
Estimate “true” prevalences \(\theta_i\) for \(1 \leq i \leq 36\)
Allocation of Resources: should we spend funds on the city with the highest observed occurrence of the disease? Same shrinkage factors?
Shrinkage Procedure (Efron and Morris, p315) \[ z_i = c_i y_i \] where \(y_i\) are the observed relative rates (normalized so \(\bar{y} = 0\) The smaller sample sizes will get shrunk more.
The most gentle are in the range \(0.6 \rightarrow 0.9\) but some are \(0.1 \rightarrow 0.3\).
de Finetti and Markowitz: Mean-variance portfolio shrinkage: \(\frac{1}{\gamma} \Sigma^{-1} \mu\)
Different shrinkage factors for different history lengths.
Portfolio Allocation in the SP500 index
Entry/exit; splits; spin-offs etc. For example, 73 replacements to the SP500 index in period 1/1/94 to 12/31/96.
Advantage: \(E ( \alpha | D_t ) = 0.39\), that is 39 bps per month which on an annual basis is \(\alpha = 468\) bps.
The posterior mean for \(\beta\) is \(p ( \beta | D_t ) = 0.745\)
\(\bar{x}_{M} = 12.25 \%\) and \(\bar{x}_{PT} = 14.05 \%\).
Date | Symbol | 6/96 | 12/89 | 12/79 | 12/69 |
---|---|---|---|---|---|
General Electric | GE | 2.800 | 2.485 | 1.640 | 1.569 |
Coca Cola | KO | 2.342 | 1.126 | 0.606 | 1.051 |
Exxon | XON | 2.142 | 2.672 | 3.439 | 2.957 |
ATT | T | 2.030 | 2.090 | 5.197 | 5.948 |
Philip Morris | MO | 1.678 | 1.649 | 0.637 | ***** |
Royal Dutch | RD | 1.636 | 1.774 | 1.191 | ***** |
Merck | MRK | 1.615 | 1.308 | 0.773 | 0.906 |
Microsoft | MSFT | 1.436 | ***** | ***** | ***** |
Johnson/Johnson | JNJ | 1.320 | 0.845 | 0.689 | ***** |
Intel | INTC | 1.262 | ***** | ***** | ***** |
Procter and Gamble | PG | 1.228 | 1.040 | 0.871 | 0.993 |
Walmart | WMT | 1.208 | 1.084 | ***** | ***** |
IBM | IBM | 1.181 | 2.327 | 5.341 | 9.231 |
Hewlett Packard | HWP | 1.105 | 0.477 | 0.497 | ***** |
Pepsi | PEP | 1.061 | 0.719 | ***** | ***** |
Date | Symbol | 6/96 | 12/89 | 12/79 | 12/69 |
---|---|---|---|---|---|
Pfizer | PFE | 0.918 | 0.491 | 0.408 | 0.486 |
Dupont | DD | 0.910 | 1.229 | 0.837 | 1.101 |
AIG | AIG | 0.910 | 0.723 | ***** | ***** |
Mobil | MOB | 0.906 | 1.093 | 1.659 | 1.040 |
Bristol Myers Squibb | BMY | 0.878 | 1.247 | ***** | 0.484 |
GTE | GTE | 0.849 | 0.975 | 0.593 | 0.705 |
General Motors | GM | 0.848 | 1.086 | 2.079 | 4.399 |
Disney | DIS | 0.839 | 0.644 | ***** | ***** |
Citicorp | CCI | 0.831 | 0.400 | 0.418 | ***** |
BellSouth | BLS | 0.822 | 1.190 | ***** | ***** |
Motorola | MOT | 0.804 | ***** | ***** | ***** |
Ford | F | 0.798 | 0.883 | 0.485 | 0.640 |
Chervon | CHV | 0.794 | 0.990 | 1.370 | 0.966 |
Amoco | AN | 0.733 | 1.198 | 1.673 | 0.758 |
Eli Lilly | LLY | 0.720 | 0.814 | ***** | ***** |
Date | Symbol | 6/96 | 12/89 | 12/79 | 12/69 |
---|---|---|---|---|---|
Abbott Labs | ABT | 0.690 | 0.654 | ***** | ***** |
AmerHome Products | AHP | 0.686 | 0.716 | 0.606 | 0.793 |
FedNatlMortgage | FNM | 0.686 | ***** | ***** | ***** |
McDonald’s | MCD | 0.686 | 0.545 | ***** | ***** |
Ameritech | AIT | 0.639 | 0.782 | ***** | ***** |
Cisco Systems | CSCO | 0.633 | ***** | ***** | ***** |
CMB | CMB | 0.621 | ***** | ***** | ***** |
SBC | SBC | 0.612 | 0.819 | ***** | ***** |
Boeing | BA | 0.598 | 0.584 | 0.462 | ***** |
MMM | MMM | 0.581 | 0.762 | 0.838 | 1.331 |
BankAmerica | BAC | 0.560 | ***** | 0.577 | ***** |
Bell Atlantic | BEL | 0.556 | 0.946 | ***** | ***** |
Gillette | G | 0.535 | ***** | ***** | ***** |
Kodak | EK | 0.524 | 0.570 | 1.106 | ***** |
Chrysler | C | 0.507 | ***** | ***** | 0.367 |
Home Depot | HD | 0.497 | ***** | ***** | ***** |
Colgate | COL | 0.489 | 0.499 | ***** | ***** |
Wells Fargo | WFC | 0.478 | ***** | ***** | ***** |
Nations Bank | NB | 0.453 | ***** | ***** | ***** |
Amer Express | AXP | 0.450 | 0.621 | ***** | ***** |
keynes = 15.08 + 1.83 market
buffett = 18.06 + 0.486 market
Year | Keynes | Market |
---|---|---|
1928 | -3.4 | 7.9 |
1929 | 0.8 | 6.6 |
1930 | -32.4 | -20.3 |
1931 | -24.6 | -25.0 |
1932 | 44.8 | -5.8 |
1933 | 35.1 | 21.5 |
1934 | 33.1 | -0.7 |
1935 | 44.3 | 5.3 |
1936 | 56.0 | 10.2 |
1937 | 8.5 | -0.5 |
1938 | -40.1 | -16.1 |
1939 | 12.9 | -7.2 |
1940 | -15.6 | -12.9 |
1941 | 33.5 | 12.5 |
1942 | -0.9 | 0.8 |
1943 | 53.9 | 15.6 |
1944 | 14.5 | 5.4 |
1945 | 14.6 | 0.8 |
King’s College Cambridge
Likelihood | Prior | Posterior |
---|---|---|
Binomial | Beta | Beta |
Negative | Binomial | Beta |
Poisson | Gamma | Gamma |
Geometric | Beta | Beta |
Exponential | Gamma | Gamma |
Normal (mean unknown) | Normal | Normal |
Normal (variance unknown) | Inverse Gamma | Inverse Gamma |
Normal (mean and variance unknown) | Normal/Gamma | Normal/Gamma |
Multinomial | Dirichlet | Dirichlet |