Machine Learning: From Probability to Prediction
Vadim Sokolov
George Mason University
Why Should Executives Care?
| Business Question | Distribution |
|---|---|
| Will the customer buy? | Binomial |
| How many orders today? | Poisson |
| What’s the forecast error? | Normal |
Choosing the right distribution is the first step in building a reliable model. Wrong distribution = wrong predictions!
Models the number of successes in \(n\) independent trials, each with probability \(p\)
\[P(X=k) = \binom{n}{k} p^k(1-p)^{n-k}\]
Key Parameters:
Examples: A/B test conversions, click-through rates, quality defects
The Patriots won 19 out of 25 coin tosses in 2014-15. How likely?
The “Law of Large Numbers” Perspective:
With 32 NFL teams over 20+ years, some team will have a suspicious streak!
Key insight: Probability of Patriots specifically = 0.5%. But probability that some team has a streak ≈ much higher!
Business lesson: When auditing for fraud or anomalies:
Looking at enough data, you’ll always find something “unusual”
How many goals will a team score? Historical EPL data:
home_team_name away_team_name home_score guest_score
1 Arsenal Liverpool 3 4
2 Bournemouth Manchester United 1 3
3 Burnley Swansea 0 1
4 Chelsea West Ham 2 1
5 Crystal Palace West Bromwich Albion 0 1
Each row = one match with final scores.
The Business Problem:
Sports betting: $200+ billion industry.
Our approach:
Who uses this? FiveThirtyEight, ESPN, DraftKings, Betfair, team analytics
A key signature of Poisson data: the mean equals the variance.
Model Diagnostics: Mean vs Variance
| Relationship | Suggests |
|---|---|
| Variance ≈ Mean | Poisson ✓ |
| Variance > Mean | Overdispersion (Negative Binomial) |
| Variance < Mean | Underdispersion (rare) |
Other Poisson Applications:
Poisson is the “go-to” for count data!
goals <- c(epl$home_score, epl$guest_score)
lambda <- mean(goals)
x <- 0:8
observed <- table(factor(goals, levels = x)) / length(goals)
expected <- dpois(x, lambda = lambda)
barplot(rbind(observed, expected), beside = TRUE,
names.arg = x, col = c("steelblue", "coral"),
xlab = "Goals Scored", ylab = "Proportion",
legend.text = c("Observed", "Poisson Model"))Model Validation:
The Poisson model (coral bars) fits the observed data (blue bars) remarkably well!
What this tells us:
Slight discrepancy at 0 goals: Real matches have slightly fewer 0-0 draws than Poisson predicts (teams try harder when level!)
Models count of random events: goals, arrivals, defects, clicks
\[P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}\]
Business Applications:
If events are rare and independent, Poisson is your model!
A single \(\lambda\) for all teams is too simple. Better model:
\[\lambda_{ij} = \text{Attack}_i \times \text{Defense}_j \times \text{HomeAdvantage}\]
This is how real sports analytics works:
Same framework applies to:
To predict a specific match, we estimate each team’s scoring rate:
For Arsenal vs Liverpool at home, we estimate Arsenal will score about 1.8 goals on average. Liverpool’s away \(\lambda\) would be calculated similarly.
# Simple estimate: average goals scored and conceded
arsenal_attack <- mean(epl$home_score[epl$home_team_name == "Arsenal"])
liverpool_defense <- mean(epl$home_score[epl$away_team_name == "Liverpool"])
league_avg <- mean(goals)
# Arsenal's expected goals vs Liverpool (simplified)
lambda_arsenal <- arsenal_attack * (liverpool_defense / league_avg)
lambda_arsenal[1] 1.851998
Once we have \(\lambda\) for each team, we can simulate the match thousands of times.
For Arsenal (\(\lambda=1.8\)) vs Liverpool (\(\lambda=1.5\)), running 10,000 simulations gives:
This is how betting companies set their odds!
set.seed(42)
n_sims <- 10000
# Simulate Arsenal vs Liverpool
arsenal_goals <- rpois(n_sims, lambda = 1.8) # λ for Arsenal
liverpool_goals <- rpois(n_sims, lambda = 1.5) # λ for Liverpool
# Match outcomes
c(Arsenal_Win = mean(arsenal_goals > liverpool_goals),
Draw = mean(arsenal_goals == liverpool_goals),
Liverpool_Win = mean(arsenal_goals < liverpool_goals)) Arsenal_Win Draw Liverpool_Win
0.4538 0.2254 0.3208
Each simulation draws random goals from Poisson distributions
This is how FiveThirtyEight and bookmakers build their models!
Monte Carlo Applications:
When math is too hard, simulate!
The most important theorem in statistics:
The average of many independent random events tends toward a Normal distribution, regardless of the original distribution.
Why it matters: Stock returns, measurement errors, test scores — all tend to be Normal because they’re sums of many small effects.
Practical Implications:
Rule of thumb: Sample size ≥ 30 usually sufficient for CLT to kick in
This is why the Normal distribution is everywhere!
Suppose the true vote share in Michigan is 51%. What happens when we poll voters?
set.seed(42)
true_p <- 0.51
# Poll of 10 voters
hist(replicate(1000, mean(rbinom(10, 1, true_p))), breaks = 20,
main = "Poll: 10 Voters", xlab = "Vote Share", col = "steelblue",
freq = FALSE, xlim = c(0.2, 0.8))
abline(v = true_p, col = "red", lwd = 2, lty = 2)
# Poll of 100 voters
hist(replicate(1000, mean(rbinom(100, 1, true_p))), breaks = 20,
main = "Poll: 100 Voters", xlab = "Vote Share", col = "steelblue",
freq = FALSE, xlim = c(0.2, 0.8))
abline(v = true_p, col = "red", lwd = 2, lty = 2)
# Poll of 1000 voters
hist(replicate(1000, mean(rbinom(1000, 1, true_p))), breaks = 20,
main = "Poll: 1000 Voters", xlab = "Vote Share", col = "steelblue",
freq = FALSE, xlim = c(0.2, 0.8))
abline(v = true_p, col = "red", lwd = 2, lty = 2)Larger samples → tighter Normal distribution around the true value (red line)
The “bell curve” — the most important distribution in statistics
The 68-95-99.7 Rule:
Why it’s everywhere: Central Limit Theorem guarantees that averages of many random events become Normal
Applications: Quality control, financial risk, test scores, measurement error
Male heights follow a Normal distribution: mean = 70 inches, sd = 3 inches
R Functions for Normal Distribution:
| Function | Purpose | Example |
|---|---|---|
pnorm() |
Probability ≤ x | P(height ≤ 73) |
qnorm() |
Find percentile | 95th percentile |
dnorm() |
Density at x | Height of curve |
rnorm() |
Random samples | Simulate data |
Business Applications:
How extreme was the October 1987 crash of -21.76%?
Conclusion: The model is wrong — stock returns have “fat tails.” Banks using Normal-based VaR dramatically underestimate risk.
The Problem with Normal Assumptions:
Stock returns have more extreme events than the Normal distribution predicts.
| Event | Normal Probability | Actually Happened |
|---|---|---|
| 1987 Crash (-22%) | 1 in \(10^{160}\) | Yes |
| 2008 Crisis | “Impossible” | Yes |
| 2020 COVID Crash | “Impossible” | Yes |
Implications for Risk Management:
Finding the relationship between variables
\[y = \beta_0 + \beta_1 x + \epsilon\]
Goal: Minimize sum of squared prediction errors
Business Questions Regression Answers:
Regression quantifies relationships and enables prediction.
Using Saratoga County housing data, we fit a model:
Price = f(Living Area)
A 2,000 sq ft house: $13K + (2000 × $113) = $239,000
Interpreting Coefficients:
| Coefficient | Meaning |
|---|---|
| Intercept ($13K) | Value of land without house |
| Slope ($113/sqft) | Price increase per sqft |
Making Predictions:
\[\text{Price} = 13,439 + 113 \times \text{SqFt}\]
| House Size | Predicted Price |
|---|---|
| 1,500 sqft | $183,000 |
| 2,500 sqft | $296,000 |
| 3,500 sqft | $409,000 |
What the plot shows:
Key observations:
The line minimizes the sum of squared vertical distances
The Capital Asset Pricing Model (CAPM) asks: Does a stock follow the market or beat it?
\[\text{Google Return} = \alpha + \beta \times \text{Market Return}\]
Call:
lm(formula = goog ~ spy)
Coefficients:
(Intercept) spy
0.0003211 1.1705808
Our Findings:
| Beta | Interpretation |
|---|---|
| \(\beta < 1\) | Less volatile (utilities, healthcare) |
| \(\beta = 1\) | Moves with market (index funds) |
| \(\beta > 1\) | More volatile (tech, small caps) |
Conclusion: Google tracked the market without consistent alpha in 2017-2023. High beta = higher risk, potentially higher reward.
How does advertising affect price sensitivity? We model sales as a function of price and whether the product was featured in ads.
Key finding: The interaction term (log(price):feat) is negative and significant — advertising changes how customers respond to price!
Finding: Advertising increases price sensitivity
| Condition | Price Elasticity |
|---|---|
| No advertising | -0.96 |
| With advertising | -0.96 + (-0.98) = -1.94 |
Why? Ads coincide with promotions → attract price-sensitive shoppers
Key Lessons:
Correlation ≠ Causation: Ads don’t cause sensitivity; they coincide with promotions
Selection effects: Who responds to ads? Price hunters!
Confounding variables: Promotions happen during ad campaigns
Managerial insight: Don’t blame advertising for price sensitivity — it’s the promotion strategy
Always ask: What’s really driving the relationship?
What if the outcome is yes/no?
\[P(y=1 \mid x) = \frac{1}{1 + e^{-\beta^T x}}\]
Why not just use linear regression?
Can Vegas point spreads predict game outcomes? We fit a logistic regression using historical NBA data.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| spread | 0.156 | 0.014 | 11.332 | 0 |
Interpretation: For each additional point in the spread, log-odds of favorite winning increases by 0.16. The p-value < 0.001 confirms spreads are highly predictive.
Using our model, we can predict win probability for any point spread:
| Spread | P(Favorite Wins) |
|---|---|
| 4 points | 65% |
| 8 points | 78% |
| 12 points | 87% |
1 2
0.6511238 0.7769474
Same approach used for: credit scoring, churn prediction, marketing response, fraud detection — any binary outcome.
How accurate is our model? The confusion matrix shows predictions vs. actual outcomes.
Predicted
Actual 1
0 131
1 422
Our model achieves about 66% accuracy — better than a coin flip!
Reading the Matrix:
| Pred: 0 | Pred: 1 | |
|---|---|---|
| Actual: 0 | TN (correct!) | FP (oops) |
| Actual: 1 | FN (oops) | TP (correct!) |
Sports Betting Reality:
But past performance ≠ future results
| Predicted: Win | Predicted: Lose | |
|---|---|---|
| Actual: Win | True Positive (TP) | False Negative (FN) |
| Actual: Lose | False Positive (FP) | True Negative (TN) |
Key Metrics:
Caution: Accuracy can mislead! A spam filter predicting “not spam” for everything has 99% accuracy but catches zero spam. Choose metrics based on business costs.
Understanding the ROC Curve:
Area Under Curve (AUC):
| AUC | Model Quality |
|---|---|
| 0.5 | Random (useless) |
| 0.6-0.7 | Poor |
| 0.7-0.8 | Fair |
| 0.8-0.9 | Good |
| 0.9+ | Excellent |
The optimal threshold depends on business costs:
There is no universal “correct” threshold
Framework for Threshold Selection:
Example — Credit Card Fraud:
Let business economics guide your model decisions
| Concept | Key Insight |
|---|---|
| Distributions | Binomial (binary), Poisson (counts), Normal (continuous) |
| Poisson | Mean = Variance — the fingerprint of count data |
| Normal | CLT makes it universal for averages |
| Linear Regression | Coefficients = effect sizes |
| Logistic Regression | Outputs probabilities for classification |
| ROC/AUC | Trade-off between false positives and false negatives |
| Threshold | Business costs should drive the choice |
Statistics is the science of decision-making under uncertainty
Online Articles:
Key Insight from HBR: A simple A/B test at Bing generated over $100M annually by testing a “low priority” idea
Books for Further Study:
Online Courses: