2  Bayes Rule

When the facts change, I change my mind. What do you do, sir? John Maynard Keynes

One of the key questions in the theory of learning is: How do you update your beliefs in the presence of new information? Bayes rule provides the answer. Conditional probability can be interpreted as updating your probability of event \(A\) after you have learned the new information that \(B\) has occurred. In this sense probability is also the language of how you’ll change opinions in the light of new evidence. For example, consider finding the probability that a die roll is odd, given the information that the number is less than 4. The intuitive answer might change once we incorporate this new evidence. \[ P(\text{Odd} \mid <4) = \frac{P(\{1,3\})}{P(\{1,2,3\})} = \frac{2}{3}. \]

Probability rules allow us to change our mind if the facts change. For example, suppose that we have evidence \(E = \{ E_1 , E_2 \}\) consists of two pieces of information and that we are interested in identifying a cause \(C\), \(P(C\mid E_1,E_2)\). Bayes rule simply lets you calculate this conditional probability in a sequential fashion. First, conditioning on the information contained in \(E_1\), lets us calculate \[ P( C| E_1 ) = \frac{ P( E_1 \mid C ) P( C) }{ P( E_1 ) } \] Then, using the posterior probability \(P( C| E_1 )\) as the “new” prior for the next piece of information \(E_2\) lets us find \[ P( C| E_1 , E_2 ) = \frac{ P( E_2 \mid E_1 , C ) P( C \mid E_1 ) }{ P( E_2 \mid E_1 ) } \] Hence, we see that we need assessments of the two conditional probabilities \(P( E_1 \mid C )\) and \(P( E_2 \mid E_1 , C )\). In many situations, the latter will be simply \(P( E_2 \mid C )\) and not involve \(E_1\). The events \(( E_1, E_2 )\) will be said to be conditionally independent given \(C\).

This concept generalizes to a sequence of events where \(E = \{ E_1,\ldots E_n \}\). When learning from data we will use this property all the time. An illustrative example will be the Black Swan problem which we discuss later.

Bayes’ rule is a fundamental concept in probability theory and statistics. It describes how to update our beliefs about an event based on new evidence. We start with an initial belief about the probability of an event (called the prior probability). We then observe some conditional information (e.g. evidence). We use Bayes’ rule to update our initial belief based on the evidence, resulting in a new belief called the posterior probability. Remember, the formula is \[ P(A\mid B) = \dfrac{P(B\mid A) P(A)}{P(B)} \] where:

NotePosterior probability

\(P(A\mid B)\) is the posterior probability of event \(A\) occurring given that \(B\) is known to happen for sure. This is the probability we’re trying to find.

NoteLikelihood

\(P(B\mid A)\) is the likelihood of observing event \(B\) if event \(A\) has occurred.

NotePrior probability

\(P(A)\) is the prior probability of event \(A\) occurring. This is our initial belief about the probability of \(A\) before we see any evidence.

NoteMarginal probability

\(P(B)\) is the marginal probability of observing event \(B\). This is the probability of observing B regardless of whether \(A\) occurs.

The ability to use Bayes rule sequentially is key in many applications, when we need to update our beliefs in the presence of new information. For example, Bayesian learning was used by mathematician Alan Turing in England at Bletchley Park to break the German Enigma code - a development that helped the Allies win the Second World War (Simpson 2010). Turing called his algorithm Banburismus, it is a process he invented which used sequential conditional probability to infer information about the likely settings of the Enigma machine.

Dennis Lindley argued that we should all be trained in Bayes rule and conditional probability can be simply viewed as disciplined probability accounting. Akin to how market odds change as evidence changes. However, human intuition is rarely naturally calibrated for Bayesian reasoning; it is a skill that must be learned, much like literacy.

2.1 Law of Total Probability

The Law of Total Probability is a fundamental rule relating marginal probabilities to conditional probabilities. It’s particularly useful when you’re dealing with a set of mutually exclusive and collectively exhaustive events.

Suppose you have a set of events \(B_1, B_2, ..., B_n\) that are mutually exclusive (i.e., no two events can occur at the same time) and collectively exhaustive (i.e., at least one of the events must occur). The Law of Total Probability states that for any other event \(A\), the probability of \(A\) occurring can be calculated as the sum of the probabilities of \(A\) occurring given each \(B_i\), multiplied by the probability of each \(B_i\) occurring.

Mathematically, it is expressed as:

\[ P(A) = \sum_{i=1}^{n} P(A\mid B_i) P(B_i) \]

Example 2.1 (Total Probability) Let’s consider a simple example to illustrate this. Suppose you have two bags of balls. Bag 1 contains 3 red and 7 blue balls, while Bag 2 contains 6 red and 4 blue balls. You randomly choose one of the bags and then randomly draw a ball from that bag. What is the probability of drawing a red ball?

Here, the events \(B_1\) and \(B_2\) can be choosing Bag 1 and Bag 2, respectively. You want to find drawing a red ball (event \(A\)).

Applying the law:

  • \(P(A\mid B_1)\) is the probability of drawing a red ball from Bag 1, which is \(\frac{3}{10}\).
  • \(P(A\mid B_2)\) is the probability of drawing a red ball from Bag 2, which is \(\frac{6}{10}\).
  • Assume the probability of choosing either bag is equal, so \(P(B_1) = P(B_2) = \frac{1}{2}\).

Using the Law of Total Probability: \[ P(A) = P(A\mid B_1) \times P(B_1) + P(A\mid B_2) \times P(B_2)= \frac{3}{10} \times \frac{1}{2} + \frac{6}{10} \times \frac{1}{2} = \frac{9}{20} \]

So, the probability of drawing a red ball in this scenario is \(\frac{9}{20}\).

This law is particularly useful in complex probability problems where direct calculation of probability is difficult. By breaking down the problem into conditional probabilities based on relevant events, it simplifies the calculation and helps to derive a solution.

2.2 Intuition and Simple Examples

Example 2.2 (Intuition) Our intuition is not well trained to make use of Bayes rule. Suppose I tell you that Steve was selected at random from a representative sample, and that he is 6 feet 2 inches tall and an excellent basketball player. He goes to the gym every day and practices hard playing basketball. Do you think Steve is a custodian at a factory or an NBA player? Most people assume Steve is an NBA player which is wrong. The ratio of NBA players to custodians is very small, probabilistically Steve is more likely to be a custodian. Let’s look at it graphically. The key is to provide the right conditioning and to consider the prior probability! Even though the ratio of people who practice basketball hard is much higher among NBA players (it is 1) when compared to custodians, the larger number of the population means we still have more custodians in the US than NBA players.

\[\begin{align*} P(\text{Practice hard} \mid \text{Play in NBA}) \approx 1\\ P( \text{Play in NBA} \mid \text{Practice hard}) \approx 0. \end{align*}\]

Even though you practice hard, the odds of playing in the NBA are low (\(1000\) players out of \(7\) billion). But given you’re in the NBA, you no doubt practice very hard. To understand this further, let’s look at the conditional probability implication and apply Bayes rule \[ P \left ( \text{Play in NBA} \mid \text{Practice hard} \right ) = \dfrac{P \left ( \text{Practice hard} \mid \text{Play in NBA} \right )}{P(\text{Practice hard})}P( \text{Play in NBA}). \] This is written in the form \[ \text{Posterior} = \frac{\text{Likelihood}}{\text{Marginal}}\times \text{Prior} = \text{Bayes Factor} \times \text{Prior}. \] The Likelihood/Marginal ratio is called the Bayes Factor. As we will see in the text, one of the key advantages over classical is the ability to sequentially update our beliefs as new evidence appears. It allows for disciplined probability accounting in “real-time”. With the advent of prediction markets and data science, it has become increasingly important to be able to update our beliefs as new evidence appears.

The initial (a.k.a. prior) probability \(P(\text{Play in NBA} ) = 450/(8 \cdot 10^9) = 5.625 \times 10^{-8}\). This assumes a global population of around 8 billion, making the conditional (or, so called, posterior) probability also very small. \[ P \left ( \text{Play in NBA} \mid \text{Practice hard} \right ) \approx 0, \] \(P(\text{practice hard})\) is not that small and \(P(\text{practice hard} \mid \text{play in NBA})=1\). Hence, when one ‘reverses the conditioning’ one gets a very small probability. This makes sense!

The Steve example illustrates how our intuition fails us, but let’s consider an even more striking case that demonstrates the power of Bayes rule with extreme probabilities. Consider the question: what is the probability that a randomly selected 7-foot-tall American male plays in the NBA?

Most people’s intuition suggests this probability should be quite high - after all, being exceptionally tall seems like the primary qualification for professional basketball. However, Bayes rule reveals a more nuanced picture that depends critically on the base rates involved.

To calculate \(P(\text{NBA player} \mid \text{7 feet tall})\) using Bayes rule, we need to carefully estimate each component:

\[P(\text{NBA} \mid \text{7ft}) = \frac{P(\text{7ft} \mid \text{NBA}) \times P(\text{NBA})}{P(\text{7ft})}\]

The prior probability \(P(\text{NBA})\) represents the baseline chance of being an NBA player. With approximately 450 active players drawn from roughly 40 million American males of playing age, this gives us \(P(\text{NBA}) \approx 1.1 \times 10^{-5}\) - an extraordinarily small number.

The likelihood \(P(\text{7ft} \mid \text{NBA})\) asks what fraction of NBA players are 7 feet or taller. Historically, this has been around 17% of the league, so \(P(\text{7ft} \mid \text{NBA}) \approx 0.17\).

The marginal probability \(P(\text{7ft})\) requires us to estimate how rare 7-foot-tall men are in the general population. Male height follows approximately a normal distribution with mean 69 inches and standard deviation 3 inches. At 84 inches (7 feet), we’re looking at a z-score of 5.0, which corresponds to roughly 1 in 3.5 million men, giving us \(P(\text{7ft}) \approx 2.87 \times 10^{-7}\).

Applying Bayes rule: \[P(\text{NBA} \mid \text{7ft}) = \frac{0.17 \times 1.1 \times 10^{-5}}{2.87 \times 10^{-7}} \approx 0.065\]

This yields approximately 6.5% - a dramatic increase from the baseline probability of 0.001%, yet still surprisingly low given our intuitions.

Let’s also consider a direct count-based calculation. As of September 2025, there are 39 players in the NBA who are 7 feet tall or taller, out of a total of 450 NBA players. This means the probability that a randomly selected NBA player is at least 7 feet tall is:

\[ P(\text{7ft} \mid \text{NBA}) = \frac{39}{450} = 0.0867 \]

This empirical estimate is slightly different from the earlier historical average, but it still highlights the rarity of extreme height even among elite basketball players.

Regardless of the exact calculation, this example powerfully demonstrates how Bayes rule forces us to account for base rates. Even when height provides enormous predictive value for NBA success (the likelihood ratio is massive), the extreme rarity of both 7-foot-tall individuals and NBA players means that most 7-footers will not be professional basketball players. This counterintuitive result exemplifies why disciplined probabilistic reasoning through Bayes rule is essential for making accurate inferences in the presence of rare events.

Example 2.3 (Craps) Craps is a fast-moving dice game with a complex betting layout. It’s highly volatile, but eventually your bankroll will drift towards zero. Let’s look at the pass line bet. The expectation \(E(X)\) governs the long run. When 7 or 11 comes up, you win. When 2, 3 or 12 comes up, this is known as “craps”, you lose. When 4, 5, 6, 8, 9 or 10 comes up, this number is called the “point”, the bettor continues to roll until a 7 (you lose) or the point comes up (you win).

We need to know the probability of winning. The pay-out, probability and expectation for a $1 bet

Win Prob
1 0.4929
-1 0.5071

This leads to an edge in favor of the house as \[ E(X) = 1 \cdot 0.4929 + (- 1) \cdot 0.5071 = -0.014 \] The house has a 1.4% edge.

To calculate the probability of winning: \(P( \text{Win} )\) let’s use the law of total probability \[ P( \text{Win} ) = \sum_{ \mathrm{Point} } P ( \text{Win} \mid \mathrm{Point} ) P ( \mathrm{Point} ) \] The set of \(P( \mathrm{Point} )\) are given by

Value Probability Percentage
2 1/36 2.78%
3 2/36 5.56%
4 3/36 8.33%
5 4/36 11.1%
6 5/36 13.9%
7 6/36 16.7%
8 5/36 13.9%
9 4/36 11.1%
10 3/36 8.33%
11 2/36 5.56%
12 1/36 2.78%

The conditional probabilities \(P( \text{Win} \mid \mathrm{Point} )\) are harder to calculate \[ P( \text{Win} \mid 7 \; \mathrm{or} \; 11 ) = 1 \; \; \mathrm{and} \; \; P( \text{Win} \mid 2 , 3 \; \mathrm{or} \; 12 ) = 0 \] We still have to work out all the probabilities of winning given the point. Suppose the point is \(4\) \[ P( \text{Win} \mid 4 ) = P ( 4 \; \mathrm{before} \; 7 ) = \dfrac{P(4)}{P(7)+P(4)} = \frac{3}{9} = \frac{1}{3} \] There are 6 ways of getting a 7, 3 ways of getting a 4 for a total of 9 possibilities. Now do all of them and sum them up. You get \[ P( \text{Win}) = 0.4929 \]

Example 2.4 (Coin Jar) Large jar containing 1024 fair coins and one two-headed coin. You pick one at random and flip it \(10\) times and get all heads. What’s the probability that the coin is the two-headed coin? The probability of initially picking the two headed coin is 1/1025. There is a 1/1024 chance of getting \(10\) heads in a row from a fair coin. Therefore, it’s a \(50/50\) bet.

Let’s do the formal Bayes rule math. Let \(E\) be the event that you get \(10\) Heads in a row, then

\[ P \left ( \mathrm{two \; headed} \mid E \right ) = \frac{ P \left ( E \mid \mathrm{ two \; headed} \right )P \left ( \mathrm{ two \; headed} \right )} {P \left ( E \mid \mathrm{ fair} \right )P \left ( \mathrm{ fair} \right ) + P \left ( E \mid \mathrm{ two \; headed} \right )P \left ( \mathrm{ two \; headed} \right )} \] Therefore, the posterior probability \[ P \left ( \mathrm{two \; headed} \mid E \right ) = \frac{ 1 \times \frac{1}{1025} }{ \frac{1}{1024} \times \frac{1024}{1025} + 1 \times \frac{1}{1025} } = 0.50 \] What’s the probability that the next toss is a head? Using the law of total probability gives

\[\begin{align*} P( H ) &= P( H \mid \mathrm{ two \; headed} )P( \mathrm{ two \; headed} \mid E ) + P( H \mid \mathrm{ fair} )P( \mathrm{ fair} \mid E) \\ & = 1 \times \frac{1}{2} + \frac{1}{2} \times \frac{1}{2} = \frac{3}{4} \end{align*}\]

Example 2.5 (Monty Hall Problem) Another example of a situation when calculating probabilities is counterintuitive. The Monty Hall problem was named after the host of the long-running TV show Let’s Make a Deal. The original solution was proposed by Marilyn vos Savant, who had a column with the correct answer that many Mathematicians thought was wrong!

The game set-up is as follows. A contestant is given the choice of 3 doors. There is a prize (a car, say) behind one of the doors and something worthless behind the other two doors: two goats. The game is as follows:

  1. You pick a door.
  2. Monty then opens one of the other two doors, revealing a goat. He can’t open your door or show you a car
  3. You have the choice of switching doors.

The question is, is it advantageous to switch? The answer is yes. The probability of winning if you switch is 2/3 and if you don’t switch is 1/3.

Conditional probabilities allow us to answer this question. Assume you pick door 2 (event \(A\)) at random, given that the host opened Door 3 and showed a goat (event B), we need to calculate \(P(A\mid B)\). The prior probability that the car is behind Door 2 is \(P(A) = 1/3\) and \(P(B\mid A) = 1\), if the car is behind Door 2, the host has no choice but to open Door 3. The Bayes rule then gives us \[ P(A\mid B) = \frac{P(B\mid A)P(A)}{P(B)} = \frac{1/3}{1/2} = \frac{2}{3}. \] The overall probability of the host opening Door 3 \[ P(B) = (1/3 \times 1/2) + (1/3 \times 1) = 1/6 + 1/3 = 1/2. \]

The posterior probability that the car is behind Door 2 after the host opens Door 3 is 2/3. It is to your advantage to switch doors.

2.3 Real World Bayes

Search and Rescue

Example 2.6 (USS Scorpion sank 5 June, 1968 in the middle of the Atlantic.) Experts placed bets on each casualty and how each would affect the sinking. Undersea soundings gave a prior on location. Bayes rule: \(L\) is location and \(S\) is scenario \[ P (L \mid S) = \frac{ P(S \mid L) P(L)}{P(S)} \] The Navy spent \(5\) months looking and found nothing. Built a probability map: within \(5\) days, the submarine was found within \(220\) yards of the most likely probability!

A similar story happened during the search of an Air France plane that flew from Rio to Paris.

Example 2.7 (Wald and Airplane Safety) Many lives were saved by analysis of conditional probabilities performed by Abraham Wald during the Second World War. He was analyzing damages on the US planes that came back from bombing missions in Germany. Somebody suggested to analyze the distribution of the hits over different parts of the plane. The idea was to find a pattern in the damages and design a reinforcement strategy.

After examining hundreds of damaged airplanes, researchers came up with the following table

Location Number of Planes
Engine 53
Cockpit 65
Fuel system 96
Wings, fuselage, etc. 434

We can convert those counts to probabilities

Location Number of Planes
Engine 0.08
Cockpit 0.1
Fuel system 0.15
Wings, fuselage, etc. 0.67

We can conclude that the most likely area to be damaged on the returned planes was the wings and fuselage. \[ P(\mbox{hit on wings or fuselage } \mid \mbox{returns safely}) = 0.67 \] Wald realized that analyzing damages only on survived planes is not the right approach. Instead, he suggested that it is essential to calculate the inverse probability \[ P(\mbox{returns safely} \mid \mbox{hit on wings or fuselage }) = ? \] To calculate that, he interviewed many engineers and pilots, he performed a lot of field experiments. He analyzed likely attack angles. He studied the properties of a shrapnel cloud from a flak gun. He suggested to the army that they fire thousands of dummy bullets at a plane sitting on the tarmac. Wald constructed a ‘probability model’ carefully to reconstruct an estimate for the joint probabilities. The table below shows the results.

Hit Returned Shot Down
Engine 53 57
Cockpit 65 46
Fuel system 96 16
Wings, fuselage, etc. 434 33

Which allows us to estimate joint probabilities, for example \[ P(\mbox{outcome = returns safely} , \mbox{hit = engine }) = 53/800 = 0.066 \] We also can calculate the conditional probabilities now \[ P(\mbox{outcome = returns safely} \mid \mbox{hit = wings or fuselage }) = \dfrac{434}{434+33} = 0.9293362. \] Should we reinforce wings or fuselage? Which part of the airplane needs to be reinforced? \[ P(\mbox{outcome = returns safely} \mid \mbox{hit = engine }) = \dfrac{53}{53+57} = 0.48 \]

2.5 First Application: Naive Bayes

Use of the Bayes rule allows us to build our first predictive model, called Naive Bayes classifier. Naive Bayes is a collection of classification algorithms based on Bayes Theorem. It is not a single algorithm but a family of algorithms that all share a common principle, that every feature being classified is independent of the value of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3” in diameter. A Naive Bayes classifier considers each of these “features” (red, round, 3” in diameter) to contribute independently to the probability that the fruit is an apple, regardless of any correlations between features. Features, however, aren’t always independent which is often seen as a shortcoming of the Naive Bayes algorithm and this is why it’s labeled “naive”.

Color, Shape and Size of Fruits

Although it’s a relatively simple idea, Naive Bayes can often outperform other more sophisticated algorithms and is extremely useful in common applications like spam detection and document classification. In a nutshell, the algorithm allows us to predict a class, given a set of features using probability. So in another fruit example, we could predict whether a fruit is an apple, orange or banana (class) based on its colour, shape etc (features). In summary, the advantages are:

  • It’s relatively simple to understand and build
  • It’s easily trained, even with a small dataset
  • It’s fast!
  • It’s not sensitive to irrelevant features

The main disadvantage is that it assumes every feature is independent, which isn’t always the case.

Let’s say we have data on 1000 pieces of fruit. The fruit being a Banana, Orange or some Other fruit and imagine we know 3 features of each fruit, whether it’s long or not, sweet or not and yellow or not, as displayed in the table below:

Fruit Long Sweet Yellow Total
Banana 400 350 450 500
Orange 0 150 300 300
Other 100 150 50 200
Total 500 650 800 1000

From this data we can calculate marginal probabilities

  • 50% of the fruits are bananas
  • 30% are oranges
  • 20% are other fruits

Based on our training set we can also say the following:

  • From 500 bananas 400 (0.8) are Long, 350 (0.7) are Sweet and 450 (0.9) are Yellow
  • Out of 300 oranges 0 are Long, 150 (0.5) are Sweet and 300 (1) are Yellow
  • From the remaining 200 fruits, 100 (0.5) are Long, 150 (0.75) are Sweet and 50 (0.25) are Yellow So let’s say we’re given the features of a piece of fruit and we need to predict the class. If we’re told that the additional fruit is Long, Sweet and Yellow, we can classify it using the following formula and subbing in the values for each outcome, whether it’s a Banana, an Orange or Other Fruit. The one with the highest probability (score) being the winner.

Given the evidence \(E\) (\(L\) = Long, \(S\) = Sweet and \(Y\) = Yellow) we can calculate the probability of each class \(C\) (\(B\) = Banana, \(O\) = Orange or \(F\) = Other Fruit) using Bayes’ Theorem: \[\begin{align*} P(B \mid E) = & \frac{P(L \mid B)P(S \mid B)P(Y \mid B)P(B)}{P(L)P(S)P(Y)}\\ =&\frac{0.8\times 0.7\times 0.9\times 0.5}{P(E)}=\frac{0.252}{P(E)} \end{align*}\]

Orange: \[ P(O\mid E)=0. \]

Other Fruit: \[\begin{align*} P(F \mid E) & = \frac{P(L \mid F)P(S \mid F)P(Y \mid F)P(F)}{P(L)P(S)P(Y)}\\ =&\frac{0.5\times 0.75\times 0.25\times 0.2}{P(E)}=\frac{0.01875}{P(E)} \end{align*}\]

In this case, based on the higher score, we can assume this Long, Sweet and Yellow fruit is, in fact, a Banana.

Notice, we did not have to calculate \(P(E)\) because it is a normalizing constant and it cancels out when we calculate the ratio

\[ \dfrac{P(B \mid E)}{P(F \mid E)} = 0.252/0.01875 = 13.44 > 1. \]

Now that we’ve seen a basic example of Naive Bayes in action, you can easily see how it can be applied to Text Classification problems such as spam detection, sentiment analysis and categorization. By looking at documents as a set of words, which would represent features, and labels (e.g. “spam” and “ham” in case of spam detection) as classes we can start to classify documents and text automatically.

Example 2.13 (Spam Filtering) The original spam filtering algorithm was based on Naive Bayes. The “naive” aspect of Naive Bayes comes from the assumption that inputs (words in the case of text classification) are conditionally independent, given the class label. Naive Bayes treats each word independently, and the model doesn’t capture the sequential or structural information inherent in the language. It does not consider grammatical relationships or syntactic structures. The algorithm doesn’t understand the grammatical rules that dictate how words should be combined to form meaningful sentences. Further, it doesn’t understand the context in which words appear. For example, it may treat the word “bank” the same whether it refers to a financial institution or the side of a river bank. Despite its simplicity and the naive assumption, Naive Bayes often performs well in practice, especially in text classification tasks.

We start by collecting a dataset of emails labeled as “spam” or “not spam” (ham) and calculate the prior probabilities of spam (\(P(\text{spam})\)) and not spam (\(P(\text{ham})\)) based on the training dataset, by simply counting the proportions of each in the data.

Then each email gets converted into a bag-of-words representation (ignoring word order and considering only word frequencies). Then, we create a vocabulary of unique words from the entire dataset \(w_1,w_2,\ldots,w_N\) and calculate conditional probabilities \[ P(\mathrm{word}_i \mid \text{spam}) = \frac{\text{Number of spam emails containing }\mathrm{word}_i}{\text{Total number of spam emails}}, ~ i=1,\ldots,n \] \[ P(\mathrm{word}_i \mid \text{ham}) = \frac{\text{Number of ham emails containing }\mathrm{word}_i}{\text{Total number of ham emails}}, ~ i=1,\ldots,n \]

Now, we are ready to use our model to classify new emails. We do it by calculating the posterior probability using Bayes’ theorem. Say an email has a set of \(k\) words \(\text{email} = \{w_{e1},w_{e2},\ldots, w_{ek}\}\), then \[ P(\text{spam} \mid \text{email}) = \frac{P(\text{email} \mid \text{spam}) \times P(\text{spam})}{P(\text{email})} \] Here \[ P(\text{email} \mid \text{spam}) = P( w_{e1} \mid \text{spam})P( w_{e2} \mid \text{spam})\ldots P( w_{ek} \mid \text{spam}) \] We calculate \(P(\text{ham} \mid \text{email})\) in a similar way.

Finally, we classify the email as spam or ham based on the class with the highest posterior probability.

Suppose you have a spam email with the word “discount” appearing. Using Naive Bayes, you’d calculate the probability that an email containing “discount” is spam \(P(\text{spam} \mid \text{discount})\) and ham \(P(\text{ham} \mid \text{discount})\), and then compare these probabilities to make a classification decision.

While the naive assumption simplifies the model and makes it computationally efficient, it comes at the cost of a more nuanced understanding of language. More sophisticated models, such as transformers, have been developed to address these limitations by considering the sequential nature of language and capturing contextual relationships between words.

In summary, naive Bayes, due to its simplicity and the naive assumption of independence, is not capable of understanding the rules of grammar, the order of words, or the intricate context in which words are used. It is a basic algorithm suitable for certain tasks but may lack the complexity needed for tasks that require a deeper understanding of language structure and semantics.

2.6 Sensitivity and Specificity

Conditional probabilities are used to define two fundamental metrics used for many probabilistic and statistical learning models, namely sensitivity and specificity.

Sensitivity and specificity are two key metrics used to evaluate the performance of diagnostic tests, classification models, or screening tools. These metrics help assess how well a test can correctly identify individuals with a condition (true positives) and those without the condition (true negatives). Let’s break down each term:

  1. Sensitivity (true‐positive rate or recall) is the ability of a test \(T\) to correctly identify individuals who have a particular condition or disease (\(D\)), \(P ( T=1 \mid D=1 )\), the probability of a positive test given that the individual has the disease. It is calculated as the ratio of true positives to the sum of true positives and false negatives. \[ P(T=1\mid D=1) = \dfrac{P(T=1,D=1)}{P(D=1)}. \] A high sensitivity indicates that the test is good at identifying individuals with the condition, minimizing false negatives.
  2. Specificity (true‐negative rate) is the ability of a test to correctly identify individuals who do not have a particular condition or disease, \(P (T=0 \mid D=0 )\). It is calculated as the ratio of true negatives to the sum of true negatives and false positives. \[ P(T=0\mid D=0) = \dfrac{P(T=0,D=0)}{P(D=0)} \] A high specificity indicates that the test is good at correctly excluding individuals without the condition, minimizing false positives.

Sensitivity and specificity are often trade-offs. Increasing sensitivity might decrease specificity, and vice versa. Thus, depending on the application, you might prefer sensitivity over specificity or vice versa, depending on the consequences of false positives and false negatives in a particular application.

Consider a medical test designed to detect a certain disease. If the test has high sensitivity, it means that it is good at correctly identifying individuals with the disease. On the other hand, if the test has high specificity, it is good at correctly identifying individuals without the disease. The goal is often to strike a balance between sensitivity and specificity based on the specific needs and implications of the test results.

Sensitivity is often called the power of a procedure (a.k.a. test). Type I and Type II errors are fundamental concepts in hypothesis testing, serving as the duals to specificity and sensitivity.

NoteType I error (false positive rate)

is the percentage of healthy people who tested positive, \(P(T=1\mid D=0)\), it is the mistake of thinking something is true when it is not.

NoteType II error (or false negative rate)

is the percentage of sick people who are tested negative, \(P(T=0\mid D=1)\), it is the mistake of thinking something is not true when in fact it is true.

We would like to control both conditional probabilities with our test. Also if someone tests positive, how likely is it that they actually have the disease. There are two ‘errors’ one can make. Falsely diagnosing someone, or not correctly finding the disease.

In the stock market, one can think of type I error as not selling a losing stock quickly enough, and a type II error as failing to buy a growing stock, e.g. Amazon or Google.

\(P(T=1\mid D=1)\) Sensitivity True Positive Rate \(1-\beta\)
\(P(T=0\mid D=0 )\) Specificity True Negative Rate \(1-\alpha\)
\(P(T=1\mid D=0)\) 1-Specificity False Positive Rate \(\alpha\) (type I error)
\(P(T=0\mid D =1)\) 1-Sensitivity False Negative Rate \(\beta\) (type II error)

Often it is convenient to write those four values in the form of a two-by-two matrix, called the confusion matrix:

Actual/Predicted Positive Negative
Positive TP FN
Negative FP TN

where: TP: True Positive. FN: False Negative, FP: False Positive, TN: True Negative

We will extensively use the concepts of errors, specificity and sensitivity later in the book, when describing AB testing and predictive models. These examples illustrate why people can commonly miscalculate and mis-interpret probabilities. Those quantities can be calculated using the Bayes rule.

Medical Diagnostics

Example 2.14 (Alice Mammogram) Alice is a 40-year-old woman, what is the chance that she really has breast cancer when she gets positive mammogram result, given the conditions:

  1. The prevalence of breast cancer among people like Alice is 1%.
  2. The test has an 80% detection rate.
  3. The test has a 10% false-positive rate.

We want to calculate the posterior probability \(P(\text{cancer} \mid \text{positive mammogram})\).

Figure 2.2: Frequency Tree: Medical Diagnosis Scenario (Mammogram)

Using the frequency tree in Figure 2.2, we can see that out of 1000 cases: - Number of actual cancer cases = 10. - Number of healthy cases = 990.

The test detects 8 out of the 10 cancer cases (True Positives). The test falsely flags 100 out of the 990 healthy cases (False Positives).

The total number of positive mammograms is thus \(8 + 100 = 108\). The number of these that are actually cancer is 8.

Therefore, the posterior probability is: \[ P(\text{cancer} \mid \text{positive}) = \frac{8}{108} \approx 0.074. \] There is only about a 7.4% chance Alice has cancer, despite the positive test result.

Example 2.15 (Apple Watch Series 4 ECG and Bayes’ Theorem) The Apple Watch Series 4 can perform a single-lead ECG and detect atrial fibrillation. The software can correctly identify 98% of cases of atrial fibrillation (true positives) and 99% of cases of non-atrial fibrillation (true negatives) (Kim et al. 2024; Bumgarner et al. 2018).

Predicted atrial fibrillation no atrial fibrillation Total
atrial fibrillation 1960 980 2940
no atrial fibrillation 40 97020 97060
Total 2000 98000 100000

However, what is the probability of a person having atrial fibrillation when atrial fibrillation is identified by the Apple Watch Series 4? We use Bayes theorem to answer this question. \[ P(\text{atrial fibrillation}\mid \text{atrial fibrillation is identified }) = \frac{0.01960}{ 0.02940} = 0.6667 \]

The conditional probability of having atrial fibrillation when the Apple Watch Series 4 detects atrial fibrillation is about 67%.

Apple Watch’s positive predictive value is just 19.6 percent. That means in this group – which constitutes more than 90 percent of users of wearable devices like the Apple Watch – the app incorrectly diagnoses atrial fibrillation 79.4 percent of the time. (You can try the calculation yourself using this Bayesian calculator: enter 0.02 for prevalence, 0.98 for sensitivity, and 0.99 for specificity).

The electrocardiogram app becomes more reliable in older individuals: The positive predictive value is 76 percent among users between the ages of 60 and 64, 91 percent among those aged 70 to 74, and 96 percent for those older than 85.

In the case of medical diagnostics, the sensitivity is the ratio of people who have disease and tested positive to the total number of positive cases in the population \[ P(T=1\mid D=1) = \dfrac{P(T=1,D=1)}{P(D=1)} = 0.0196/0.02 = 0.98 \] The specificity is given by \[ P(T=0\mid D=0) = \dfrac{P(T=0,D=0)}{P(D=0)} = 0.9702/0.98 = 0.99. \] As we see the test is highly sensitive and specific. However, only 66% of those who are tested positive will have a disease. This is due to the fact that the number of sick people is much less than the number of healthy and presence of type I error.

2.7 Advanced Applications

Example 2.16 ## Obama Elections This example demonstrates a Bayesian approach to election forecasting using polling data from the 2012 US presidential election. The goal is to predict the probability of Barack Obama winning the election by combining polling data across different states.

The data used includes polling data from various pollsters across all 50 states plus DC. Each state has polling percentages for Republican (GOP) and Democratic (Dem) candidates along with their electoral vote counts. The data is aggregated by state, taking the most recent polls available.

The techniques applied involve Bayesian simulation using a Dirichlet distribution to model uncertainty in polling percentages.

The Dirichlet distribution is a probability distribution over probabilities. While a standard distribution like the Normal distribution tells you the likelihood of a variable taking a certain real value (like height), the Dirichlet distribution is used when your variables are a set of probabilities that must sum to 1 (like the vote shares of candidates in an election). In this example, for each state, we have vote shares for Obama, Romney, and Others. These three numbers must sum to 100%. The Dirichlet distribution allows us to sample possible election results that respect this constraint while reflecting the uncertainty inherent in the polling data.

Monte Carlo simulation runs 10,000 simulations of the election to estimate win probabilities. The analysis is conducted state-by-state, calculating Obama’s probability of winning each individual state. Electoral college modeling combines state probabilities with electoral vote counts to determine the overall election outcome. The simulation runs the entire election multiple times to account for uncertainty and determines the likelihood of Obama reaching the required 270 electoral votes to win. This approach demonstrates how pattern matching through statistical modeling can be used for prediction, showing how polling data can be transformed into probabilistic forecasts of election outcomes.

We start by loading the data and aggregating it by state. We then run the simulation and plot probabilities by state.

library(plyr)
# Source: "http://www.electoral-vote.com/evp2012/Pres/pres_polls.csv"
election.2012 = read.csv("../data/pres_polls.csv")
# Remove a pollster: elect2012 <- election.2012[!grepl('Rasmussen', election.2012$Pollster),]
elect2012 <- election.2012
# Aggregrate the data
elect2012 <- ddply(elect2012, .(state), subset, Day == max(Day))
elect2012 <- ddply(elect2012, .(state), summarise, R.pct = mean(GOP), O.pct = mean(Dem), EV = mean(EV))
Code
knitr::kable(elect2012[1:5,],longtable=TRUE)
knitr::kable(elect2012[47:51,],longtable=TRUE)
state R.pct O.pct EV
Alabama 61 38 9
Alaska 55 42 3
Arizona 54 44 11
Arkansas 61 37 6
California 38 59 55
state R.pct O.pct EV
47 Virginia 48 51 13
48 Washington 42 56 12
49 West Virginia 62 36 5
50 Wisconsin 46 53 10
51 Wyoming 69 28 3

Election 2012 Data (first 5 states and last 5 states)

library(MCMCpack)
prob.Obama <- function(mydata) {
p <- rdirichlet(1000, 500 * c(mydata$R.pct, mydata$O.pct, 100 - mydata$R.pct - 
    mydata$O.pct)/100 + 1)
mean(p[, 2] > p[, 1])
}
win.probs <- ddply(elect2012, .(state), prob.Obama)
win.probs$Romney <- 1 - win.probs$V1
names(win.probs)[2] <- "Obama"
win.probs$EV <- elect2012$EV
win.probs <- win.probs[order(win.probs$EV), ]
rownames(win.probs) <- win.probs$state

We then plot the probabilities of Obama winning by state.

library(usmap)
plot_usmap(data = win.probs, values = "Obama") + 
  scale_fill_continuous(low = "red", high = "blue", name = "Obama Win Probability", label = scales::comma) + theme(legend.position = "right")

Probabilities of Obama winning by state

We use those probabilities to simulate the probability of Obama winning the election. First, we calculate the probability of Obama having 270 EV or more

sim.election <- function(win.probs) {
    winner <- rbinom(51, 1, win.probs$Obama)
    sum(win.probs$EV * winner)
}

sim.EV <- replicate(10000, sim.election(win.probs))
oprob <- sum(sim.EV >= 270)/length(sim.EV)
oprob
## 0.96
library(lattice)
# Lattice Graph
densityplot(sim.EV, plot.points = "rug", xlab = "Electoral Votes for Obama", 
  panel = function(x, ...) {
      panel.densityplot(x, ...)
      panel.abline(v = 270)
      panel.text(x = 285, y = 0.01, "270 EV to Win")
      panel.abline(v = 332)
      panel.text(x = 347, y = 0.01, "Actual Obama")
}, main = "Electoral College Results Probability")

Results of recent state polls in the 2008 United States Presidential Election between Barack Obama and John McCain.

##  Dirichlet simulation
prob.Obama = function(j) {
p=rdirichlet(5000,500*c(M.pct[j],O.pct[j],100-M.pct[j]-O.pct[j])/100+1)
mean(p[,2]>p[,1])}
## sapply function to compute Obama win prob for all states
Obama.win.probs=sapply(1:51,prob.Obama)
##  sim.EV function
sim.election = function() {
winner = rbinom(51,1,Obama.win.probs)
sum(EV*winner) }
sim.EV = replicate(1000,sim.election())
Histogram of simulated election
## histogram of simulated election
hist(sim.EV,min(sim.EV):max(sim.EV),col="blue",prob=T)
abline(v=365,lwd=3)   # Obama received 365 votes
text(375,30,"Actual \n Obama \n total")
Figure 2.3: Histogram of simulated election

The analysis of the 2008 U.S. Presidential Election data reveals several key insights about the predictive power of state-level polling and the uncertainty inherent in electoral forecasting. The actual result of 365 electoral votes falls within the simulated range, demonstrating the model’s validity. The 270-vote threshold needed to win the presidency is clearly marked and serves as a critical reference point.

We used a relatively simple model to simulate the election outcome. The model uses Dirichlet distributions to capture uncertainty in state-level polling percentages. Obama’s win probabilities vary significantly across states, reflecting the competitive nature of the election. The simulation approach accounts for both sampling uncertainty and the discrete nature of electoral vote allocation. The histogram of simulated results shows the distribution of possible outcomes. The actual Obama total of 365 electoral votes is marked and falls within the reasonable range of simulated outcomes. This validates the probabilistic approach to election forecasting.

This analysis demonstrates how Bayesian methods can be effectively applied to complex prediction problems with multiple sources of uncertainty, providing both point estimates and uncertainty around those estimates.

2.8 Graphical Representation of Probability and Conditional Independence.

We can use the telescoping property of conditional probabilities to write the joint probability distribution as a product of conditional probabilities. This is the essence of the chain rule of probability. It is given by \[ P(x_1, x_2, \ldots, x_n) = P(x_1)P(x_2 \mid x_1)P(x_3 \mid x_1, x_2) \ldots P(x_n \mid x_1, x_2, \ldots, x_{n-1}). \] The expression on the right hand side can be simplified if some of the variables are conditionally independent. For example, if \(x_3\) is conditionally independent of \(x_2\), given \(x_1\), then we can write \[ P(x_3 \mid x_1, x_2) =P(x_3 \mid x_1). \]

In a high-dimensional case, when we have a joint distribution over a large number of random variables, we can often simplify the expression by using independence or conditional independence assumptions. Sometimes it is convenient to represent these assumptions in a graphical form. This is the idea behind the concept of a Bayesian network. Essentially, the graph is a compact representation of a set of independencies that hold in the distribution.

Let’s consider an example of joint distribution with three random variables, we have the following joint distribution: \[ P(a,b,c) = P(a\mid b,c)P(b\mid c)P(c) \]

Graphically, we can represent the relations between the variables known as a Directed Acyclic Graph (DAG), which is known as a Bayesian network. Each node represents a random variable and the arrows represent the conditional dependencies between the variables. When two nodes are connected they are not independent. Consider the following three cases:

\[ P(b\mid c,a) = P(b\mid c),~ P(a,b,c) = P(a)P(c\mid a)P(b\mid c) \]

Line Structure

\[ P(a\mid b,c) = P(a\mid c), ~ P(a,b,c) = P(a\mid c)P(b\mid c)P(c) \]

Lambda Structure

\[ P(a\mid b) = P(a),~ P(a,b,c) = P(c\mid a,b)P(a)P(b) \]

V-structure

Although the graph shows us the conditional independence assumptions, we can also derive other independencies from the graph. An interesting question is whether they are connected through a third node. In the first case (a), we have \(a\) and \(b\) connected through \(c\). Thus, \(a\) can influence \(b\). However, once \(c\) is known, \(a\) and \(b\) are independent. In case (b) the logic here is similar, \(a\) can influence \(b\) through \(c\), but once \(c\) is known, \(a\) and \(b\) are independent. In the third case (c), \(a\) and \(b\) are independent, but once \(c\) is known, \(a\) and \(b\) are not independent. You can formally derive these independencies from the graph by comparing \(P(a,b\mid c)\) and \(P(a\mid c)P(b\mid c)\).

Example 2.17 (Bayes Home Diagnostics) Suppose that a house alarm system sends me a text notification when some motion inside my house is detected. It detects motion when I have a person inside (burglar) or during an earthquake. Say, from prior data we know that during an earthquake alarm is triggered in 10% of the cases. Once I receive a text message, I start driving back home. While driving I hear on the radio about a small earthquake in our area. Now we want to know \(P(b \mid a)\) and \(P(b \mid a,r)\). Here \(b\) = burglary, \(e\) = earthquake, \(a\) = alarm, and \(r\) = radio message about small earthquake.

The joint distribution is then given by \[ P(b,e,a,r) = P(r \mid a,b,e)P(a \mid b,e)P(b\mid e)P(e). \] Since we know the causal relations, we can simplify this expression \[ P(b,e,a,r) = P(r \mid e)P(a \mid b,e)P(b)P(e). \] The \(P(a \mid b,e)\) distribution is defined by

Table 2.1: Conditional probability of alarm given burglary and earthquake
\(P(a=1 \mid b,e)\) b e
0 0 0
0.1 0 1
1 1 0
1 1 1

Graphically, we can represent the relations between the variables known as a Directed Acyclic Graph (DAG), which is known as a Bayesian network.

Figure 2.4: Bayesian network for alarm.

Now we can easily calculate \(P(a=0 \mid b,e)\), from the property of a probability distribution \(P(a=1 \mid b,e) + P(a=0 \mid b,e) = 1\). In addition, we are given \(P(r=1 \mid e=1) = 0.5\) and \(P(r=1 \mid e=0) = 0\). Further, based on historic data we have \(P(b) = 2\cdot10^{-4}\) and \(P(e) = 10^{-2}\). Note that causal relations allowed us to have a more compact representation of the joint probability distribution. The original naive representation requires specifying \(2^4\) parameters.

To answer our original question, calculate \[ P(b \mid a) = \dfrac{P(a \mid b)P(b)}{P(a)},~~P(a) = P(a=1 \mid b=1)P(b=1) + P(a=1 \mid b=0)P(b=0). \] We have everything but \(P(a \mid b)\). This is obtained by marginalizing \(P(a=1 \mid b,e)\), to yield \[ P(a \mid b) = P(a \mid b,e=1)P(e=1) + P(a \mid b,e=0)P(e=0). \] We can calculate \[ P(a=1 \mid b=1) = 1, ~P(a=1 \mid b=0) = 0.1*10^{-2} + 0 = 10^{-3}. \] This leads to \(P(b \mid a) = 2\cdot10^{-4}/(2\cdot10^{-4} + 10^{-3}(1-2\cdot10^{-4})) = 1/6\).

This result is somewhat counterintuitive. We get such a low probability of burglary because its prior is very low compared to the prior probability of an earthquake. What will happen to the posterior if we live in an area with higher crime rates, say \(P(b) = 10^{-3}\). Figure 2.5 shows the relationship between the prior and posterior. \[ P(b \mid a) = \dfrac{P(b)}{P(b) + 10^{-3}(1-P(b))} \]

prior <- seq(0, .1, length.out = 200)
post <- prior / (prior + 0.001 * (1 - prior))
plot(prior, post, type = "l", lwd = 3, col = "red")
Figure 2.5: Relationship between the prior and posterior

Now, suppose that you hear on the radio about a small earthquake while driving. Then, using Bayesian conditioning, \[ P(b=1 \mid a=1,r=1) = \dfrac{P(a,r \mid b)P(b)}{P(a,r)} \] and \[ P(a,r \mid b)P(b) = \dfrac{\sum_e P(b=1,e,a=1,r=1)}{\sum_b\sum_eP(b,e,a=1,r=1)} \] \[ =\dfrac{\sum_eP(r=1 \mid e)P(a=1 \mid b=1,e)P(b=1)P(e)}{\sum_b\sum_eP(r=1 \mid e)P(a=1 \mid b,e)P(b)P(e)} \] which is \(\approx 2\%\) in our case. This effect is called explaining away, namely when new information explains some previously known fact.