Chapter 3 Probability and Uncertainty

Data analysis projects have to deal with randomness. The act of data science is being able to “seperate” the signal from the noise. For example, when analyzing buyers decisons and their reaction to an incentive, we need to account for randomness in the observed human behavior. A random phenomena by its very nature means that mean we cannot find regularity in its behaviour making prediction hand. Surprisingly, often random events have some statistical regularity in some ways. For example, if we flip a coin, it would be hard to predict the outcome (head or tail) on an individual flip, but if we flip a coin many times and count the proportion of heads, the average will converge to something close to \(1/2\). Probability is a language that lets you communicate information about uncertain outcomes and events. By assigning a numeric value between zero and one to an event, or a collection of outcomes, in its simplest form, probability measures how likely an event is to occur. Probability is immensely useful and adheres to only a few basic rules. First known applications of probability are in games that involved chance, e.g. dice rolls or card draw. A classic definition of the probability (due to J Bernoulli) is the ratio of number of favorable outcomes \(m\) to the total number of outcomes \(n\), which we simply denote \[ P = \dfrac{m}{n}. \] For an even \(A\), we will write \(P(A)\), for two events \(A\), \(B\), we will calculate \(P(A \text{ or } B) = P(A \cup B)\), \(P(A \text{ and }B) = P(A \cap B)\), using Kologorov’s rules.

The study of probability arose in part due to interest in understanding games of chance, like cards or dice. These games provide useful examples of many statistical concepts, because when we repeat these games the likelihood of different outcomes remains (mostly) the same (statistical regularity). The first rigorous treatment of probability was presented by J Bernoulli in his paper “Ars Conjectandi” (art of guesses) where he claims that to makes a guess is the same thing as to measure a probability. However, there are deep questions about the meaning of probability that we will not address here; see Suggested Readings at the end if you are interested in learning more about this fascinating topic and its history. **** Assigning probabilities to events is a challenging problem. There are a number of rules that probabilities must follow and we will show how to use them. Arguably the most important rule is Bayes rule for conditional probability. In the age of AI, it has certainly proved to be by far the most useful. One of the key properties of probabilities is that they are updated as you learn new information. Conditional means given its personal characteristics of the personal situation. One argue that all pobabilities are conditional in some way. This is known as Bayesian updating, and is central to how machines learn from observed data. Rational human behavior ought to adhere to Bayes rule — although there is much literature documenting the contrary.

3.1 Random variables

Our goal here is to introduce you to the concepts of probability, conditional probability and their governing rules. Understanding these concepts serves as a basis for more complex data analysis and machine learning algorithms. Building probabilistic models has many challenges and real world application. You are about to learn about practical examples from fields as diverse as medical diagnosis, chess games to racetrack odds.

Random variables are quantities that we are not certain about. The simplest version of a random variable is a binary yes/no outcome. A random variable that can take a finite or a countable number of values is called discrete random variable. Otherwise, it will be a continuous random variable.

A random variable will describe an uncertain quantity, denoted by \(X\), by attaching a numeric value to the occurrence of an event. Two examples of discrete random variable are

  1. Will a user click-through on a Google ad?
  2. Who will win the 2020 will elections?

Random variables are denoted by quantities such as \(X\) and are constructed by assigning specific values to events such as \(\{X=x\}\) which corresponds to the outcomes where \(X\) equals to a specific number \(x\). Associated with possible outcomes are probabilities, a number between zero and one. We will use \(\operatorname{P}\left(X=x\right)\) to denote the probability that random variable \(X\) is equal to \(x\). A map from all possible values of a discrete random variable to probabilities is called a probability mass function \(p(x)\),

Example 3.1 (Coin Toss) The quintessential random variable is an outcome of a coin toss. The set of all possible outcomes, known as the sample space, is \(S = \{H,T\}\), and \(p(X = H) = p(X = T) = 1/2\). On the other hand, a single outcome can be an element of many different events. For example, there are four possible outcomes of two coin tosses, HH, TT, HT, TH, which are equally likely with probabilities 1/4. The probability mass function over the number of heads \(X\) out of two coin tosses is

\(x\) \(p(x)\)
0 1/4
1 1/2
2 1/4

Given the probability mass function we can, for example, calculate the probability of at least one head as \(\operatorname{P}\left(X \geq 1\right) = \operatorname{P}\left(X =0\right) + \operatorname{P}\left(X =1\right) = 3/4\).

An important property of the probability mass function is that \[ \sum_{x\in S} p(x) = 1. \] Here \(S\) denotes the set of all possible values of random variable \(X\). Clearly, all probabilities have to be greater than or equal to zero, so that \(p(x)\ge 0\)

Example 3.2 (Pete Rose) Pete Rose of the Cincinnati Reds set a National League record of hitting safely in \(44\) consecutive games. How likely a such a long sequence of safe hits is to be observed? If you were a bookmaker, what odds would you offer on such an event? This means that he safely reached first base after hitting the ball into fair territory, without the benefit of an error or a fielder’s choice at least once at every of those 44 games. There are a couple of facts we know about him: - Rose was a \(300\) hitter, he hits safely 3 times out of 10 attempts - Each at bat is assumed to be independent, i.e., the current at bat doesn’t affect the outcome of the next. So \(P(\text{leit}) = 0.3\)

Assuming he comes to bat \(4\) times each game, what probability might reasonably be associated with that hitting streak? First we define notation. We use \(A_i\) to denote an event of hitting safely at game \(i\), then \[ \begin{aligned} & P( {\rm Rose \; Hits \; Safely \; in \;44 \; consecutive \; games} ) = \\ & P ( A_1 \; \text{and} \; A_2 \ldots \text{and} \; A_{44} ) = P ( A_1 ) P ( A_2 ) \ldots P ( A_{44} ) \end{aligned} \] We now need to find \(P(A_i) \ldots\) where \(P (A_i ) = 1 - P ( \text{not} \; A_i )\)

\[\begin{aligned} P ( A_1 ) & = 1 - P ( {\rm not} \; A_1 ) \\ & = 1 - P ( {\rm Rose \; makes \; 4 \; outs } ) \\ & = 1 - ( 0.7)^4 = 0.76 \end{aligned}\]

For the winning streak, then we have \((0.76)^{44} = 0.0000057\), a very low probability. In terms of odds, there are three basic inferences - This means that the odds for a particular player as good as Pete Rose starting a hitting streak today are \(175,470\) to \(1\) - This doesn’t mean that the run of \(44\) won’t be beaten by some player at some time: the Law of Very Large Numbers - Joe DiMaggio’s record is 56. He is a 325better, thus we have \((0.792)^{56} = 2.13 \times 10^{-6}\) or 455,962 to \(1\). It’s going to be hard to beat.

Example 3.3 (Patriots) Example 1.2:

Let’s consider another example and calculate the probability of winning 19 coin tosses out of 25. Patriots won 19 out of 25 coin tosses in 2014-15 season. What is the probability of this happening a priori?

Let \(X\) be a random variable equal to \(1\) if the Patriots win and \(0\) otherwise. It’s reasonable to assume \(P(X = 1) = \frac{1}{2}\). The probability of observing the sequence in which there is 1 on the first 19 positions and 0 afterwards is \((1/2)^{25}\). We can code a typical sequence as, \[ 1,1,1,\ldots,1,0,0,\ldots,0. \]

Now there are \(177,100\) different sequences of 25 games where the Patriots win 19. Each potential sequence has probability \(0.5^{25}\), thus
\[ P \left ( \text{Patriots win 19 out 25 tosses} \right ) = 177,100 \times 0.5^{25} = 0.005 \]

So betting odds1 of \(1:1\) gives \(P = \frac{1}{2}\), oddes on \(2:1\) (I give \(2\) for each \(1\) you bet) is \(P = \frac{1}{3}\).

There are \(25! = 25 \times 24 \times \ldots \times 1\) possible ways to rearrange a sequence of 25 numbers. However, the order 1s and 0s are irrelevant and there are \(19!\) way to re-arrange 1s and \(6!\) ways to re-arrange zeros, thus the number of different winning sequences is \[ \dfrac{25!}{19!\times 6!} = 177,100. \]

Probability measures the uncertainty of an event. But how do we measure probability? One school of thought, takes probability as subjective, namely personal to the observer. de Finetti famously concluded that “Probability does not exist.” Measuring that is personal to the observer. It’s not like mass which is a property of an object. If two different observers have differing “news” then there is an them to bet (exchange contracts). Thus leading to a assessment of probability. Ramsey (1926) takes this view.

Much of data science and analytic is then the art of building probability models to study phenomenon. For many events most people will agree on their probabilities, for example \(p(H) = 0.5\) and \(p(T) = 0.5\). In the subjective view of probability we can measure or elicit a personal probability as a “willingness to play.” Namely, will you be willing to bet $1 so you can get $2 if head lands Tail and $0 if Head occurs?

3.1.1 Conditional Probability

Suppose that we have two random variables \(X\) and \(Y\), which can be related to each other. Knowing \(X\) would change your news about \(Y\). For example, as a first pass, psychologists who study phenomenon of happiness can be interested in understanding it relation to income level. Now we need a single probability mass function (a.k.a. probabilistic model) that describes all possible values of those two variables. Joint distributions do exactly that. Formally, the joint distribution of two variable \(X\) and \(Y\) is a function given by \[p(x,y) = \operatorname{P}\left((X=x,Y=y)\right).\] This maps all combinations of possible values of these two variables to a probability on the interval [0,1].

Let’s look at an example. Suppose that to model relationship between two quantities, salary \(Y\) and happiness \(X\). After running a survey, we summarize our results using the joint distribution, that is described by the following “happiness index” table as a function of salary.

Table 3.1: Results of the Gallop survey. Rows are Salary (\(Y\)) and columns are happiness (\(X\))
X = 0 (low) X = 1 (medium) X = 2 (high)
Y = low (0) 0.03 0.13 0.14
Y = medium (1) 0.12 0.11 0.01
Y = high (2) 0.07 0.01 0.09
Y = very high (3) 0.02 0.13 0.14

Each cell of the table is the joint probability, e.g. 14% of people have very high income level and are very happy. Those joint probabilities are calculated by simple counting and calculating the proportions.

Now, if we want to answer the question what is the percent of high incomers in the population. For that we need to calculate what is called a marginal probability \(p(y = 2)\). We can calculate the proportion of high incomers \(p(y = 2)\) by summing up the entries in the third row of the table, which is 0.17 in our case.

0.07 +  0.01 + 0.09
#> [1] 0.17

Formally marginal probability over \(y\) is calculated by summing the joint probability over the other variable, \(x\), \[ p(y) = \sum_{x \in S}p(x,y) \] Where \(S\) is a set of all possible values of the random variable \(X\).

Another, question of interest is whether happiness depends on income level. To answer those types of questions, we need to introduce an important concept, which is the conditional probability of \(X\) given that value of variable \(Y\) is known. This is denoted by \(p(X=x\mid Y=y)\) or simply \(p(x\mid y)\), where \(\mid\) reads as “given” or “conditional upon.”

The conditional probability \(p(x\mid y)\) also has interpretation as updating your probability over \(X\) after you have learned the new information about \(Y\). In this sense, probability is also the language of how you change opinions in light of new evidence. Proportion of happy people among high incomers is given by the conditional probability \(\operatorname{P}\left((\right)X=2\mid Y=2)\) and can be calculated by dividing proportion of those who are high incomer and highly happy by the proportion of the high incomers

\[ \operatorname{P}\left(X=2\mid Y=2\right) = \dfrac{\operatorname{P}\left(X=2,Y=2\right)}{\operatorname{P}\left(Y=2\right)} = \dfrac{0.09}{0.17} = 0.53. \]

Now, if we compare it with the proportion of highly happy people \(\operatorname{P}\left(X = 2\right) = 0.38\), we see that on average you are more likely to be happy given your income is high.

Formally, conditional probability is calculated from the joint probability as follows: \[ p(x\mid y) = \dfrac{p(x,y)}{p(y)}. \] Formula above allows us to calculate the conditional probability from joint and marginal. However, we can use the same identity to calculate joint from conditional and marginal \[ p(x,y) = p(x\mid y)p(y). \] Using the symmetry of the joint distribution \(p(x,y) = p(y,x)\), we have \[ p(y\mid x)p(x) = p(x\mid y)p(y). \] We can re-write the above identity as \[ p(x\mid y) = \dfrac{p(y\mid x)p(x)}{p(y)}. \] This identity is called the Bayes rule. The important implication is that you can calculate \(p( x \mid y )\) from \(p( y \mid x )\) as long as you know the marginal probabilities \(p(x)\) and \(p(y)\). The marginal probabilities are given by \[ p(x) = \sum_{y \in S} p(x,y) \;\text{, and }\; p(y) = \sum_{x \in S} p(x,y) \] respectively.

3.2 Expectation

An expected value of a random variable is a weighted average. Each possible value of a random variable is weighted by its probability. For example, Google map uses expected value when calculating travel times. We might compute two different routes by their expected travel time. Typically, a forecast or expected value is all that is required — there expected values can be updated in real time as we travel. Say I am interested in travel time from Washington National airport to Fairfax in Virginia. The histogram below shows the travel times observed for a work day evening and were obtained from Uber.

Example 3.4 (Google Maps & Uber)
d = read.csv("data/dc_travel_time.csv") 
# use eventing travel times (column 18) and convert from seconds to minutes 
evening_tt = d[,18]/60; day_tt = d[,15]/60; 
evening_tt = evening_tt[!is.na(evening_tt)] # remove missing observations 
hist(evening_tt, freq = F,main="Travel times in the evening", xlab="Travel Time $min], nclass=$20$", col="blue")

From this dataset, we can empirically estimate the probabilities of observing different values of travel times

Travel Time Probability
18 0.05
22 0.77
28 0.18

From this sample, there is a small chance (5%) I can get to Fairfax in 18 minutes, which probably happens on a holiday and a non-trivial chance (18%) to travel for 28 minutes, possibly due to a sports game or bad weather. Most of the times (77%) our travel time is 22 minutes. However, when Uber shows you the travel time, it uses the expected value as a forecast rather than the full distribution. Specifically, you will be given an expected travel of 23 minutes.

0.05*18 + 0.77*22 + 0.18*28
#> [1] 23

It is a simple summary takes into account travel accidents and other events that can effect travel time as best as it can.

The expected value \(\operatorname{E}\left(X\right)\) of discrete random variable \(X\) which takes possible values \(\{x_1,\ldots x_n\}\) is calculated using

\[ \operatorname{E}\left(X\right) =\sum_{i=1}^{n}\operatorname{P}\left(X = x_i\right)x_i \]

For example, in a binary scenario, if \(X\in \{0,1\}\) and \(P(X=1)=p\), than \(\operatorname{E}\left(X\right) = 0\times(1-p)+1\times p = p\). The expected value of a Bernoulli random variable is simply the probability of success. In many binary scenerios, probabilistic forecast is sufficient.

3.3 Variance

3.3.1 Conditional Probability & Independence

Using the notion of conditional probability, we can define independence of two variables. Two random variable \(X\) and \(Y\) are said to be independent if \[ \operatorname{P}\left(Y = y \mid X = x\right) = \operatorname{P}\left(Y = y\right), \] for all possible \(x\) and \(y\) values. That is, learning information \(X=x\) doesn’t affect.

Conditional probabilities are counter intuitive. For example, one of the most important properties is typically \(p( x \mid y ) \neq p( y\mid x )\), our probabilitic assessment of \(Y\) for any value \(y\). This is known as Prosecutors’ Fallacy as it arises when probability is used as evidence2 in a court of law. Specifically, the probability of innocence given the evidence is not the same as the probability of evidence given innocence. It is very important to ask the question “what exactly are we conditioning on ?” Ususally the observed evidence or data. Probability, of course, given evidence was one of the first applications of Bayes. Central to persoualised probability. Clearly this is a strong condition and rarely holds in practice.

We just derived an important relation, that allows us to calculate conditional probability \(p(x \mid y)\) when we know joint probability \(p(x,y)\) and marginal probability \(p(y)\). The total probability or evidence can be calculated as usual, via \(p(y) = \sum_{x}p(x,y)\).

We will see that independence will lead to a different conclusion that the Bayes conditional probability decomposition: specifically, independence yields \(p( x,y ) = p(x ) p(y )\) and Bayes says \(p( x ,y ) = p( x ) p( x \mid y )\).

We need to specify distribution on each of those variables. Two random variable \(X\) and \(Y\) are independent if \[ p (Y = y \mid X = x) = p(Y = y), \] for all possible \(x\) and \(y\) values variables separately, and the joint distribution will be giving by \[ p(x,y) = p(x)p(y). \] If \(X\) and \(Y\) are independent then probability of the event \(X\) and event \(Y\) happening at the same time is the product of individual probabilities. From the conditional distribution formula it follows that \[ p(x \mid y) = \dfrac{p(x,y)}{p(y)} = \dfrac{p(x)p(y)}{p(y)} = p(x). \] Another way to think of independence is to say that knowing the value of \(Y\) doesn’t tell us anything about possible values of \(X\). For example when tossing a coin twice, the probability of getting \(H\) in the second toss does not depend on the outcome of the first toss.

The expression of independence expresses the fact that knowing \(X=x\) tells you nothing about \(Y\). In the coin tossing example, if \(X\) is the outcome of the first toss and \(Y\) is the outcome of the second toss \[ p ( X=H \mid Y=T ) = p (X=H \mid Y=H ) = p (X=H). \]

Let’s do a similar example which illustrates this point clearly. Most people would agree with the following conditional probability assessments

Proseutors Fallacy

\[\begin{align*} \operatorname{P}\left(\text{Practice hard} \mid \text{Play in NBA}\right) \approx 1\\ \operatorname{P}\left( \text{Play in NBA} \mid \text{Practice hard}\right) \approx 0. \end{align*}\]

Even though you practice hard, the odds of playing in NBA are low (\(1000\) players out of \(7\) billion). But given you’re in NBA, you no doubt practice very hard. To understand this further, lets look at the conditional probability implication and apply Bayes rule \[ p \left ( \text{Play in NFL} \mid \text{Practice hard} \right ) = \dfrac{p \left ( \text{Practice hard} \mid \text{Play in NFL} \right )p( \text{Play in NFL})}{p(\text{Practice hard})}. \] The initial (a.k.a. prior) probability \(p(\text{Play in NFL} ) = \frac{1000}{7 \cdot 10^6} \approx 0\), makes the conditional (or, so called, posterior) probability also very small. \[ p \left ( \text{Play in NBA} \mid \text{Practice hard} \right ) \approx 0, \] \(P(\text{practice hard})\) is not that small and \(P(\text{practice hard} \mid \text{play in NBA})=1\). Hence, when one ‘vevevses the conditioning’ one gets a very small probability. This makes sense!

3.3.2 Sensitivity and Specificity

Conditional probabilities are used to define two fundamental metrics used for many probabilistic and statistical learning models, namely sensitivity and specificity. Explained in medical diagnose set-up.

Specificity is the percent of negatives correctly identifies, for example \(p(T = 0\mid D = 0)\).

  • Sensitivity measures the true positive rate, or recall, for example, the percentage of sick people who are correctly identified \(P ( T=1 \mid D=1 )\).
  • Specificity measures true negative rate, e.g % of negatives corrected identified as such. \(P (T=0 \mid D=0 )\).

Sensitivity is often called the power of a procedure (a.k.a. test). There are two kinds of errors (type I and type II) as well as sensitivity and specificity are the dual concepts.

  • A type I error (false positive rate) is the % of healthy people who tested positive, \(p(T=1\mid D=0)\), it is the mistake of thinking something is true when it is not.
  • A type II error or false negative rate is the % sick people who are tested negative, \(p(T=0\mid D=1)\), it the mistake of thinking something is not true when in fact it is true.

We would like to control both conditional probabilities with our test. Also if someone tests positive, how likely is it that they actually have the disease. There are two ‘errors’ one can make. Falsely diagnosing someone, or not correctly finding the disease.

In the stock market, one can think of type I error as not not selling a loosing stock quickly enough, and a type II error as failing to buy a growing stock, e.g. Amazon or Google.

\(p(T=1\mid D=1)\) Sensitivity True Positive Rate \(1-\beta\)
\(p(T=0\mid D=0 )\) Specificity True Negative Rate \(1-\alpha\)
\(p(T=1\mid D=0)\) 1-Specificity False Positive Rate \(\alpha\) (type I error)
\(p(T=0\mid D =1)\) 1-Sensitivity False Negative Rate \(\beta\) (type II error)

We will extensively use the concepts of errors, specificity and sensitivity later in the book, when describing AB testing and predictive models.

These example illustrates why people can commonly miscalculate and mis-interpret probabilities.

Example 3.5 (Wald and Airplane Safety) Example: Wald and Airplane Safety

Many lives were saved by analysis of conditional probabilities performed by Abraham Wald during the Second World War. He was analyzing damages on the US planes that came back from bombarding missions in Germany. Somebody suggested to analyze distribution of the hits over different parts of the plane. The idea was to find a pattern in the damages and design a reinforcement strategy.

After examining hundreds of damaged airplanes, researchers came up with the following table

Location Number of Planes
Engine 53
Cockpit 65
Fuel system 96
Wings, fuselage, etc. 434

We can convert those counts to probabilities

Location Number of Planes
Engine 0.08
Cockpit 0.1
Fuel system 0.15
Wings, fuselage, etc. 0.67

We can conclude the the most likely area to be damaged on the returned planes was the wings and fuselage. \[ p(\mbox{hit on wings or fuselage } \mid \mbox{returns safely}) = 0.67 \] Wald realized that analyzing damages only on survived planes is not the right approach. Instead, he suggested that it is essential to calculate the inverse probability \[ p(\mbox{returns safely} \mid \mbox{hit on wings or fuselage }) = ? \] To calculate that, he interviewed many engineers and pilots, he performed a lot field experiments. He analyzed likely attack angles. He studied the properties of a shrapnel cloud from a flak gun. He suggested to the army that they fire thousands of dummy bullets at a plane sitting on the tarmac. Wald constructed a ‘probability model’ carefull to reconstruct an estimate for the joint probabilities. Table below shows the results.

Hit Returned Shut Down
Engine 53 57
Cockpit 65 46
Fuel system 96 16
Wings, fuselage, etc. 434 33

Which allows us to estimate joint probabilities, for example \[ p(\mbox{outcome = returns safely} , \mbox{hit = engine }) = 53/800 = 0.066 \] We also can calculate the conditional probabilities now \[ p(\mbox{outcome = returns safely} \mid \mbox{hit = wings or fuselage }) = \dfrac{434}{434+33} = `r 434/(434+33)` \]

Should we reinforce wings or fuselage? Which part of the airplane does need ot be reinforced?

\[ p(\mbox{outcome = returns safely} \mid \mbox{hit = engine }) = \dfrac{53}{53+57} = 0.48 \] Here is another illustration taken from Economics literature. This insight led to George Akerlof winning the Nobel Prize for the concept of asymmetric information.

Example 3.6 (Sally Clark Case: Independence or Bayes Rule?) To show that independence can lead to dramatically different results from Bayes conditional probabilities, consider the Sally Clark case. Sally Clark was accused and convicted of killing her two children who could have both died of SIDS. One explanation is that this was a random occurrence, the other one is that they both died of sudden infant death syndrome (SIDS). How can we use conditional probability to figure out a reasonable assessment of the probability that she murdered her children. First, some known probability assessments

  1. The chance of a family of non-smokers having a SIDS death is \(1\) in \(8,500\).

  2. The chance of a second SIDS death is \(1\) in \(100\).

  3. The chance of a mother killing her two children is around \(1\) in \(1,000,000\).

Under Bayes \[\begin{align*} \operatorname{P}\left(\mathrm{both} \; \; \mathrm{SIDS}\right) & = \operatorname{P}\left(\mathrm{first} \; \mathrm{SIDS}\right) P{\mathrm{Second} \; \;\mathrm{SIDS} \mid \mathrm{first} \; \mathrm{SIDS}}\\ & = \frac{1}{8500} \cdot \frac{1}{100} = \frac{1}{850,000}. \end{align*}\]

The \(1/100\) comes from taking into account the genetics properties of SIDS. Independence, as implemented by the court, gets you to a probabilitic assessment of \[ P \left( \mathrm{both} \; \; \mathrm{SIDS} \right) = (1/8500) (1/8500) = (1/73,000,000). \] This is an unceedable low probability. It is still not the answer to our question of context. We need \(P(- \mid - )\). This will come to the Bayes rule.

First, some general comment on the likelihood ratio calculation used to assess the weight of evidence in favor of guilty v.s. innocent guectte evidence. Under Bayes we’ll find that there’s reasonable evidence that she’d be acquitted. We need the relative odds ratio. Let \(I\) denote the event that Sally Clark is innocent and \(G\) denotes guilty. Let \(E\) denote the evidence. In most cases, \(E\) contains a sequence \(E_1, E_2, \ldots\) of ‘facts’ and we have to use the likelihood ratios in turn. Bayes rule then tells you to combine via multiplicative fassion. If likelihood ratio \(>1\), odds of guily uncune. If likelihood ratio \(<1\), more likelihood to be \(I\). By Bayes rule \[ \frac{p(I\mid E)}{p(G\mid E)} = \frac{p( E, I)}{p( G, I)}. \] If we further decompose \(p(E , I) = p(E\mid I )p(I)\) then we have to discuss the prior probability of innocence, namely \(p(I)\). Hence this is one subtle advantage of the above decomposition.

The underlying intuition that Bayes gives us in this example, is that one the two possible explanations of the data, both of which are unlikely, it is the relative likelihood of comparison that should matter. Here is a case where the \(p\)-value would be non-sensible (\(p(E\mid I) \neq p(I\mid E)\)). Effectively comparing two rare event probabilities from the two possible models or explanations.

Hence putting these two together gives the odds of guilt as \[ \frac{p(I\mid E)}{p(G\mid E)} = \frac{1/850,000}{1/1,000,000} = 1.15. \] Solving for the posterior probability yields \(46.5\%\) for probability of guilty given evidence. \[ p( G\mid E) = \frac{1}{1 + O(G\mid E)} = 0.465. \] Basically a \(50/50\) bet. Not enough to definitively convict! But remember that our initial prior probability on guilt \(p(G)\) was \(10^{-6}\). So now there has been a dramatic increase to a posterior probability of \(0.465\). So it’s not as if Bayes rule thinks this is evidence in the suspects favor – but the magnitude is still not in the \(0.999\) range though, where most jurors would have to be to feel comfortable with a guilt verdict.

If you use the ‘wrong’ model of independence (as the court did) you get \[ P \left( \mathrm{both} \; \; \mathrm{SIDS} \right) = \frac{1}{8500} \cdot\frac{1}{8500} = \frac{1}{73,000,000}. \] With the independence assumption, you make the assessment \[ \frac{p(I\mid E)}{p(G\mid E)} = \frac{1}{73} \; {\rm and} \; p( G\mid E) \approx 0.99. \] Given these probability assumptions, the suspect looks guilty with probability 99%.

Experts also mis-interpret the evidence by saying: 1 in 73 million chance that it is someone else. This is clearly false and misleading to the jury and has leads to appeals.

Example 3.7 (O. J. Simpson Case (Dershowitz Fallacy)) \(p ( G \mid B ) \neq p( G \mid B , M )\) Alan Dershowitz (O. J Simpson’s layer) stated to the press that fewer than \(1\) in \(2000\) of batterers go on to murder their wives (in any given year)

Later used probability to argue that because so few husbands who batter their wifes actually go on to murder their wives. Thus, O.J. is highly likely to be not guilty. This leaves out the most relenat conditioning information that we also know that Nicole Brown Simpson was actually murdered, put it in formula \(p ( G \mid B ) \neq p( G \mid B , M )\).

It is not hard to come to a wrong conclusion if you don’t take into account all the relevant conditional information.

For example, in the famous O.J. Simpson case, let \(G\) denote guilt, \(M\) denote murderer, and \(B\) denotes batterer.

He intended this information to exonerate O.J. As there are about \(X\) uccides \(p( M \mid \text{not } G , B ) = p( M\mid \text{not } G ) = \frac{1}{20,000}\). The Bayes factor is then \[ \frac{ p( G \mid M , B ) }{ p( \text{not } G \mid M , B ) } = \frac{ 1/2000 }{1 /20,000} = 10 \] which implies posterior probabilities \[ p( \text{not } G \mid M , B ) = \frac{1}{1+10} \; {\rm and} \; p( G \mid M , B ) = \frac{10}{11} \] Hence its over \(90\)% chance that O.J. is guilty based on this information!

3.4 Bayesian Learning

How do you update your beliefs in the presence of new information? This is one of the key questions in the theory of probabilistic learning. Bayes rule provides the answer. When calculating probabilities of an outcome, we often need to account for additional information. Usually the problem is to find probability of an event \(X=x\), after we found out that some data outcome \(Y=y\). For example, we need to find probability that a thrown dice shows on its upper surface an odd number and we found out that number shown is less then 4. We write \(p(x = \text{odd} \mid Y = \text{less then 4})= 2/3\).

Conditional probability can be interpreted as updating your probability of event \(X=x\) after you have learned the new information that \(Y=y\) has occurred. We denote this by \(P(X=x \mid Y=y)\). Bayes rule then allow us to update a predictive rules given observed data. In this sense probability is also the language that will let you change options in the light of new evidence. Probability rules allow us to change our mind if the facts change. As famously said by Kaynes

Dennis Slendley argued that we should all be trained Bayes rule and conditional probability can be simply view as disciplined probability accounting. Akin to how market odds change as evidence changes. One issue is human behaviour and intuition trained in how to use Bayes rule, akin to learning the Alphabet! Daniel Kahneman’s Thinking, Fast and Slow discusses the intuitive way to make calculations (fast) versus deep thinking process (slow). The author demonstrates that use of Bayes rule requires us to switch to the slow mode, which is hard and, in fact, painful for many people. Kahneman considers the the next question, please assume that Steve was selected at random from a representative sample. An individual has been described by a neighbor as follows: “Steve is very shy and withdrawn, invariably helpful but with little interest in people or in the world of reality. A meek and tidy soul, he has a need for order and structure, and a passion for detail.” Is Steve more likely to be a librarian or a farmer? According to Kahneman, most people assume Steve is a librarian, which is wrong. Since there are 20 male farmers for each male librarian in the United States, probabilistically Steve is more likely to be a farmer. Let’s look at it graphically.

Librarian or Farmer

Figure 3.1: Librarian or Farmer

Even though the ratio of shy people is much hither among librarians when compared to farmers, larger number of population means we still have more shy farmer in the US then librarians. Graphically, the purple rectangle is larger then the blue one.

Bayesian learning was used by mathematician Alan Turing in England in Bletchley Park to break the German Enigma code - a development that helped the Allies win the Second World War (Simpson 2010). Turing called his algorithm Banburismus [^ DandHackay’s book (p—), Good Turing has a detacted discussion of how to calculate Bayes factors in this case.], it is a process he invented which used sequential conditional probability to infer information about the likely settings of the Enigma machine.

Those quantities can be calculated using the Bayes rule. The sensitivity is \[ p(T=1\mid D=1) = \dfrac{p(T=1,D=1)}{p(D=1)} = 0.019/0.002 = 0.95 \] The specificity is given by \[ p(T=0\mid D=0) = \dfrac{p(T=0,D=0)}{p(D=0)} = 0.9702/0.98 = 0.99. \] As we see the test is highly sensitive and specific. However, only 66% of those who are tested positive will have a disease. This is due to the fact that number of sick people is much less then the number of healthy and presence of type I error.

Example 3.8 (Apple Watch Series 4 ECG and Bayes’ Theorem) The Apple Watch Series 4 can perform a single-lead ECG and detect atrial fibrillation. The software can correctly identify 98% of cases of atrial fibrillation (true positives) and 99% of cases of non-atrial fibrillation (true negatives).

Predicted atrial fibrillation no atrial fibrillation Total
atrial fibrillation 1960 980 2940
no atrial fibrillation 40 97020 97060
Total 2000 98000 100000

However, what is the probability of a person having atrial fibrillation when atrial fibrillation is identified by the Apple Watch Series 4? We use Bayes theorem to answer this question.

\[ p(\text{atrial fibrillation}\mid \text{atrial fibrillation is identified }) = \frac{0.01960}{ 0.02940} = 0.6667 \] The conditional probability of having atrial fibrillation when the Apple Watch Series 4 detects atrial fibrillation is about 67%.

In people younger than 55, Apple Watch’s positive predictive value is just 19.6 percent. That means in this group – which constitutes more than 90 percent of users of wearable devices like the Apple Watch – the app incorrectly diagnoses atrial fibrillation 79.4 percent of the time. (You can try the calculation yourself using this Bayesian calculator: enter 0.001 for prevalence, 0.98 for sensitivity, and 0.996 for specificity).

The electrocardiogram app becomes more reliable in older individuals: The positive predictive value is 76 percent among users between the ages of 60 and 64, 91 percent among those aged 70 to 74, and 96 percent for those older than 85.

Example 3.9 (Example: USS Scorpion sank 5 June, 1968 in the middle of the Atlantic.) Experts placed bets of each casualty and how each would affect the sinking. Undersea soundings gave a prior on location. Bayes rule: \(L\) is location and \(S\) is scenario \[ p (L \mid S) = \frac{ p(S \mid L) p(L)}{p(S)} \] The Navy spent \(5\) months looking and found nothing. Build a probability map: within \(5\) days, the submarine was found within \(220\) yards of most likely probability!

A similar story happened during the search of an Air France plane that flew from Rio to Paris.

3.4.1 Enigma machine: Code-breaking

Turing and Good (1940s). This andlys is follows Madcay (-) Consider an alphabet of \(26\) letters. Let \(x\) and \(y\) be two codes of length \(T\). We will look to see how many letters match (\(M\)) and don’t match \(N\). In these sequences. Even though the codes are describing different sentences, when letters are the same, if the same code is being used then the coed sequence will have a match. To compute the bayes factor we need the joint probabilities \[ P( x,y\mid \mathcal{H}_0 ) \; \; {\rm and} \; \; P( x,y\mid \mathcal{H}_1 ) \] where under \(\mathcal{H}_0\) they are different codes, in which case the joint prob is \(( 1 / A )^{2T}\). For \(\mathcal{H}_1\) we first need to know the chance of the same letter matching. If \(p_t\) denotes the frequencies of the use of English letters, then we have this match probability \(m = \sum_{i} p_i^2\) which is about \(2/26\). Hence for a particular set of letters \[ P( x_i , y_i \mid \mathcal{H}_1 ) = \frac{m}{A} \; {\rm if} \; x_i =y_i \; \; {\rm and} \; \; P( x_i , y_i \mid \mathcal{H}_1 ) = \frac{1-m}{A(A-1)} \; {\rm if} \; x_i \neq y_i \] Hence the log Bayes factor is \[\begin{align*} \ln \frac{P( x,y\mid \mathcal{H}_1 )}{P( x,y\mid \mathcal{H}_0 )} & = M \ln \frac{ m/A}{1/A^2} +N \ln \frac{ ( 1-m ) / A(A-1) }{ 1/ A^2} \\ & = M \ln mA + N \ln \frac{ ( 1-m )A }{A-1 } \end{align*}\] The first term comes when you get a match and the increase in the Bayes factor is large, \(3.1\) (on a \(log_{10}\))-scale, otherwise you get a no-match and the Bayes factor decreases by \(- 0.18\).

Example 3.10 () Example, \(N=4\), \(M=47\) out of \(T=51\), then gives evidence of \(2.5\) to \(1\) in favor of \(\mathcal{H}_1\)

How long a sequence do you need to look at? Calculate the expected log odds. Turing and Good figured you needed sequences of about length \(400\). Can also look at doubles and triples. See MacKay for further details.

3.5 Play the Winner Rule

Suppose you are betting on red or black at roulette. There are \(38\) slots, \(18\) black, \(18\) red and \(2\) green. So the probabilities and expectation are stacked against you. Suppose you have only seen all Reds appear \[ RRRRRRRRRR \] Most people will want to bet black on this case when the right answer is don’t bet. In the 1920s Wilson recorded which number most frequently occurred on the roulette wheel and used the play the winner rule. The wheels were then biased and this was a winning strategy. Ethier (1982) provides further examples.

3.5.1 Black Swans

A related problem is the Black Swan inference problem. Suppose that after \(n\) trials where \(n\) is large you have only seen successes and that you assess the probability of the next trial being a success as \((T+1)/(T+2)\) that is, almost certain. This is a model of observing White Swans and having never seen a Black Swan. (Taleb, 2008, The Black Swan: the Impact of the Highly Improbable). Taleb makes it sound as if the rules of probability are not rich enough to be able to handle Black Swan events. There is a related class of problems in finance known as Peso problems where countries decide to devalue their currencies and there is little a prior evidence from recent history that such an event is going to happen.

To obtain such a probability assessment we use a Binomial/Beta conjugate Bayes updating model. The key point is that it can also explain that there is still a large probability of a Black Swan event to happen sometime in the future. Independence model has difficulty doing this.

The Bayes Learning Beta-Binomial model will have no problem. We model with where \(Y_{t}=0\) or \(1\), with probability \(P\left( Y_{t}=1\mid \theta\right) =\theta\). This is the classic Bernoulli “coin-flipping” model and is a component of more general specifications such as regime switching or outlier-type models.

The likelihood for a sequence of Bernoulli observations is \[ p\left( y\mid \theta\right) =\prod_{t=1}^{T}p\left( y_{t}\mid \theta\right) =\theta^{\sum_{t=1}^{T}y_{t}}\left( 1-\theta\right) ^{T-\sum_{t=1}^{T}y_{t}}. \] The maximum likelihood estimator is the sample mean, \(\widehat{\theta} = T^{-1}\sum_{t=1}^{T}y_{t}\). This makes little sense when you just observe white swans. It predicts \(\hat{\theta} = 1\) and gets shocked when it sees a black swan (zero probability event). Bayes, on the other hand, allows for ‘learning.’

To do this we need prior distribution for the ‘parameter’ \(\theta\). A natural choice is a Beta distribution, denoted by \(\theta\sim\text{Beta}\left( a,A\right)\) with pdf is given by \[ p\left( \theta\mid a,A\right) =\frac{\theta^{a-1}\left( 1-\theta\right) ^{A-1}}{B\left( a,A\right) }, \] where \(B\left( \alpha,A\right)\) denotes a Beta function. Since \(p\left( \theta\mid a,A\right)\) is a density and integrates to 1, we have

\[ B\left( a,A\right) =\int_{0}^{1}\theta^{a-1}\left( 1-\theta\right) ^{A-1}d\theta . \].

Bayes rule then tell us how to combine the likelihood and prior to obtain a posterior distribution, namely \(\theta \mid Y=y\). What do we believe about \(\theta\) given a sequence of. Our predictor rule is then \(P(Y_{t=1} =1 \mid Y=y ) = \mathbb{E}(\theta \mid y)\) it is straightforward to show that the posterior distribution is again a Beta distribution with \[ p\left( \theta\mid y\right) \sim\mathcal{B}\left( a_{T},A_{T}\right) \; {\rm and} \; a_{T}=a+\sum_{t=1}^{T}y_{t} , A_{T}=A+T-\sum_{t=1}^{T}y_{t} \] There is a “conjugate” form of the posterior: it is also a Beta distribution and the hyper-parameters \(a_{T}\) and \(A_{T}\) depend on the data only via the sufficient statistics, \(T\) and \(\sum_{t=1}^{T}y_{t}\). The posterior mean and variance are

\[ \mathbb{E}\left[ \theta\mid y\right] =\frac{a_{T}}{a_{T}+A_{T}} \;\text{ and }\; Var\left[ \theta\mid y\right] =\frac{a_{T}A_{T}}{\left( a_{T}+A_{T}\right) ^{2}\left( a_{T}+A_{T}+1\right) }\text{,} \]

respectively. This implies that for large samples, \(\operatorname{E}\left(\theta\mid y\right) \approx \bar{y} = \widehat{\theta}\), the MLE.

Suppose that after \(n\) trials where \(n\) is large you have only seen successes and that you assess the probability of the next trial being a success as \((T+1)/(T+2)\) that is, almost certain. This is a model of observing White Swans and having never seen a Black Swan. (Taleb, 2008, The Black Swan: the Impact of the Highly Improbable).

To obtain such a probability assessment a natural model is Binomial/Beta conjugate Bayesian updating model. We can access the probability that a black Swan event to happen sometime in the future.

For the purpose of illustration, start with a uniform prior specification, \(\theta \sim \mathcal{U}(0,1)\), then we have the following probability assessment. After \(T\) trials, suppose that we have only seen \(T\) successes, namely, \(( y_1 , \ldots , y_T ) = ( 1 , \ldots , 1 )\). Then you assess the probability of the next trial being a success as \[ p( Y_{T+1} =1 \mid y_1=1 , \ldots , y_T=1 ) = \frac{T+1}{T+2} \] This follows from the mean of the Beta posterior, \(\theta \mid y \sim \text{Beta}(T+1, T+1)\), \(P(Y_{T+1} = 1 \mid y) = \mathbb{E}_{\theta \mid y}\left[P(Y_{T=1} \mid \theta) \right] = \mathbb{E}[\theta \mid y]\). For large \(T\) this is almost certain.

Now consider a future set of \(n\) trials, where \(n\) is also large. The probability of never seeing a Black Swan is then given by \[ p( y_{T+1} =1 , \ldots , y_{T+n} = 1 \mid y_1=1 , \ldots , y_T=1 ) = \frac{ T+1 }{ T+n+1 } \] For a fixed \(T\), and large \(n\), we have \(\frac{ T+1 }{ T+n+1 } \rightarrow 0\). Hence, we will see a Black Swan event with large probability — we just don’t know when! The exchangeable Beta-Binomial model then implies that a Black Swan event will eventually appear. One shouldn’t be that surprised when it actually happens.

Example 3.11 (Bayes Home Diagnostics) Suppose that a house alarm which send me a text notification when some motion inside my house is detected. It detect motion when I have a person inside (burglar) or during an earthquake. Say, from prior data we know that during an earthquake alarm is triggered in 10% of the cases. One I receive a text message, I start driving back home. While driving I hear on the radio about a small earthquake in our area. Now we want to know \(p(b \mid a)\) and \(p(b \mid a,r)\).

The joint distribution is then given by \[ p(b,e,a,r) = p(r \mid a,b,e)p(a \mid b,e)p(b)p(e). \] Since we know the causal relations, we can simplify this expression \[ p(b,e,a,r) = p(r \mid e)p(a \mid b,e)p(b)p(e). \] The joint distribution is defined by

\(p(a=1 \mid b,e)\) b e
0 0 0
0.1 0 1
1 1 0
1 1 1

Graphically, we can represent the relations between the variables known as a Directed Acyclic Graph (DAG), which is known as Bayesian network.

Now we can easily calculate \(p(a=0 \mid b,e)\), from the property of a probability distribution \(p(a=1 \mid b,e) + p(a=0 \mid b,e) = 1\). \(p(r=1 \mid e=1) = 0.5\) and \(p(r=1 \mid e=0) = 0\), say based on historic data we have \(p(b) = 2\cdot10^{-4}\) and \(p(e) = 10^{-2}\). Note that causal relations allowed us to have a more compact representation of the joint probability distribution. The original naive representations requires specifying \(2^4\) parameters.

To answer our original question, calculate \[ p(b \mid a) = \dfrac{p(a \mid b)p(a)}{p(b)},~~p(b) = p(a=1 \mid b=1)p(b=1) + p(a=1 \mid b=0)p(b=0). \] We have everything but \(p(a \mid b)\). This is obtained by marginalizing \(p(a=1 \mid b,e)\), to yield \[ p(a \mid b) = p(a \mid b,e=1)p(e=1) + p(a \mid b,e=0)p(e=0). \] We can calculate \[ p(a=1 \mid b=1) = 1, ~p(a=1 \mid b=0) = 0.1*10^{-2} + 0 = 10^{-3}. \] This leads to \(p(b \mid a) = 2\cdot10^{-4}/(2\cdot10^{-4} + 10^{-3}(1-2\cdot10^{-4})) = 1/6\).

This result is somewhat counterintuitive. We get such a low probability of burglary because its prior is very low compared to prior probability of an earthquake. What will happen to posterior if we live in an area with higher crime rates, say \(p(b) = 10^{-3}\). Figure 3.2 shows the relationship between the prior and posterior. \[ p(a \mid b) = \dfrac{p(b)}{p(b) + 10^{-3}(1-p(b))} \]
Relationship between the prior and posterior

Figure 3.2: Relationship between the prior and posterior

Now, suppose that you hear on the radio about a small earthquake while driving. Then, using Bayesian conditioning, \[\begin{align*} p(b=1 \mid a=1,r=1) = & \dfrac{p(a,r \mid b)p(b)}{p(a,r)}\\ = &\dfrac{\sum_e p(b=1,e,a=1,r=1)}{\sum_b\sum_ep(b,e,a=1,r=1)} \\ = &\dfrac{\sum_ep(r=1 \mid e)p(a=1 \mid b=1,e)p(b=1)p(e)}{\sum_b\sum_ep(r=1 \mid e)p(a=1 \mid b,e)p(b)p(e)} \end{align*}\] which is \(\approx 2\%\) in our case. This effect is called explaining away, namely when new information explains some previously known fact.

3.6 Continuous Random Variables

If we want to build a probabilistic model of a stock price or return. We need to use a continuous random variable that can take an interval of values. Instead of frequency function we will use density function, \(p(x)\) to describe a continuous variable. Unlike the discrete case \(p(x)\) is not the probability that random variable takes value \(x\). Rather, we need to talk about value being inside an interval. For example probability of \(X\) with density \(p(x)\) being inside any interval \([a,b]\), with \(a<b\) is given by \[ P(a < X < b) = \int_{a}^{b}p(x)dx. \] The total probability is one as \(\int_{-\infty}^\infty p(x) dx=1\). The simplest continuous random variable is the uniform. A uniform distribution describes a variable which takes on any value as likely as any other. For example, if you are asked about what would be the temperature in Chicago on July 4 of next year, you might say anywhere between 20 and 30 C. The density function of the corresponding uniform distribution is then \[ p(x) = \begin{cases} 1, ~~~20 \le x \le 30\\0, ~~~\mbox{otherwise}\end{cases} \]

Under, this model, then the probability of temperature being between 25 and 27 degrees is \[ P(25 \le x \le 27) = \int_{25}^{27} p(x)dx = (27-25)/10 = 0.2 \]

Probability of temperature being between 25 and 27

Figure 3.3: Probability of temperature being between 25 and 27

3.6.1 Standard Deviation and Covariance

Variance measures the spread of a random variable around its expected value \[ \operatorname{Var}\left(X\right) = \operatorname{E}\left((X-\operatorname{E}\left(X\right))^2\right) < \int_{-\infty}^\infty (x-\mu) ^2 p_X(x)dx, \] where \(\mu = \mathbb{E}(X)=\int_{-\infty}^{\infty}p_X(x)dx\). The standard deviation is more convenient and is a square root of variance\(\operatorname{sd}\left(X\right) = \sqrt{\operatorname{Var}\left(X\right)}\). Standard deviation has the desirable property that it is measured in the same units as the random variable \(X\) itself and is a more useful measure.

3.6.1.1 Covariance.

Suppose that we have two random variables \(X\) and \(Y\). We need to measure whether they move together or in opposite directions. The covariance is defined by \[ \operatorname{Cov}\left(X,Y\right) = \operatorname{E}\left(\left[ X- \operatorname{E}\left(X\right))(Y- \operatorname{E}\left(Y\right)\right]\right). \]

When \(X\) and \(Y\) are discrete and we are given the joint probability distribution, we need to calculate \[ \operatorname{Cov}\left(X,Y\right) = \sum_{x,y} ( x - \operatorname{E}\left(X\right) )(y - \operatorname{E}\left(Y\right))p(x,y). \] Covariance is measured in unit of \(X^2\times\)unit of \(Y^2\). This can be inconvenient and makes it hard to compare covariances of different pairs of variables. A more convenient metric is the correlation, which is defined by \[ \operatorname{Corr}\left(X,Y\right)= \frac{ \operatorname{Cov}\left(X,Y\right) }{ \operatorname{sd}\left(X\right) \operatorname{sd}\left(Y\right) }. \] Correlation, \(\operatorname{Corr}\left(X,Y\right)\), is unitless and takes values between 0 and 1.

For example, if \(\operatorname{Cov}\left(\mbox{Apple} \; \mbox{Sales, GDP}\right) = 8\), \(\operatorname{sd}\left( \mbox{Apple} \; \mbox{Sales} \right) = 2\) and \(\operatorname{sd}\left(\mbox{GDP} \right) = 5\), then there’s a \(80\)% correlation between Apple Sales and GDP as \[ \operatorname{Corr}\left( \mbox{Apple} \;\mbox{ Sales, Economy} \right) = \frac{8}{2 \times 5} = \frac{8}{10} = 0.8. \]
Example 3.12 (Google Stock Returns)
goog = read.csv("data/GOOG2019.csv") 
rgoog = goog$Adj.Close[2:251]/goog$Adj.Close[1:250] - 1 
sp = read.csv("data/SP2019.csv");   rsp = sp$Adj.Close[2:251]/sp$Adj.Close[1:250] - 1 
plot(rgoog, rsp, col="lightblue", pch=21, bg="grey", xlab="GOOG return", ylab="SP500 return") 

var_goog = mean((rgoog - mean(rgoog))^2) 
var_sp = mean((rsp - mean(rsp))^2) 
cov = mean((rgoog - mean(rgoog))*(rsp - mean(rsp))); print(cov) 
#> [1] 8e-05
cor = cov/(sqrt(var_goog)*sqrt(var_sp)); print(cor)
#> [1] 0.67

3.6.1.2 Portfolios: Linear combinations

Calculating means and standard deviations of combination of random variables is central tool in probability. It is known as the portfolio problem. Let \(P\) be your portfolio, which comprises a mix of two assets \(X\) and \(Y\), typically stocks and bonds, \[P = aX + bY,\] where \(a\) and \(b\) are the portfolio weights, typically \(a+b=1\), as we are allocating our total capital. Imagine, that you have placed \(a\) dollars on the random outcome \(X\), and \(b\) dollars on \(Y\). The portfolio \(P\) measures your total weighted outcome.

Key portfolio rules: The expected value and variance follow the relations \[\begin{align*} \operatorname{E}\left(aX + bY\right) = & a\operatorname{E}\left(X\right)+b\operatorname{E}\left(Y\right)\\ \operatorname{Var}\left( aX + bY \right) = & a^2 \operatorname{Var}\left(X\right) + b^2 \operatorname{Var}\left(Y\right) + 2 ab \operatorname{Cov}\left(X,Y \right), \end{align*}\] with covariance defined by \[ \operatorname{Cov}\left(X,Y\right) = \operatorname{E}\left( ( X- \operatorname{E}\left(X\right) )(Y- \operatorname{E}\left(Y\right))\right). \] Expectation and variance help us to understand the long-run behavior. When we make long-term decisions, we need to use the expectations to avoid biases.

The covariance is related to the correlation by \(\operatorname{Cov}\left(X,Y\right) = \text{Corr}(X, Y) \cdot \sqrt{\text{Var}{X} \cdot \text{Var}{Y}}\).

Example 3.13 (Historical Stock and Bond Returns) Add \(60\%\) stocks, \(40\%\) bond returns. Worse draw down. 2008 crisis, Coronavirus crisis?

Example 3.14 (Tortoise and Hare) Tortoise and Hare who are selling cars. Say \(X\) is the number of cars sold and probability distributions, means and variances is given by the following table

\(X\) Mean Variance sd
0 1 2 3 \(\operatorname{E}\left(X\right)\) \(\operatorname{Var}\left(X\right)\) \(\sqrt{\operatorname{Var}\left(X\right)}\)
Tortoise 0 0.5 0.5 0 1.5 0.25 0.5
Hare 0.5 0 0 0.5 1.5 2.25 1.5

Let’s do Tortoise expectations and variances \[\begin{align*} \operatorname{E}\left(T\right) & = (1/2) (1) + (1/2)(2) = 1.5 \\ \operatorname{Var}\left(T\right) & = \operatorname{E}\left(T^2\right) - \operatorname{E}\left(T\right)^2 \\ & = (1/2)(1)^2 + (1/2)(2)^2 - (1.5)^2 = 0.25 \end{align*}\]

Now the Hare’s \[\begin{align*} \operatorname{E}\left(H\right) & = (1/2)(0) + (1/2)(3) = 1.5 \\ \operatorname{Var}\left(H\right) & = (1/2)(0)^2 + (1/2)(3)^2- (1.5)^2 = 2.25 \end{align*}\]

What do these tell us about the long run behavior?

  1. Tortoise and Hare have the same expected number of cars sold.

  2. Tortoise is more predictable than Hare. He has a smaller variance.

The standard deviations \(\sqrt{\operatorname{Var}\left(X\right)}\) are \(0.5\) and \(1.5\), respectively. Given two equal means, you always want to pick the lower variance. If we are to invest into one of those, we prefer Tortoise.

3.7 Common Distributions

3.7.1 Binomial Distribution

Bernoulli modelled the notion of probability for a coin toss, now known as the Bernoulli distribution, there \(X \in \{0,1\}\) and \(P(X=1)=p, P(X=0) = 1-p\). Laplace gave us the principle of insufficient reason: where you would list out the possibilities and then place equal probability on each of the outcomes. Essentially the discrete distribution on the set of possible outcomes.

A Bernoulli trial relates to an experiment with the following conditions

  1. The result of each trial is either a success or failure.

  2. The probability \(p\) of a success is the same for all trials.

  3. The trials are assumed to be independent.

A Binomial distribution arises from a sequence of Bernoulli trials, and assigns probability to \(X\), which is the number of successes. It’s probability distribution is calculated via: \[ p (X=x) = {n \choose x} p^x(1-p)^{n-x}. \] Here \({n \choose x} = \frac{n!}{x!(n-x)!}\), is the combinatorial function, where \(n!=n(n-1)(n-2)\ldots 2 \cdot 1\) counts the number of ways of getting \(x\) successes in \(n\) trials.

Mean and Variance of Binomial
Binomial Distribution Parameters
Expected value \(\mu = E(X) = n p\)
Variance \(\sigma^2 = Var(X) = n p ( 1 - p )\)

Suppose we are about to toss two coins. Let \(X\) denote the number of heads. Then the following table specifies the probability distribution \(p(x)\) for all possible values \(x\) of \(X\). This leads to the following table

Outcomes of three coin flips
\(x\) \(p(X=x)\)
0 1/4
1 1/2
2 1/4

Thus, most likely we will see one Head after two tosses. Now, let’s look at a more complex example and introduce our first probability distribution, namely Binomial distribution.

Let \(X\) be the number of heads in three flips. Each possible outcome ("realization") of \(X\) is an event. Now consider the event of getting only two heads \[ \{ X= 2\} = \{ HHT, HTH, THH \} , \] The probability distribution of \(X\) is Binomial with parameters \(n = 3, p= 1/2\), where \(n\) denotes the sample size (a.k.a. number of trials) and \(p\) is the probability of heads, we have a fair coin. The notation that we will use \(X \sim \mathrm{Bin} \left ( n = 3 , p = \frac{1}{2} \right )\) where the sign \(\sim\) is read as distributed as. For large sample sizes \(n\), this distribution is approximately normal with mean \(np\) and variance of \(np(1-p)\).

Outcomes of three coin flips
Result \(X\) \(p(X=x)\)
HHH 3 \(p^3\)
HHT 2 \(p^2 ( 1- p)\)
HTH 2 \(p^2 ( 1 - p)\)
THH 2 \((1-p)p^2\)
HTT 1 \(p( 1-p)^2\)
THT 1 \(p ( 1-p)^2\)
TTH 1 \((1-p)^2 p\)
TTT 0 \((1-p)^3\)

3.7.2 Normal or Gaussian Distribution

The Normal or Gaussian distribution is central to probability and statistical inference. Suppose that we are trying to predict tomorrow’s return on the S&P500. There’s a number of questions that come to mind

  1. What is the random variable of interest?

  2. How can we describe our uncertainty about tomorrow’s outcome?

  3. Instead of listing all possible values we’ll work with intervals instead. The probability of an interval is defined by the area under the probability density function.

Returns are are continuous (as opposed to discrete) random variables. Hence a normal distribution would be appropriate - but on what scale? We will see that on the log-scale a Normal distribution provides a good approximation.

The most widely used model for a continuous random variable is the normal distribution. Standard normal random variable \(Z\) has the following properties

The standard Normal has mean \(0\) and has a variance \(1\), and is written as \[ Z \sim N(0,1) \] Then, we have the probability statements of interest \[\begin{align*} P{-1 <Z< 1} &=0.68\\ P{-1.96 <Z< 1.96} &=0.95\\ \end{align*}\]

In R, we can find probabilities

pnorm(1.96)
#> [1] 0.98

and quantiles

qnorm(0.9750)
#> [1] 2

The quantile function qnorm is the inverse of pnorm.

A random variable that follows normal distribution with general mean and variance \(X \sim \mbox{N}(\mu, \sigma^2)\), has the following properties \[\begin{align*} p(\mu - 2.58 \sigma < X < \mu + 2.58 \sigma) &=& 0.99 \\ p(\mu - 1.96 \sigma < X < \mu + 1.96 \sigma) &=& 0.95 \, . \end{align*}\] The chance that \(X\) will be within \(2.58 \sigma\) of its mean is \(99\%\), and the chance that it will be within \(2\sigma\) of its mean is about \(95\%\).

The probability model is written \(X \sim N(\mu,\sigma^2)\), where \(\mu\) is the mean, \(\sigma^2\) is the variance. This can be transformed to a standardized normal via \[ Z =\frac{X-\mu}{\sigma} \sim N(0,1). \] For a normal distribution, we know that \(X \in [\mu-1.96\sigma,\mu+1.96\sigma]\) with probability 95%. We can make similar claims for any other distribution using the Chebyshev’s empirical rule, which is valid for any population:

  1. At least 75% probability lies within 2\(\sigma\) of the mean \(\mu\)

  2. At least 89% lies within 3\(\sigma\) of the mean \(\mu\)

  3. At least \(100(1-1/m^2)\)% lies within \(m\times \sigma\) of the mean \(\mu\).

This also holds true for the Normal distribution. The percentages are \(95\)%, \(99\)% and \(99.99\)%.
Example 3.15 (Google Stock 2019) Consider observations of daily log-returns of a Google stock for 2019 Daily log-return on day \(t\) is calculated by taking a logarithm of the ratio of price at close of day \(t\) and at close of day \(t-1\) \[ y_t = \log\left(\dfrac{P_t}{P_{t-1}}\right) \] For example on January 3 of 2017, the open price is 778.81 and close price was 786.140, then the log-return is \(\log(786.140/778.81) = -0.0094\). It was empirically observed that log-returns follow a Normal distribution. This observation is a basis for Black-Scholes model with is used to evaluate future returns of a stock.
p = read.csv("data/GOOG2019.csv")$Adj.Close; n = length(p) 
r = log(p[2:n]/p[1:(n-1)]) 
hist(r, breaks=30, col="lightblue")

Observations on the far right correspond to the days when positive news was released and on the far left correspond to bad news. Typically, those are days when the quarterly earnings reports are released.

To estimate the expected value \(\mu\) (return) and standard deviation \(\sigma\) (a measure of risk), we simply calculate their sample counterparts \[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i, ~\mathrm{ and }~ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x} )^2 \] The empirical (or sample) values \(\bar x\) and \(s^2\) are called sample mean and sample variance. Here simply vie them as our best guess about the mean and variance of the normal distribution model then our probabilistic model for next day’s return is then given by \[ R \sim N(\bar x, s^2). \]

This method of estimating the parameters of the distribution is called the method of moments, meaning that we estimate the variance and the mean (first two moments of a distribution) by simply calculating those from the observed sample.

We can formalise this fully in the Normal/ Normal Bayes learning model as follows.

(Add prior \(\mathcal{N}(\theta_0, \tau^2)\) and likelihood, \(p(y \mid \theta, \sigma^2) \equiv p(\bar{y} \mid \theta, \sigma^2)\).)

Notice that we have reduced the data bu sufficiency and we only need \(\bar{y}\) rather than the full dataset \(\left(y_1,y_2, \ldots, y_n \right)\). The Normal/ Normal Bayesian learning model provides the basis for shrinkage estimation of multiply means and the basis of the Kalman filter for dynamically unclag a path of an object.

Say we are interested in investing into Google and would like to calculated the expected return of our investment as well as risk associated with this investment We assume that behavior of the returns in the future will be the same as in 2019.

n = length(r) 
rbar = sum(r)/n; print(rbar) 
#> [1] 0.00098
s2 = sum((r-rbar)^2)/(n-1); print(s2) 
#> [1] 0.00023
x = seq(-0.08,0.08, length.out = 200) 
hist(r, breaks=30, col="lightblue", freq = F) 
lines(x,dnorm(x,rbar,sqrt(s2)), col="red", lwd=2)

Now, assume, I invest all my portfolio into Google. I can predict my annual return to be \(251 \times 9.8\times 10^{-4}\) = 0.25 and risk (volatility) of my investment is \(\sqrt{s^2}\) = 0.02% a year.

I can predict the risk of loosing 3% or more in one day using my model is 1.93%.

pnorm(log(1-0.03), rbar, sqrt(s2))*100
#> [1] 1.9
sp = read.csv("data/SPMonthly.csv")$Adj.Close; n = length(sp) 
spret = sp[602:n]/sp[601:(n-1)]-1 # Calculate  1977-1987 returns 
mean(spret) 
#> [1] 0.012
sd(spret)
#> [1] 0.043
Example 3.16 (Stock market crash 1987) Prior to the October, 1987 crash SP500 monthly returns were 1.2% with a risk/volatility of 4.3%. The question is how extreme was the 1987 crash of \(-21.76\)%? \[ X \sim N \left(1.2, 4.3^2 \right ) \] This probability distribution can be standardized to yield \[ Z =\frac{X-\mu}{\sigma} = \frac{X - 1.2}{4.3} \sim N(0,1) . \] Now,, we calculate the observed \(Z\), given the outcome of the crash event \[ Z = \frac{-0.2176 - 0.012}{0.043} = -5.27 \] That’s is \(5\)-sigma event in terms of the distribution of \(X\). Meaning that -0.2176 is 5 standard diviations away from the mean. Under a normal model that is equivalent to \(P(X < -0.2176)\) = r pnorm(-21.76, 1.2,4.3).

3.7.3 Poisson Distribution

The Poisson distribution is obtained as a result of the Binomial when \(p\) is small and \(n\) is large. In applications, the Poisson models count data. Suppose we want to model arrival rate of users to one of our stores. Let \(\lambda = np\), which is fixed and take the limit as \(n \rightarrow \infty\). There is a relationship between , \(p(x)\) ans \(p(x+1)\) given by \[ \dfrac{p(x+1)}{p(x)}= \dfrac{\left(\dfrac{n}{x+1}\right)p^{x+1}(1-p)^{n-x-1}}{\left(\dfrac{n}{x}\right)p^{x}(1-p)^{n-x}} \approx \dfrac{np}{x+1} \] If, we approximate \(p(x+1)\approx \lambda p(x)/(x+1)\) with \(\lambda=np\), then we obtain the Poission pdf given by \(p(x) = p(0)\lambda^x/x!\). To ensure that \(\sum_{x=0}^\infty p(x) = 1\), we set \[ f(0) = \dfrac{1}{\sum_{x=0}^{\infty}\lambda^x/x!} = e^{-\lambda}. \] The above equality follows from the power series property of the exponent function \[ e^{\lambda} = \sum_{x=0}^{\infty}\dfrac{\lambda^x}{x!} \] The Poisson distribution counts the occurrence of events. Given a rate parameter, denoted by \(\lambda\), we calculate probabilities as follows \[ p( X = x ) = \frac{ e^{-\lambda} \lambda^x }{x!} \; {\rm where} \; x=0,1,2,3, \ldots \] The mean and variance of the Poisson are given by:

Poisson Distribution Parameters
Expected value \(\mu = \operatorname{E}\left(X\right) = \lambda\)
Variance \(\sigma^2 = \operatorname{Var}\left(X\right) = \lambda\)

Here \(\lambda\) denotes the rate of occurrence of an event.

Consider the problem of modeling soccer scores in the English Premier League (EPL) games. We use data from Betfair, a website, which posts odds on many football games. The goal is to calculate odds for the possible scores in a match. \[ 0-0, \; 1-0, \; 0-1, \; 1-1, \; 2-0, \ldots \] Another question we might ask, is what’s the odds of a team winning?

This is given by \(P\left ( X> Y \right )\). The odds’s of a draw are given by \(P \left ( X = Y \right )\)?

Professional sports betters rely on sophisticated statistical models to predict the outcomes. Instead, we present a simple, but useful model for predicting outcomes of EPL games. We follow the methodology given in Spiegelhalter and Ng (2009).

First, load the data and then model the number of goals scored using Poisson distribution.

df = read.csv("data/epl.csv") %>% select(home_team_name,away_team_name,home_score,guest_score)
head(df)
#>   home_team_name       away_team_name home_score guest_score
#> 1        Arsenal            Liverpool          3           4
#> 2    Bournemouth    Manchester United          1           3
#> 3        Burnley              Swansea          0           1
#> 4        Chelsea             West Ham          2           1
#> 5 Crystal Palace West Bromwich Albion          0           1
#> 6        Everton            Tottenham          1           1

Let’s look at the empirical distribution across the number of goals scored by Manchester United

team_name="Manchester United" 
team_for  = c(df[df$home_team_name==team_name,"home_score"],df[df$away_team_name==team_name,"guest_score"]) 
n = length(team_for) 
for_byscore = table(team_for)/n 
barplot(for_byscore, col="coral", main="Histogram of Goals Scored by MU")

Hence the historical data fits closely to a Poisson distribution, the parameter \(\lambda\) describes the average number of goals scored and we calculate it by calculating the sample mean, the maximum likelihood estimate. A Bayesian method where we assume that \(\lambda\) has a Gamma prior is also available. This lets you incorporate outside information into the predictive model.

lambda_for = mean(team_for) 
barplot(rbind(dpois(0:4, lambda = lambda_for),for_byscore),beside = T, col=c("aquamarine3","coral"), xlab="Goals", ylab="probability", main="Histogram vs Poisson Model Prediction of Goals Scored by MU") 
legend("topright", c("Poisson","MU"), pch=15, col=c("aquamarine3", "coral"), bty="n")

Now we will use Poisson model and Monter Carlo simulations to predict possible outcomes of the MU vs Hull games. First we estimate the rate parameter for goals by MU lmb_mu and goals by Hull lmb_h. Each team played a home and away game with every other team, thus 38 total games was played by all teams. We calculate the average by dividing total number of goals scored by the number of games

sumdf = df %>% 
  group_by(home_team_name) %>% 
  summarise(Goals_For_Home = sum(home_score)) %>%
  full_join(df %>% 
              group_by(away_team_name) %>% 
              summarise(Goals_For_Away = sum(guest_score)), by = c("home_team_name" = "away_team_name")
            ) %>%
  full_join(df %>% 
              group_by(home_team_name) %>% 
              summarise(Goals_Against_Home = sum(guest_score))
            ) %>%
  full_join(df %>% 
              group_by(away_team_name) %>%
              summarise(Goals_Against_Away = sum(home_score)), by = c("home_team_name" = "away_team_name")
            ) %>%
  rename(Team=home_team_name)
#> Joining, by = "home_team_name"
sumdf[sumdf$Team %in% c("Manchester United", "Hull"),]
#> # A tibble: 2 × 5
#>   Team         Goals_For_Home Goals_For_Away Goals_Against_Ho… Goals_Against_Aw…
#>   <chr>                 <int>          <int>             <int>             <int>
#> 1 Hull                     28              9                35                45
#> 2 Manchester …             26             28                12                17
lmb_mu = (26+28)/38; print(lmb_mu)
#> [1] 1.4
lmb_h = (28+9)/38; print(lmb_h)
#> [1] 0.97

Now we simulate 100 games between the teams

x = rpois(100,lmb_mu)
y = rpois(100,lmb_h)
sum(x>y)
#> [1] 39
sum(x==y)
#> [1] 26
table(x,y)
#>    y
#> x    0  1  2  3  4
#>   0 10 11  8  1  1
#>   1  9 12  5  5  1
#>   2  8  7  3  1  2
#>   3  3  6  1  1  0
#>   4  2  2  1  0  0

From our simulaiton that 39 number of times MU wins and 26 there is a draw. The actual outcome was 0-0 (Hull at MU) and 0-1 (Mu at Hull). Thus our model fives a reasonable prediction.

The model can be improved by calculating different averages for home and away games. For example, Hull does much better at home games compared to away games. Further, we can include the characteristics of the opponent team to account for interactions between attack strengh (number of scored) and defence weakness of the opponent. Now we modify our value of expected goals for each of the teams by calculating \[ \hat \lambda = \lambda \times \text{Defense weakness} \]

Let’s model the MU at Hull game. The average away goals for MU \(28/19 = 1.47\) and the defence weakness of Hull is \(36/19 = 1.84\), thus the adjusted expected number of goals to be scored by MU is 2.79. Similarly, the adjusted number of the goals Hull is expected to score is \(28/19 \times 17/19 = 1.32\)

As a result of the simulation, we obtain

set.seed(1)
x = rpois(100, 28/19*35/19)
y = rpois(100,28/19*17/19)
table(x,y)
#>    y
#> x    0  1  2  3  4  5
#>   0  1  3  0  0  0  0
#>   1  3  5  6  1  1  0
#>   2  4 16  7  3  0  0
#>   3  6  7  2  3  0  0
#>   4  4  7  5  2  1  0
#>   5  2  3  1  2  0  2
#>   6  1  0  0  1  0  0
#>   7  1  0  0  0  0  0
image(z=table(x,y),x = 0:7,y = 0:5, xlab="MU Score", ylab="Hull Score")

sum(x>y)
#> [1] 67

A model is only as good as its predictions. Let’s see how our model did in out-of-sample prediction,

  • Man U wins 67 games out of 100, we should bet when odds ratio is below 67 to 100.
  • Most likely outcome is 1-2 (16 games out of 100)
  • The actual outcome was 0-1 (they played on August 27, 2016)
  • In out simulation 0-1 was the fourth most probable outcome (8 games out of 100).

References

Simpson, Edward. 2010. “Edward Simpson: Bayes at Bletchley Park.” Significance 7 (2): 76–80.
Spiegelhalter, David, and Yin-Lam Ng. 2009. “One Match to Go!” Significance 6 (4): 151–53.

  1. Odds, \(O(A)\), is the ratio of the probability of not happening over happening, \[ O(A) = (1 - P(A)) / P(A), \] equavalently, \[ P(A) = \frac{1}{1 + O(A)}. \]↩︎

  2. Aside: In the case of independence, \(P(x \mid y) = P(x)\) and \(P(y \mid x) = P(y)\).↩︎