Unit 1: Introduction: AI Today and in the Past. Probability and Bayes Rule
On this Day (January 27):
Robots and Automatic Machines Were Generally Very Inventive: Al-Jazari (XII Century)
Hesdin Castle (Robert II of Artois), Leonardo’s robot…
Jaquet-Droz automata (XVIII century):
It takes a lot to create an AI system:
We propose that a 2-month, 10-man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.
Old AI
If rain outside, then take umbrella
This rule cannot be learned from data. It does not allow inference. Cannot say anything about rain outside if I see an umbrella.
New AI
Probability of taking umbrella, given there is rain
Conditional probability rule can be learned from data. Allows for inference. We can calculate the probability of rain outside if we see an umbrella.
Definition:
The computer program learns as the data is accumulating relative to a certain problem class \(T\) and the target function of \(P\) if the quality of solving these problems (relative to \(P\)) improves with gaining new experience.
There are no correct answers, only data, e.g. clustering:
In shadows of data, uncertainty reigns,
Bayesian whispers, where knowledge remains.
With prior beliefs, we start our quest,
Updating with evidence, we strive for the best.
A dance of the models, predictions unfold,
Inferences drawn, from the new and the old.
Through probabilities, we find our way,
In the world of AI, it’s the Bayesian sway.
So gather your data, let prior thoughts flow,
In the realm of the unknown, let your insights grow.
For in this approach, with each little clue,
We weave understanding, both rich and true.
A humorous and illustrative scene of a hockey player sitting on a bench in full gear, holding a hockey stick in one hand and a whiteboard marker in th
Old AI: Deep Blue (1997) vs. Garry Kasparov
Subjective Probability (de Finetti, Ramsey, Savage, von Neumann, ... )
Principle of Coherence:
A set of subjective probability beliefs must avoid sure loss
Use probability to describe outcomes involving more than one variable at a time. Need to be able to measure what we think will happen to one variable relative to another
In general the notation is ...
Relationship between the joint and conditional ... \[ \begin{aligned} P(x,y) & = P(x) P(y \mid x) \\ & = P(y) P(x \mid y) \end{aligned} \]
Relationship between the joint and marginal ... \[ \begin{aligned} P(x) & = \sum_y P(x,y) \\ P(y) & = \sum_x P(x,y) \end{aligned} \]
The computation of \(P(x \mid y)\) from \(P(x)\) and \(P(y \mid x)\) is called Bayes theorem ... \[ P(x \mid y) = \frac{P(y,x)}{P(y)} = \frac{P(y,x)}{\sum_x P(y,x)} = \frac{P(y \mid x)P(x)}{\sum_x P(y \mid x)P(x)} \]
This shows now the conditional distribution is related to the joint and marginal distributions.
You’ll be given all the quantities on the r.h.s.
Key fact: \(P(x \mid y)\) is generally different from \(P(y \mid x)\)!
Example: Most people would agree \[ \begin{aligned} Pr & \left ( Practice \; hard \mid Play \; in \; NBA \right ) \approx 1\\ Pr & \left ( Play \; in \; NBA \mid Practice \; hard \right ) \approx 0 \end{aligned} \]
The main reason for the difference is that \(P( Play \; in \; NBA ) \approx 0\).
Two random variable \(X\) and \(Y\) are independent if \[ P(Y = y \mid X = x) = P (Y = y) \] for all possible \(x\) and \(y\) values. Knowing \(X=x\) tells you nothing about \(Y\)!
Example: Tossing a coin twice. What’s the probability of getting \(H\) in the second toss given we saw a \(T\) in the first one?
Source: The Secret Betting Strategy That Beats Online Bookmakers
We can express probabilities in terms of Odds via \[ O(A) = \frac{ 1- P(A) }{ P(A) } \; \; {\rm or} \; \; P(A) = \frac{ 1 }{ 1 + O(A) } \]
In terms of probability \(P = \frac{1}{3}\).
The following problem is known as the “exchange paradox”.
You know that \(y = \frac{1}{2} x\) or \(y = 2 x\). You are thinking about whether you should switch your opened envelope for the unopened envelope of your friend. It is tempting to do an expected value calculation as follows \[ E( y) = \frac{1}{2} \cdot \frac{1}{2} x + \frac{1}{2} \cdot 2 x = \frac{5}{4} x > x \] Therefore, it looks as if you should switch no matter what value of \(x\) you see. A consequence of this, following the logic of backwards induction, that even if you didn’t open your envelope that you would want to switch!
Where’s the flaw in this argument? Use Bayes rule to update the probabilities of which envelope your opponent has! Assume \(p(m)\) of dollars to be placed in the envelope by the swami.
Such an assumption then allows us to calculate an odds ratio \[ \frac{ p \left ( y = \frac{1}{2} x | x \right ) }{ p \left ( y = 2 x | x \right ) } \] concerning the likelihood of which envelope your opponent has.
Then, the expected value is given by
\[ E(y) = p \left ( y = \frac{1}{2} x \; \vert \; x \right ) \cdot \frac{1}{2} x + p \left ( y = 2 x | x \right ) \cdot 2 x \] and the condition \(E( y) > x\) becomes a decision rule.
Three prisoners \({\cal A} , {\cal B} , {\cal C}\).
Each believe are equally likely to be set free.
Prisoner \({\cal A}\) goes to the warden \({\cal W}\) and asks if s/he is getting axed.
The Warden can’t tell \({\cal A}\) anything about him.
He provides the new information: \({\cal WB}\) = “\({\cal B}\) is to be executed”
Uniform Prior Probabilities: \[ \begin{array}{c|ccc} Prior & {\cal A} & {\cal B} & {\cal C} \\\hline {\cal P} ( {\rm Pardon} ) & 0.33 & 0.33 & 0.33 \end{array} \]
Posterior: Compute \(P ( {\cal A} | {\cal WB} )\)?
What happens if \({\cal C}\) overhears the conversation?
Compute \(P ( {\cal C} | {\cal WB} )\)?
Named after the host of the long-running TV show, Let’s make a Deal.
There is a prize (a car, say) behind one of the doors and something worthless behind the other two doors: two goats.
The game is as follows:
You pick a door.
Monty then opens one of the other two doors, revealing a goat.
You have the choice of switching doors.
Is it advantageous to switch?
Assume you pick door \(A\) at random. Then \(P(A) = ( 1 /3 )\).
You need to figure out \(P( A | MB )\) after Monte reveals \(B\) is a goat.
In its simplest form.
Many problems in decision making can be solved using Bayes rule.
Bayes Rule: \[ \mbox{P}(A|B) = \frac{\mbox{P}(A \cap B)}{\mbox{P}(B)} = \frac{ \mbox{P}(B|A) \mbox{P}(A)}{ \mbox{P}(B)} \] Law of Total Probability: \[ \mbox{P}(B) = \mbox{P}(B|A) \mbox{P}(A ) + \mbox{P}(B| \bar{A} ) \mbox{P}(\bar{A} ) \]
The Apple Watch Series 4 can perform a single-lead ECG and detect atrial fibrillation. The software can correctly identify 98% of cases of atrial fibrillation (true positives) and 99% of cases of non-atrial fibrillation (true negatives).
However, what is the probability of a person having atrial fibrillation when atrial fibrillation is identified by the Apple Watch Series 4?
Bayes’ Theorem: \[ P(A|B)=\frac{P(B|A)P(A)}{P(B)} \]
Predicted | atrial fibrillation | no atrial fibrillation |
---|---|---|
atrial fibrillation | 1960 | 980 |
no atrial fibrillation | 40 | 97020 |
\[ 0.6667 = \frac{0.98\cdot 0.02}{ 0.0294} \]
The conditional probability of having atrial fibrillation when the Apple Watch Series 4 detects atrial fibrillation is about 67%.
How Abraham Wald improved aircraft survivability. Raw Reports from the Field
Type of damage suffered | Returned (316 total) | Shot down (60 total) |
---|---|---|
Engine | 29 | ? |
Cockpit | 36 | ? |
Fuselage | 105 | ? |
None | 146 | 0 |
This fact would allow Wald to estimate: \[ P(\text{damage on fuselage} \mid \text{returns safely}) = 105/316 \approx 32\% \] You need the inverse probability : \[ P(\text{returns safely} \mid \text{damage on fuselage}) \] Completely different!
Imputation: fill-in missing data.
Type of damage suffered | Returned (316 total) | Shot down (60 total) |
---|---|---|
Engine | 29 | 31 |
Cockpit | 36 | 21 |
Fuselage | 105 | 8 |
None | 146 | 0 |
Then Wald got: \[ \begin{aligned} P(\text{returns safely} \mid \text{damage on fuselage}) & =\frac{105}{105+8}\approx 93\%\\ P(\text{returns safely} \mid \text{damage on engine}) & =\frac{29}{29+31}\approx 48\% \end{aligned} \]
Many Business Applications!! Suggestions vs Search….
evidence:
known facts about criminal (e.g. blood type, DNA, ...)
suspect:
matches a trait with evidence at scene of crime
Let \({\cal G}\) denote the event that the suspect is the criminal.
Bayes computes the conditional probability of guilt
\[ P ( {\cal G} | {\rm evidence} ) \] Evidence \({\cal E}\): suspect and criminal possess a common trait
Bayes Theorem yields \[ P ( {\cal G} | {\rm evidence} ) = \frac{ P ( {\rm evidence} | {\cal G} ) P ( {\cal G} ) }{ P ( {\rm evidence} )} \]
In terms of relative odds \[ \frac{ P ( {\cal I} | {\rm evidence} ) }{ P ( {\cal G} | {\rm evidence} ) } = \frac{ P ( {\rm evidence} | {\cal I} ) }{ P ( {\rm evidence} | {\cal G} ) } \frac{ P ( {\cal I} ) }{ P ( {\cal G} ) } \]
There are two terms:
How many people on the island?
Sensitivity “what if” analysis?
The most common fallacy is confusing \[ P ( {\rm evidence} | {\cal G} ) \; \; {\rm with} \; \; P ( {\cal G} | {\rm evidence} ) \]
Bayes rule yields \[ P ( {\cal G} | {\rm evidence} ) = \frac{ P ( {\rm evidence} | {\cal G} ) p( {\cal G} )}{ P ( {\rm evidence} )} \] Your assessment of \(P( {\cal G} )\) will matter.
Suppose there’s a criminal on a island of \(N+1\) people.
Bayes factors are likelihood ratios
The Bayes factor is given by \[ \frac{p(E|I)}{p(E|G)}=p \]
If we start with a uniform prior distribution we have
\[ p(I)=\frac{1}{N+1}\;\;\mathrm{and}\;\;odds(I)=N \]
Posterior Probability related to Odds \[ p(I|y)=\frac {1}{1+odds(I|y)}% \]
The posterior probability \(p(I|y)\neq p(y|I)=p\).
\[ p( I|y) = \frac{1}{1 + 10^3 \cdot 10^{-3}} = \frac{1}{2} \]
The odds on innocence are \(odds(I|y)=1\).
There’s a \(50/50\) chance that the criminal has been found.
Sally Clark was accused and convicted of killing her two children
They could have both died of SIDS.
The chance of a family which are non-smokers and over \(25\) having a SIDS death is around \(1\) in \(8,500\).
The chance of a family which has already had a SIDS death having a second is around \(1\) in \(100\).
The chance of a mother killing her two children is around \(1\) in \(1,000,000\).
Under Bayes \[ \begin{aligned} P \left( \mathrm{both} \; \; \mathrm{SIDS} \right) & = P \left( \mathrm{first} \; \mathrm{SIDS} \right) P \left( \mathrm{Second} \; \; \mathrm{SIDS} | \mathrm{first} \; \mathrm{SIDS} \right) \\ & = \frac{1}{8500} \cdot \frac{1}{100} = \frac{1}{850,000} \end{aligned} \] The \(\frac{1}{100}\) comes from taking into account genetics.
Independence, as the court did, gets you
\[ P \left( \mathrm{both} \; \; \mathrm{SIDS} \right) = (1/8500) (1/8500) = (1/73,000,000) \]
\[ \frac{p(I|E)}{p(G|E)} = \frac{P( E \cap I)}{P( E \cap G)} \] \(P( E \cap I) = P(E|I )P(I)\) needs discussion of \(p(I)\).
\[ \frac{p(I|E)}{p(G|E)} = \frac{1/850,000}{1/1,000,000} = 1.15 \] In terms of posterior probabilities
\[ p( G|E) = \frac{1}{1 + O(G|E)} = 0.465 \]
\[ \frac{p(I|E)}{p(G|E)} = \frac{1}{73} \; {\rm and} \; p( G|E) \approx 0.99 \] The suspect looks guilty.
The O.J. Simpson trial was possibly the trail of the century
The murder of his wife Nicole Brown Simpson, and a friend, Ron Goldman, in June 1994 and the trial dominated the TV networks
DNA evidence and probability: \(p( E| G)\)
Bayes Theorem: \(p( G | E )\)
Prosecutor’s Fallacy: \(p( G|E ) \neq p(E|G)\)
Odds ratio with gives \[ \frac{ p( I|E) }{ p ( G | E ) } = \frac{ p( E|I )}{ p( E|G) } \frac{ p(I) }{p(G ) } \] Prior odds conditioned on background information.
Suppose that you are a juror in a murder case of a husband who is accused of killing his wife.
The husband is known is have battered her in the past.
Consider the three events:
\(G\) “husband murders wife in a given year”
\(M\) “wife is murdered in a given year”
\(B\) “husband is known to batter his wife”
Conditional on eventually murdering their wife, there a one in ten chance it happens in a given year.
In 1994, \(5000\) women were murdered, \(1500\) by their husband
Given a population of \(100\) million women at the time \[ p( M | I ) = \frac{ 3500 }{ 10^8 } \approx \frac{1}{30,000} . \] We’ll also need \(p( M | I , B ) = p( M | I )\)
What’s the “match probability” for a rare event?
Bayes theorem in Odds \[ \frac{p(G|M,B)}{p(I|M,B)} = \frac{p(M|G,B)}{p(M|I,B)} \frac{p(G|B)}{p(I|B)} \]
By assumption,
\[ \frac{p(G|B)}{p(I|B)} = \frac{1}{999} \]
Therefore, \[ \frac{p(G|M,B)}{p(I|M,B)} \approx 30 \; {\rm and} \; p(G|M,B) = \frac{30}{31} \approx 97\% \] More than a 50/50 chance that your spouse murdered you!
The defense stated to the press: in any given year
“Fewer than \(1\) in \(2000\) of batterers go on to murder their wives”.
Now estimate \(p( M | \bar{G} , B ) = p( M| \bar{G} ) = \frac{1}{20,000}\).
The Bayes factor is then
\[ \frac{ p( G | M , B ) }{ p( \bar{G} | M , B ) } = \frac{ 1/999 }{1 /20,000} = 20 \] which implies posterior probabilities
\[ p( \bar{G} | M , B ) = \frac{1}{1+20} \; {\rm and} \; p( G | M , B ) = \frac{20}{21} \] Hence its over 95% chance that O.J. is guilty based on this information!
Defense intended this information to exonerate O.J.
“Witness” \(80\) % certain saw a “checker” \(C\) taxi in the accident.
What’s your \(P ( C | E )\) ?
Need \(P ( C )\). Say \(P( C ) = 0.2\) and \(P( E | C) = 0.8\).
Then your posterior is
\[ P ( C | E ) = \frac{0.8 \cdot 0.2}{ 0.8 \cdot 0.2 + 0.2 \cdot 0.8 } = 0.5 \]
Therefore \(O ( C ) = 1\) a 50/50 bet.
Most people don’t update quickly enough in light of new data
Wards Edwards 1960s
When you have a small sample size, Bayes rule still updates probabilities
Two players: either \(70\) % A or \(30\) % A
Observe \(A\) beats \(B\) \(3\) times out of \(4\).
What’s \(P ( A = 70 \% \; {\rm player} )\) ?