Introduction: AI Today and in the Past. Probability and Bayes Rule
Vadim Sokolov
George Mason University
Robots and Automatic Machines Were Generally Very Inventive: Al-Jazari (XII Century)
Hesdin Castle (Robert II of Artois), Leonardo’s robot…
Jaquet-Droz automata (XVIII century):
Logic machine of Ramon Llull (XIII-XIV centuries)
Old AI
If rain outside, then take umbrella
This rule cannot be learned from data. It does not allow inference. Cannot say anything about rain outside if I see an umbrella.
New AI
Probability of taking umbrella, given there is rain
Conditional probability rule can be learned from data. Allows for inference. We can calculate the probability of rain outside if we see an umbrella.
Definition:
The computer program learns as the data is accumulating relative to a certain problem class \(T\) and the target function of \(P\) if the quality of solving these problems (relative to \(P\)) improves with gaining new experience.
There are no correct answers, only data, e.g. clustering:
In shadows of data, uncertainty reigns,
Bayesian whispers, where knowledge remains.
With prior beliefs, we start our quest,
Updating with evidence, we strive for the best.
A dance of the models, predictions unfold,
Inferences drawn, from the new and the old.
Through probabilities, we find our way,
In the world of AI, it’s the Bayesian sway.
So gather your data, let prior thoughts flow,
In the realm of the unknown, let your insights grow.
For in this approach, with each little clue,
We weave understanding, both rich and true.
A humorous and illustrative scene of a hockey player sitting on a bench in full gear, holding a hockey stick in one hand and a whiteboard marker in th
Old AI: Deep Blue (1997) vs. Garry Kasparov
Alpha GO vs Lee Sedol: Move 37 by AlphaGo in Game Two
Probability lets us talk efficiently about things that we are uncertain about.
All these involve estimating or predicting unknowns!!
Random Variables are numbers that we are not sure about. There’s a list of potential outcomes. We assign probabilities to each outcome.
Example: Suppose that we are about to toss two coins. Let \(X\) denote the number of heads. We call \(X\) the random variable that stands for the potential outcome.
Probability is a language designed to help us communicate about uncertainty. We assign a number between \(0\) and \(1\) measuring how likely that event is to occur. It’s immensely useful, and there’s only a few basic rules.
We describe the behavior of random variables with a Probability Distribution
Example: Suppose we are about to toss two coins. Let \(X\) denote the number of heads.
\[X = \left\{ \begin{array}{ll} 0 \text{ with prob. } 1/4\\ 1 \text{ with prob. } 1/2\\ 2 \text{ with prob. } 1/4 \end{array} \right.\]
\(X\) is called a Discrete Random Variable
Question: What is \(P(X=0)\)? How about \(P(X \geq 1)\)?
“happiness index” as a function of salary.
| Salary (\(X\)) | Happiness (\(Y\)): 0 (low) | 1 (medium) | 2 (high) |
|---|---|---|---|
| low 0 | 0.03 | 0.12 | 0.07 |
| medium 1 | 0.02 | 0.13 | 0.11 |
| high 2 | 0.01 | 0.13 | 0.14 |
| very high 3 | 0.01 | 0.09 | 0.14 |
Is \(P(Y=2 \mid X=3) > P(Y=2)\)?
The computation of \(P(x \mid y)\) from \(P(x)\) and \(P(y \mid x)\) is called Bayes theorem: \[ P(x \mid y) = \frac{P(y,x)}{P(y)} = \frac{P(y\mid x)p(x)}{p(y)} \]
This shows now the conditional distribution is related to the joint and marginal distributions.
You’ll be given all the quantities on the r.h.s.
Key fact: \(P(x \mid y)\) is generally different from \(P(y \mid x)\)!
Example: Most people would agree
\[\begin{align*} Pr & \left ( Practice \; hard \mid Play \; in \; NBA \right ) \approx 1\\ Pr & \left ( Play \; in \; NBA \mid Practice \; hard \right ) \approx 0 \end{align*}\]
The main reason for the difference is that \(P( Play \; in \; NBA ) \approx 0\).
Two random variable \(X\) and \(Y\) are independent if \[ P(Y = y \mid X = x) = P (Y = y) \] for all possible \(x\) and \(y\) values. Knowing \(X=x\) tells you nothing about \(Y\)!
Example: Tossing a coin twice. What’s the probability of getting \(H\) in the second toss given we saw a \(T\) in the first one?
Sally Clark was accused and convicted of killing her two children
They could have both died of SIDS.
The chance of a family which are non-smokers and over 25 having a SIDS death is around 1 in 8,500.
The chance of a family which has already had a SIDS death having a second is around 1 in 100.
The chance of a mother killing her two children is around 1 in 1,000,000.
The \(\frac{1}{100}\) comes from taking into account genetics.
\[ P \left( \mathrm{both} \; \; \mathrm{SIDS} \right) = (1/8500) (1/8500) = (1/73,000,000) \]
\[ \frac{p(I|E)}{p(G|E)} = \frac{P( E \cap I)}{P( E \cap G)} \] \(P( E \cap I) = P(E|I )P(I)\) needs discussion of \(p(I)\).
The expected value of a random variable is simply a weighted average of the possible values X can assume.
The weights are the probabilities of occurrence of those values.
\[E(X) = \sum_x xP(X=x)\]
With \(n\) equally likely outcomes with values \(x_1, \ldots, x_n\), \(P(X = x_i) = 1/n\)
\[E(X) = \frac{x_1+x_2+\ldots+x_n}{n}\]
\[E(X) = \frac{1}{37}\times 36 + \frac{36}{37}\times 0 = 0.97\]
\[E(X) = \frac{18}{37}\times 2 + \frac{19}{37}\times 0 = 0.97\]
Casino is guaranteed to make money in the long run!
The variance is calculated as
\[Var(X) = E\left((X - E(X))^2\right)\]
A simpler calculation is \(Var(X) = E(X^2) - E(X)^2\).
The standard deviation is the square-root of variance.
\[sd(X) = \sqrt{Var(X)}\]
\[Var(X) = \frac{1}{37}\times (36 - 0.97)^2 + \frac{36}{37}\times (0 - 0.97)^2 = 34\]
\[Var(X) = \frac{18}{37}\times (2 - 0.97)^2+ \frac{19}{37}\times (0- 0.97)^2 = 1\]
If your goal is to spend as much time as possible in the casino (free drinks): place small bets on black/red
Tortoise and Hare are selling cars. Probability distributions, means and variances for \(X\), the number of cars sold
| 0 | 1 | 2 | 3 | Mean | Variance | sd | |
|---|---|---|---|---|---|---|---|
| cars sold | \(X\) | \(E(X)\) | \(Var(X)\) | \(\sqrt{Var(X)}\) | |||
| Tortoise | 0 | 0.5 | 0.5 | 0 | 1.5 | 0.25 | 0.5 |
| Hare | 0.5 | 0 | 0 | 0.5 | 1.5 | 2.25 | 1.5 |
Let’s do Tortoise expectations and variances
The Tortoise \[\begin{align*} E(T) &= (1/2)(1) + (1/2)(2) = 1.5 \\ Var(T) &= E(T^2) - E(T)^2 \\ &= (1/2)(1)^2 + (1/2)(2)^2 - (1.5)^2 = 0.25 \end{align*}\]
Now the Hare’s \[\begin{align*} E(H) &= (1/2)(0) + (1/2)(3) = 1.5 \\ Var(H) &= (1/2)(0)^2 + (1/2)(3)^2- (1.5)^2 = 2.25 \end{align*}\]
What do these tell us above the long run behavior?
Two key properties:
Let \(a, b\) be given constants
where \(Cov(X,Y)\) is the covariance between random variables.
What about Tortoise and Hare? We need to know \(Cov(\text{Tortoise, Hare})\). Let’s take \(Cov(T,H) = -1\) and see what happens
Suppose \(a = \frac{1}{2}, b= \frac{1}{2}\) Expectation and Variance
\[\begin{align*} E\left(\frac{1}{2} T + \frac{1}{2} H\right) &= \frac{1}{2} E(T) + \frac{1}{2} E(H) = \frac{1}{2} \times 1.5 + \frac{1}{2} \times 1.5 = 1.5 \\ Var\left(\frac{1}{2} T + \frac{1}{2} H\right) &= \frac{1}{4} 0.25 + \frac{1}{4} 2.25 - 2 \frac{1}{2} \frac{1}{2} = 0.625 - 0.5 = 0.125 \end{align*}\]
Much lower!
Many Business Applications!! Suggestions vs Search….
Alice is a 40-year-old women, what is the chance that she really has breast cancer when she gets positive mammogram result, given the conditions:
The posterior probability \(P(\text{cancer} \mid \text{positive mammogram})\)?
Of 1000 cases:
Conditional probability is how AI systems express judgments in a way that reflects their partial knowledge.
Personalization runs on conditional probabilities, all of which must be estimated from massive data sets in which you are the conditioning event.
Many Business Applications!! Suggestions vs Search, ….
Will a subscriber like Saving Private Ryan, given that he or she liked the HBO series Band of Brothers?
Both are epic dramas about the Normandy invasion and its aftermath.
100 people in your database, and every one of them has seen both films.
Their viewing histories come in the form of a big “ratings matrix”.
| Liked Band of Brothers | Didn’t like it | |
|---|---|---|
| Liked Saving Private Ryan | 56 subscribers | 6 subscribers |
| Didn’t like it | 14 subscribers | 24 subscribers |
\[P(\text{likes Saving Private Ryan} \mid \text{likes Band of Brothers})=\frac{56}{56+14}=80\%\]
But real problem is much more complicated:
The solution to all three issues is careful modeling.
The fundamental equation is: \[\text{Predicted Rating} =\text{Overall Average} + \text{Film Offset} + \text{User Offset} + \text{User-Film Interaction}\]
These three terms provide a baseline for a given user/film pair: