Chief AI Officer Program: ML Essentials

Introduction: AI Today and in the Past. Probability and Bayes Rule

Vadim Sokolov

George Mason University

Brief History of AI

Mechanical machines

Robots and Automatic Machines Were Generally Very Inventive: Al-Jazari (XII Century)

Hesdin Castle (Robert II of Artois), Leonardo’s robot…

Mechanical machines

Jaquet-Droz automata (XVIII century):

Mechanical machines

  • But this is in mechanics, in mathematics/logic AI it was quite rudimentary for a long time

Logic machine of Ramon Llull (XIII-XIV centuries)

  • Starting with Dr. Frankenstein, further AI in the literature appears constantly …

Shennon’s Theseus

  • YouTube Video
  • Early 1950s, Claude Shannon (The father of Information Theory) demonstrates Theseus
  • A life-sized magnetic mouse controlled by relay circuits, learns its way around a maze.

1956-1960: Great hopes

  • Optimistic time. It seemed a that we were almost there…
  • Allen Newell, Herbert A. Simon, and Cliff Shaw: Logic Theorist.
  • Automated reasoning.
  • It was able to prof most of the Principia Mathematica, in some places even more elegant than Russell and Whitehead.

1956-1960: Big Hopes

  • General Problem Solver - a program that tried to think as a person
  • A lot of programs that have been able to do some limited things (MicroWorlds):
    • Analogy (IQ tests with multiple choice questions)
    • Student (algebraic verbal tasks)
    • Blocks World (rearranged 3D blocks).

1970s: Knowledge Based Systems

  • The bottom line: to accumulate a fairly large set of rules and knowledge about the subject area, then draw conclusions.
  • First success: MYCIN - Diagnosis of blood infections:
    • about 450 rules
    • The results are like an experienced doctor and significantly better than beginner doctors.

1980-2010: Commercial applications Industry AI

  • The first AI department was at Dec (Digital Equipment Corporation). It is argued that by 1986 he saved the Dec about $10 million per year.
  • The boom ended by the end of the 80s, when many companies could not live up to high expectations.

Rule-Based System vs Bayes

Old AI


If rain outside, then take umbrella

This rule cannot be learned from data. It does not allow inference. Cannot say anything about rain outside if I see an umbrella.


 

New AI

Probability of taking umbrella, given there is rain

Conditional probability rule can be learned from data. Allows for inference. We can calculate the probability of rain outside if we see an umbrella.

  • Bayesian approach is a powerful statistical framework based on the work of Thomas Bayes and later Laplace.
  • It provides a probabilistic approach to reasoning and learning
  • Allowing us to update our beliefs about the world as we gather new data.
  • This makes it a natural fit for artificial intelligence, where we often need to deal with uncertainty and incomplete information.

DEFINITION

  • How to determine “learning”?

Definition:

The computer program learns as the data is accumulating relative to a certain problem class \(T\) and the target function of \(P\) if the quality of solving these problems (relative to \(P\)) improves with gaining new experience.

  • The definition is very (too?) General.
  • What specific examples can be given?

Tasks and concepts of ML

Tasks and concepts of ML: Supervised Learning

  • training sample – a set of examples, each of which consists of input features (attributes) and the correct “answers” - the response variable
  • Learn a rule that maps input features to the response variable
  • Then this rule is applied to new examples (deployment)
  • The main thing is to train a model that explains not only examples from the training set, but also new examples (generalizes)
  • Otherwise - overfitting

Tasks and concepts of ML: unsupervised learning

There are no correct answers, only data, e.g. clustering:

  • We need to divide the data into pre -unknown classes to some extent similar:
    • highlight the family of genes from the sequences of nucleotides
    • cluster users and personalize the application for them
    • cluster the mass spectrometric image to parts with different composition

Tasks and concepts of ML: unsupervised learning

  • Dimensionality reduction: data have a high dimension, it is necessary to reduce it, select the most informative features so that all of the above algorithms can work
  • Matrix Competition: There is a sparse matrix, we must predict what is in the missing positions.
  • Anomaly detection: find anomalies in the data, e.g. fraud detection.
  • Often the outputs answers are given for a small part of the data, then we call it semi -supervised Learning.

Tasks and concepts of ML: reinforcement learning

  • Multi-armed bandits: there is a certain set of actions, each of which leads to random results, you need to get as much rewards possible
  • Exploration vs.Exploitation: how and when to proceed from the study of the new to use what has already studied
  • Credit Assignment: You get rewarded at the very end (won the game), and we must somehow distribute this reward on all the moves that led to victory.

Tasks and concepts of ML: active learning

  • Active Learning - how to choose the following (relatively expensive) test
  • Boosting - how to combine several weak classifiers so that it turns out good
  • Model Selection - where to draw a line between models with many parameters and with a few.
  • Ranking: response list is ordered (internet search)

Tasks and concepts of AI

Tasks and concepts of AI: Reasoning

  • Bayesian networks: given conditional probabilities, calculate the probability of the event
  • o1 by OpenAI: a family of AI models that are designed to perform complex reasoning tasks, such as math, coding, and science. o1 models placed among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME)
  • Gemini 2.0: model for the agentic era

Tasks and concepts of AI: Representation

  • Knowledge Graphs: a graph database that uses semantic relationships to represent knowledge
  • Embeddings: a way to represent data in a lower-dimensional space
  • Transformers: a deep learning model that uses self-attention to process sequential data

Tasks and concepts of AI: Generation

In shadows of data, uncertainty reigns,
Bayesian whispers, where knowledge remains.
With prior beliefs, we start our quest,
Updating with evidence, we strive for the best.

A dance of the models, predictions unfold,
Inferences drawn, from the new and the old.
Through probabilities, we find our way,
In the world of AI, it’s the Bayesian sway.

So gather your data, let prior thoughts flow,
In the realm of the unknown, let your insights grow.
For in this approach, with each little clue,
We weave understanding, both rich and true.

Music

Tasks and concepts of AI: Generation

from openai import OpenAI
client = OpenAI(api_key="your-api-key")
response = client.images.generate(
    model="dall-e-3",
    prompt="a hockey player trying to understand the Bayes rule",
    size="1024x1024",
    quality="standard",
    n=1,
)

print(response.data[0].url)

Tasks and concepts of AI: Generation

A humorous and illustrative scene of a hockey player sitting on a bench in full gear, holding a hockey stick in one hand and a whiteboard marker in th

Chess and AI

Old AI: Deep Blue (1997) vs. Garry Kasparov

Kasparov vs IBM’s DeepBlue in 1997

AlphaGo Zero

  • Remove all human knowledge from training process - only uses self play,
  • Takes raw board as input and neural network predicts the next move.
  • Uses Monte Carlo tree search to evaluate the position.
  • The algorithm was able to beat AlphaGo 100-0. The algorithm was then used to play chess and shogi and was able to beat the best human players in those games as well.

Alpha GO vs Lee Sedol: Move 37 by AlphaGo in Game Two

Probability in machine learning

  • In all methods and approaches, it is useful not only generate an answer, but also evaluate how confident in this answer, how well the model describes the data, how these values will change in further experiments, etc.
  • Therefore, the central role in machine learning is played by the theory of probability - and we will also actively use it.

Bayes Approach

Review of Basic Probability Concepts

Probability lets us talk efficiently about things that we are uncertain about.

  • What will Amazon’s sales be next quarter?
  • What will the return be on my stocks next year?
  • How often will users click on a particular Google ad?

All these involve estimating or predicting unknowns!!

Random Variables

Random Variables are numbers that we are not sure about. There’s a list of potential outcomes. We assign probabilities to each outcome.

Example: Suppose that we are about to toss two coins. Let \(X\) denote the number of heads. We call \(X\) the random variable that stands for the potential outcome.

Probability

Probability is a language designed to help us communicate about uncertainty. We assign a number between \(0\) and \(1\) measuring how likely that event is to occur. It’s immensely useful, and there’s only a few basic rules.

  1. If an event \(A\) is certain to occur, it has probability \(1\), denoted \(P(A)=1\)
  2. Either an event \(A\) occurs or it does not. \[P(A) = 1 - P(\text{not }A)\]
  3. If two events are mutually exclusive (both cannot occur simultaneously) then \[P(A \text{ or } B) = P(A) + P(B)\]
  4. Joint probability, when events are independent \[P(A \text{ and } B) = P( A) P(B)\]

Probability Distribution

We describe the behavior of random variables with a Probability Distribution

Example: Suppose we are about to toss two coins. Let \(X\) denote the number of heads.

\[X = \left\{ \begin{array}{ll} 0 \text{ with prob. } 1/4\\ 1 \text{ with prob. } 1/2\\ 2 \text{ with prob. } 1/4 \end{array} \right.\]

\(X\) is called a Discrete Random Variable

Question: What is \(P(X=0)\)? How about \(P(X \geq 1)\)?

Example: Happiness Index

“happiness index” as a function of salary.

Salary (\(X\)) Happiness (\(Y\)): 0 (low) 1 (medium) 2 (high)
low 0 0.03 0.12 0.07
medium 1 0.02 0.13 0.11
high 2 0.01 0.13 0.14
very high 3 0.01 0.09 0.14

Is \(P(Y=2 \mid X=3) > P(Y=2)\)?

Bayes Rule

The computation of \(P(x \mid y)\) from \(P(x)\) and \(P(y \mid x)\) is called Bayes theorem: \[ P(x \mid y) = \frac{P(y,x)}{P(y)} = \frac{P(y\mid x)p(x)}{p(y)} \]

This shows now the conditional distribution is related to the joint and marginal distributions.

You’ll be given all the quantities on the r.h.s.

Bayes Rule

Key fact: \(P(x \mid y)\) is generally different from \(P(y \mid x)\)!

Example: Most people would agree

\[\begin{align*} Pr & \left ( Practice \; hard \mid Play \; in \; NBA \right ) \approx 1\\ Pr & \left ( Play \; in \; NBA \mid Practice \; hard \right ) \approx 0 \end{align*}\]

The main reason for the difference is that \(P( Play \; in \; NBA ) \approx 0\).

Independence

Two random variable \(X\) and \(Y\) are independent if \[ P(Y = y \mid X = x) = P (Y = y) \] for all possible \(x\) and \(y\) values. Knowing \(X=x\) tells you nothing about \(Y\)!

Example: Tossing a coin twice. What’s the probability of getting \(H\) in the second toss given we saw a \(T\) in the first one?

Sally Clark Case: Independence or Bayes?

Sally Clark was accused and convicted of killing her two children

They could have both died of SIDS.

  • The chance of a family which are non-smokers and over 25 having a SIDS death is around 1 in 8,500.

  • The chance of a family which has already had a SIDS death having a second is around 1 in 100.

  • The chance of a mother killing her two children is around 1 in 1,000,000.

Bayes or Independence

  1. Under Bayes \[\begin{align*} P \left( \mathrm{both} \; \; \mathrm{SIDS} \right) & = P \left( \mathrm{first} \; \mathrm{SIDS} \right) P \left( \mathrm{Second} \; \; \mathrm{SIDS} | \mathrm{first} \; \mathrm{SIDS} \right) \\ & = \frac{1}{8500} \cdot \frac{1}{100} = \frac{1}{850,000} \end{align*}\]

The \(\frac{1}{100}\) comes from taking into account genetics.

  1. Independence, as the court did, gets you

\[ P \left( \mathrm{both} \; \; \mathrm{SIDS} \right) = (1/8500) (1/8500) = (1/73,000,000) \]

  1. By Bayes rule

\[ \frac{p(I|E)}{p(G|E)} = \frac{P( E \cap I)}{P( E \cap G)} \] \(P( E \cap I) = P(E|I )P(I)\) needs discussion of \(p(I)\).

Random Variables: Expectation \(E(X)\)

The expected value of a random variable is simply a weighted average of the possible values X can assume.

The weights are the probabilities of occurrence of those values.

\[E(X) = \sum_x xP(X=x)\]

With \(n\) equally likely outcomes with values \(x_1, \ldots, x_n\), \(P(X = x_i) = 1/n\)

\[E(X) = \frac{x_1+x_2+\ldots+x_n}{n}\]

Roulette Expectation

  • European Odds: 36 numbers (red/black) + zero
  • You bet $1 on 11 Black (pays 35 to 1)
  • \(X\) is the return on this bet

\[E(X) = \frac{1}{37}\times 36 + \frac{36}{37}\times 0 = 0.97\]

  • If you bet $1 on Black (pays 1 to 1)

\[E(X) = \frac{18}{37}\times 2 + \frac{19}{37}\times 0 = 0.97\]

Casino is guaranteed to make money in the long run!

Standard Deviation \(sd(X)\) and Variance \(Var(X)\)

The variance is calculated as

\[Var(X) = E\left((X - E(X))^2\right)\]

A simpler calculation is \(Var(X) = E(X^2) - E(X)^2\).

The standard deviation is the square-root of variance.

\[sd(X) = \sqrt{Var(X)}\]

Roulette Variance

  • European Odds: 36 numbers (red/black) + zero
  • You bet $1 on 11 Black (pays 35 to 1)
  • \(X\) is the return on this bet

\[Var(X) = \frac{1}{37}\times (36 - 0.97)^2 + \frac{36}{37}\times (0 - 0.97)^2 = 34\]

  • If you bet $1 on Black (pays 1 to 1)

\[Var(X) = \frac{18}{37}\times (2 - 0.97)^2+ \frac{19}{37}\times (0- 0.97)^2 = 1\]

If your goal is to spend as much time as possible in the casino (free drinks): place small bets on black/red

Example: \(E(X)\) and \(Var(X)\)

Tortoise and Hare are selling cars. Probability distributions, means and variances for \(X\), the number of cars sold

0 1 2 3 Mean Variance sd
cars sold \(X\) \(E(X)\) \(Var(X)\) \(\sqrt{Var(X)}\)
Tortoise 0 0.5 0.5 0 1.5 0.25 0.5
Hare 0.5 0 0 0.5 1.5 2.25 1.5

Expectation and Variance Calculations

Let’s do Tortoise expectations and variances

  • The Tortoise \[\begin{align*} E(T) &= (1/2)(1) + (1/2)(2) = 1.5 \\ Var(T) &= E(T^2) - E(T)^2 \\ &= (1/2)(1)^2 + (1/2)(2)^2 - (1.5)^2 = 0.25 \end{align*}\]

  • Now the Hare’s \[\begin{align*} E(H) &= (1/2)(0) + (1/2)(3) = 1.5 \\ Var(H) &= (1/2)(0)^2 + (1/2)(3)^2- (1.5)^2 = 2.25 \end{align*}\]

Expectation and Variance Interpretation

What do these tell us above the long run behavior?

  • Tortoise and Hare have the same expected number of cars sold.
  • Tortoise is more predictable than Hare. He has a smaller variance The standard deviations \(\sqrt{Var(X)}\) are \(0.5\) and \(1.5\), respectively
  • Given two equal means, you always want to pick the lower variance.

Linear Combinations of Random Variables

Two key properties:

Let \(a, b\) be given constants

  • Expectations and Variances \[\begin{align*} E(aX + bY) &= a E(X) + b E(Y) \\ Var(aX + bY) &= a^2 Var(X) + b^2 Var(Y) + 2 ab Cov(X,Y) \end{align*}\]

where \(Cov(X,Y)\) is the covariance between random variables.

Tortoise and Hare Portfolio

What about Tortoise and Hare? We need to know \(Cov(\text{Tortoise, Hare})\). Let’s take \(Cov(T,H) = -1\) and see what happens

Suppose \(a = \frac{1}{2}, b= \frac{1}{2}\) Expectation and Variance

\[\begin{align*} E\left(\frac{1}{2} T + \frac{1}{2} H\right) &= \frac{1}{2} E(T) + \frac{1}{2} E(H) = \frac{1}{2} \times 1.5 + \frac{1}{2} \times 1.5 = 1.5 \\ Var\left(\frac{1}{2} T + \frac{1}{2} H\right) &= \frac{1}{4} 0.25 + \frac{1}{4} 2.25 - 2 \frac{1}{2} \frac{1}{2} = 0.625 - 0.5 = 0.125 \end{align*}\]

Much lower!

Bayesian Updating

“Personalization" \(=\)”Conditional Probability"

  • Conditional probability is how AI systems express judgments in a way that reflects their partial knowledge.
  • Personalization runs on conditional probabilities, all of which must be estimated from massive data sets in which you are the conditioning event.


Many Business Applications!! Suggestions vs Search….

Bayes’s Rule in Medical Diagnostics

Alice is a 40-year-old women, what is the chance that she really has breast cancer when she gets positive mammogram result, given the conditions:

  1. The prevalence of breast cancer among people like Alice is 1%.
  2. The test has an 80% detection rate.
  3. The test has a 10% false-positive rate.

The posterior probability \(P(\text{cancer} \mid \text{positive mammogram})\)?

Medical Diagnostics - Visualization

Medical Screening

Of 1000 cases:

  • 108 positive mammograms. 8 are true positives. The remaining 100 are false positives.
  • 892 negative mammograms. 2 are false negatives. The other 890 are true negatives.

“Personalization” = “Conditional Probability”

Conditional probability is how AI systems express judgments in a way that reflects their partial knowledge.

Personalization runs on conditional probabilities, all of which must be estimated from massive data sets in which you are the conditioning event.

Many Business Applications!! Suggestions vs Search, ….

How does Netflix Give Recommendations?

Will a subscriber like Saving Private Ryan, given that he or she liked the HBO series Band of Brothers?

Both are epic dramas about the Normandy invasion and its aftermath.

100 people in your database, and every one of them has seen both films.

Their viewing histories come in the form of a big “ratings matrix”.

Liked Band of Brothers Didn’t like it
Liked Saving Private Ryan 56 subscribers 6 subscribers
Didn’t like it 14 subscribers 24 subscribers

\[P(\text{likes Saving Private Ryan} \mid \text{likes Band of Brothers})=\frac{56}{56+14}=80\%\]

How does Netflix Give Recommendations? - Complexity

But real problem is much more complicated:

  1. Scale. It has 100 million subscribers and ratings data on more than 10,000 shows. The ratings matrix has more than a trillion possible entries.
  2. “Missingness”. Most subscribers haven’t watched most films. Moreover, missingness pattern is informative.
  3. Combinatorial explosion. In a database with 10,000 films, no one else’s history is exactly the same as yours.

The solution to all three issues is careful modeling.

How does Netflix Give Recommendations? - Fundamental Equation

The fundamental equation is: \[\text{Predicted Rating} =\text{Overall Average} + \text{Film Offset} + \text{User Offset} + \text{User-Film Interaction}\]

These three terms provide a baseline for a given user/film pair:

  • The overall average rating across all films is 3.7.
  • Every film has its own offset. Popular movies have positive offsets.
  • Every user has an offset. Some users are more or less critical than average.

Netflix - Latent Features

  • The User-Film Interaction is calculated based on a person’s ratings of similar films exhibit patterns because those ratings are all associated with a latent feature of that person.
  • There’s not just one latent feature to describe Netflix subscribers, but dozens or even hundreds. There’s a “British murder mystery” feature, a “gritty character-driven crime drama” feature, a “cooking show” feature, a “hipster comedy films” feature, …

The Hidden Features Tell the Story

  • These latent features are the magic elixir of the digital economy–a special brew of data, algorithms, and human insight that represents the most perfect tool ever conceived for targeted marketing.
  • Your precise combination of latent features–your tiny little corner of a giant multidimensional Euclidean space–makes you a demographic of one.
  • Netflix spent $130 million for 10 episodes on The Crown. Other network television: $400 million commissioning 113 pilots, of which 13 shows made it to a second season.