11  Pattern Matching

Prediction is very difficult, especially about the future.” Niels Bohr, Danish physicist and Nobel laureate

The history of data analysis is closely intertwined with the development of pattern matching techniques. The ability to identify and understand patterns in data has been crucial for scientific discoveries, technological advancements, and decision-making. From the early days of astronomy to modern machine learning, pattern matching has played a pivotal role in advancing our understanding of the world around us. This chapter explores the key concepts of pattern matching, its historical development, and its impact on data analysis.

Data science involves two major steps: collection and cleaning of data and building a model or applying an algorithm. In this chapter we present the process of building predictive models. To illustrate the process, think of your data as being generated by a black box in which a set of input variables \(x\) go through the box and generate an output variable \(y\).

11.1 Why Pattern Matching?

For Gauss, Laplace, and many other scientists, the central challenge was estimating parameters when the functional form of the relationship was already known—often linear (as in the Earth-shape example) or multiplicative (e.g., Newton’s \(F=ma\)). In many modern problems, however, the relationship itself is unknown and defies simple mathematical description; human behaviour and natural language are prominent examples (Halevy, Norvig, and Pereira (2009)).

In such cases a pattern-matching approach can uncover the hidden relationships directly from data. Pattern matching means identifying recurring sequences, relationships, or structures within a dataset—much like finding a puzzle piece that completes a picture. Recognising these patterns yields insights, reveals trends, supports prediction, and ultimately improves decisions. Initial pattern-matching studies have often sparked major scientific advances, as the early history of mammography illustrates.

Example 11.1 (Mammography and Early Pattern Matching) The use of mammograms for breast cancer detection relied on simple pattern matching in its initial stages. Radiologists visually examined the X-ray images for specific patterns indicative of cancer, such as dense areas of tissue appearing different from surrounding breast tissue (masses) and small white spots of calcium deposits called microcalcifications. These patterns were associated with early-stage cancer and could be easily missed by visual inspection alone.

Radiologists relied on their expertise and experience to identify these patterns and distinguish them from normal breast tissue variations. This process was subjective and prone to errors, particularly with subtle abnormalities or in dense breasts. Subtle abnormalities, especially in dense breasts, could be easily missed using visual assessment alone. Despite these limitations, pattern matching played a crucial role in the early detection of breast cancer, saving countless lives. It served as the foundation for mammography as a screening tool.

Albert Solomon, a German surgeon, played a pivotal role in the early development of mammography (Nicosia et al. (2023)). His most significant contribution was his 1913 monograph, “Beitr{"a}ge zur Pathologie und Klinik der Mammakarzinome” (Contributions to the Pathology and Clinic of Breast Cancers). In this work, he demonstrated the potential of X-ray imaging for studying breast disease. He pioneered the use of X-rays, he compared surgically removed breast tissue images with the actual tissue and was able to identify characteristic features of cancerous tumors, such as their size, shape, and borders. He was one of the first to recognize the association between small calcifications appearing on X-rays and breast cancer.

The presence of calcium deposits is correlated with breast cancer and is still a prevailing imaging biomarker for its detection. Although the discovery of the deposit-cancer association induced scientific discoveries, the molecular mechanisms that lead to the formation of these calcium deposits, as well as the significance of their presence in human tissues, have not been completely understood (Bonfiglio et al. (2021)).

Richard Feynman on Pattern Matching and Chess

Richard Feynman, the renowned physicist, was a strong advocate for the importance of pattern matching and its role in learning and problem-solving. He argued that many scientific discoveries start with pattern matching. He emphasized that experts in any field, whether it’s chess, science, or art, develop a strong ability to identify and understand relevant patterns in their respective domains.

He often used the example of chess to illustrate this concept (Feynman (n.d.)). Feynman argued that a skilled chess player doesn’t consciously calculate every possible move. Instead, they recognize patterns on the board and understand the potential consequences of their actions. For example, a chess player might recognize that having a knight in a certain position is advantageous and will lead to a favorable outcome. This ability to identify and understand patterns allows them to make quick and accurate decisions during the game. Through playing and analyzing chess games, players develop mental models that represent their understanding of the game’s rules, strategies, and potential patterns. These mental models allow them to anticipate their opponent’s moves and formulate effective responses.

He emphasized that this skill could be transferred to other domains, such as scientific research, engineering, and even everyday problem-solving.

Here is a quote from his interview:

Richard Feynman

Let’s say a chess game. And you don’t know the rules of the game, but you’re allowed to look at the board from time to time, in a little corner, perhaps. And from these observations, you try to figure out what the rules are of the game, what [are] the rules of the pieces moving.

You might discover after a bit, for example, that when there’s only one bishop around on the board, that the bishop maintains its color. Later on you might discover the law for the bishop is that it moves on a diagonal, which would explain the law that you understood before, that it maintains its color. And that would be analogous we discover one law and later find a deeper understanding of it.

Ah, then things can happen–everything’s going good, you’ve got all the laws, it looks very good–and then all of a sudden some strange phenomenon occurs in some corner, so you begin to investigate that, to look for it. It’s castling–something you didn’t expect…

After you’ve noticed that the bishops maintain their color and that they go along on the diagonals and so on, for such a long time, and everybody knows that that’s true; then you suddenly discover one day in some chess game that the bishop doesn’t maintain its color, it changes its color. Only later do you discover the new possibility that the bishop is captured and that a pawn went all the way down to the queen’s end to produce a new bishop. That could happen, but you didn’t know it.

In an interview on Artificial General Intelligence (AGI), he compares human and machine intelligence:

Richard Feynman

First of all, do they think like human beings? I would say no and I’ll explain in a minute why I say no. Second, for “whether they be more intelligent than human beings: to be a question, intelligence must first be defined. If you were to ask me are they better chess players than any human being? Possibly can be, yes,”I’ll get you some day”. They’re better chess players than most human beings right now!

By 1996, computers had become stronger than grandmasters. With the advent of deep neural networks in 2002, Stockfish 15 is way stronger. A turning point on our understanding of AI algorithms was AlphaZero and chess.

AlphaGo coupled with deep neural networks and Monte Carlo simulation provided a gold standard for chess. AlphaZero showed that neural networks can self-learn by competing against itself. Neural networks are used to pattern match and interpolate both the policy and value function. This implicitly performs “feature selection”. Whilst humans have heuristics for features in chess, such as control center, king safety and piece development, AlphaZero “learns” from experience. With a goal of maximizing the probability of winning, neural networks have a preference for initiative, speed and momentum and space over minor material such as pawns. Thus reviving the old school romantic chess play.

Feynman discusses how machines show intelligence:

Richard Feynman

With regard to the question of whether we can make it to think like [human beings], my opinion is based on the following idea: That we try to make these things work as efficiently as we can with the materials that we have. Materials are different than nerves, and so on. If we would like to make something that runs rapidly over the ground, then we could watch a cheetah running, and we could try to make a machine that runs like a cheetah. But, it’s easier to make a machine with wheels. With fast wheels or something that flies just above the ground in the air. When we make a bird, the airplanes don’t fly like a bird, they fly but they don’t fly like a bird, okay? They don’t flap their wings exactly, they have in front, another gadget that goes around, or the more modern airplane has a tube that you heat the air and squirt it out the back, a jet propulsion, a jet engine, has internal rotating fans and so on, and uses gasoline. It’s different, right?

So, there’s no question that the later machines are not going to think like people think, in that sense. With regard to intelligence, I think it’s exactly the same way, for example they’re not going to do arithmetic the same way as we do arithmetic, but they’ll do it better.

11.2 Correlations

Arguably, the simplest form of pattern matching is correlation. Correlation is a statistical measure that quantifies the strength of the relationship between two variables. It is a measure of how closely two variables move in relation to each other. Correlation is often used to identify patterns in data and determine the strength of the relationship between two variables. It is a fundamental statistical concept that is widely used in various fields, including science, engineering, finance, and business.

Let’s consider the correlation between returns on Google stock and S&P 500 stock index. The correlation coefficient is a measure of the strength and direction of the linear relationship between two variables. It is a number between -1 and 1.

Example 11.2 (Google Stock Returns) Figure 11.1 shows the scatterplot of Google and S&P 500 daily returns

goog = read.csv("../data/GOOG2019.csv") 
rgoog = goog$Adj.Close[2:251]/goog$Adj.Close[1:250] - 1 
sp = read.csv("../data/SP2019.csv");   rsp = sp$Adj.Close[2:251]/sp$Adj.Close[1:250] - 1 
plot(rgoog, rsp, col="lightblue", pch=21, bg="grey", xlab="GOOG return", ylab="SP500 return") 
Figure 11.1: Scattershot of Google and S&P 500 daily returns

Let’s calculate the covariance and correlation between the daily returns of the Google stock and S&P 500.

var_goog = mean((rgoog - mean(rgoog))^2) 
var_sp = mean((rsp - mean(rsp))^2) 
cov = mean((rgoog - mean(rgoog))*(rsp - mean(rsp))); print(cov) 
## [1] 8e-05
cor = cov/(sqrt(var_goog)*sqrt(var_sp)); print(cor)
## [1] 0.67

The example demonstrates how correlation serves as a fundamental form of pattern recognition, showing the relationship between Google stock returns and S&P 500 returns. We see how correlation coefficients quantify the strength and direction of linear relationships between variables, ranging from -1 to 1. The Google/S&P 500 example shows how correlation analysis is used in finance to understand market relationships and portfolio diversification. The code demonstrates the mathematical calculation of correlation through covariance and variance, showing the underlying statistical mechanics.

11.3 Unsupervised Learning

Prediction and forecasting is the most frequent problem in data analysis and the most common approach to solve the prediction problem is via pattern matching. Prediction and forecasting are two closely related concepts that are often used interchangeably. In business and engineering the main motivation for prediction and forecasting is to make better decisions. In science, the main motivation is to test and validate theories.

Prediction and forecasting help to identify trends and patterns in historical data that would otherwise remain hidden. This allows analysts to make informed decisions about the future based on what they know about the past. By using prediction models, analysts can identify potential risks and opportunities that may lie ahead. This information can then be used to develop proactive strategies to mitigate risks and capitalize on opportunities.

In many business applications the concern is improving efficiency of a system. For example to improve logistic chains and to optimally allocate resources, we need to forecast demand and supply and to predict the future prices of the resources. By predicting future sales, businesses can better plan their marketing and sales efforts. This can lead to increased sales and profitability. Prediction and forecasting can be used to identify and mitigate potential risks, such as financial losses, supply chain disruptions, and operational failures.

Below we provide an example of an unsupervised learning approach to forecasting. Formally, unsupervised learning is a type of machine learning where the model is not provided with labeled data. The goal is to find patterns in the data without any prior knowledge of the structure of the data. In the example below we use the 2012 US presidential election data to forecast the probability of Obama winning the election.

Example 11.3 (Obama Elections) This example demonstrates a Bayesian approach to election forecasting using polling data from the 2012 US presidential election. The goal is to predict the probability of Barack Obama winning the election by combining polling data across different states.

The data used includes polling data from various pollsters across all 50 states plus DC. Each state has polling percentages for Republican (GOP) and Democratic (Dem) candidates along with their electoral vote counts. The data is aggregated by state, taking the most recent polls available.

The techniques applied involve Bayesian simulation using a Dirichlet distribution to model uncertainty in polling percentages. Monte Carlo simulation runs 10,000 simulations of the election to estimate win probabilities. The analysis is conducted state-by-state, calculating Obama’s probability of winning each individual state. Electoral college modeling combines state probabilities with electoral vote counts to determine the overall election outcome. The simulation runs the entire election multiple times to account for uncertainty and determines the likelihood of Obama reaching the required 270 electoral votes to win. This approach demonstrates how pattern matching through statistical modeling can be used for prediction, showing how polling data can be transformed into probabilistic forecasts of election outcomes.

We start by loading the data and aggregating it by state.

library(plyr)
# Source: "http://www.electoral-vote.com/evp2012/Pres/pres_polls.csv"
election.2012 = read.csv("../data/pres_polls.csv")
# Remove a pollster: elect2012 <- election.2012[!grepl('Rasmussen', election.2012$Pollster),]
elect2012 <- election.2012
# Aggregrate the data
elect2012 <- ddply(elect2012, .(state), subset, Day == max(Day))
elect2012 <- ddply(elect2012, .(state), summarise, R.pct = mean(GOP), O.pct = mean(Dem), EV = mean(EV))
knitr::kable(elect2012[1:25,], caption = "Election 2012 Data",longtable=TRUE)
knitr::kable(elect2012[26:51,], caption = "Election 2012 Data",longtable=TRUE)
Election 2012 Data
state R.pct O.pct EV
Alabama 61 38 9
Alaska 55 42 3
Arizona 54 44 11
Arkansas 61 37 6
California 38 59 55
Colorado 47 51 9
Connecticut 40 58 7
D.C. 7 91 3
Delaware 40 59 3
Florida 49 50 29
Georgia 53 45 16
Hawaii 28 71 4
Idaho 65 33 4
Illinois 41 57 20
Indiana 54 44 11
Iowa 47 52 6
Kansas 60 38 6
Kentucky 61 38 8
Louisiana 58 41 8
Maine 41 56 4
Maryland 37 62 10
Massachusetts 38 61 11
Michigan 45 54 16
Minnesota 45 53 10
Mississippi 56 44 6
Election 2012 Data
state R.pct O.pct EV
26 Missouri 54 44 10
27 Montana 55 41 3
28 Nebraska 61 38 5
29 Nevada 46 52 6
30 New Hampshire 46 52 4
31 New Jersey 41 58 14
32 New Mexico 43 53 5
33 New York 36 63 29
34 North Carolina 51 48 15
35 North Dakota 59 39 3
36 Ohio 48 50 18
37 Oklahoma 67 33 7
38 Oregon 43 54 7
39 Pennsylvania 47 52 20
40 Rhode Island 36 63 4
41 South Carolina 55 44 9
42 South Dakota 58 40 3
43 Tennessee 60 39 11
44 Texas 57 41 38
45 Utah 73 25 6
46 Vermont 31 67 3
47 Virginia 48 51 13
48 Washington 42 56 12
49 West Virginia 62 36 5
50 Wisconsin 46 53 10
51 Wyoming 69 28 3

We then run the simulation and plot probabilities by state.

library(MCMCpack)
prob.Obama <- function(mydata) {
    p <- rdirichlet(1000, 500 * c(mydata$R.pct, mydata$O.pct, 100 - mydata$R.pct - 
        mydata$O.pct)/100 + 1)
    mean(p[, 2] > p[, 1])
}
win.probs <- ddply(elect2012, .(state), prob.Obama)
win.probs$Romney <- 1 - win.probs$V1
names(win.probs)[2] <- "Obama"
win.probs$EV <- elect2012$EV
win.probs <- win.probs[order(win.probs$EV), ]
rownames(win.probs) <- win.probs$state

We then plot the probabilities of Obama winning by state.

library(usmap)
plot_usmap(data = win.probs, values = "Obama") + 
  scale_fill_continuous(low = "red", high = "blue", name = "Obama Win Probability", label = scales::comma) + theme(legend.position = "right")

Probabilities of Obama winning by state

We use those probabilities to simulate the probability of Obama winning the election. First, we calculate the probability of Obama having 270 EV or more

sim.election <- function(win.probs) {
    winner <- rbinom(51, 1, win.probs$Obama)
    sum(win.probs$EV * winner)
}

sim.EV <- replicate(10000, sim.election(win.probs))
oprob <- sum(sim.EV >= 270)/length(sim.EV)
oprob
## [1] 0.97
library(lattice)
# Lattice Graph
densityplot(sim.EV, plot.points = "rug", xlab = "Electoral Votes for Obama", 
    panel = function(x, ...) {
        panel.densityplot(x, ...)
        panel.abline(v = 270)
        panel.text(x = 285, y = 0.01, "270 EV to Win")
        panel.abline(v = 332)
        panel.text(x = 347, y = 0.01, "Actual Obama")
}, main = "Electoral College Results Probability")

Results of recent state polls in the 2008 United States Presidential Election between Barack Obama and John McCain.

# Source: LearnBayes library
#| fig-height: 6
election.2008 = read.csv("../data/election2008.csv")
data(election.2008)
attach(election.2008)

##  Dirichlet simulation

prob.Obama = function(j)
 {
 p=rdirichlet(5000,500*c(M.pct[j],O.pct[j],100-M.pct[j]-O.pct[j])/100+1)
 mean(p[,2]>p[,1])
 }

## sapply function to compute Obama win prob for all states

Obama.win.probs=sapply(1:51,prob.Obama)

##  sim.EV function

sim.election = function()
 {
 winner = rbinom(51,1,Obama.win.probs)
 sum(EV*winner)
 }

sim.EV = replicate(1000,sim.election())

## histogram of simulated election
hist(sim.EV,min(sim.EV):max(sim.EV),col="blue",prob=T)
abline(v=365,lwd=3)   # Obama received 365 votes
text(375,30,"Actual \n Obama \n total")

The analysis of the 2008 U.S. Presidential Election data reveals several key insights about the predictive power of state-level polling and the uncertainty inherent in electoral forecasting. The actual result of 365 electoral votes falls within the simulated range, demonstrating the model’s validity. The 270-vote threshold needed to win the presidency is clearly marked and serves as a critical reference point.

We used a relatively simple model to simulate the election outcome. The model uses Dirichlet distributions to capture uncertainty in state-level polling percentages. Obama’s win probabilities vary significantly across states, reflecting the competitive nature of the election. The simulation approach accounts for both sampling uncertainty and the discrete nature of electoral vote allocation. The histogram of simulated results shows the distribution of possible outcomes. The actual Obama total of 365 electoral votes is marked and falls within the reasonable range of simulated outcomes. This validates the probabilistic approach to election forecasting.

This analysis demonstrates how Bayesian methods can be effectively applied to complex prediction problems with multiple sources of uncertainty, providing both point estimates and measures of uncertainty around those estimates.

11.4 Supervised Learning

The problem of supervised learning is to learn patterns from observed data to make predictions on new, unseen data. The key idea is that we have input-output pairs \((x_i, y_i)\) where we know the correct output \(y_i\) for each input \(x_i\), and we use these examples to learn a function that maps inputs to outputs.

Supervised learning has become ubiquitous across modern engineering, business, and technology applications. In manufacturing, predictive maintenance systems use sensor data from industrial equipment to forecast potential failures, enabling proactive maintenance that reduces downtime and costs. Autonomous vehicles rely heavily on supervised learning for object detection, lane recognition, and decision-making systems that process real-time sensor data from cameras, LiDAR, and radar. In healthcare, supervised learning powers diagnostic imaging systems that can detect diseases from X-rays, MRIs, and CT scans with accuracy rivaling human radiologists.

Financial institutions employ supervised learning for fraud detection, credit scoring, and algorithmic trading systems that analyze vast amounts of transaction data. Smart cities utilize supervised learning for traffic flow optimization, energy consumption forecasting, and air quality monitoring. Many companies use prediction models for customer churn, helping identify early warning signs of dissatisfaction. Marketing teams leverage supervised learning for customer segmentation, campaign optimization, and lead scoring to improve conversion rates. Supply chain optimization uses supervised learning to forecast demand, optimize inventory levels, and predict delivery times. These applications demonstrate how supervised learning has evolved from simple prediction tasks to complex, real-time decision-making systems that operate across diverse domains.

A typical prediction problem involves building a rule that maps observed inputs \(x\) into the output \(y\). The inputs \(x\) are often called predictors, features, or independent variables, while the output \(y\) is often called the response or dependent variable. The goal is to find a predictive rule \[ y = f(x). \]

The map \(f\) can be viewed as a black box which describes how to find the output \(y\) from the input \(x\). One of the key requirements of \(f\) is that we should be able to efficiently find this function using an algorithm. In the simple case \(y\) and \(x\) are both univariate (scalars) and we can view the map as

graph LR
    x --> f((f))
    f --> y
Figure 11.2

The goal of machine learning is to reconstruct this map from observed data. In a multivariate setting \(x = (x_1,\ldots,x_p)\) is a list of \(p\) variables. This leads to a model of the form \(y = f(x_1,\ldots,x_p)\). There are a number of possible goals of analysis, such as estimation, inference or prediction. The main one being prediction.

The prediction task is to calculate a response that corresponds to a new feature input variable. An example of inference is the task of establishing causation, with the goal of extracting information about the nature of the black box association of the response variable to the input variables.

In either case, the goal is to use data to find a pattern that we can exploit. The pattern will be “statistical” in its nature. To uncover the pattern we use a training dataset, denoted by \[ D = (y_i,x_i)_{i=1}^n \]

where \(x_i\) is a set of \(p\) predictors and \(y_i\) is response variable. Prediction problem is to use a training dataset \(D\) to design a rule that can be used for predicting output values \(y\) for new observations \(x\).

Let \(f(x)\) be predictor of \(y\), we will use notation \[ \hat{y} = f(x). \]

To summarize, we will use the following notation.

\(y\) output variable (response/outcome)
\(x\) input variable (predictor/covariate/feature)
\(f(x)\) predictive rule
\(\hat y\) predicted output value

We distinguish several types of input or output variables. First, binary variables that can only have two possible values, e.g. yes/no, left/right, 0/1, up/down, etc. A generalization of binary variable is a categorical variable that can take a fixed number of possible values, for example, marriage status. Additionally, some of the categorical variable can have a natural order to them, for example education level or salary range. Those variables are called ordinal. Lastly, the most common type of a variable is quantitative which is described by a real number.

Depending on the type of the output variable, there are three types of prediction problems.

Types of output variables and corresponding prediction problems.
Output Variable Type Description Prediction Problem
Binary \(y\in \{0,1\}\) Classification
Categorical \(y\in \{0,\ldots,K\}\) for \(K\) possible categories Classification
Quantitative \(y \in \mathbb{R}\) (any real number) Regression
Ordinal \(y\) has natural ordering Ranking

Here are some examples of prediction problems:

Binary Classification: Predicting whether an email is spam or not spam involves input variables such as email content, sender information, presence of certain keywords, and email length. The output variable is \(y \in \{0,1\}\) where 0 = not spam, 1 = spam. The goal is to classify new emails as spam or legitimate.

Categorical Classification: Predicting the type of social media content based on text and image features uses input variables including text content, image features, user engagement metrics, posting time, and hashtags. The output variable is \(y \in \{0,1,2,3,4\}\) where 0 = news, 1 = entertainment, 2 = educational, 3 = promotional, 4 = personal. The goal is to automatically categorize social media posts for content moderation and recommendation systems.

Regression (Quantitative): Predicting house prices based on features uses input variables such as square footage, number of bedrooms, location, age of house, and lot size. The output variable is \(y \in \mathbb{R}\) (house price in dollars). The goal is to predict the selling price of a new house.

Ranking (Ordinal): Predicting customer satisfaction ratings involves input variables including product quality, customer service experience, delivery time, and price. The output variable is \(y \in \{1,2,3,4,5\}\) where 1 = very dissatisfied, 5 = very satisfied. The goal is to predict customer satisfaction level for new customers.

There are several simple predictive rules we can use to predict the output variable \(y\). For example, in the case of regression problem, the simplest rule is to predict the average value of the output variable. This rule is called the mean rule and is defined as \[ f(x) = \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i. \]

Notice, this model does not depend on the input variable \(x\) and will predict the same value for all observations. This rule is simple and easy to implement, but it is not very accurate. In case of binary \(y\), we can apply thresholding to the mean rule to obtain a binary classifier. \[ f(x) = \begin{cases} 1 & \text{if } \bar{y} > 0.5, \\ 0 & \text{if } \bar{y} \leq 0.5. \end{cases} \]

A more sophisticated rule is the nearest neighbor rule. This rule predicts the output value \(y\) for a new observation \(x\) by finding the closest observation in the training dataset and using its output value. The nearest neighbor rule is defined as \[ f(x) = y_{i^*}, \] where \[i^* = \arg\min_{i=1,\ldots,n} \|x_i - x\|\] is the index of the closest observation in the training dataset. These two models represent two extreme cases of predictive rules: the mean rule is “stubborn” (it always predicts the same value) and the nearest neighbor rule is “flexible” (can be very sensitive to small changes in the inputs). Using the language of statistics the mean rule is of high bias and low variance, while the nearest neighbor rule is of low bias and high variance. Although those two rules are simple, they sometimes lead to useful models that can be used in practice. Further, those two models represent a trade-off between accuracy and complexity (the bias-variance trade-off). We will discuss this trade-off in more detail in the later section.

The mean model and nearest neighbor model belong to a class of so-called non-parametric models. The non-parametric models do not make explicit assumption about the form of the function \(f(x)\). In contrast, parametric models assume that the predictive rule \(f(x)\) is a specific function defined by vector of parameters, which we will denote as \(\theta\). A typical notation is then \[f(x) = f_{\theta}(x).\]

Traditional modeling culture employs statistical models characterized by single-layer transformations, where the relationship between input variables and output is modeled through direct, interpretable mathematical formulations. These approaches typically involve linear combinations, additive structures, or simple nonlinear transformations that maintain analytical tractability and statistical interpretability. The list of widely used models includes:

Model Formula Parameters
Linear Regression \(y = \beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p\) \(\theta = (\beta_0, \beta_1, \ldots, \beta_p)\)
Generalized Linear Models (GLM) \(y = f^{-1}(\beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p)\) \(\theta = (\beta_0, \beta_1, \ldots, \beta_p)\)
Generalized Additive Models (GAM) \(y = \beta_0 + f_1(x_1) + \dots + f_k(x_k)\) \(\theta = (\beta_0, f_1, \ldots, f_k)\)
Principal Component Regression (PCR) \(y = \beta^T (W x),\quad W \in \mathbb{R}^{k \times p},\quad k < p\) \(\theta = (\beta, W)\)
Sliced Inverse Regression (SIR) \(y = f(\beta_1^T x,\, \beta_2^T x,\, \ldots,\, \beta_k^T x,\, \epsilon)\) \(\theta = (\beta_1, \beta_2, \ldots, \beta_k)\)

In contrast to these traditional single-layer approaches, Deep Learning employs sophisticated high-dimensional multi-layer neural network architectures that can capture complex, non-linear relationships in data through hierarchical feature learning. Each layer transforms the input data through by applying an affine transformation and a non-linear activation function. The depth and complexity of these architectures allow deep learning models to automatically discover intricate patterns and representations from raw input data, making them particularly effective for tasks involving high-dimensional inputs such as image recognition, natural language processing, and complex time series analysis. Unlike traditional statistical models that rely on hand-crafted features, deep learning models learn hierarchical representations directly from the data, with early layers capturing simple features (like edges in images) and deeper layers combining these into more complex, abstract representations.

We wish to find map \(f\) such that \[\begin{align*} y &= f ( x ) \\ y &= f ( x_1 , \ldots , x _p ) \end{align*}\]

Essentially, the goal is to perform the pattern matching, also known as nonparametric regression. It involves finding complex relationships in data without assuming a specific functional form. In deep learning, we use composite functions rather than additive functions. We write the superposition of univariate functions as \[ f = f_1 \circ \ldots \circ f_L \; \; \text{versus} \; \; f_1 + \ldots + f_L \] where the composition of functions creates a hierarchical structure. Each function \(f_i\) is a combination of a linear transformation and a non-linear activation function \[ f_i(x) = \sigma(W_i x + b_i), \] The composition of functions creates a hierarchical structure. Then the set of parameters that we need to find is \(\theta = (W_1, b_1, \ldots, W_L, b_L)\).

11.5 Model Estimation

There are two main approaches to finding the set of parameters \(\theta\). The first is optimization approach that minimizes a loss function. Loss function measures how well predictive rule \(f\) captures the relationship between input and output variables. The most common loss function is the mean squared error (MSE). The second approach is to use full Bayesian inference and to calculate the distribution over parameter \(\theta\) given the observed data.

Both approaches start with formulating likelihood function. Likelihood is a function that tells us how probable the observed data is, given a particular value of the parameter in a statistical model. It is not the same as probability; instead, it’s a function of the parameter, with the data fixed. Suppose you flip a biased coin 10 times and get 7 heads. You want to estimate the probability of getting heads on a single toss. You try different values of \(\theta\) and ask: “How likely is it to get exactly 7 heads out of 10 flips if the true probability is \(\theta\)?” This leads to the likelihood function. Formally, given \(y_i \sim f(y_i\mid x_i, \theta)\): iid samples from a distribution with parameter \(\theta\), the likelihood function is defined as \[ L(\theta) = \prod_{i=1}^n p(y_i\mid x_i, \theta). \] It treats the data \(D = (y_i,x_i)_{i=1}^n\) as fixed and varies \(\theta\).

Likelihood connects our model to the data generating process by quantifying how likely it is to observe the actual data we have under different parameter values. For example, if we assume our data follows a normal distribution \(y \sim N(f_\theta(x), \sigma^2)\) with mean \(f_\theta(x)\) and variance \(\sigma^2\), the likelihood function would be:

\[ L(\theta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - f_\theta(x_i))^2}{2\sigma^2}\right). \tag{11.1}\]

For the case of classification problem, we assume that \(y_i\) follows a Bernoulli distribution \(y_i \sim \text{Bernoulli}(p_i)\). The likelihood function is defined as \[ L(\theta) = \prod_{i=1}^n p_i^{y_i} (1-p_i)^{1-y_i}. \] Here \(p_i\) is the probability of the response variable taking on a value of 1, given the input variables. A typical approach to calculate \(p_i\) is to use logistic function \(\sigma(\cdot)\) \[\begin{align*} f_{\beta}(x_i) = & \beta^Tx_i\\ p_i = & \sigma(f_{\beta}(x_i)) = \frac{e^{f_{\beta}(x_i)}}{1+e^{f_{\beta}(x_i)}}, \end{align*}\] Notice, that logistic function \(\sigma(\cdot)\) is restricted to output values in \((0,1)\).

The optimization-based approach is to find the set of parameters \(\theta\) that maximizes the likelihood function. \[ \theta^* = \arg\max_{\theta} L(\theta). \]

Although most often, it is easier to optimize the log-likelihood function. We simply take the logarithm of the likelihood function \[ \log L(\theta) = l(\theta) = \sum_{i=1}^n \log p(y_i\mid x_i, \theta). \] Notice, the log-likelihood function is a sum of log-likelihoods for each data point. This is a more convenient form for optimization algorithms.

Why does the solution not change? Since the logarithm is a monotonically increasing function, if \(L(\theta_1) > L(\theta_2)\), then \(\log L(\theta_1) > \log L(\theta_2)\). This means that the parameter value that maximizes the likelihood function will also maximize the log-likelihood function. The maximum point stays the same, just the function values are transformed.

The value of parameters \(\theta\) that maximizes the log-likelihood is called the maximum likelihood estimate (MLE).

Finally, in machine learning literature, it is more common to minimize the negative log-likelihood function (same as maximizing the log-likelihood function). \[ \theta^* = \arg\min_{\theta} -l(\theta). \] Then the negative log-likelihood function is called the loss function. Thus the problem of finding maximum likelihood estimate is equivalent to minimizing the loss function.

Let’s calculate the loss function that corresponds to the normal likelihood function given by Equation 11.1. Using the fact that logarithm of product is a sum of logarithms, we can rewrite the likelihood function as \[ l(\theta) = \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - f_\theta(x_i))^2}{2\sigma^2}\right). \] Inside the sum, we have a product of two terms. The first term is a constant with respect to \(\theta\) and the second term is a function of \(\theta\). We can rewrite the likelihood function as \[ l(\theta) = \sum_{i=1}^n \left[\log \frac{1}{\sqrt{2\pi\sigma^2}} + \log \exp\left(-\frac{(y_i - f_\theta(x_i))^2}{2\sigma^2}\right)\right]. \]

The first term \(\log \frac{1}{\sqrt{2\pi\sigma^2}}\) is a constant with respect to \(\theta\), so we can drop it from the optimization problem. The second term can be simplified using the fact that \(\log \exp(x) = x\):

\[ l(\theta) = \sum_{i=1}^n \left[-\frac{(y_i - f_\theta(x_i))^2}{2\sigma^2}\right] + C, \]

where \(C\) is a constant that does not depend on \(\theta\). Since we are maximizing the log-likelihood, we can drop the constant term and the negative sign:

\[ l(\theta) = -\frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - f_\theta(x_i))^2. \]

When we minimize the negative log-likelihood (which is equivalent to maximizing the log-likelihood), we get:

\[ -\ell(\theta) = \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - f_\theta(x_i))^2. \]

This is the mean squared error (MSE) loss function, which is the most commonly used loss function for regression problems. The factor \(\frac{1}{2\sigma^2}\) is often absorbed into the learning rate or regularization parameter in optimization algorithms. Thus, another name of the estimator is the least squares estimator. It is the same as the maximum likelihood estimator, assuming that the \(f_\theta(x_i)\) is normally distributed.

Penalized Likelihood

While maximum likelihood estimation provides a principled approach to parameter estimation, we can often find better estimators using what is called a penalized likelihood. In fact, there are certain cases, when penalized estimator leads to universally better estimators. In statistics, we would say that MLE is inadmissible, meaning there exists another estimator (James-Stein) that is strictly better in terms of risk. Later is Section Chapter 18 we will discuss the theory of penalized estimators in more detail.

Penalized likelihood addresses overfitting by adding a regularization term to the likelihood function. Instead of maximizing just the likelihood, we maximize:

\[ L_{\text{penalized}}(\theta) = L(\theta) \cdot \exp(-\lambda \phi(\theta)) \]

Or equivalently, we minimize the negative log-likelihood plus a penalty: \[ l(\theta) =\sum_{i=1}^n -l(y_i, f_{\theta} (x_i)) +\lambda \sum_{j=1}^p \phi(\theta_j), \] where \(\lambda > 0\) is the regularization parameter that controls the strength of regularization, and \(\phi(\theta)\) is the penalty function that measures model complexity. In machine learning the technique of adding the penalty term to the loss function is called regularization.

Regularization can be viewed as constraint on the model space. The techniques were originally applied to solve ill-posed problems where a slight change in the initial data could significantly alter the solution. Regularization techniques were then proposed for parameter reconstruction in a physical system modeled by a linear operator implied by a set of observations. It had long been believed that ill-conditioned problems offered little practical value, until Tikhonov published his seminal paper Andrey Nikolayevich Tikhonov et al. (1943) on regularization. Andrei N. Tikhonov (1963) proposed methods for solving regularized problems of the form \[ \min_\beta ||y- X\beta||^p_p + \lambda||(\beta - \beta^{(0)})||^q_q. \] Here \(\lambda\) is the weight on the regularization penalty and the \(\ell_q\)-norm is defined by \(||\beta||_q^q = \sum_i \beta_i^q\). This optimization problem is a Lagrangian form of the constrained problem given by \[ \mbox{minimize}_{\beta}\quad||y- X\beta||^2_2\qquad\mbox{subject to }\sum_{i=1}^{p}\phi(\beta_i) \le s. \] with \(\phi(\beta_i) = (\beta_i - \beta_i^{(0)})^q\).

Later, sparsity became a primary driving force behind new regularization methods Candes and Wakin (2008). The idea is that the vector of parameters \(\beta\) is sparse, meaning that most of its elements are zero. This is a natural assumption for many models, such as the linear regression model. We will discuss the sparsity in more detail later in the book.

There are multiple optimization algorithms that can be used to find the solution to the penalized likelihood problem. Later in the book we will discuss the Stochastic Gradient Descent (SGD) algorithm, which is a popular tool for training deep learning models.

Bayesian Approach

Similar to the likelihood maximization approach, the Bayesian approach to model estimation starts with the likelihood function. The difference is that we assume that the parameters \(\theta\) are random variables and follow some prior distribution. Then we use the Bayes rule to find the posterior distribution of the parameters \[ p(\theta | D) \propto l(\theta) \cdot p(\theta) \] where \(p(\theta)\) is the prior distribution and \(p(y | \theta)\) is the likelihood function. The posterior distribution is the distribution of the parameters given the data \(D = (y_i,x_i)_{i=1}^n\). It is a distribution over the parameters, not a single value.

Penalized likelihood has a natural Bayesian interpretation. The penalty term corresponds to a prior distribution on the parameters: \[ p(\theta) \propto \exp(-\lambda \phi(\theta)) \] Then the penalized likelihood is proportional to the posterior distribution: \[ p(\theta | y) \propto p(y | \theta) \cdot p(\theta) = L(\theta) \cdot \exp(-\lambda \phi(\theta)) \]

This means maximizing the penalized likelihood is equivalent to finding the maximum a posteriori (MAP) estimate, which is the mode of the posterior distribution.

11.6 Prediction Accuracy

After we fit our model and find the optimal value of the parameter \(\theta\), denoted by \(\hat \theta\), we need to evaluate the accuracy of the predictive model. It involves comparing the model’s predictions to actual outcomes. We can simply use the value of the loss function from the training step to evaluate model’s predictive power. However, this only tells us how well the model fits the training data. It doesn’t tell us how well the model will perform on unseen data. To evaluate the model’s performance on unseen data, we need to use a different approach.

The most common approach is to split the data into training and test sets. The training set is used to train the model, while the test set is used to evaluate its performance. This approach is known as the train-test split. It is a simple and effective way to evaluate how well model predicts for unseen inputs.

Another approach is to use cross-validation. It involves splitting the data into smaller subsets and using them to train and test the model multiple times. When our sample size is small, this allows for a more robust estimate of the model’s performance than simply splitting the data into a single training and test set. For small data sets, simple train-test split approach will be sensitive to choice of test samples, thus the estimated predicted performance will be unstable (high variance). Cross-validation helps to reduce this variance by averaging the performance across multiple folds. This makes the performance estimate more robust and less sensitive to the choice of test samples.

Cross-validation involves several steps. The data is randomly divided into \(k\) equal-sized chunks (folds). For each fold, the model is trained on \(k-1\) folds and tested on the remaining fold. This process is repeated \(k\) times, ensuring each fold is used for testing once. The performance of the model is evaluated on each fold using a chosen metric, such as accuracy, precision, recall, or F1 score. The average of the performance metrics across all k folds is reported as the final estimate of the model’s performance.

A common choice for \(k\) is 5 or 10. When \(K=n\), this is known as leave-one-out cross-validation. This method can be computationally expensive but is less likely to overfit the data. Stratified cross-validation ensures that each fold contains approximately the same proportion of each class as in the entire dataset. This is important for imbalanced datasets where one class is significantly larger than the others.

Notice, that cross-validation requires re-training the model multiple times, which can be computationally expensive. Thus, for large datasets, we typically prefer simple train-test split. However, for small datasets, cross-validation can provide a more robust estimate of the model’s performance.

Either method is limited to evaluating the model’s performance on data that is available to the modeler. What if we start using our model on data that is different from the training and test sets? Unlike in physics, when a model represents a law that is universal, in data science, we are dealing with data that is generated by a process that is not necessarily universal. For example, if we are building a model to predict the price of a house, we can train and test the model on data from a specific city. However, if we start using the model to predict the price of a house in a different city, the model might not perform as well. This is because the data from the new city might be different from the data used to train and test the model. This is known as the problem of generalization. It refers to the ability of a model to perform well on data that is different from the training and test sets.

Evaluation Metrics for Regression

There are several metrics that can be used to evaluate the performance of regression models. We can simply use the same function as we use for fitting the model, e.g. least squares \[ \text{MSE} = \dfrac{1}{m}\sum_{i=1}^n (y_i -\hat y_i)^2, \] here \(\hat y_i\) is the predicted value of the i-th data point by the model \(\hat y_i = f(x_i,\hat\theta)\) and \(m\) is the total number of data points used for the evaluation. This metric is called the Mean Squared Error (MSE). It is the average squared difference between the actual and predicted values. Lower MSE indicates better model performance, as it means the model’s predictions are closer to the actual values.

A slight variation of this metric is Root Mean Squared Error (RMSE). This is the square root of MSE and is also commonly used due to its units being the same as the target variable. \[ \text{RMSE} = \sqrt{\text{MSE}}. \] However, MSE is sensitive to outliers, as it squares the errors, giving more weight to large errors. This can lead to misleading results when the data contains outliers.

Mean Absolute Error (MAE) solves the sensitivity to the outliers problem. It is the mean of the absolute errors, providing a more robust measure than MSE for skewed error distributions \[ \text{MAE} = \dfrac{1}{m}\sum_{i=1}^n |y_i -\hat y_i|. \] A variation of it is the Mean Absolute Percentage Error (MAPE), which is the mean of the absolute percentage errors \[ \text{MAPE} = \dfrac{1}{m}\sum_{i=1}^n \left | \dfrac{y_i -\hat y_i}{y_i} \right |. \]

Alternative way to measure the predictive quality is to use the coefficient of determination, also known as the R-squared value, which measures the proportion of variance in the target variable that is explained by the model. Higher R-squared indicates better fit. However, R-squared can be misleading when comparing models with different numbers of features. R-squared is defined as follows \[ R^2 = 1 - \dfrac{\sum_{i=1}^n (y_i -\hat y_i)^2}{\sum_{i=1}^n (y_i -\bar y_i)^2}, \] where \(\bar y_i\) is the mean of the target variable. R-squared is a relative measure of fit, so it can be used to compare different models. However, it is not an absolute measure of fit, so it cannot be used to determine whether a model is good or bad. It is also sensitive to the number of features in the model, so it cannot be used to compare models with different numbers of features.

Finally, we can use graphics to evaluate the model’s performance. For example, we can create a scatterplot of the actual and predicted values of the target variable to visually compare them. We can also plot the histogram or a boxplot of the residuals (errors) to see if they are normally distributed.

Evaluation Metrics for Classification

Accuracy is the most fundamental metric used to evaluate models. It is defined as the ratio of the number of correct predictions to the total number of predictions. The formula is given by \[\text{Accuracy} = \frac{\text{TP+TN}}{\text{TP+TN+FP+FN}},\] where TP, TN, FP, and FN are the numbers of true positives, true negatives, false positives, and false negatives, respectively. However, it can be misleading for imbalanced datasets where one class is significantly larger than others. For example, if 95% of the data belongs to one class, a model that always predicts this class will be 95% accurate, even though it’s not very useful.

A more comprehensive understanding of model performance can be achieved by calculating the sensitivity (precision) and specificity (recall) as well as confusion matrix discussed in Section 2.4. The confusion matrix is

Actual/Predicted Positive Negative
Positive TP FN
Negative FP TN

Precision measures the proportion of positive predictions that are actually positive. It is useful for evaluating how good the model is at identifying true positives. Recall measures the proportion of actual positives that are correctly identified by the model. It is useful for evaluating how good the model is at not missing true positives.

Then we can use those to calculate F1 Score which is a harmonic mean of precision and recall, providing a balanced view of both metrics. Higher F1 score indicates better overall performance. If misclassifying certain instances is more costly than others, weighted metrics account for these different costs. For imbalanced datasets, metrics like F1 score or balanced accuracy are important to avoid misleading interpretations.

Sometimes, we use multiple metrics to get a comprehensive assessment of the model’s performance. Additionally, consider comparing the model’s performance to a baseline model or other existing models for the same task. Sometimes, it is hard to beat a “coin flip” classification model, when the model predicts the class randomly with equal probability. In regression, a simple baseline model is \(f(x_i) = \bar y\), which is the mean of the target variable.