12  Pattern Matching

The history of data analysis is closely intertwined with the development of pattern matching techniques. The ability to identify and understand patterns in data has been crucial for scientific discoveries, technological advancements, and decision-making. From the early days of astronomy to modern machine learning, pattern matching has played a pivotal role in advancing our understanding of the world around us. This chapter explores the key concepts of pattern matching, its historical development, and its impact on data analysis.

Data science involves two major steps, collection and cleaning of data and building a model or applying an algorithm. In this chapter we present the process of building predictive models. To illustrate the process think of your data as being generated by a black box on which a set of input variables \(x\) go through the box and generate an output variable \(y\).

12.1 Why Pattern Matching?

For Gauss, Laplace and many other scientist, the main problem was the problem of estimating parameters, while the relationship between the variables was known and was usually linear, like in the shape of the earth example of multiplicative, e.g. Newton’s second law \(F = ma\). However, in many cases, the relationship between the variables is unknown and cannot be described by a simple mathematical model. Halevy, Norvig, and Pereira (2009) discuss the problem of human behavior and natural languages. Neither can be described by a simple mathematical model.

This is case, the pattern matching approach is a way to use data to find those relations. In data analysis, pattern matching is the process of identifying recurring sequences, relationships, or structures within a dataset. It’s like looking for a specific puzzle piece within a larger picture. By recognizing these patterns, analysts can gain valuable insights into the data, uncover trends, make predictions, and ultimately improve decision-making. Sometimes initial pattern matching analysis leads to a scientific discovery. Consider a case of mammography and early pattern matching.

Example 12.1 (Mammography and Early Pattern Matching) The use of mammograms for breast cancer detection relied on simple pattern matching in its initial stages. Radiologists visually examined the X-ray images for specific patterns indicative of cancer, such as: dense areas of tissue appearing different from surrounding breast tissue (a.k.a masses) and small white spots of calcium deposits called microcalcifications. These patterns were associated with early-stage cancer and could be easily missed by visual inspection alone.

Radiologists relied on their expertise and experience to identify these patterns and distinguish them from normal breast tissue variations. This process was subjective and prone to errors, particularly with subtle abnormalities or in dense breasts. Subtle abnormalities, especially in dense breasts, could be easily missed using visual assessment alone. Despite these limitations, pattern matching played a crucial role in the early detection of breast cancer, saving countless lives. It served as the foundation for mammography as a screening tool.

Albert Solomon, a German surgeon, played a pivotal role in the early development of mammography (Nicosia et al. (2023)). His most significant contribution was his 1913 monograph, “Beiträge zur Pathologie und Klinik der Mammakarzinome” (Contributions to the Pathology and Clinic of Breast Cancers). In this work, he demonstrated the potential of X-ray imaging for studying breast disease. He pioneered the use of X-rays, he compared surgically removed breast tissue images with the actual tissue and was able to identify characteristic features of cancerous tumors, such as their size, shape, and borders. He was one of the first to recognize the association between small calcifications appearing on X-rays and breast cancer.

Presence of calcium deposits is correlated with brest cancer and is still prevailing imaging biomarkers for its detection. Although discovery of the deposit-cancer asosciation induced scientific discoveries, the molecular mechanisms that leads to the formation of these calcium deposits, as well as the significance of their presence in human tissues, have not been completely understood (Bonfiglio et al. (2021)).

12.1.1 Richard Feynman on Pattern Matching and Chess

Richard Feynman, the renowned physicist, was a strong advocate for the importance of pattern matching and its role in learning and problem-solving. He argued that in many scientific discoveries, start with pattern matching. He emphasized that experts in any field, whether it’s chess, science, or art, develop a strong ability to identify and understand relevant patterns in their respective domains.

He often used the example of chess to illustrate this concept (Feynman (n.d.)). Feynman argued that a skilled chess player doesn’t consciously calculate every possible move. Instead, they recognize patterns on the board and understand the potential consequences of their actions. For example, a chess player might recognize that having a knight in a certain position is advantageous and will lead to a favorable outcome. This ability to identify and understand patterns allows them to make quick and accurate decisions during the game. Through playing and analyzing chess games, players develop mental models that represent their understanding of the game’s rules, strategies, and potential patterns. These mental models allow them to anticipate their opponent’s moves and formulate effective responses.

He emphasized that this skill could be transferred to other domains, such as scientific research, engineering, and even everyday problem-solving.

Here is a quote from his interview

Richard Feynman

Let’s say a chess game. And you don’t know the rules of the game, but you’re allowed to look at the board from time to time, in a little corner, perhaps. And from these observations, you try to figure out what the rules are of the game, what [are] the rules of the pieces moving.

You might discover after a bit, for example, that when there’s only one bishop around on the board, that the bishop maintains its color. Later on you might discover the law for the bishop is that it moves on a diagonal, which would explain the law that you understood before, that it maintains its color. And that would be analogous we discover one law and later find a deeper understanding of it.

Ah, then things can happen–everything’s going good, you’ve got all the laws, it looks very good–and then all of a sudden some strange phenomenon occurs in some corner, so you begin to investigate that, to look for it. It’s castling–something you didn’t expect …..

After you’ve noticed that the bishops maintain their color and that they go along on the diagonals and so on, for such a long time, and everybody knows that that’s true; then you suddenly discover one day in some chess game that the bishop doesn’t maintain its color, it changes its color. Only later do you discover the new possibility that the bishop is captured and that a pawn went all the way down to the queen’s end to produce a new bishop. That could happen, but you didn’t know it.

In an interview on Artificial General Intelligence (AGI), he compares human and machine intelligence

Richard Feynman

First of all, do they think like human beings? I would say no and I’ll explain in a minute why I say no. Second, for “whether they be more intelligent than human beings: to be a question, intelligence must first be defined. If you were to ask me are they better chess players than any human being? Possibly can be , yes ,”I’ll get you some day”. They’re better chess players than most human beings right now!

By 1996, computers had become stronger than GMs. With the advent of deep neural networks in 2002, Stockfish15 is way stronger. A turning point on our understanding of AI algorithms was AlphaZero and Chess

AlphaGo coupled with deep neural networks and Monte Carlo simulation provided a gold standard for chess. AlphaZero showed that neural networks can self-learn by competing against itself. Neural networks are used to pattern match and interpolate both the policy and value function. This implicitly performs “feature selection”. Whilst humans have heuristics for features in chess, such as control center, king safety and piece development, AlphaZero “learns” from experience. With a goal of maximizing the probability of winning, neural networks have a preference for initiative, speed and momentum and space over minor material such as pawns. Thus reviving the old school romantic chess play.

Feynman discusses how machines show intelligence:

Richard Feynman

With regard to the question of whether we can make it to think like [human beings], my opinion is based on the following idea: That we try to make these things work as efficiently as we can with the materials that we have. Materials are different than nerves, and so on. If we would like to make something that runs rapidly over the ground, then we could watch a cheetah running, and we could try to make a machine that runs like a cheetah. But, it’s easier to make a machine with wheels. With fast wheels or something that flies just above the ground in the air. When we make a bird, the airplanes don’t fly like a bird, they fly but they don’t fly like a bird, okay? They don’t flap their wings exactly, they have in front, another gadget that goes around, or the more modern airplane has a tube that you heat the air and squirt it out the back, a jet propulsion, a jet engine, has internal rotating fans and so on, and uses gasoline. It’s different, right?

So, there’s no question that the later machines are not going to think like people think, in that sense. With regard to intelligence, I think it’s exactly the same way, for example they’re not going to do arithmetic the same way as we do arithmetic, but they’ll do it better.

12.2 Correlations

Arguable, the simplest form of pattern matching is correlation. Correlation is a statistical measure that quantifies the strength of the relationship between two variables. It is a measure of how closely two variables move in relation to each other. Correlation is often used to identify patterns in data and determine the strength of the relationship between two variables. It is a fundamental statistical concept that is widely used in various fields, including science, engineering, finance, and business.

Let’s consider the correlation between returns on Google stock and S&P 500 stock index. The correlation coefficient is a measure of the strength and direction of the linear relationship between two variables. It is a number between -1 and 1.

Example 12.2 (Google Stock Returns) Figure 12.1 shows the scattershot of Google and S&P 500 daily returns

goog = read.csv("../data/GOOG2019.csv") 
rgoog = goog$Adj.Close[2:251]/goog$Adj.Close[1:250] - 1 
sp = read.csv("../data/SP2019.csv");   rsp = sp$Adj.Close[2:251]/sp$Adj.Close[1:250] - 1 
plot(rgoog, rsp, col="lightblue", pch=21, bg="grey", xlab="GOOG return", ylab="SP500 return") 
Figure 12.1: Scattershot of Google and S&P 500 daily returns

Let’s calculate the covariance and correlation between the daily returns of the Google stock and S&P 500.

var_goog = mean((rgoog - mean(rgoog))^2) 
var_sp = mean((rsp - mean(rsp))^2) 
cov = mean((rgoog - mean(rgoog))*(rsp - mean(rsp))); print(cov) 
## [1] 7.998625e-05
cor = cov/(sqrt(var_goog)*sqrt(var_sp)); print(cor)
## [1] 0.6673029

12.3 Prediction and Forecasting

Prediction and forecasting is most frequent problem in data analysis and the most common approach to solve the prediction problem is via the pattern matching. Prediction and forecasting are two closely related concepts that are often used interchangeably. In business and engineering the main motivation for prediction and forecasting is to make better decisions. In science, the main motivation is to test and validate theories. Prediction and forecasting help to identify trends and patterns in historical data that would otherwise remain hidden. This allows analysts to make informed decisions about the future based on what they know about the past. By using prediction models, analysts can identify potential risks and opportunities that may lie ahead. This information can then be used to develop proactive strategies to mitigate risks and capitalize on opportunities. In many business applications the concern is improving efficiency of a system. For example to improve logistic chains and to optimally allocate resources, we need to forecast demand and supply and to predict the future prices of the resources. By predicting future sales, businesses can better plan their marketing and sales efforts. This can lead to increased sales and profitability. Prediction and forecasting can be used to identify and mitigate potential risks, such as financial losses, supply chain disruptions, and operational failures.

There are plenty of engineering applications. Given the availability of sensors, predictive maintenance became a widespread approach. By predicting when equipment is likely to fail, engineers can take proactive steps to prevent failures and reduce downtime. Another example is traffic forecasting, which is used by drivers and engineers to plan future road and bridge construction projects. Prediction and forecasting are also used in weather forecasting, resource management, and many other applications.

A typical prediction problem involves building a rule that maps observed inputs \(x\) into the output \(y\). The inputs \(x\) are often called predictors, features, or independent variables, while the output \(y\) is often called the response or dependent variable. The goal is to find a predictive rule \[ y = f(x). \]

The map \(f\) can be viewed as a black box which describes how to find the output \(y\) from the input \(x\). One of the key requirement of \(f\) is that we should be able to efficiently find this function using an algorithm. In the simple case \(y\) and \(x\) are both univariate (scalars) and we can view the map as

flowchart LR
    x((x)) -->f
    f--> y((y))

The goal of machine learning is to reconstruct this this map from observed data. In a multivariate setting \(x = (x_1,\ldots,x_p)\) is a list of \(p\) variables. This leads to a model of the form \(y = f(x_1,\ldots,x_p)\). There are a number of possible goals of analysis, such as estimation, inference or prediction. The main one being prediction.

The prediction task is to calculate a response that corresponds to a new feature input variable. Example of af an inference is the task of establishing a causation, with the goal of extracting information about the nature of the black box association of the response variable to the input variables.

In either case, the goal is to use data to find a pattern that we can exploit. The pattern will be ``statistical” in its nature. To uncover the pattern we use a training dataset, denoted by \[ D = (y_i,x_i)_{i=1}^n \]

where \(x_i\) is a set of \(p\) predictors ans \(y_i\) is response variable. Prediction problem is to use a training dataset \(D\) to design a rule that can be used for predicting output values \(y\) for new observations \(x\).

Let \(f(x)\) be predictor of \(y\), we will use notation \[ \hat{y} = f(x). \]

To summarize, we will use the following notation.

\(y\) output variable (response/outcome)
\(x\) input variable (predictor/covariate/feature)
\(f(x)\) predictive rule
\(\hat y\) predicted output value

We distinguish several types of input or output variables. First, binary variables that can only have two possible values, e.g. yes/no, left/right, 0/1, up/down, etc. A generalization of binary variable is a categorical variable that can take a fixed number of possible values, for example, marriage status. Additionally, some of the categorical variable can have a natural order to them, for example education level or salary range. Those variables are called ordinal. Lastly, the most common type of a variable is quantitative which is described by a real number.

Depending on the type of the output variable, there are three types of prediction problems. When \(y\) is binary \(y\in \{0,1\}\) or categorical, \(y\in \{0,\ldots,K\}\), for \(K\) possible categories, the prediction problem is called classification. When \(y\) is any number \(y \in R\), this is known as a regression. Finally, when \(y\) is ordinal, this problem known as ranking.

There are several simple predictive rules we can use to predict the output variable \(y\). For example, in the case of regression problem, the simplest rule is to predict the average value of the output variable. This rule is called the mean rule and is defined as \[ f(x) = \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i. \]

Notice, this model does not depend on the input variable \(x\) and will predict the same value for all observations. This rule is simple and easy to implement, but it is not very accurate. A more sophisticated rule is the nearest neighbor rule. This rule predicts the output value \(y\) for a new observation \(x\) by finding the closest observation in the training dataset and using its output value. The nearest neighbor rule is defined as \[ f(x) = y_{i^*}, \]

where \(i^* = \arg\min_{i=1,\ldots,n} \|x_i - x\|\) is the index of the closest observation in the training dataset. These two models represent two extreme cases of predictive rules: the mean rule is “stubborn” (it always predicts the same value) and the nearest neighbor rule is “flexible” (can be very sensitive to small changes in the inputs). Using the language of statistics the mean rule is of high bias and low variance, while the nearest neighbor rule is of low bias and high variance. Although those two rules are simple, they sometimes lead to useful models that can be used in practice. Further, those two models represent a trade-off between accuracy and complexity (a.k.a bias-variance trade-off). We will discuss this trade-off in more detail in the later section.

The mean model and nearest neighbor model belong to a class of so-called non-parametric models. The non-parametric models do not make explicit assumption about the form of the function \(f(x)\). In contrast, parametric models assume that the predictive rule \(f(x)\) belongs to a specific family of functions and each function in this family is defined by setting values of the parameters.