= seq(-10,10,length.out = 100)
x plot(x, 1/(1+exp(-x)), type='l', col="red", xlab="x", ylab="p")
14 Classification: Logistic Regression
Classification is a type of predictive modeling where the goal is to predict a categorical variable based on a set of input variables. The categorical variable is often called the response variable and the input variables are called the predictors. The response variable is typically binary, meaning it can take on only two values, such as 0 or 1, or it can be a multi-class variable, meaning it can take on more than two values.
We start by assuming a binomial likelihood function for the response variable. The binomial likelihood function is a function of the probability of the response variable taking on a value of 1, given the input variables. The binomial likelihood function is defined as follows \[ P(y_i = 1\mid p_i) = p_i^{y_i} (1-p_i)^{1-y_i}, \] where \(p_i\) is the funciton of the inputs \(x_i\) and coefficients \(\beta\) that gives us the probability of the response variable taking on a value of 1, given the input variables. A typical approach to calculate \(p_i\) is to use logistic function \[ p_i = f(x_i,\beta)= \frac{e^{\beta^Tx_i}}{1+e^{\beta^Tx_i}}, \] where \(\beta\) is a vector of parameters. The logistic function is a sigmoid function that maps any real number to a number between zero and one.
Given the observed data set \(\{(x_i,y_i)\}_{i=1}^n\), where each \(y_i\) is either 0 or 1, the goal is to predict a probability that the next observation will be 1. Binomial log-likelihood minimisation leads to the maximum likelihood estimator for parameters \(\beta\) (a.k.a cross-entropy estimator). The maximum likelihood estimator is defined as follows \[ \mini_{\beta} -\sum_{i=1}^n y_i \log \left ( f(x_i,\beta) \right ) + (1-y_i) \log \left ( 1-f(x_i,\beta) \right ). \]
In the unconditional case, when we do not observe any inputs \(x\), the cross-entropy estimator is again, the sample mean. If we take the derivative of the above expression with respect to \(\beta\) and set it to zero, we get \[ \frac{d}{d\beta} -\sum_{i=1}^n y_i \log \left ( \beta \right ) + (1-y_i) \log \left ( 1-\beta \right ) = -\sum_{i=1}^n \frac{y_i}{\beta} - \frac{1-y_i}{1-\beta} = 0 \] which gives us the solution \[ \hat{\beta} = \frac{1}{n}\sum_{i=1}^n y_i. \] which is the sample mean.
Unlike the least squares estimator, there is no analytical solution to the problem of minimizing cross-entropy. However, there are efficient numerical optimization algorithms that can be used to find the optimal solution. As we will show later, the cross-entropy estimator is equivalent to the maximum likelihood estimator, assuming that \(y\) is the Bernoulli random variable.
In the case when we have more than two classes \(y \in \{1,\ldots,K\}\), we simply build \(K\) models \(f_1(x_i,\beta),\ldots, f(x_K,\beta)\), one for each class and then use the softmax function to convert the output of each model into a number between zero and one. The softmax function is defined as follows \[ \mathrm{softmax}\left(f_j(x,\beta)\right) = \frac{\exp(f_j(x,\beta))}{\sum_{i=1}^K \exp(f_i(x,\beta))}. \] The softmax function is a generalization of the sigmoid function to the case of more than two classes. It is often used as the activation function in the output layer of neural networks for multi-class classification problems. It converts the output of each model into a probability distribution over the classes, making it suitable for multi-class classification with probabilistic outputs.
In summary, choosing the right loss function for your predictive rule depends on several factors, including type of prediction task: regression vs classification and importance of sensitivity to outliers.
For that, we need to specify the loss function that will measure the mismatch between an observed and a predicted values. The loss function is a measure of how well the model fits the data and is used to estimate the parameters of the model. The goal is to find the values of the parameters that minimize the loss function. A most popular choice for the loss function is the squared error loss function. \[ L(\beta; ~D) = \sum_{i=1}^n (y_i - f(x_i))^2 \rightarrow \mathrm{minimize}_{\beta} \]
14.1 Logistic Regression
When the value \(y\) we are trying to predict is categorical (or qualitative) we have a classification problem. For a binary output we predict the probability its going to happen \[ p ( Y=1 | X = x ), \] where \(X = (x_1,\ldots,x_p)\) is our usual list of predictors.
Suppose that we have a binary response, \(y\) taking the value \(0\) or \(1\)
- Win or lose
- Sick or healthy
- Buy or not buy
- Pay or default
The goal is to predict the probability that \(y\) equals \(1\). You can then do and categorize a new data point. Assessing credit risk and default data is a typical problem. - \(y\): whether or not a customer defaults on their credit card (No or Yes). - \(x\): The average balance that customer has remaining on their credit card after making their monthly payment, plus as many other features you think might predict \(Y\).
A linear model is a powerful tool to find relations among different variables \[ y = \beta^Tx + \epsilon. \] It works assuming that \(y\) variable is contentious and ranges in \((-\infty,+\infty)\). Another assumption is that conditional distribution of \(y\) is normal \(p(y\mid \beta^Tx) \sim N(\beta^Tx, \sigma^2)\)
What do we do when assumptions about conditional normal distributions do not hold. For example \(y\) can be a binary variable with values 0 and 1. For example \(y\) is
- Outcome of an election
- Result of spam filter
- Decision variable about loan approval
We model response \(\{0,1\}\), using a continuous variable \(y \in [0,1]\) which is interpreted as the probability that response equals to 1. \[ p( y= 1 | x_1, \ldots , x_p ) = F \left ( \beta_1 x_1 + \ldots + b_p x_p \right ) \] where \(f\) is increasing and \(0< f(x)<1\).
It seems logical to find a transformation \(F\) so that \(F(\beta^Tx + \epsilon) \in [0,1]\). Then we can predict using \(F(\beta^Tx)\) and intercepting interpret the result as a probability, i.e if \(F(\beta^Tx) = z\) then we interpret it as \(p(y=1) = z\). Such function \(F\) is called a link function.
Do we know a function that maps any real number to a number in \([0,1]\) interval? What about commutative distribution function \(F(x) = p(Z \le x)\)? If we choose CDF \(\Phi(x)\) for \(N(0,1)\) then we have \[\begin{align*} \hat y = p(y=1) &= \Phi(\beta^Tx) \\ \Phi^{-1}(\hat y) = & \beta^Tx + \epsilon \end{align*}\] This is a linear model for \(\Phi^{-1}(\hat y)\), but not for \(y\)! You can thing of this as a change of units for variable \(y\). In this specific case, when we use normal CDF, the resulting model is called probit, it stands for probability unit. The resulting link function is \(\Phi^{-1}\) and now \(\Phi^{-1}(Y)\) follows a normal distribution! This term was coined in the 1930’s by biologists studying the dosage-cure rate link. We can fit a probit model using glm function in R.
set.seed(92) # Kuzy
= seq(-3,3,length.out = 100)
x = pnorm(x+rnorm(100))>0.5
y = glm(y~x, family=binomial(link="probit"))
probitModel = as.double(coef(probitModel))
mc # we want to predict outcome for x = -1
= -1
xnew yt = mc[1] + mc[2]*xnew) (
-0.86
pnorm(yt)) (
0.19
pred = predict(probitModel, list(x = c(xnew)), type="response")) (
1
0.19
= dnorm(mc[1] + mc[2]*x)
nd plot(x,nd, type='l', col="red", xlab="x", ylab = "P(y=1)")
polygon(c(-3,x[x< -1],-1),c(0,nd[x< -1],0), col="blue")
Our prediction is the blue area which is equal to 0.195.
plot(x,y, type='p', col="red", xlab="x", ylab = "P(y=1)")
= predict(probitModel, list(x=x), type="response")
pred_probit lines(x,pred_probit, type='l')
A couple of observations: (i) this fits the data much better than the linear estimation, and (i) it always lies between 0 and 1. Instead of thinking of \(y\) as a probability and transforming right hand side of the linear model we can think of transforming \(y\) so that transformed variable lies in \((-\infty,+\infty)\). We can use odds ratio, that we talked about before \[ \dfrac{y}{1-y} \]
Odds ration lies in the interval \((0,+\infty)\). Almost what we need, but not exactly. Can we do another transform that maps \((0,+\infty)\) to \((-\infty,+\infty)\)? \[ \log\left(\dfrac{y}{1-y}\right) \] will do the trick! This function is called a logit function and it is This function is called a logit function and it is the inverse of the sigmoidal "logistic" function or logistic transform. The is linear in \[ \log \left ( \frac{ p \left ( y=1|x \right ) }{ 1 - p \left ( Y=1|x \right ) } \right ) = \beta_0 + \beta_1 x_1 + \ldots + x_p. \] These model are easy to fit in R:
glm( y ~ x1 + x2, family="binomial")
- is for indicates \(y=0\) or \(1\)
- has a bunch of other options.
Outside of specific field, i.e. behavioral economics, the logistic function is much more popular of a choice compared to probit model. Besides that fact that is more intuitive to work with logit transform, it also has several nice properties when we deal with multiple classes (more then 2). Also, it is computationally easier then working with normal distributions. The density function of the logit is very similar to the probit one.
= glm(y~x, family=binomial(link="logit"))
logitModel = predict(logitModel, list(x = x), type="response")
pred_logit plot(x,pred_probit, pch=20, col="red", cex=0.9, ylab="y")
lines(x,pred_logit, type='p', pch=20, cex=0.5, col="blue")
lines(x,y, type='p', pch=21, cex=0.5, bg="lightblue")
legend("bottomright",pch=20, legend=c("Logit", "Probit"), col=c("blue","red"),y.intersp = 2)
We can easily derive an inverse of the logit function to get back the original \(y\) \[ \log\left(\dfrac{y}{1-y}\right) = \beta^Tx;~~ y=\dfrac{e^{\beta^Tx}}{1+e^{\beta^Tx}} \]
Example 14.1 (Example: NBA point spread)
= read.csv("../data/NBAspread.csv")
NBA attach(NBA)
= nrow(NBA)
n hist(NBA$spread[favwin==1], col=5, main="", xlab="spread")
hist(NBA$spread[favwin==0], add=TRUE, col=6)
legend("topright", legend=c("favwin=1", "favwin=0"), fill=c(5,6), bty="n")
boxplot(NBA$spread ~ NBA$favwin, col=c(6,5), horizontal=TRUE, ylab="favwin", xlab="spread")
Does the Vegas point spread predict whether the favorite wins or not? Turquoise = Favorites does win, Purple = Favorite does not win. In R: the output gives us
= glm(favwin~spread-1, family=binomial)
nbareg summary(nbareg)
Call:
glm(formula = favwin ~ spread - 1, family = binomial)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
spread 0.1560 0.0138 11.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 766.62 on 553 degrees of freedom
Residual deviance: 527.97 on 552 degrees of freedom
AIC: 530
Number of Fisher Scoring iterations: 5
= seq(0,30,length=100)
s = exp(s*nbareg$coef[1])/(1+exp(s*nbareg$coef[1]))
fit plot(s, fit, typ="l", col=4, lwd=2, ylim=c(0.5,1), xlab="spread", ylab="P(favwin)")
The \(\beta\) measures how our log-odds change! \(\beta = 0.156\)
Let’s do the NBA Point Spread Prediction. “Plug-in” the values for the new game into our logistic regression \[ { P \left ( \mathrm{ favwin} \mid \mathrm{ spread} \right ) = \frac{ e^{ \beta x } }{ 1 + e^{\beta x} } } \] Check that when \(\beta =0\) we have \(p= \frac{1}{2}\).
Given our new values spread\(=8\) or spread\(=4\), the win probabilities are \(77\)% and \(65\)%, respectively. Clearly, the bigger spread means a higher chance of winning.
Example 14.2 (Logistic Regression for Tennis Classification) Data science plays a major role in tennis, you can learn about recent AI tools developed by IBM from this This Yahoo Article.
We will analyze the Tennis Major Tournament Match Statistics Data Set from the UCI ML repository. The data set has one per each game from four major Tennis tournaments in 2013 (Australia Open, French Open, US Open, and Wimbledon).
Let’s load the data and familiarize ourselves with it
= read.csv("./../data/tennis.csv")
d dim(d)
943 44
str(d[,1:5])
'data.frame': 943 obs. of 5 variables:
$ Player1: chr "Lukas Lacko" "Leonardo Mayer" "Marcos Baghdatis" "Dmitry Tu"..
$ Player2: chr "Novak Djokovic" "Albert Montanes" "Denis Istomin" "Michael "..
$ Round : int 1 1 1 1 1 1 1 1 1 1 ...
$ Result : int 0 1 0 1 0 0 0 1 0 1 ...
$ FNL1 : int 0 3 0 3 1 1 2 2 0 3 ...
Let’s look at the few coluns of the randomly selected five rows of the data
sample(1:943,size = 5),c("Player1","Player2","Round","Result","gender","surf")] d[
Player1 | Player2 | Round | Result | gender | surf | |
---|---|---|---|---|---|---|
532 | Florian Mayer | Juan Monaco | 1 | 1 | M | Hard |
816 | L.Kubot | J.Janowicz | 5 | 0 | M | Grass |
431 | Svetlana Kuznetsova | Ekaterina Makarova | 1 | 1 | W | Clay |
568 | Marcos Baghdatis | Go Soeda | 1 | 1 | M | Hard |
216 | Mandy Minella | Anastasia Pavlyuchenkova | 2 | 0 | W | Hard |
We have data for 943 matches and for each match we have 44 columns, including names of the players, their gender, surface type and match statistics. Let’s look at the number of break points won by each player. We will plot BPW (break points won) by each player on the scatter plot and will colorize each dot according to the outcome
= dim(d)[1]
n plot(d$BPW.1+rnorm(n),d$BPW.2+rnorm(n), pch=21, col=d$Result+2, cex=0.6, bg="yellow", lwd=0.8,
xlab="BPW by Player 1", ylab="BPW by Player 2")
legend("bottomright", c("P1 won", "P2 won"), col=c(3,2), pch=21, bg="yellow", bty='n')
We can clearly see that number of the break points won is a clear predictor of the match outcome. Which is obvious and follows from the rules, to win a match, a player must win break points. Now, we want to understand the impact of a winning a break point on the overall match outcome. We do it by building a logistic regression model
which(is.na(d$BPW.1)) # there is one row with NA value for the BPW.1 value and we remove it
171
= d[-171,]; n = dim(d)[1]
d = glm(Result ~ BPW.1 + BPW.2-1, data=d, family = "binomial" )
m summary(m)
Call:
glm(formula = Result ~ BPW.1 + BPW.2 - 1, family = "binomial",
data = d)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
BPW.1 0.4019 0.0264 15.2 <2e-16 ***
BPW.2 -0.4183 0.0277 -15.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1305.89 on 942 degrees of freedom
Residual deviance: 768.49 on 940 degrees of freedom
AIC: 772.5
Number of Fisher Scoring iterations: 5
R
output does not tell us how accurate our model is but we can quickly check it by using the table
function. We will use \(0.5\) as a threshold for our classification.
table(d$Result, as.integer(m$fitted.values>0.5))
0 1
0 416 61
1 65 400
Thus, our model got (416+416)/942 = 0.88% of the predictions correctly!
Essentially, the logistic regression is trying to draw a line that separates the red observations from the green one. In out case, we have two predictors \(x_1\) = BPW.1 and \(x_2\) = BPW.2 and our model is \[ \log\left(\dfrac{p}{1-p}\right) = \beta_1x_1 + \beta_2 x_2, \] where \(p\) is the probability of player 1 winning the match. We want to find the line along which the probability is 1/2, meaning that \(p/(1-p) = 1\) and log-odds \(\log(p/(1-p)) = 0\), thus the equation for the line is \(\beta_1x_1 + \beta_2 x_2 = 0\) or \[ x_2 = \dfrac{-\beta_1}{\beta_2}x_1 \]
Let’s see the line found by the glm
function
legend("bottomright", c("P1 won", "P2 won"), col=c(3,2), pch=21, bg="yellow", bty='n')
= seq(0,30,length.out = 200)
x = -m$coefficients[1]*x/m$coefficients[2]
y lines(x,y, lwd=2, col="red")
There are a couple of observations. First, effect of a break point on the game outcome is significant and symmetric, effect of loosing break point is the same as the effect of winning one. We also can interpret the effect of winning a break point in the following way. We will keep BPW.2 = 0 and will calculate what happens to the probability of winning when BPW.1 changes from 0 to 1. The odds ration for player 1 winning when BPW.1 = 0 is exp(0)
which is 1, meaning that the probability that P1 wins is 1/2. Now when BPW.1 = 1, the odds ratio is 1.5
exp(0.4019)
1.5
We can calculate probability of winning from the regression equation \[
\dfrac{p}{1-p} = 1.5,~~~p = 1.5(1-p),~~~2.5p = 1.5,~~~p = 0.6
\] Thus probability of winning goes from 50% to 60%, we can use predict
function to get this result
predict.glm(m,newdata = data.frame(BPW.1 = c(0), BPW.2 = c(0)), type="response")
1
0.5
predict.glm(m,newdata = data.frame(BPW.1 = c(1), BPW.2 = c(0)), type="response")
1
0.6
What happens to the chances of winning when P1 wins three more break points compared to the opponent
predict.glm(m,newdata = data.frame(BPW.1 = c(0), BPW.2 = c(0)), type="response")
1
0.5
predict.glm(m,newdata = data.frame(BPW.1 = c(3), BPW.2 = c(0)), type="response")
1
0.77
Chances go up by 27%.
Tennis is arguably the sport in which mean and women are treated equally. Both man and women matches are shown during the prime-time on TV, they both have the same prize money. However, one of the comments you hear often is that Women’s matches are “less predictable”, meaning that an upset (when the favorite looses) is more likely to happen in a women’s match compared to man Matches. We can test thus statement by looking at the residuals. The large the residual the less accurate our prediction was.
= which(d$res<2)
outlind boxplot(d$res[outlind] ~ d$gender[outlind], col=c(2,3), xlab="Gender",ylab="Residual")
Let’s do a formal T-test on the residuals foe men’s and women’s matches
= d %>% filter(res<2, gender=="M") %>% pull(res)
men = d %>% filter(res<2, gender=="W") %>% pull(res)
women t.test(men, women, alternative = "two.sided")
Welch Two Sample t-test
data: men and women
t = -5, df = 811, p-value = 3e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.105 -0.043
sample estimates:
mean of x mean of y
1.2 1.3
Looks like the crowd wisdom that Women’s matches are less predictable is correct.
Example 14.3 (Credit Card Default) We have 10,000 observations
= read.csv("../data/CreditISLR.csv", stringsAsFactors = T)
Default head(Default)
default | student | balance | income |
---|---|---|---|
No | No | 730 | 44362 |
No | Yes | 817 | 12106 |
No | No | 1074 | 31767 |
No | No | 529 | 35704 |
No | No | 786 | 38464 |
No | Yes | 920 | 7492 |
Let’s build a logistic regression model
=glm(default~balance,data=Default,family=binomial)
glm.fitsummary(glm.fit)
Call:
glm(formula = default ~ balance, family = binomial, data = Default)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.65133 0.36116 -29.5 <2e-16 ***
balance 0.00550 0.00022 24.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1596.5 on 9998 degrees of freedom
AIC: 1600
Number of Fisher Scoring iterations: 8
We use it now to predict default
predict.glm(glm.fit,newdata = list(balance=1000))
1
-5.2
-1.065e+01 + 5.499e-03*1000
-5.2
predict.glm(glm.fit,newdata = list(balance=1000), type="response")
1
0.0058
exp(-1.065e+01 + 5.499e-03*1000)/(1+exp(-1.065e+01 + 5.499e-03*1000))
0.0058
Predicting default
= predict.glm(glm.fit,newdata = x, type="response")
y plot(x$balance,y, pch=20, col="red", xlab = "balance", ylab="Default")
lines(Default$balance, as.integer(Default$default)-1, type='p',pch=20)
14.2 Confusion Matrix
We can use accuracy rate: \[ \text{accuracy} = \dfrac{\text{\# of Correct answers}}{n} \] or its dual, error rate \[ \text{error rate} = 1 - \text{accuracy}. \] You remember, we haw two types of errors. We can use confusion matrix to quantify those
Predicted: YES | Predicted: NO | |
---|---|---|
Actual: YES | TPR | FNR |
Actual: NO | FPR | TNR |
True positive rate (TPR) is the sensitivity and false positive rate (FPR) is the specificity of our predictive model
Example: Evolute the previous model Accuracy = 0.96
Predicted: YES | Predicted: NO | |
---|---|---|
Actual: YES | TPR=0.6 | FNR=0.4 |
Actual: NO | FPR=0.03 | TNR=0.97 |
I used \(p=0.2\) as a cut-off. What if I use smaller or larger \(p\), e.g. \(p=0\)?
ROC Curve Shows what happens for different cut-off values
First, we define a function, that calculates the ROC
<- function(p,y, ...){
roc <- factor(y)
y <- length(p)
n <- as.vector(p)
p <- p > matrix(rep(seq(0,1,length=100),n),ncol=100,byrow=TRUE)
Q <- colMeans(!Q[y==levels(y)[1],])
specificity <- colMeans(Q[y==levels(y)[2],])
sensitivity plot(1-specificity, sensitivity, type="l", ...)
abline(a=0,b=1,lty=2,col=8)
}
## roc curve and fitted distributions
= predict.glm(glm.fit,newdata = Default, type="response")
pred = y
default roc(p=pred, y=Default$default, bty="n", main="in-sample")
# our 1/5 rule cutoff
points(x= 1-mean((pred<.2)[default==0]),
y=mean((pred>.2)[default==1]),
cex=1.5, pch=20, col='red')
## a standard `max prob' (p=.5) rule
points(x= 1-mean((pred<.5)[default==0]),
y=mean((pred>.5)[default==1]),
cex=1.5, pch=20, col='blue')
legend("bottomright",fill=c("red","blue"),
legend=c("p=1/5","p=1/2"),bty="n",title="cutoff")
Look at other predictors
=glm(default~balance+income+student,data=Default,family=binomial)
glm.fitsummary(glm.fit)
Call:
glm(formula = default ~ balance + income + student, family = binomial,
data = Default)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.09e+01 4.92e-01 -22.08 <2e-16 ***
balance 5.74e-03 2.32e-04 24.74 <2e-16 ***
income 3.03e-06 8.20e-06 0.37 0.7115
studentYes -6.47e-01 2.36e-01 -2.74 0.0062 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1571.5 on 9996 degrees of freedom
AIC: 1580
Number of Fisher Scoring iterations: 8
Student is significant!?
Student vs Balance
boxplot(balance~student,data=Default, col = Default$student, ylab = "balance")
Let’s adjust for balance
= data.frame(balance = seq(1000,2500,length.out = 100), student = as.factor(rep("No",100)), income=rep(40,100))
x2 = predict.glm(glm.fit,newdata = x1, type="response")
y1 = predict.glm(glm.fit,newdata = x2, type="response")
y2 plot(x1$balance,y1, type='l', col="red", xlab="Balance", ylab = "P(Default)")
lines(x2$balance,y2, type='l', col="black")
legend("topleft",bty="n", legend=c("Not Student", "Student"), col=c("black","red"), lwd=2)
14.2.1 Estimating the Regression Coefficients
The coefficients \(\beta = (\beta_0, \beta_1)\) can be estimated using maximum likelihood method we discussed when talked about linear regression.
The model we derived above, gives us probability of \(y\), given \(x\) \[ p(y\mid x) = \dfrac{e^{\beta^Tx}}{1+e^{\beta^Tx}} \]
Now the problem is as follows \[ \underset{\beta}{maximize} \prod_{i:y_i = 1}p(x_i)\prod_{j:y_j = 0}(1-p(x_j)). \]
Maximum likelihood is a very general approach that is used to fit many of the non-linear models.
14.2.2 Choosing \(p\) and Evaluating Quality of Classifier
In logistic regression we use the logistic function to calculate a probability of \(y = 1\) \[ p(y=1\mid x) = \dfrac{e^{\beta^Tx}}{1+e^{\beta^Tx}}. \] Then, to predict a label we use a rule if \(y<p\), predict 0, and predict 1 otherwise. Now we answer the question of how to choose the cut-off value \(p\). We show it through an example.
Example 14.4 (Load Default) Assume a bank is using a logistic regression model to predict probability of a loan default and would issue a loan if \(a = p(y=1) < p\). Here \(p\) is the level of risk bank is willing to take. If bank chooses \(p=1\) and gives loans to everyone it is likely to loose a lot of money from defaulted accounts. If it chooses \(p = 0\) it will not issue loan to anyone and wont make any money. In order to choose an appropriate \(p\), we need to know what are the risks. Assume, bank makes $0.25 on every $1 borrowed in interest in fees and loose the entire amount of $1 if account defaults. This leads to the following pay-off matrix
payer | defaulter | |
---|---|---|
loan | -0.25 | 1 |
no load | 0 | 0 |
Then, given \(a = p(y=1)\), the expected profit is profit = \(0.25(1-a) - a\) to maintain a positive profit we need to choose \[ 0.25(1-a) - a >0 \iff -1.25a > -0.25 \iff a < 0.25 /1.25= 0.2 \] Thus, by choosing cutoff to be 0.2 or less, we guarantee to make profit on our loans.
To evaluate a binary classification predictor, we will use confusion matrix. It is shows numbers of correct predictions by the model (true positives and true negatives) and incorrect ones (false positive and false negatives). Say, we have a model that predicts weather person has a disease or not and we evaluate this model using 200 samples (\(n=200\)) with 60 being labeled as 0 (NO) and 140 labeled as 1 (YES) and model predicted correctly 130 YES labeled observations and 50 NOs.
Predicted: YES | Predicted: NO | |
---|---|---|
Actual: YES | TP = 130 | FN = 10 |
Actual: NO | FP = 10 | TN = 50 |
Sometimes, it is convenient to used rates rather than absolute counts and we compute
Predicted: YES | Predicted: NO | |
---|---|---|
Actual: YES | TPR = 130/140 | FNR = 10/140 |
Actual: NO | FPR = 10/60 | TNR = 50/60 |
True positive rate (TPR) is nothing but the sensitivity and false positive rate (FPR) is the specificity of our predictive model. Accuracy, which is the percent of correct predictions is another metric can be used to evaluate a classifier. \[ \mbox{Accuracy} = \dfrac{\mbox{TP + TN}}{n}. \] The error rate is opposite to accuracy \[ \mbox{Error rate} = 1- \mbox{Accuracy} \]
For a logistic regression, the confusion matrix will be different for different choices of the cut-off values \(p\). If we would like to understand the performance of the model for different values of \(p\) we can split an ROC curve, which plots pairs of TPR and FPR for different values of \(p\). Saw we take a sequence of 11 values \(p \in {0, 0.1, 0.2,\ldots,1}\) and we evaluate TPR and FPR for those 10 values and plot those pairs on a 2D plot then we will get the ROC curve. A few facts about the ROC curve:
- If we set \(p=0\), then any model will always predict NO, this leads to FPR=0 and TPR=0
- If we set \(p=1\) and model always predicts YES, then we get FPR = 1 and TPR = 1
- If we have an “ideal" model then for any \(0<p<1\) we will have FPR = 0 and TPR = 1.
- A naive model that uses coin flip to classify will have FPR = 1/2 and TPR = 1/2
- An ROC curve for an model will lie in-between the ideal curve and naive curve. If your model is worse then naive, it is not a good model. And your model cannot be better than an ideal model.
Example 14.5 (Default) Let’s consider an example. We want to predict default given attributes of the loan applicant. We have 1000 observations of 9 variables
= read.csv("../data/credit.csv")
credit $history = factor(credit$history, levels=c("A30","A31","A32","A33","A34"))
creditlevels(credit$history) = c("good","good","poor","poor","terrible")
$foreign <- factor(credit$foreign, levels=c("A201","A202"), labels=c("foreign","german"))
credit$rent <- factor(credit$housing=="A151")
credit$purpose <- factor(credit$purpose, levels=c("A40","A41","A42","A43","A44","A45","A46","A47","A48","A49","A410"))
creditlevels(credit$purpose) <- c("newcar","usedcar",rep("goods/repair",4),"edu",NA,"edu","biz","biz")
<- credit[,c("Default", "duration", "amount",
credit "installment", "age", "history",
"purpose", "foreign", "rent")]
::kable(head(credit)) knitr
Default | duration | amount | installment | age | history | purpose | foreign | rent |
---|---|---|---|---|---|---|---|---|
0 | 6 | 1169 | 4 | 67 | terrible | goods/repair | foreign | FALSE |
1 | 48 | 5951 | 2 | 22 | poor | goods/repair | foreign | FALSE |
0 | 12 | 2096 | 2 | 49 | terrible | edu | foreign | FALSE |
0 | 42 | 7882 | 2 | 45 | poor | goods/repair | foreign | FALSE |
1 | 24 | 4870 | 3 | 53 | poor | newcar | foreign | FALSE |
0 | 36 | 9055 | 2 | 35 | poor | edu | foreign | FALSE |
We build a logistic regression model using all of the 8 predictors and their interactions
credscore = glm(Default~.^2,data=credit,family=binomial)
Then we plot ROC curve (FPR vs TPR) for different values of \(p\) and compare the curve with the naive
14.3 Multinomial logistic regression
Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. In logistic regression we assumed that the labels were binary: \(y_i \in \{0,1\}\) . We used such a classifier to distinguish between two kinds of hand-written digits. Softmax regression allows us to handle \(y_i \in \{1,\ldots ,K\}\) where \(K\) is the number of classes. Our model took the form: \[ f(x\mid \beta)=\dfrac{1}{1+\exp(-\beta^Tx)}~, \] and the model parameters \(\beta\) were trained to minimize the loss function (negative log-likelihood) \[ J(\beta) = -\left[ \sum_{i=1}^m y_i \log f(x\mid \beta) + (1-y_i) \log (1-f(x\mid \beta)) \right] \]
Given a test input \(x\), we want our model to estimate the probability that \(p(y=k|x)\) for each value of \(k=1,\ldots ,K\) Thus, our model will output a \(K\)-dimensional vector (whose elements sum to 1) giving us our K estimated probabilities. Concretely, our model \(f(x\mid \beta)\) takes the form: \[ \begin{aligned} f(x\mid \beta) = \begin{bmatrix} p(y = 1 | x; \beta) \\ p(y = 2 | x; \beta) \\ \vdots \\ p(y = K | x; \beta) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{K}{\exp(\beta_k^T x) }} \begin{bmatrix} \exp(\beta_1^{T} x ) \\ \exp(\beta_2^{T} x ) \\ \vdots \\ \exp(\beta_k^T x ) \\ \end{bmatrix}\end{aligned} \]
Here \(\beta_i \in R^n, i=1,\ldots,K\) are the parameters of our model. Notice that the term \(1/ \sum_{j=1}^{K}{\exp(\beta_k^T x) }\) normalizes the distribution, so that it sums to one.
For convenience, we will also write \(\beta\) to denote all the parameters of our model. When you implement softmax regression, it is usually convenient to represent \(\beta\) as an \(n\)-by-\(K\) matrix obtained by concatenating \(\beta_1,\beta_2,\ldots ,\beta_K\) into columns, so that \[ \beta = \left[\begin{array}{cccc}| & | & | & | \\ \beta_1 & \beta_2 & \cdots & \beta_K \\ | & | & | & | \end{array}\right]. \]
We now describe the cost function that we’ll use for softmax regression. In the equation below, \(1\) is the indicator function, so that \(1\)(a true statement)=1, and \(1\)(a false statement)=0. For example, 1(2+3 > 4) evaluates to 1; whereas 1(1+1 == 5) evaluates to 0. Our cost function will be: \[ \begin{aligned} J(\beta) = - \left[ \sum_{i=1}^{m} \sum_{k=1}^{K} 1\left\{y_i = k\right\} \log \frac{\exp(\beta_k^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i)}\right] \end{aligned} \]
Notice that this generalizes the logistic regression cost function, which could also have been written: \[ \begin{aligned} J(\beta) &= - \left[ \sum_{i=1}^m (1-y_i) \log (1-f(x\mid \beta)) + y_i \log f(x\mid \beta) \right] \\ &= - \left[ \sum_{i=1}^{m} \sum_{k=0}^{1} 1\left\{y_i = k\right\} \log p(y_i = k | x_i ; \beta) \right]\end{aligned} \] The softmax cost function is similar, except that we now sum over the \(K\) different possible values of the class label. Note also that in softmax regression, we have that \[ p(y_i = k | x_i ; \beta) = \frac{\exp(\beta_k^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i) }. \]
Softmax regression has an unusual property that it has a redundant set of parameters. To explain what this means, suppose we take each of our parameter vectors \(\beta_j\), and subtract some fixed vector \(\psi\). Our model now estimates the class label probabilities as \[ \begin{aligned} p(y_i = k | x_i ; \beta) &= \frac{\exp((\beta_k-\psi)^T x_i)}{\sum_{j=1}^K \exp( (\beta_j-\psi)^T x_i)} \\ &= \frac{\exp(\beta_k^T x_i) \exp(-\psi^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i) \exp(-\psi^T x_i)} \\ &= \frac{\exp(\beta_k^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i)}.\end{aligned} \] In other words, subtracting \(\psi\) does not affect our model’ predictions at all! This shows that softmax regression’s parameters are redundant. More formally, we say that our softmax model is overparameterized, meaning that for any model we might fit to the data, there are multiple parameter settings that give rise to exactly the same model function \(f(x \mid \beta)\) mapping from inputs \(x\) to the predictions.
Further, if the cost function \(J(\beta)\) is minimized by some setting of the parameters \((\beta_1,\ldots,\beta_K)\), then it is also minimized by \((\beta_1-\psi,\ldots,\beta_K-\psi)\) for any value of \(\psi\). Thus, the minimizer of \(J(\beta)\) is not unique. Interestingly, \(J(\beta)\) is still convex, and thus gradient descent will not run into local optima problems. But the Hessian is singular/non-invertible, which causes a straightforward implementation of Newton’s method to run into numerical problems. We can just set \(\psi\) to \(\beta_i\) and remove \(\beta_i\).
Example 14.6 (LinkedIn Study) How to Become an Executive(Irwin 2016; Gan and Fritzler 2016)?
Logistic regression was used to analyze the career paths of about \(459,000\) LinkedIn members who worked at a top 10 consultancy between 1990 and 2010 and became a VP, CXO, or partner at a company with at least 200 employees. About \(64,000\) members reached this milestone, \(\hat{p} = 0.1394\), conditional on making it into the database. The goals of the analysis were the following
- Look at their profiles – educational background, gender, work experience, and career transitions.
- Build a predictive model of the probability of becoming an executive
- Provide a tool for analysis of “what if” scenarios. For example, if you are to get a master’s degree, how your jobs perspectives change because of that.
Let’s build a logistic regression model with \(8\) key features (a.k.a. covariates): \[ \log\left ( \frac{p}{1-p} \right ) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_8x_8 \]
\(p\): Probability of “success” – reach VP/CXO/Partner seniority at a company with at least 200 employees.
Features to predict the “success” probability: \(x_i (i=1,2,\ldots,8)\).
Variable Parameters \(x_1\) Metro region: whether a member has worked in one of the top 10 largest cities in the U.S. or globally. \(x_2\) Gender: Inferred from member names: ‘male’, or ‘female’ \(x_3\) Graduate education type: whether a member has an MBA from a top U.S. program / a non-top program / a top non-U.S. program / another advanced degree \(x_4\) Undergraduate education type: whether a member has attended a school from the U.S. News national university rankings / a top 10 liberal arts college /a top 10 non-U.S. school \(x_5\) Company count: # different companies in which a member has worked \(x_6\) Function count: # different job functions in which a member has worked \(x_7\) Industry sector count: # different industries in which a member has worked \(x_8\) Years of experience: # years of work experience, including years in consulting, for a member.
The following estimated \(\hat\beta\)s of features were obtained. With a sample size of 456,000 thy are measured rather accurately. Recall, given each location/education choice in the “Choice and Impact” is a unit change in the feature.
- Location: Metro region: 0.28
- Personal: Gender(Male): 0.31
- Education: Graduate education type: 1.16, Undergraduate education type: 0.22
- Work Experience: Company count: 0.14, Function count: 0.26, Industry sector count: -0.22, Years of experience: 0.09
Here are three main findings
- Working across job functions, like marketing or finance, is good. Each additional job function provides a boost that, on average, is equal to three years of work experience. Switching industries has a slight negative impact. Learning curve? lost relationships?
- MBAs are worth the investment. But pedigree matters. Top five program equivalent to \(13\) years of work experience!!!
- Location matters. For example, NYC helps.
We can also personalize the prediction for predict future possible future executives. For example, Person A (p=6%): Male in Tulsa, Oklahoma, Undergraduate degree, 1 job function for 3 companies in 3 industries, 15-year experience.
Person B (p=15%): Male in London, Undergraduate degree from top international school, Non-MBA Master, 2 different job functions for 2 companies in 2 industries, 15-year experience.
Person C (p=63%): Female in New York City, Top undergraduate program, Top MBA program, 4 different job functions for 4 companies in 1 industry, 15-year experience.
Let’s re-design Person B.
Person B (p=15%): Male in London, Undergraduate degree from top international school, Non-MBA Master, 2 different job functions for 2 companies in 2 industries, 15-year experience.
- Work in one industry rather than two. Increase \(3\)%
- Undergrad from top \(10\) US program rather than top international school. \(3\)%
- Worked for \(4\) companies rather than \(2\). Another \(4\)%
- Move from London to NYC. \(4\)%
- Four job functions rather than two. \(8\)%. A \(1.5\)x effect.
- Worked for \(10\) more years. \(15\)%. A \(2\)X effect.
Choices and Impact (Person B) are shown below
14.4 Imbalanced Data
Often, you have much more observations with a specific label, such a sample is called imbalanced. You should avoid using accuracy as a metric to choose a model. Say, you have a binary classification problem with 95% of samples labeled as 1. Then a naive classifier that assigns label 1 for each input will be 95% accurate. An ROC curve and area under the curve (AUC) metric derived from it should be used. Alternatively, you can use F1 sore, which combines precision and recall \[ F1 = 2\dfrac{\mathrm{precision} \times \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}} \]
A modeler should consider synthetically generating more samples of class with small number of observation, e.g. bootstrap or using a generative model or subsampling observations with major label if data set is large enough.
14.4.1 Kernel Trick
Kernel trick is a method of using a linear classifier to solve a non-linear problem. The idea is to map the data into a higher dimensional space, where it becomes linearly separable. The kernel trick is to use a kernel function \(K(x_i,x_j)\) to calculate the inner product of two vectors in the higher dimensional space without explicitly calculating the mapping \(\phi(x_i)\) and \(\phi(x_j)\). The kernel function is defined as \(K(x_i,x_j) = \phi(x_i)^T\phi(x_j)\). The most popular kernel functions are polynomial kernel \(K(x_i,x_j) = (x_i^Tx_j)^d\) and Gaussian kernel \(K(x_i,x_j) = \exp(-\gamma||x_i-x_j||^2)\). The kernel trick is used in Support Vector Machines (SVM) and Gaussian Processes (GP).
Code
= function(numSamples,radius,noise) {
gencircledata = matrix(0,ncol = 3, nrow = numSamples); # matrix to store our generated data
d # Generate positive points inside the circle.
for (i in 1:(numSamples/2) ) {
= runif(1,0, radius * 0.4);
r = runif(1,0, 2 * pi);
angle = r * sin(angle);
x = r * cos(angle);
y = runif(1,-radius, radius) * noise;
noiseX = runif(1,-radius, radius) * noise;
noiseY = c(0,x,y)
d[i,]
}
# Generate negative points outside the circle.
for (i in (numSamples/2+1):numSamples ) {
= runif(1,radius * 0.8, radius);
r = runif(1,0, 2 * pi);
angle = r * sin(angle);
x = r * cos(angle);
y = runif(1,-radius, radius) * noise;
noiseX = runif(1,-radius, radius) * noise;
noiseY = c(1,x,y)
d[i,]
}colnames(d) = c("label", "x1", "x2")
return(d)
}
= gencircledata(numSamples=200, radius=1, noise=0.001)
d plot(d[,2],d[,3], col=d[,1]+1, pch=19, xlab="x", ylab="y")

The data on the left in Figure Figure 14.1 is clearly not linearly separable. However, if we map it to a three-dimensional space using the transformation: \[ \begin{aligned} \phi: R^{2} & \longrightarrow R^{3} \\ \left(x_{1}, x_{2}\right) & \longmapsto\left(z_{1}, z_{2}, z_{3}\right)=\left(x_{1}^{2}, \sqrt{2} x_{1} x_{2}, x_{2}^{2}\right), \end{aligned} \] and attempt to linearly separate the transformed data, the decision boundaries become hyperplanes in \(R^{3}\), expressed as \(\omega^{T} z + b = 0\). In terms of the original variables \(x\), these boundaries take the form: \[ \omega_{1} x_{1}^{2} + \omega_{2} \sqrt{2} x_{1} x_{2} + \omega_{3} x_{2}^{2} = 0, \] which corresponds to the equation of an ellipse. This demonstrates that we can apply a linear algorithm to transformed data to achieve a non-linear decision boundary with minimal effort.
Now, consider what the algorithm is actually doing. It relies solely on the Gram matrix \(K\) of the data. Once \(K\) is computed, the original data can be discarded: \[ \begin{aligned} K & = \left[\begin{array}{ccc} x_{1}^{T} x_{1} & x_{1}^{T} x_{2} & \cdots \\ x_{2}^{T} x_{1} & \ddots & \\ \vdots & & \end{array}\right]_{n \times n} = X X^{T}, \\ \text{where} \quad X & = \left[\begin{array}{c} x_{1}^{T} \\ \vdots \\ x_{n}^{T} \end{array}\right]_{n \times d}. \end{aligned} \] Here, \(X\), which contains all the data, is referred to as the design matrix.
When we map the data using \(\phi\), the Gram matrix becomes: \[ K = \left[\begin{array}{ccc} \phi\left(x_{1}\right)^{T} \phi\left(x_{1}\right) & \phi\left(x_{1}\right)^{T} \phi\left(x_{2}\right) & \cdots \\ \phi\left(x_{2}\right)^{T} \phi\left(x_{1}\right) & \ddots & \\ \vdots & & \end{array}\right]. \]
Let us compute these inner products explicitly. For vectors \(r\) and \(s\) in \(R^{3}\) corresponding to \(a\) and \(b\), respectively: \[ \begin{aligned} \langle r, s \rangle & = r_{1} s_{1} + r_{2} s_{2} + r_{3} s_{3} \\ & = a_{1}^{2} b_{1}^{2} + 2 a_{1} a_{2} b_{1} b_{2} + a_{2}^{2} b_{2}^{2} \\ & = \langle a, b \rangle^{2}. \end{aligned} \]
Thus, instead of explicitly mapping the data via \(\phi\) and then computing the inner product, we can compute it directly in one step, leaving the mapping \(\phi\) implicit. In fact, we do not even need to know \(\phi\) explicitly; all we require is the ability to compute the modified inner product. This modified inner product is called a kernel, denoted \(K(x, y)\). The matrix \(K\), which contains the kernel values for all pairs of data points, is also referred to as the kernel matrix.
Since the kernel itself is the primary object of interest, rather than the mapping \(\phi\), we aim to characterize kernels without explicitly relying on \(\phi\). Mercer’s Theorem provides the necessary framework for this characterization.
Let’s implement it
library("scatterplot3d")
<- function(x1, x2) {
phi <- x1^2
z1 <- sqrt(2) * x1 * x2
z2 <- x2^2
z3 return(cbind(z1, z2, z3))
}
# Generate sample 2D data (you can replace this with your actual data)
# Apply the transformation
<- phi(d[,2], d[,3])
transformed_data scatterplot3d(transformed_data, color = ifelse(d[,1] == 0, "red", "blue"), pch = 19,
xlab = "z1 (x1^2)", ylab = "z2 (sqrt(2) * x1 * x2)", zlab = "z3 (x2^2)",
main = "3D Scatter Plot of Transformed Data", angle=222, grid=FALSE, box=FALSE)
Example 14.7 (Formula One) As described in the Artificial Intelligence in Formula 1 article, Formula One teams are increasingly leveraging AI and machine learning to optimize race strategies. The article highlights how teams collect massive amounts of data during races - with 300 sensors per car generating millions of data points over 200-mile races. This data includes critical variables like fuel load, tire degradation, weight effects, and pit stop timing that must be optimized in real-time.
The key innovation is moving from pre-race strategy planning to in-race dynamic optimization using cloud computing platforms like AWS. Teams run Monte Carlo simulations of all cars and traffic situations to continuously update their strategy recommendations. This allows them to make optimal decisions about when to pit, which tires to use, and how to manage fuel consumption based on real-time race conditions rather than static pre-race plans.
The article emphasizes that the best strategies can vary dramatically from moment to moment during a race, making real-time AI-powered decision making crucial for competitive advantage. Teams are limited to 60 data scientists, so they must rely heavily on automated machine learning systems to process the vast amounts of sensor data and generate actionable insights during races.
The CNBC article highlights how Formula One championships are increasingly being determined by technological innovation rather than just driver skill. F1 success depends heavily on advanced data analytics, machine learning algorithms, and real-time computing capabilities. Key technological factors driving F1 success include real-time data processing where teams process millions of data points from hundreds of sensors per car during races. AI-powered strategy optimization uses machine learning algorithms to continuously analyze race conditions and recommend optimal pit stop timing, tire choices, and fuel management. Cloud computing infrastructure allows teams to rely on platforms like AWS to run complex simulations and data analysis during races. Predictive modeling employs advanced algorithms to predict tire degradation, fuel consumption, and competitor behavior. Simulation capabilities enable teams to run thousands of Monte Carlo simulations to optimize race strategies.
The technological arms race in Formula One has led to significant regulatory challenges. To maintain competitive balance and prevent larger teams from gaining insurmountable advantages through unlimited technological investment, F1 has implemented strict caps on the number of engineers and data scientists that teams can employ. Currently, teams are limited to 60 data scientists and engineers, which forces them to be highly strategic about their technological investments and resource allocation. This cap creates an interesting dynamic where teams must balance the need for sophisticated AI and machine learning capabilities with the constraint of limited human resources. As a result, teams are increasingly turning to automated systems and cloud-based solutions to maximize their technological capabilities within these constraints. The cap also levels the playing field somewhat, ensuring that success depends more on the efficiency and innovation of the technology rather than simply having more engineers and data scientists than competitors.