= seq(-10,10,length.out = 100)
x plot(x, 1/(1+exp(-x)), type='l', col="red", xlab="x", ylab="p")
1 Logistic Regression
Classification is a type of predictive modeling where the goal is to predict a categorical variable based on a set of input variables. A categorical variable is a variable that can take on only a limited number of values, such as 0 or 1, or it can be a multi-class variable, meaning it can take on more than two values. For example, in medical diagnosis, we might want to predict whether a patient has a disease (1) or not (0) based on symptoms and test results. A particularly important application is in self-driving cars, where computer vision systems must classify objects in real-time from camera feeds - distinguishing between pedestrians, other vehicles, traffic signs, and road obstacles to make safe driving decisions.
Given observed data \(\{(x_i,y_i)\}_{i=1}^n\), where each \(y_i\) is either 0 or 1, we start by assuming a binomial likelihood function for the response variable, defined as follows: \[ P(y_i = 1\mid p_i) = p_i^{y_i} (1-p_i)^{1-y_i}, \] where \(p_i\) is the funciton of the inputs \(x_i\) and coefficients \(\beta\) that gives us the probability of the response variable taking on a value of 1, given the input variables. A typical approach to calculate \(p_i\) is to use logistic function \[\begin{align*} f_{\beta}(x_i) = & \beta^Tx_i\\ p_i = & \sigma(f_{\beta}(x_i)) = \frac{e^{f_{\beta}(x_i)}}{1+e^{f_{\beta}(x_i)}}, \end{align*}\]
where \(\beta\) is a vector of parameters. The logistic function \(\sigma(\cdot)\) is a function that maps any real number to a number between zero and one.
1.1 Model Fitting
The we fit the model using Binomial log-likelihood minimisation. It leads us to the maximum likelihood estimator for parameters \(\beta\) (a.k.a cross-entropy estimator), defined as \[ \hat \beta = \arg\min_{\beta}\mathcal{L}(\beta), \] where \[ \mathcal{L}(\beta) = -\sum_{i=1}^n \left[ y_i \log p_i + (1-y_i) \log \left ( 1-p_i \right ) \right]. \] Similar to the least squares estimator, the cross-entropy estimator optimisaiton problem is convex, it has a unique solution.
In the unconditional case, when we do not observe any inputs \(x\), the cross-entropy estimator is again, the sample mean. If we take the derivative of the above expression with respect to \(\beta_0\) and set it to zero, we get \[ - \frac{d}{d\beta_0}\sum_{i=1}^n \left[ y_i \log \left ( \beta_0 \right ) + (1-y_i) \log \left ( 1-\beta_0 \right ) \right] = -\sum_{i=1}^n \left[ \frac{y_i}{\beta_0} - \frac{1-y_i}{1-\beta_0} \right] = 0 \] which gives us the solution \[ \hat{\beta}_0 = \frac{1}{n}\sum_{i=1}^n y_i. \] which is the sample mean.
Unlike the least squares estimator or in unconditional case, the system of equations \[ \nabla \mathcal{L}(\beta) = 0 \] is not linear and cannot be solved by inverting a matrix. However, there are efficient iterative numerical optimization algorithms that can be used to find the optimal solution. The most common one is the BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm. It is a quasi-Newton method that’s particularly well-suited for optimizing the cross-entropy loss function in logistic regression.
In the case when we have more than two classes \(y \in \{1,\ldots,K\}\), we simply build \(K\) models \(f_{\beta_1}(x),\ldots, f_{\beta_K}(x)\), one for each class and then use the softmax function to convert the output of each model into a number between zero and one. The softmax function is defined as follows \[ \mathrm{softmax}\left(f_{\beta_j}(x)\right) = \frac{\exp(f_{\beta_j}(x))}{\sum_{i=1}^K \exp(f_{\beta_i}(x))}. \] The softmax function is a generalization of the logistic function to the case of more than two classes. It is often used as the activation function in the output layer of neural networks for multi-class classification problems. It converts the output of each model into a probability distribution over the classes, making it suitable for multi-class classification with probabilistic outputs.
The vecotr of non-scaled outputs \((f_{\beta_1}(x),\ldots, f_{\beta_K}(x))\) is called the logits.
The logistic function has a nice statistical interpretation. It is the CDF of the logistic distribution, which is a symmetric distribution with mean 0 and variance \(\pi^2/3\), thus \(p_i\) is simply a value of this CDF, evaluated at \(\beta^Tx_i\).
Further, Logistic regression models the log-odds (logit) of the probability as a linear function of the predictors, which aligns with the maximum likelihood estimation framework and provides desirable statistical properties. Specifically, if we invert the logistic function, \[ p_i = \sigma(\beta^Tx_i) = \frac{e^{\beta^Tx_i}}{1+e^{\beta^Tx_i}}, \] we get the log-odds \[ \log\left(\frac{p_i}{1-p_i}\right) = \beta^Tx_i. \] Meaning that \(\beta^Tx_i\) measures how probability of \(y_i = 1\) changes with respect to the change in \(x_i\) on the log-odds scale. It allows us to interpret the model coefficients as the log-odds ratios of the response variable.
In some disciplines, such as econometrics, physcology and natural sciences, a normal CDF is used instead of the logistic CDF. It is done for historical reasons and because Normal CDF has slightly differnt assumptions about the data, that might be more natural in some cases.
In the case of the normal CDF, the model is called probit, it stands for probability unit, and the link function is called probit link. The probit model is defined as \[ \Phi^{-1}(p_i) = \beta^Tx_i. \] where \(\Phi(\cdot)\) is the normal CDF.
The term probit was coined in the 1930’s by biologists studying the dosage-cure rate link. We can fit a probit model using glm
function in R
.
set.seed(92) # Kuzy
= seq(-3,3,length.out = 100)
x = pnorm(x+rnorm(100))>0.5
y = glm(y~x, family=binomial(link="probit"))
probitModel = as.double(coef(probitModel))
mc # we want to predict outcome for x = -1
= -1
xnew yt = mc[1] + mc[2]*xnew) (
## [1] -0.86
pnorm(yt)) (
## [1] 0.19
pred = predict(probitModel, list(x = c(xnew)), type="response")) (
## 1
## 0.19
= dnorm(mc[1] + mc[2]*x)
nd plot(x,nd, type='l', col="red", xlab="x", ylab = "P(y=1)")
polygon(c(-3,x[x< -1],-1),c(0,nd[x< -1],0), col="blue")
Our prediction is the blue area which is equal to 0.195.
plot(x,y, type='p', col="red", xlab="x", ylab = "P(y=1)")
= predict(probitModel, list(x=x), type="response")
pred_probit lines(x,pred_probit, type='l')
Outside of specific field, i.e. behavioral economics, the logistic function is much more popular of a choice compared to probit model. Besides that fact that is more intuitive to work with logit transform, it also has several nice properties when we deal with multiple classes (more then 2). Also, it is computationally easier then working with normal distributions. The density function of the logit is very similar to the probit one.
= glm(y~x, family=binomial(link="logit"))
logitModel = predict(logitModel, list(x = x), type="response")
pred_logit plot(x,pred_probit, pch=20, col="red", cex=0.9, ylab="y")
lines(x,pred_logit, type='p', pch=20, cex=0.5, col="blue")
lines(x,y, type='p', pch=21, cex=0.5, bg="lightblue")
legend("bottomright",pch=20, legend=c("Logit", "Probit"), col=c("blue","red"),y.intersp = 2)
Example 1.1 (Example: NBA point spread) We will use the NBA point spread data to illustrate the logistic regression. The data is available in the NBAspread.csv
file. The data contains the point spread for each game in the NBA from 2013 to 2014 season. The data also contains the outcome of the game, whether the favorite won or not. The point spread is the number of points by which the favorite is expected to win the game and is predicted by the bookmakers. We simply want to see how well the point spread predicts the outcome of the game.
We start by loading the data and visualizing it.
= read.csv("../data/NBAspread.csv")
NBA = nrow(NBA)
n head(NBA)
favwin | favscr | undscr | spread | favhome | fregion | uregion |
---|---|---|---|---|---|---|
1 | 72 | 61 | 7.0 | 0 | 3 | 4 |
1 | 82 | 74 | 7.0 | 1 | 3 | 1 |
1 | 87 | 57 | 17.0 | 1 | 3 | 3 |
0 | 69 | 70 | 9.0 | 1 | 3 | 3 |
0 | 77 | 79 | 2.5 | 0 | 2 | 3 |
1 | 91 | 65 | 9.0 | 0 | 3 | 4 |
hist(NBA$spread[NBA$favwin==1], col=5, main="", xlab="spread")
hist(NBA$spread[NBA$favwin==0], add=TRUE, col=6)
legend("topright", legend=c("favwin=1", "favwin=0"), fill=c(5,6), bty="n")
boxplot(NBA$spread ~ NBA$favwin, col=c(6,5), horizontal=TRUE, ylab="favwin", xlab="spread")
Does the Vegas point spread predict whether the favorite wins or not? The histogram shows the distribution of point spreads for games where the favorite won (turquoise) versus games where the favorite lost (purple). The boxplot provides another view of this relationship. Let’s fit a logistic regression model to quantify this relationship:
= glm(favwin~spread-1, family=binomial, data=NBA)
nbareg %>% tidy() %>% kable() nbareg
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
spread | 0.16 | 0.01 | 11 | 0 |
= seq(0,30,length=100)
s = exp(s*nbareg$coef[1])/(1+exp(s*nbareg$coef[1]))
fit plot(s, fit, typ="l", col=4, lwd=2, ylim=c(0.5,1), xlab="spread", ylab="P(favwin)")
The \(\beta\) measures how our log-odds change, For this model, we have \(\beta = 0.156\), meaning that for every one point increase in the point spread, the log-odds of the favorite winning increases by 0.156.
Not, we can use the model to predict the probability of the favorite winning for a new game with a point spread of 8 or 4.
predict(nbareg, newdata = data.frame(spread = c(8,4)), type = "response")
## 1 2
## 0.78 0.65
The code above simply “Plugs-in” the values for the new game into our logistic regression \[ { P \left ( \mathrm{ favwin} \mid \mathrm{ spread} \right ) = \frac{ e^{ \beta x } }{ 1 + e^{\beta x} } } \] We can calculate it manually as well.
exp(0.156*8)/(1+exp(0.156*8))
## [1] 0.78
exp(0.156*4)/(1+exp(0.156*4))
## [1] 0.65
Check that when \(\beta =0\) we have \(p= \frac{1}{2}\).
Given our new values spread\(=8\) or spread\(=4\), the win probabilities are \(78\)% and \(65\)%, respectively. Clearly, the bigger spread means a higher chance of winning.
Notice, that the predict function returns a numeric value between 0 and 1. However, if we want to make a decision (to bet or not to bet), we need to have a binary outcome. A simple methods to move between the predicted probability and binary value is to use thresholding. \[ \hat y_i = \begin{cases} 1 & \text{if } \hat p_i > \alpha \\ 0 & \text{if } \hat p_i \leq \alpha \end{cases} \] where \(\alpha\) is a threshold value. A typical choice is \(\alpha = 0.5\).
Now let’s calculate the number of correct predictions using threshold \(\alpha = 0.5\). R
has a convinient table
function that can summarise the counts of the predicted and actual values in a table.
table(NBA$favwin, as.integer(predict(nbareg, type="response")>0.5), dnn=c("Actual", "Predicted"))
## Predicted
## Actual 1
## 0 131
## 1 422
Our model gets 0.7631103 of the predictions correctly. This number is called accuracy of the model.
1.2 Confusion Matrix
We will analyse the tennis data set to show what is the decision boundary for the logistic regression model. The decision boundary is the line that separates the two classes. It is defined as the line where the probability of the favorite winning is 0.5. Then we will use the confusion matrix to evaluate the performance of the model.
Example 1.2 (Logistic Regression for Tennis Classification) Data science plays a major role in tennis, you can learn about recent AI tools developed by IBM from this This Yahoo Article.
We will analyze the Tennis Major Tournament Match Statistics Data Set from the UCI ML repository. The data set has one per each game from four major Tennis tournaments in 2013 (Australia Open, French Open, US Open, and Wimbledon).
Let’s load the data and familiarize ourselves with it
= read.csv("./../data/tennis.csv")
d dim(d)
## [1] 943 44
Let’s look at the few coluns of the randomly selected five rows of the data
sample(1:943,size = 5),c("Player1","Player2","Round","Result","gender","surf")] d[
Player1 | Player2 | Round | Result | gender | surf | |
---|---|---|---|---|---|---|
532 | Florian Mayer | Juan Monaco | 1 | 1 | M | Hard |
816 | L.Kubot | J.Janowicz | 5 | 0 | M | Grass |
431 | Svetlana Kuznetsova | Ekaterina Makarova | 1 | 1 | W | Clay |
568 | Marcos Baghdatis | Go Soeda | 1 | 1 | M | Hard |
216 | Mandy Minella | Anastasia Pavlyuchenkova | 2 | 0 | W | Hard |
We have data for 943 matches and for each match we have 44 columns, including names of the players, their gender, surface type and match statistics. Let’s look at the number of break points won by each player. We will plot BPW (break points won) by each player on the scatter plot and will colorize each dot according to the outcome
= dim(d)[1]
n plot(d$BPW.1+rnorm(n),d$BPW.2+rnorm(n), pch=21, col=d$Result+2, cex=0.6, bg="yellow", lwd=0.8,
xlab="BPW by Player 1", ylab="BPW by Player 2")
legend("bottomright", c("P1 won", "P2 won"), col=c(3,2), pch=21, bg="yellow", bty='n')
We can clearly see that number of the break points won is a clear predictor of the match outcome. Which is obvious and follows from the rules, to win a match, a player must win break points. Now, we want to understand the impact of a winning a break point on the overall match outcome. We do it by building a logistic regression model
which(is.na(d$BPW.1)) # there is one row with NA value for the BPW.1 value and we remove it
## [1] 171
= d[-171,]; n = dim(d)[1]
d = glm(Result ~ BPW.1 + BPW.2-1, data=d, family = "binomial" )
m %>% tidy() %>% kable() m
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
BPW.1 | 0.40 | 0.03 | 15 | 0 |
BPW.2 | -0.42 | 0.03 | -15 | 0 |
The predicted values are stored in the fitted.values
field of the model object. Those are the probabilities of player 1 winning the match. We need to convert them to binary predictions using \(0.5\) as a threshold for our classification.
table(d$Result, as.integer(m$fitted.values>0.5), dnn=c("Actual", "Predicted"))
## Predicted
## Actual 0 1
## 0 416 61
## 1 65 400
This table shows the number of correct and incorrect predictions for each class. The rows are the actual outcomes and the columns are the predicted outcomes. The first row shows the number of matches where player 1 won and the model predicted that player 1 won. The second row shows the number of matches where player 1 lost and the model predicted that player 1 lost. Thus, our model got (416+416)/942 = 0.8832272% of the predictions correctly! The accuracy is the ratio of the number of correct predictions to the total number of predictions.
This table is called confusion matrix. It is a table that shows the number of correct and incorrect predictions for each class. The rows are the actual outcomes and the columns are the predicted outcomes. Formally, it is defined as
Predicted: YES | Predicted: NO | |
---|---|---|
Actual: YES | TPR | FNR |
Actual: NO | FPR | TNR |
Essentially, the logistic regresion is trying to draw a line that separates the red observations from the green one. In out case, we have two predictors \(x_1\) = BPW.1 and \(x_2\) = BPW.2 and our model is \[ \log\left(\dfrac{p}{1-p}\right) = \beta_1x_1 + \beta_2 x_2, \] where \(p\) is the probability of player 1 winning the match. We want to find the line along which the probability is 1/2, meaning that \(p/(1-p) = 1\) and log-odds \(\log(p/(1-p)) = 0\), thus the equation for the line is \(\beta_1x_1 + \beta_2 x_2 = 0\) or \[ x_2 = \dfrac{-\beta_1}{\beta_2}x_1 \]
Let’s see the line found by the glm
function
legend("bottomright", c("P1 won", "P2 won"), col=c(3,2), pch=21, bg="yellow", bty='n')
= seq(0,30,length.out = 200)
x = -m$coefficients[1]*x/m$coefficients[2]
y lines(x,y, lwd=2, col="red")
There are a couple of observations. First, effect of a break point on the game outcome is significant and symmetric, effect of loosing break point is the same as the effect of winning one. We also can interpret the effect of winning a break point in the following way. We will keep BPW.2 = 0 and will calculate what happens to the probability of winning when BPW.1 changes from 0 to 1. The odds ration for player 1 winning when BPW.1 = 0 is exp(0)
which is 1, meaning that the probability that P1 wins is 1/2. Now when BPW.1 = 1, the odds ratio is 1.5
exp(0.4019)
## [1] 1.5
We can calculate probability of winning from the regression equation \[
\dfrac{p}{1-p} = 1.5,~~~p = 1.5(1-p),~~~2.5p = 1.5,~~~p = 0.6
\] Thus probability of winning goes from 50% to 60%, we can use predict
function to get this result
predict.glm(m,newdata = data.frame(BPW.1 = c(0), BPW.2 = c(0)), type="response")
## 1
## 0.5
predict.glm(m,newdata = data.frame(BPW.1 = c(1), BPW.2 = c(0)), type="response")
## 1
## 0.6
What happens to the chances of winning when P1 wins three more break points compared to the opponent
predict.glm(m,newdata = data.frame(BPW.1 = c(0), BPW.2 = c(0)), type="response")
## 1
## 0.5
predict.glm(m,newdata = data.frame(BPW.1 = c(3), BPW.2 = c(0)), type="response")
## 1
## 0.77
Chances go up by 27%.
Tennis is arguably the sport in which mean and women are treated equally. Both man and women matches are shown during the prime-time on TV, they both have the same prize money. However, one of the comments you hear often is that Women’s matches are “less predictable”, meaning that an upset (when the favorite looses) is more likely to happen in a women’s match compared to man Matches. We can test thus statement by looking at the residuals. The large the residual the less accurate our prediction was.
= which(d$res<2)
outlind boxplot(d$res[outlind] ~ d$gender[outlind], col=c(2,3), xlab="Gender",ylab="Residual")
Let’s do a formal T-test on the residuals foe men’s and women’s matches
= d %>% filter(res<2, gender=="M") %>% pull(res)
men = d %>% filter(res<2, gender=="W") %>% pull(res)
women t.test(men, women, alternative = "two.sided") %>% tidy() %>% kable()
estimate | estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|---|
-0.07 | 1.2 | 1.3 | -4.7 | 0 | 811 | -0.11 | -0.04 | Welch Two Sample t-test | two.sided |
The difference of \(0.07\) between men and women and the static value of \(-4.7\) means that the crowd wisdom that Women’s matches are less predictable is correct. The difference is statistically significant!
1.3 ROC Curve and Confounding Variables
Using default data set, we will illustrate the concept of ROC curve and confounding variables.
Example 1.3 (Credit Card Default) We will use the Default
data set from the ISLR
package to illustrate the logistic regression. The data set contains information on credit card defaults. Credit risk assessment is a major application of logistic regression and is widely used in the financial industry. In fact, the banks with the best credit risk assessment models are the ones that are able to offer the best interest rates to their customers and are winning the market.
We will also use this example to illustrate the concept of ROC (Receiver Operating Characteristic) curve. That helps us to understand the trade-off between sensitivity and specificity by varying the threshold \(\alpha\).
First, load the data set. We have 10,000 observations.
= read.csv("../data/CreditISLR.csv", stringsAsFactors = T)
Default head(Default)
default | student | balance | income |
---|---|---|---|
No | No | 730 | 44362 |
No | Yes | 817 | 12106 |
No | No | 1074 | 31767 |
No | No | 529 | 35704 |
No | No | 786 | 38464 |
No | Yes | 920 | 7492 |
There are three predictors in the data set: balance
, income
and student
. The balance
variable represents the average credit card balance in dollars for each individual. This is the amount of money that a person owes on their credit card, which is a key predictor of whether they are likely to default on their credit card payments.
In the context of credit risk assessment, balance is one of the most important variables because it directly measures the amount of credit being utilized. Credit card companies and banks use this information, along with other factors like income and student status, to assess the likelihood that a customer will default on their payments. The logistic regression model we’re building will use this balance information to predict the probability of default, helping financial institutions make informed decisions about credit limits, interest rates, and risk management.
=glm(default~balance,data=Default,family=binomial)
glm.fit%>% tidy() %>% kable() glm.fit
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -10.65 | 0.36 | -29 | 0 |
balance | 0.01 | 0.00 | 25 | 0 |
We use it now to predict default for a new individual with a balance of $1000.
predict.glm(glm.fit,newdata = list(balance=1000))
## 1
## -5.2
-1.065e+01 + 5.499e-03*1000
## [1] -5.2
Notice, that by default the predict
function returns the log-odds. We can convert it to the probability using the logistic function.
predict.glm(glm.fit,newdata = list(balance=1000), type="response")
## 1
## 0.0058
exp(-1.065e+01 + 5.499e-03*1000)/(1+exp(-1.065e+01 + 5.499e-03*1000))
## [1] 0.0058
Now, let’s plot the predicted probability of default for a range of balances between 1000 and 3000.
= list(balance=seq(1000,to = 3000,length.out = 100))
x = predict.glm(glm.fit,newdata = x, type="response")
y plot(x$balance,y, pch=20, col="red", xlab = "balance", ylab="Default")
lines(Default$balance, as.integer(Default$default)-1, type='p',pch=20)
We can calculate the confusion matrix for the model to see how well it predicts the default.
= predict.glm(glm.fit,newdata = Default, type="response")
y = table(Default$default, as.integer(y>0.2), dnn=c("Actual", "Predicted"))
confusion_matrix confusion_matrix
## Predicted
## Actual 0 1
## No 9404 263
## Yes 134 199
Notice, that instead of using the default threshold of 0.5, we used 0.2. This is a common practice in the financial industry. The lower the threshold, the more likely the model is to predict a default. In other words, we only approve a loan to a person if model predicts a default with probability 0.2 or lower.
Instead of using counts in the confusion matrix, we can use rates
# Convert to rates by dividing by row totals
1,] = confusion_matrix[1,] / sum(confusion_matrix[1,]) # Actual: NO row
confusion_matrix[2,] = confusion_matrix[2,] / sum(confusion_matrix[2,]) # Actual: YES row
confusion_matrix[
%>% kable(digits=3) confusion_matrix
0 | 1 | |
---|---|---|
No | 0.97 | 0.027 |
Yes | 0.40 | 0.598 |
Using the rates (proportions) is a more typical way to present the confusion matrix.
For our predictions, we used the value of \(\alpha=0.2\) as a cut-off. What if I use smaller or larger \(\alpha\), e.g. \(\alpha=0\)? We wil use ROC curve to unswer this question. ROC curve shows the relationship between sensitivity (true positive rate) and 1 - specificity (false positive rate) for different classification thresholds. Here, 1 - specificity the proportion of negative cases that are incorrectly classified as positive. Thus, ROC curve shows both types of errors for different values of \(\alpha\).
First, we define a function, that calculates the ROC curve.
<- function(p,y, ...){
roc <- factor(y)
y <- length(p)
n <- as.vector(p)
p = seq(0,1,length.out = 100)
alpha <- p > matrix(rep(alpha,n),ncol=100,byrow=TRUE)
Q <- colMeans(!Q[y==levels(y)[1],])
specificity <- colMeans(Q[y==levels(y)[2],])
sensitivity return(list(specificity=specificity, sensitivity=sensitivity, alpha=alpha))
}
Now, let’s plot the ROC curve.
## roc curve and fitted distributions
= predict.glm(glm.fit,newdata = Default, type="response")
pred = roc(p=pred, y=Default$default)
res plot(1-res$specificity, res$sensitivity, type="l", bty="n", main="in-sample")
abline(a=0,b=1,lty=2,col=8)
= Default$default
def # our 1/5 rule cutoff
points(x= 1-mean((pred<.2)[def=="No"]),
y=mean((pred>.2)[def=="Yes"]),
cex=1.5, pch=20, col='red')
## a standard `max prob' (p=.5) rule
points(x= 1-mean((pred<.5)[def=="No"]),
y=mean((pred>.5)[def=="Yes"]),
cex=1.5, pch=20, col='blue')
legend("bottomright",fill=c("red","blue"),
legend=c("alpha=1/5","alpha=1/2"),bty="n",title="cutoff")
ROC curve shows the trade-off between sensitivity and specificity for different values of \(\alpha\). The closer the curve is to the top-left corner, the better the model is. The diagonal line represents the random classifier (flip a coin for each prediction). The curve above the diagonal line is better than random, the curve below the diagonal line is worse than random.
Now, let’s look at other predictors. We will add income
and student
to the model.
=glm(default~balance+income+student,data=Default,family=binomial)
glm.fit%>% tidy() %>% kable() glm.fit
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -10.87 | 0.49 | -22.08 | 0.00 |
balance | 0.01 | 0.00 | 24.74 | 0.00 |
income | 0.00 | 0.00 | 0.37 | 0.71 |
studentYes | -0.65 | 0.24 | -2.74 | 0.01 |
The p-values indicate that student
is significant and income
is not. However, the coefficient for student
is negative. Meaning that students have lower probability of defaulting! This is counterintuitive, as one would expect that students have higher probability of defaulting. This is because the student
variable and balance
are confounded. We can see this by plotting the balance vs student.
boxplot(balance~student,data=Default, ylab = "balance", col = c(2,3))
We can see that students have higher balance on average. This is not surprising, as students are typically younger and have lower income and many have student loans.
Confounding occurs when the effect of one variable on the outcome is mixed with the effect of another variable, making it difficult to separate their individual contributions. In our case, student
and balance
are confounded.
Let’s adjust for balance. We will plot the predicted probability of default for a range of balances between 1000 and 2500 for students and non-students.
= data.frame(balance = seq(1000,2500,length.out = 100), student = as.factor(rep("Yes",100)), income=rep(40,100))
x1 = data.frame(balance = seq(1000,2500,length.out = 100), student = as.factor(rep("No",100)), income=rep(40,100))
x2 = predict.glm(glm.fit,newdata = x1, type="response")
y1 = predict.glm(glm.fit,newdata = x2, type="response")
y2 plot(x1$balance,y1, type='l', col="red", xlab="Balance", ylab = "P(Default)")
lines(x2$balance,y2, type='l', col="black")
legend("topleft",bty="n", legend=c("Not Student", "Student"), col=c("black","red"), lwd=2)
We can see that for a given balance, students are actually less likely to default than non-students. This is because students have higher balance on average.
To summarise, what we’ve learned in this example, thus far. ROC curves visualize the trade-off between sensitivity (true positive rate) and specificity (1 - false positive rate) across different classification thresholds. Curves closer to the top-left corner indicate better model performance, while the diagonal line represents random classification (AUC = 0.5). Curves above the diagonal are better than random, below are worse. ROC curves help choose optimal classification thresholds based on business requirements rather than default 0.5 cutoff.
Confounding occurs when the effect of one variable on the outcome is mixed with the effect of another variable. In the Default dataset, student status and balance are confounded. Students have higher average balances due to student loans and lower income. Without controlling for balance, students appear more likely to default. After adjusting for balance, students are actually less likely to default. The solution is to include confounding variables in the model to isolate individual effects. Always consider variable relationships when interpreting coefficients.
Now, a natural question is how to choose the cut-off value \(\alpha\)? Assume a bank is using our logistic regression model to predict probability of a loan default and would issue a loan if \(p(y=1) < \alpha\). Here \(\alpha\) is the level of risk bank is willing to take. If bank chooses \(\alpha=1\) and gives loans to everyone it is likely to loose a lot of money from defaulted accounts. If it chooses \(\alpha = 0\) it will not issue loan to anyone and wont make any money. In order to choose an appropriate \(\alpha\), we need to know what are the risks. Assume, bank makes $0.25 on every $1 borrowed in interest in fees and loose the entire amount of $1 if account defaults. This leads to the following pay-off matrix
payer | defaulter | |
---|---|---|
loan | 0.25 | -1 |
no loan | -0.25 | 0 |
We are making 25 cents on a dollar on a good loan and loose everything on a default!
Given this pay-off matrix, we can calculate the expected profit for the bank.
= predict.glm(glm.fit,newdata = Default, type="response")
pred = roc(p=pred, y=Default$default)
res = res$alpha
alpha = res$specificity # correctly detect default
specificity = res$sensitivity # correctly detect non-default
sensitivity # We loose money when we make a mistake. If we issue a loan to a non-defaulting customer, we loose $0.25. If we dont issue a loan to a defaulting customer, we loose $1.
= 0*specificity - 1*(1-specificity) + 0.25*sensitivity - 0.25*(1-sensitivity)
profit plot(alpha, profit, type='l', xlab = "alpha", ylab = "profit")
abline(h=0, lty=2)
Let’s find the values of \(\alpha\) for which profit is positive.
min(alpha[profit>0])
## [1] 0.02
max(alpha[profit>0])
## [1] 0.25
We have to be prudent and choose \(\alpha\) in this range. Now let’s find \(\alpha\) that maximises the profit.
# Find index of max value in profit
which.max(profit)] alpha[
## [1] 0.051
Thus, to maximise the profit, we only approve the loan if we are 95% sure that the customer will not default.
1.4 Multinomial logistic regression
Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. In logistic regression we assumed that the labels were binary: \(y_i \in \{0,1\}\) . We used such a classifier to distinguish between two kinds of hand-written digits. Softmax regression allows us to handle \(y_i \in \{1,\ldots ,K\}\) where \(K\) is the number of classes. Our model took the form: \[ f(x\mid \beta)=\dfrac{1}{1+\exp(-\beta^Tx)}~, \] and the model parameters \(\beta\) were trained to minimize the loss function (negative log-likelihood) \[ J(\beta) = -\left[ \sum_{i=1}^m y_i \log f(x\mid \beta) + (1-y_i) \log (1-f(x\mid \beta)) \right] \]
Given a test input \(x\), we want our model to estimate the probability that \(p(y=k|x)\) for each value of \(k=1,\ldots ,K\) Thus, our model will output a \(K\)-dimensional vector (whose elements sum to 1) giving us our K estimated probabilities. Concretely, our model \(f(x\mid \beta)\) takes the form: \[ \begin{aligned} f(x\mid \beta) = \begin{bmatrix} p(y = 1 | x; \beta) \\ p(y = 2 | x; \beta) \\ \vdots \\ p(y = K | x; \beta) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{K}{\exp(\beta_k^T x) }} \begin{bmatrix} \exp(\beta_1^{T} x ) \\ \exp(\beta_2^{T} x ) \\ \vdots \\ \exp(\beta_k^T x ) \\ \end{bmatrix}\end{aligned} \]
Here \(\beta_i \in R^n, i=1,\ldots,K\) are the parameters of our model. Notice that the term \(1/ \sum_{j=1}^{K}{\exp(\beta_k^T x) }\) normalizes the distribution, so that it sums to one.
For convenience, we will also write \(\beta\) to denote all the parameters of our model. When you implement softmax regression, it is usually convenient to represent \(\beta\) as an \(n\)-by-\(K\) matrix obtained by concatenating \(\beta_1,\beta_2,\ldots ,\beta_K\) into columns, so that \[ \beta = \left[\begin{array}{cccc}| & | & | & | \\ \beta_1 & \beta_2 & \cdots & \beta_K \\ | & | & | & | \end{array}\right]. \]
We now describe the cost function that we’ll use for softmax regression. In the equation below, \(1\) is the indicator function, so that \(1\)(a true statement)=1, and \(1\)(a false statement)=0. For example, 1(2+3 > 4) evaluates to 1; whereas 1(1+1 == 5) evaluates to 0. Our cost function will be: \[ \begin{aligned} J(\beta) = - \left[ \sum_{i=1}^{m} \sum_{k=1}^{K} 1\left\{y_i = k\right\} \log \frac{\exp(\beta_k^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i)}\right] \end{aligned} \]
Notice that this generalizes the logistic regression cost function, which could also have been written: \[ \begin{aligned} J(\beta) &= - \left[ \sum_{i=1}^m (1-y_i) \log (1-f(x\mid \beta)) + y_i \log f(x\mid \beta) \right] \\ &= - \left[ \sum_{i=1}^{m} \sum_{k=0}^{1} 1\left\{y_i = k\right\} \log p(y_i = k | x_i ; \beta) \right]\end{aligned} \] The softmax cost function is similar, except that we now sum over the \(K\) different possible values of the class label. Note also that in softmax regression, we have that \[ p(y_i = k | x_i ; \beta) = \frac{\exp(\beta_k^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i) }. \]
Softmax regression has an unusual property that it has a redundant set of parameters. To explain what this means, suppose we take each of our parameter vectors \(\beta_j\), and subtract some fixed vector \(\psi\). Our model now estimates the class label probabilities as \[ \begin{aligned} p(y_i = k | x_i ; \beta) &= \frac{\exp((\beta_k-\psi)^T x_i)}{\sum_{j=1}^K \exp( (\beta_j-\psi)^T x_i)} \\ &= \frac{\exp(\beta_k^T x_i) \exp(-\psi^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i) \exp(-\psi^T x_i)} \\ &= \frac{\exp(\beta_k^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i)}.\end{aligned} \] In other words, subtracting \(\psi\) does not affect our model’ predictions at all! This shows that softmax regression’s parameters are redundant. More formally, we say that our softmax model is overparameterized, meaning that for any model we might fit to the data, there are multiple parameter settings that give rise to exactly the same model function \(f(x \mid \beta)\) mapping from inputs \(x\) to the predictions.
Further, if the cost function \(J(\beta)\) is minimized by some setting of the parameters \((\beta_1,\ldots,\beta_K)\), then it is also minimized by \((\beta_1-\psi,\ldots,\beta_K-\psi)\) for any value of \(\psi\). Thus, the minimizer of \(J(\beta)\) is not unique. Interestingly, \(J(\beta)\) is still convex, and thus gradient descent will not run into local optima problems. But the Hessian is singular/non-invertible, which causes a straightforward implementation of Newton’s method to run into numerical problems. We can just set \(\psi\) to \(\beta_i\) and remove \(\beta_i\).
Example 1.4 (LinkedIn Study) How to Become an Executive(Irwin 2016; Gan and Fritzler 2016)?
Logistic regression was used to analyze the career paths of about \(459,000\) LinkedIn members who worked at a top 10 consultancy between 1990 and 2010 and became a VP, CXO, or partner at a company with at least 200 employees. About \(64,000\) members reached this milestone, \(\hat{p} = 0.1394\), conditional on making it into the database. The goals of the analysis were the following
- Look at their profiles – educational background, gender, work experience, and career transitions.
- Build a predictive model of the probability of becoming an executive
- Provide a tool for analysis of “what if” scenarios. For example, if you are to get a master’s degree, how your jobs perspectives change because of that.
Let’s build a logistic regression model with \(8\) key features (a.k.a. covariates): \[ \log\left ( \frac{p}{1-p} \right ) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_8x_8 \]
\(p\): Probability of “success” – reach VP/CXO/Partner seniority at a company with at least 200 employees.
Features to predict the “success” probability: \(x_i (i=1,2,\ldots,8)\).
Variable Parameters \(x_1\) Metro region: whether a member has worked in one of the top 10 largest cities in the U.S. or globally. \(x_2\) Gender: Inferred from member names: ‘male’, or ‘female’ \(x_3\) Graduate education type: whether a member has an MBA from a top U.S. program / a non-top program / a top non-U.S. program / another advanced degree \(x_4\) Undergraduate education type: whether a member has attended a school from the U.S. News national university rankings / a top 10 liberal arts college /a top 10 non-U.S. school \(x_5\) Company count: # different companies in which a member has worked \(x_6\) Function count: # different job functions in which a member has worked \(x_7\) Industry sector count: # different industries in which a member has worked \(x_8\) Years of experience: # years of work experience, including years in consulting, for a member.
The following estimated \(\hat\beta\)s of features were obtained. With a sample size of 456,000 thy are measured rather accurately. Recall, given each location/education choice in the “Choice and Impact” is a unit change in the feature.
- Location: Metro region: 0.28
- Personal: Gender(Male): 0.31
- Education: Graduate education type: 1.16, Undergraduate education type: 0.22
- Work Experience: Company count: 0.14, Function count: 0.26, Industry sector count: -0.22, Years of experience: 0.09
Here are three main findings
- Working across job functions, like marketing or finance, is good. Each additional job function provides a boost that, on average, is equal to three years of work experience. Switching industries has a slight negative impact. Learning curve? lost relationships?
- MBAs are worth the investment. But pedigree matters. Top five program equivalent to \(13\) years of work experience!!!
- Location matters. For example, NYC helps.
We can also personalize the prediction for predict future possible future executives. For example, Person A (p=6%): Male in Tulsa, Oklahoma, Undergraduate degree, 1 job function for 3 companies in 3 industries, 15-year experience.
Person B (p=15%): Male in London, Undergraduate degree from top international school, Non-MBA Master, 2 different job functions for 2 companies in 2 industries, 15-year experience.
Person C (p=63%): Female in New York City, Top undergraduate program, Top MBA program, 4 different job functions for 4 companies in 1 industry, 15-year experience.
Let’s re-design Person B.
Person B (p=15%): Male in London, Undergraduate degree from top international school, Non-MBA Master, 2 different job functions for 2 companies in 2 industries, 15-year experience.
- Work in one industry rather than two. Increase \(3\)%
- Undergrad from top \(10\) US program rather than top international school. \(3\)%
- Worked for \(4\) companies rather than \(2\). Another \(4\)%
- Move from London to NYC. \(4\)%
- Four job functions rather than two. \(8\)%. A \(1.5\)x effect.
- Worked for \(10\) more years. \(15\)%. A \(2\)X effect.
Choices and Impact (Person B) are shown below
1.5 Imbalanced Data
Often, you have much more observations with a specific label, such a sample is called imbalanced. This is a common problem in real-world classification tasks where one class significantly outnumbers the other(s). For example, in fraud detection, legitimate transactions vastly outnumber fraudulent ones; in medical diagnosis, healthy patients often outnumber those with rare diseases; and in manufacturing, defective products are typically much rarer than non-defective ones.
When dealing with imbalanced data, you should avoid using accuracy as a metric to choose a model. Consider a binary classification problem with 95% of samples labeled as class 1. A naive classifier that simply assigns label 1 to every input will achieve 95% accuracy, making it appear deceptively good while being completely useless for practical purposes.
Instead, more appropriate evaluation metrics should be used. The Receiver Operating Characteristic (ROC) curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various classification thresholds. The Area Under the Curve (AUC) provides a single scalar value that measures the model’s ability to distinguish between classes, regardless of the chosen threshold. An AUC of 0.5 indicates random guessing, while 1.0 represents perfect classification.
The F1 score combines precision and recall into a single score, providing a balanced measure that penalizes models that are either too conservative or too aggressive: \[ F1 = 2\dfrac{\mathrm{precision} \times \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}} \] where precision measures the proportion of true positives among predicted positives, and recall measures the proportion of true positives that were correctly identified.
The precision-recall curve is particularly useful for imbalanced datasets, as it plots precision against recall at various thresholds, focusing on the performance of the positive class. Cohen’s Kappa measures agreement between predicted and actual classifications while accounting for agreement by chance, making it more robust to class imbalance than accuracy.
To address imbalanced data, several strategies can be employed. Data-level approaches include oversampling, where you synthetically generate more samples of the minority class using techniques like bootstrap sampling with replacement, SMOTE (Synthetic Minority Over-sampling Technique) which creates synthetic examples by interpolating between existing minority class samples, or generative models like GANs or variational autoencoders to create realistic synthetic data. Undersampling reduces the majority class samples, which is particularly effective when the dataset is large enough. Hybrid approaches combine both oversampling and undersampling techniques.
Algorithm-level approaches include cost-sensitive learning, where you assign different misclassification costs to different classes, ensemble methods using techniques like bagging or boosting that can naturally handle imbalanced data, and threshold adjustment to modify the classification threshold to optimize for specific metrics like F1-score.
The choice of approach depends on the specific problem, available data, and computational resources. It’s often beneficial to experiment with multiple techniques and evaluate their performance using appropriate metrics rather than relying solely on accuracy.
1.6 Kernel Trick
The kernel trick is particularly valuable when dealing with complex, non-linear patterns in data that cannot be separated by simple linear boundaries. Many real-world classification problems exhibit such non-linear relationships, making the kernel trick an essential tool in machine learning.
The key insight is that whilemany problems appear intractable in their original feature space, they often become linearly separable when mapped to higher-dimensional spaces using appropriate kernel functions. This transformation allows us to apply powerful linear classification methods to solve complex non-linear problems efficiently.
Kernel trick is a method of using a linear classifier to solve a non-linear problem. The idea is to map the data into a higher dimensional space, where it becomes linearly separable. The kernel trick is to use a kernel function \(K(x_i,x_j)\) to calculate the inner product of two vectors in the higher dimensional space without explicitly calculating the mapping \(\phi(x_i)\) and \(\phi(x_j)\). The kernel function is defined as \(K(x_i,x_j) = \phi(x_i)^T\phi(x_j)\). The most popular kernel functions are polynomial kernel \(K(x_i,x_j) = (x_i^Tx_j)^d\) and Gaussian kernel \(K(x_i,x_j) = \exp(-\gamma||x_i-x_j||^2)\). The kernel trick is used in Support Vector Machines (SVM) and Gaussian Processes (GP).
Code
= function(numSamples,radius,noise) {
gencircledata = matrix(0,ncol = 3, nrow = numSamples); # matrix to store our generated data
d # Generate positive points inside the circle.
for (i in 1:(numSamples/2) ) {
= runif(1,0, radius * 0.4);
r = runif(1,0, 2 * pi);
angle = r * sin(angle);
x = r * cos(angle);
y = runif(1,-radius, radius) * noise;
noiseX = runif(1,-radius, radius) * noise;
noiseY = c(0,x,y)
d[i,]
}
# Generate negative points outside the circle.
for (i in (numSamples/2+1):numSamples ) {
= runif(1,radius * 0.8, radius);
r = runif(1,0, 2 * pi);
angle = r * sin(angle);
x = r * cos(angle);
y = runif(1,-radius, radius) * noise;
noiseX = runif(1,-radius, radius) * noise;
noiseY = c(1,x,y)
d[i,]
}colnames(d) = c("label", "x1", "x2")
return(d)
}
= gencircledata(numSamples=200, radius=1, noise=0.001)
d plot(d[,2],d[,3], col=d[,1]+1, pch=19, xlab="x", ylab="y")

The data on the left in Figure Figure 1.1 is clearly not linearly separable. However, if we map it to a three-dimensional space using the transformation: \[ \begin{aligned} \phi: R^{2} & \longrightarrow R^{3} \\ \left(x_{1}, x_{2}\right) & \longmapsto\left(z_{1}, z_{2}, z_{3}\right)=\left(x_{1}^{2}, \sqrt{2} x_{1} x_{2}, x_{2}^{2}\right), \end{aligned} \] and attempt to linearly separate the transformed data, the decision boundaries become hyperplanes in \(R^{3}\), expressed as \(\omega^{T} z + b = 0\). In terms of the original variables \(x\), these boundaries take the form: \[ \omega_{1} x_{1}^{2} + \omega_{2} \sqrt{2} x_{1} x_{2} + \omega_{3} x_{2}^{2} = 0, \] which corresponds to the equation of an ellipse. This demonstrates that we can apply a linear algorithm to transformed data to achieve a non-linear decision boundary with minimal effort.
Now, consider what the algorithm is actually doing. It relies solely on the Gram matrix \(K\) of the data. Once \(K\) is computed, the original data can be discarded: \[ \begin{aligned} K & = \left[\begin{array}{ccc} x_{1}^{T} x_{1} & x_{1}^{T} x_{2} & \cdots \\ x_{2}^{T} x_{1} & \ddots & \\ \vdots & & \end{array}\right]_{n \times n} = X X^{T}, \\ \text{where} \quad X & = \left[\begin{array}{c} x_{1}^{T} \\ \vdots \\ x_{n}^{T} \end{array}\right]_{n \times d}. \end{aligned} \] Here, \(X\), which contains all the data, is referred to as the design matrix.
When we map the data using \(\phi\), the Gram matrix becomes: \[ K = \left[\begin{array}{ccc} \phi\left(x_{1}\right)^{T} \phi\left(x_{1}\right) & \phi\left(x_{1}\right)^{T} \phi\left(x_{2}\right) & \cdots \\ \phi\left(x_{2}\right)^{T} \phi\left(x_{1}\right) & \ddots & \\ \vdots & & \end{array}\right]. \]
Let us compute these inner products explicitly. For vectors \(r\) and \(s\) in \(R^{3}\) corresponding to \(a\) and \(b\), respectively: \[ \begin{aligned} \langle r, s \rangle & = r_{1} s_{1} + r_{2} s_{2} + r_{3} s_{3} \\ & = a_{1}^{2} b_{1}^{2} + 2 a_{1} a_{2} b_{1} b_{2} + a_{2}^{2} b_{2}^{2} \\ & = \langle a, b \rangle^{2}. \end{aligned} \]
Thus, instead of explicitly mapping the data via \(\phi\) and then computing the inner product, we can compute it directly in one step, leaving the mapping \(\phi\) implicit. In fact, we do not even need to know \(\phi\) explicitly; all we require is the ability to compute the modified inner product. This modified inner product is called a kernel, denoted \(K(x, y)\). The matrix \(K\), which contains the kernel values for all pairs of data points, is also referred to as the kernel matrix.
Since the kernel itself is the primary object of interest, rather than the mapping \(\phi\), we aim to characterize kernels without explicitly relying on \(\phi\). Mercer’s Theorem provides the necessary framework for this characterization.
Let’s implement it
library("scatterplot3d")
<- function(x1, x2) {
phi <- x1^2
z1 <- sqrt(2) * x1 * x2
z2 <- x2^2
z3 return(cbind(z1, z2, z3))
}
# Generate sample 2D data (you can replace this with your actual data)
# Apply the transformation
<- phi(d[,2], d[,3])
transformed_data scatterplot3d(transformed_data, color = ifelse(d[,1] == 0, "red", "blue"), pch = 19,
xlab = "z1 (x1^2)", ylab = "z2 (sqrt(2) * x1 * x2)", zlab = "z3 (x2^2)",
main = "3D Scatter Plot of Transformed Data", angle=222, grid=FALSE, box=FALSE)