1  Logistic Regression

Classification is a type of predictive modeling where the goal is to predict a categorical variable based on a set of input variables. A categorical variable is a variable that can take on only a limited number of values, such as 0 or 1, or it can be a multi-class variable, meaning it can take on more than two values. For example, in medical diagnosis, we might want to predict whether a patient has a disease (1) or not (0) based on symptoms and test results. A particularly important application is in self-driving cars, where computer vision systems must classify objects in real-time from camera feeds - distinguishing between pedestrians, other vehicles, traffic signs, and road obstacles to make safe driving decisions.

Given observed data \(\{(x_i,y_i)\}_{i=1}^n\), where each \(y_i\) is either 0 or 1, we start by assuming a binomial likelihood function for the response variable, defined as follows: \[ P(y_i = 1\mid p_i) = p_i^{y_i} (1-p_i)^{1-y_i}, \] where \(p_i\) is the funciton of the inputs \(x_i\) and coefficients \(\beta\) that gives us the probability of the response variable taking on a value of 1, given the input variables. A typical approach to calculate \(p_i\) is to use logistic function \[\begin{align*} f_{\beta}(x_i) = & \beta^Tx_i\\ p_i = & \sigma(f_{\beta}(x_i)) = \frac{e^{f_{\beta}(x_i)}}{1+e^{f_{\beta}(x_i)}}, \end{align*}\]

where \(\beta\) is a vector of parameters. The logistic function \(\sigma(\cdot)\) is a function that maps any real number to a number between zero and one.

x = seq(-10,10,length.out = 100)
plot(x, 1/(1+exp(-x)), type='l', col="red", xlab="x", ylab="p")

1.1 Model Fitting

The we fit the model using Binomial log-likelihood minimisation. It leads us to the maximum likelihood estimator for parameters \(\beta\) (a.k.a cross-entropy estimator), defined as \[ \hat \beta = \arg\min_{\beta}\mathcal{L}(\beta), \] where \[ \mathcal{L}(\beta) = -\sum_{i=1}^n \left[ y_i \log p_i + (1-y_i) \log \left ( 1-p_i \right ) \right]. \] Similar to the least squares estimator, the cross-entropy estimator optimisaiton problem is convex, it has a unique solution.

In the unconditional case, when we do not observe any inputs \(x\), the cross-entropy estimator is again, the sample mean. If we take the derivative of the above expression with respect to \(\beta_0\) and set it to zero, we get \[ - \frac{d}{d\beta_0}\sum_{i=1}^n \left[ y_i \log \left ( \beta_0 \right ) + (1-y_i) \log \left ( 1-\beta_0 \right ) \right] = -\sum_{i=1}^n \left[ \frac{y_i}{\beta_0} - \frac{1-y_i}{1-\beta_0} \right] = 0 \] which gives us the solution \[ \hat{\beta}_0 = \frac{1}{n}\sum_{i=1}^n y_i. \] which is the sample mean.

Unlike the least squares estimator or in unconditional case, the system of equations \[ \nabla \mathcal{L}(\beta) = 0 \] is not linear and cannot be solved by inverting a matrix. However, there are efficient iterative numerical optimization algorithms that can be used to find the optimal solution. The most common one is the BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm. It is a quasi-Newton method that’s particularly well-suited for optimizing the cross-entropy loss function in logistic regression.

In the case when we have more than two classes \(y \in \{1,\ldots,K\}\), we simply build \(K\) models \(f_{\beta_1}(x),\ldots, f_{\beta_K}(x)\), one for each class and then use the softmax function to convert the output of each model into a number between zero and one. The softmax function is defined as follows \[ \mathrm{softmax}\left(f_{\beta_j}(x)\right) = \frac{\exp(f_{\beta_j}(x))}{\sum_{i=1}^K \exp(f_{\beta_i}(x))}. \] The softmax function is a generalization of the logistic function to the case of more than two classes. It is often used as the activation function in the output layer of neural networks for multi-class classification problems. It converts the output of each model into a probability distribution over the classes, making it suitable for multi-class classification with probabilistic outputs.

The vecotr of non-scaled outputs \((f_{\beta_1}(x),\ldots, f_{\beta_K}(x))\) is called the logits.

The logistic function has a nice statistical interpretation. It is the CDF of the logistic distribution, which is a symmetric distribution with mean 0 and variance \(\pi^2/3\), thus \(p_i\) is simply a value of this CDF, evaluated at \(\beta^Tx_i\).

Further, Logistic regression models the log-odds (logit) of the probability as a linear function of the predictors, which aligns with the maximum likelihood estimation framework and provides desirable statistical properties. Specifically, if we invert the logistic function, \[ p_i = \sigma(\beta^Tx_i) = \frac{e^{\beta^Tx_i}}{1+e^{\beta^Tx_i}}, \] we get the log-odds \[ \log\left(\frac{p_i}{1-p_i}\right) = \beta^Tx_i. \] Meaning that \(\beta^Tx_i\) measures how probability of \(y_i = 1\) changes with respect to the change in \(x_i\) on the log-odds scale. It allows us to interpret the model coefficients as the log-odds ratios of the response variable.

In some disciplines, such as econometrics, physcology and natural sciences, a normal CDF is used instead of the logistic CDF. It is done for historical reasons and because Normal CDF has slightly differnt assumptions about the data, that might be more natural in some cases.

In the case of the normal CDF, the model is called probit, it stands for probability unit, and the link function is called probit link. The probit model is defined as \[ \Phi^{-1}(p_i) = \beta^Tx_i. \] where \(\Phi(\cdot)\) is the normal CDF.

The term probit was coined in the 1930’s by biologists studying the dosage-cure rate link. We can fit a probit model using glm function in R.

set.seed(92) # Kuzy
x = seq(-3,3,length.out = 100)
y = pnorm(x+rnorm(100))>0.5
probitModel = glm(y~x, family=binomial(link="probit"))
mc = as.double(coef(probitModel))
# we want to predict outcome for x = -1
xnew = -1
(yt = mc[1] + mc[2]*xnew)
## [1] -0.86
(pnorm(yt))
## [1] 0.19
(pred = predict(probitModel, list(x = c(xnew)), type="response"))
##    1 
## 0.19
nd = dnorm(mc[1] + mc[2]*x)
plot(x,nd, type='l', col="red", xlab="x", ylab = "P(y=1)")
polygon(c(-3,x[x< -1],-1),c(0,nd[x< -1],0), col="blue")

Our prediction is the blue area which is equal to 0.195.

plot(x,y, type='p', col="red", xlab="x", ylab = "P(y=1)")
pred_probit = predict(probitModel, list(x=x), type="response")
lines(x,pred_probit, type='l')

Outside of specific field, i.e. behavioral economics, the logistic function is much more popular of a choice compared to probit model. Besides that fact that is more intuitive to work with logit transform, it also has several nice properties when we deal with multiple classes (more then 2). Also, it is computationally easier then working with normal distributions. The density function of the logit is very similar to the probit one.

logitModel  = glm(y~x, family=binomial(link="logit"))
pred_logit = predict(logitModel, list(x = x), type="response")
plot(x,pred_probit, pch=20, col="red", cex=0.9, ylab="y")
lines(x,pred_logit, type='p', pch=20, cex=0.5, col="blue")
lines(x,y, type='p', pch=21, cex=0.5, bg="lightblue")
legend("bottomright",pch=20, legend=c("Logit", "Probit"), col=c("blue","red"),y.intersp = 2)

Example 1.1 (Example: NBA point spread) We will use the NBA point spread data to illustrate the logistic regression. The data is available in the NBAspread.csv file. The data contains the point spread for each game in the NBA from 2013 to 2014 season. The data also contains the outcome of the game, whether the favorite won or not. The point spread is the number of points by which the favorite is expected to win the game and is predicted by the bookmakers. We simply want to see how well the point spread predicts the outcome of the game.

We start by loading the data and visualizing it.

NBA = read.csv("../data/NBAspread.csv")
n = nrow(NBA)
head(NBA)
favwin favscr undscr spread favhome fregion uregion
1 72 61 7.0 0 3 4
1 82 74 7.0 1 3 1
1 87 57 17.0 1 3 3
0 69 70 9.0 1 3 3
0 77 79 2.5 0 2 3
1 91 65 9.0 0 3 4
hist(NBA$spread[NBA$favwin==1], col=5, main="", xlab="spread")
hist(NBA$spread[NBA$favwin==0], add=TRUE, col=6)
legend("topright", legend=c("favwin=1", "favwin=0"), fill=c(5,6), bty="n")
boxplot(NBA$spread ~ NBA$favwin, col=c(6,5), horizontal=TRUE, ylab="favwin", xlab="spread")

Does the Vegas point spread predict whether the favorite wins or not? The histogram shows the distribution of point spreads for games where the favorite won (turquoise) versus games where the favorite lost (purple). The boxplot provides another view of this relationship. Let’s fit a logistic regression model to quantify this relationship:

nbareg = glm(favwin~spread-1, family=binomial, data=NBA)
nbareg %>% tidy() %>% kable()
term estimate std.error statistic p.value
spread 0.16 0.01 11 0
s = seq(0,30,length=100)
fit = exp(s*nbareg$coef[1])/(1+exp(s*nbareg$coef[1]))
plot(s, fit, typ="l", col=4, lwd=2, ylim=c(0.5,1), xlab="spread", ylab="P(favwin)")

The \(\beta\) measures how our log-odds change, For this model, we have \(\beta = 0.156\), meaning that for every one point increase in the point spread, the log-odds of the favorite winning increases by 0.156.

Not, we can use the model to predict the probability of the favorite winning for a new game with a point spread of 8 or 4.

predict(nbareg, newdata = data.frame(spread = c(8,4)), type = "response")
##    1    2 
## 0.78 0.65

The code above simply “Plugs-in” the values for the new game into our logistic regression \[ { P \left ( \mathrm{ favwin} \mid \mathrm{ spread} \right ) = \frac{ e^{ \beta x } }{ 1 + e^{\beta x} } } \] We can calculate it manually as well.

exp(0.156*8)/(1+exp(0.156*8))
## [1] 0.78
exp(0.156*4)/(1+exp(0.156*4))
## [1] 0.65

Check that when \(\beta =0\) we have \(p= \frac{1}{2}\).

Given our new values spread\(=8\) or spread\(=4\), the win probabilities are \(78\)% and \(65\)%, respectively. Clearly, the bigger spread means a higher chance of winning.

Notice, that the predict function returns a numeric value between 0 and 1. However, if we want to make a decision (to bet or not to bet), we need to have a binary outcome. A simple methods to move between the predicted probability and binary value is to use thresholding. \[ \hat y_i = \begin{cases} 1 & \text{if } \hat p_i > \alpha \\ 0 & \text{if } \hat p_i \leq \alpha \end{cases} \] where \(\alpha\) is a threshold value. A typical choice is \(\alpha = 0.5\).

Now let’s calculate the number of correct predictions using threshold \(\alpha = 0.5\). R has a convinient table function that can summarise the counts of the predicted and actual values in a table.

table(NBA$favwin, as.integer(predict(nbareg, type="response")>0.5), dnn=c("Actual", "Predicted"))
##       Predicted
## Actual   1
##      0 131
##      1 422

Our model gets 0.7631103 of the predictions correctly. This number is called accuracy of the model.

1.2 Confusion Matrix

We will analyse the tennis data set to show what is the decision boundary for the logistic regression model. The decision boundary is the line that separates the two classes. It is defined as the line where the probability of the favorite winning is 0.5. Then we will use the confusion matrix to evaluate the performance of the model.

Example 1.2 (Logistic Regression for Tennis Classification) Data science plays a major role in tennis, you can learn about recent AI tools developed by IBM from this This Yahoo Article.

We will analyze the Tennis Major Tournament Match Statistics Data Set from the UCI ML repository. The data set has one per each game from four major Tennis tournaments in 2013 (Australia Open, French Open, US Open, and Wimbledon).

Let’s load the data and familiarize ourselves with it

d = read.csv("./../data/tennis.csv")
dim(d)
## [1] 943  44

Let’s look at the few coluns of the randomly selected five rows of the data

d[sample(1:943,size = 5),c("Player1","Player2","Round","Result","gender","surf")]
Player1 Player2 Round Result gender surf
532 Florian Mayer Juan Monaco 1 1 M Hard
816 L.Kubot J.Janowicz 5 0 M Grass
431 Svetlana Kuznetsova Ekaterina Makarova 1 1 W Clay
568 Marcos Baghdatis Go Soeda 1 1 M Hard
216 Mandy Minella Anastasia Pavlyuchenkova 2 0 W Hard

We have data for 943 matches and for each match we have 44 columns, including names of the players, their gender, surface type and match statistics. Let’s look at the number of break points won by each player. We will plot BPW (break points won) by each player on the scatter plot and will colorize each dot according to the outcome

n = dim(d)[1]
plot(d$BPW.1+rnorm(n),d$BPW.2+rnorm(n), pch=21, col=d$Result+2, cex=0.6, bg="yellow", lwd=0.8,
     xlab="BPW by Player 1", ylab="BPW by Player 2")
legend("bottomright", c("P1 won", "P2 won"), col=c(3,2), pch=21, bg="yellow", bty='n')

We can clearly see that number of the break points won is a clear predictor of the match outcome. Which is obvious and follows from the rules, to win a match, a player must win break points. Now, we want to understand the impact of a winning a break point on the overall match outcome. We do it by building a logistic regression model

which(is.na(d$BPW.1)) # there is one row with NA value for the BPW.1 value and we remove it
## [1] 171
d = d[-171,]; n = dim(d)[1]
m = glm(Result ~ BPW.1 + BPW.2-1, data=d, family = "binomial" )
m %>% tidy() %>% kable()
term estimate std.error statistic p.value
BPW.1 0.40 0.03 15 0
BPW.2 -0.42 0.03 -15 0

The predicted values are stored in the fitted.values field of the model object. Those are the probabilities of player 1 winning the match. We need to convert them to binary predictions using \(0.5\) as a threshold for our classification.

table(d$Result, as.integer(m$fitted.values>0.5), dnn=c("Actual", "Predicted"))
##       Predicted
## Actual   0   1
##      0 416  61
##      1  65 400

This table shows the number of correct and incorrect predictions for each class. The rows are the actual outcomes and the columns are the predicted outcomes. The first row shows the number of matches where player 1 won and the model predicted that player 1 won. The second row shows the number of matches where player 1 lost and the model predicted that player 1 lost. Thus, our model got (416+416)/942 = 0.8832272% of the predictions correctly! The accuracy is the ratio of the number of correct predictions to the total number of predictions.

This table is called confusion matrix. It is a table that shows the number of correct and incorrect predictions for each class. The rows are the actual outcomes and the columns are the predicted outcomes. Formally, it is defined as

Confusion Matrix. TPR - True Positive Rate, FPR - False Positive Rate, TNR - True Negative Rate, FNR - False Negative Rate.
Predicted: YES Predicted: NO
Actual: YES TPR FNR
Actual: NO FPR TNR

Essentially, the logistic regresion is trying to draw a line that separates the red observations from the green one. In out case, we have two predictors \(x_1\) = BPW.1 and \(x_2\) = BPW.2 and our model is \[ \log\left(\dfrac{p}{1-p}\right) = \beta_1x_1 + \beta_2 x_2, \] where \(p\) is the probability of player 1 winning the match. We want to find the line along which the probability is 1/2, meaning that \(p/(1-p) = 1\) and log-odds \(\log(p/(1-p)) = 0\), thus the equation for the line is \(\beta_1x_1 + \beta_2 x_2 = 0\) or \[ x_2 = \dfrac{-\beta_1}{\beta_2}x_1 \]

Let’s see the line found by the glm function

legend("bottomright", c("P1 won", "P2 won"), col=c(3,2), pch=21, bg="yellow", bty='n')

x = seq(0,30,length.out = 200)
y  =  -m$coefficients[1]*x/m$coefficients[2]
lines(x,y, lwd=2, col="red") 

There are a couple of observations. First, effect of a break point on the game outcome is significant and symmetric, effect of loosing break point is the same as the effect of winning one. We also can interpret the effect of winning a break point in the following way. We will keep BPW.2 = 0 and will calculate what happens to the probability of winning when BPW.1 changes from 0 to 1. The odds ration for player 1 winning when BPW.1 = 0 is exp(0) which is 1, meaning that the probability that P1 wins is 1/2. Now when BPW.1 = 1, the odds ratio is 1.5

exp(0.4019)
## [1] 1.5

We can calculate probability of winning from the regression equation \[ \dfrac{p}{1-p} = 1.5,~~~p = 1.5(1-p),~~~2.5p = 1.5,~~~p = 0.6 \] Thus probability of winning goes from 50% to 60%, we can use predict function to get this result

predict.glm(m,newdata = data.frame(BPW.1 = c(0), BPW.2 = c(0)), type="response")
##   1 
## 0.5
predict.glm(m,newdata = data.frame(BPW.1 = c(1), BPW.2 = c(0)), type="response")
##   1 
## 0.6

What happens to the chances of winning when P1 wins three more break points compared to the opponent

predict.glm(m,newdata = data.frame(BPW.1 = c(0), BPW.2 = c(0)), type="response")
##   1 
## 0.5
predict.glm(m,newdata = data.frame(BPW.1 = c(3), BPW.2 = c(0)), type="response")
##    1 
## 0.77

Chances go up by 27%.

Tennis is arguably the sport in which mean and women are treated equally. Both man and women matches are shown during the prime-time on TV, they both have the same prize money. However, one of the comments you hear often is that Women’s matches are “less predictable”, meaning that an upset (when the favorite looses) is more likely to happen in a women’s match compared to man Matches. We can test thus statement by looking at the residuals. The large the residual the less accurate our prediction was.

outlind = which(d$res<2)
boxplot(d$res[outlind] ~ d$gender[outlind], col=c(2,3), xlab="Gender",ylab="Residual")

Let’s do a formal T-test on the residuals foe men’s and women’s matches

men = d %>% filter(res<2, gender=="M") %>% pull(res)
women = d %>% filter(res<2, gender=="W") %>% pull(res)
t.test(men, women, alternative = "two.sided") %>% tidy() %>% kable()
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
-0.07 1.2 1.3 -4.7 0 811 -0.11 -0.04 Welch Two Sample t-test two.sided

The difference of \(0.07\) between men and women and the static value of \(-4.7\) means that the crowd wisdom that Women’s matches are less predictable is correct. The difference is statistically significant!

1.3 ROC Curve and Confounding Variables

Using default data set, we will illustrate the concept of ROC curve and confounding variables.

Example 1.3 (Credit Card Default) We will use the Default data set from the ISLR package to illustrate the logistic regression. The data set contains information on credit card defaults. Credit risk assessment is a major application of logistic regression and is widely used in the financial industry. In fact, the banks with the best credit risk assessment models are the ones that are able to offer the best interest rates to their customers and are winning the market.

We will also use this example to illustrate the concept of ROC (Receiver Operating Characteristic) curve. That helps us to understand the trade-off between sensitivity and specificity by varying the threshold \(\alpha\).

First, load the data set. We have 10,000 observations.

Default = read.csv("../data/CreditISLR.csv", stringsAsFactors = T)
head(Default)
default student balance income
No No 730 44362
No Yes 817 12106
No No 1074 31767
No No 529 35704
No No 786 38464
No Yes 920 7492

There are three predictors in the data set: balance, income and student. The balance variable represents the average credit card balance in dollars for each individual. This is the amount of money that a person owes on their credit card, which is a key predictor of whether they are likely to default on their credit card payments.

In the context of credit risk assessment, balance is one of the most important variables because it directly measures the amount of credit being utilized. Credit card companies and banks use this information, along with other factors like income and student status, to assess the likelihood that a customer will default on their payments. The logistic regression model we’re building will use this balance information to predict the probability of default, helping financial institutions make informed decisions about credit limits, interest rates, and risk management.

glm.fit=glm(default~balance,data=Default,family=binomial)
glm.fit %>% tidy() %>% kable()
term estimate std.error statistic p.value
(Intercept) -10.65 0.36 -29 0
balance 0.01 0.00 25 0

We use it now to predict default for a new individual with a balance of $1000.

predict.glm(glm.fit,newdata = list(balance=1000))
##    1 
## -5.2
-1.065e+01 + 5.499e-03*1000
## [1] -5.2

Notice, that by default the predict function returns the log-odds. We can convert it to the probability using the logistic function.

predict.glm(glm.fit,newdata = list(balance=1000), type="response")
##      1 
## 0.0058
exp(-1.065e+01 + 5.499e-03*1000)/(1+exp(-1.065e+01 + 5.499e-03*1000))
## [1] 0.0058

Now, let’s plot the predicted probability of default for a range of balances between 1000 and 3000.

x = list(balance=seq(1000,to = 3000,length.out = 100))
y = predict.glm(glm.fit,newdata = x, type="response")
plot(x$balance,y, pch=20, col="red", xlab = "balance", ylab="Default")
lines(Default$balance, as.integer(Default$default)-1, type='p',pch=20)

We can calculate the confusion matrix for the model to see how well it predicts the default.

y = predict.glm(glm.fit,newdata = Default, type="response")
confusion_matrix = table(Default$default, as.integer(y>0.2), dnn=c("Actual", "Predicted"))
confusion_matrix
##       Predicted
## Actual    0    1
##    No  9404  263
##    Yes  134  199

Notice, that instead of using the default threshold of 0.5, we used 0.2. This is a common practice in the financial industry. The lower the threshold, the more likely the model is to predict a default. In other words, we only approve a loan to a person if model predicts a default with probability 0.2 or lower.

Instead of using counts in the confusion matrix, we can use rates

# Convert to rates by dividing by row totals
confusion_matrix[1,] = confusion_matrix[1,] / sum(confusion_matrix[1,])  # Actual: NO row
confusion_matrix[2,] = confusion_matrix[2,] / sum(confusion_matrix[2,])  # Actual: YES row

confusion_matrix %>% kable(digits=3)
0 1
No 0.97 0.027
Yes 0.40 0.598

Using the rates (proportions) is a more typical way to present the confusion matrix.

For our predictions, we used the value of \(\alpha=0.2\) as a cut-off. What if I use smaller or larger \(\alpha\), e.g. \(\alpha=0\)? We wil use ROC curve to unswer this question. ROC curve shows the relationship between sensitivity (true positive rate) and 1 - specificity (false positive rate) for different classification thresholds. Here, 1 - specificity the proportion of negative cases that are incorrectly classified as positive. Thus, ROC curve shows both types of errors for different values of \(\alpha\).

First, we define a function, that calculates the ROC curve.

roc <- function(p,y, ...){
  y <- factor(y)
  n <- length(p)
  p <- as.vector(p)
  alpha = seq(0,1,length.out = 100)
  Q <- p > matrix(rep(alpha,n),ncol=100,byrow=TRUE)
  specificity <- colMeans(!Q[y==levels(y)[1],])
  sensitivity <- colMeans(Q[y==levels(y)[2],])
  return(list(specificity=specificity, sensitivity=sensitivity, alpha=alpha))
}

Now, let’s plot the ROC curve.

## roc curve and fitted distributions
pred = predict.glm(glm.fit,newdata = Default, type="response")
res = roc(p=pred, y=Default$default)
plot(1-res$specificity, res$sensitivity, type="l", bty="n", main="in-sample")
abline(a=0,b=1,lty=2,col=8)
def = Default$default
# our 1/5 rule cutoff
points(x= 1-mean((pred<.2)[def=="No"]), 
       y=mean((pred>.2)[def=="Yes"]), 
       cex=1.5, pch=20, col='red') 
## a standard `max prob' (p=.5) rule
points(x= 1-mean((pred<.5)[def=="No"]), 
       y=mean((pred>.5)[def=="Yes"]), 
       cex=1.5, pch=20, col='blue') 
legend("bottomright",fill=c("red","blue"),
       legend=c("alpha=1/5","alpha=1/2"),bty="n",title="cutoff")

ROC curve shows the trade-off between sensitivity and specificity for different values of \(\alpha\). The closer the curve is to the top-left corner, the better the model is. The diagonal line represents the random classifier (flip a coin for each prediction). The curve above the diagonal line is better than random, the curve below the diagonal line is worse than random.

Now, let’s look at other predictors. We will add income and student to the model.

glm.fit=glm(default~balance+income+student,data=Default,family=binomial)
glm.fit %>% tidy() %>% kable()
term estimate std.error statistic p.value
(Intercept) -10.87 0.49 -22.08 0.00
balance 0.01 0.00 24.74 0.00
income 0.00 0.00 0.37 0.71
studentYes -0.65 0.24 -2.74 0.01

The p-values indicate that student is significant and income is not. However, the coefficient for student is negative. Meaning that students have lower probability of defaulting! This is counterintuitive, as one would expect that students have higher probability of defaulting. This is because the student variable and balance are confounded. We can see this by plotting the balance vs student.

boxplot(balance~student,data=Default, ylab = "balance", col = c(2,3))

We can see that students have higher balance on average. This is not surprising, as students are typically younger and have lower income and many have student loans.

Confounding occurs when the effect of one variable on the outcome is mixed with the effect of another variable, making it difficult to separate their individual contributions. In our case, student and balance are confounded.

Let’s adjust for balance. We will plot the predicted probability of default for a range of balances between 1000 and 2500 for students and non-students.

x1 = data.frame(balance = seq(1000,2500,length.out = 100), student = as.factor(rep("Yes",100)), income=rep(40,100))
x2 = data.frame(balance = seq(1000,2500,length.out = 100), student = as.factor(rep("No",100)), income=rep(40,100))
y1 = predict.glm(glm.fit,newdata = x1, type="response")
y2 = predict.glm(glm.fit,newdata = x2, type="response")
plot(x1$balance,y1, type='l', col="red", xlab="Balance", ylab = "P(Default)")
lines(x2$balance,y2, type='l', col="black")
legend("topleft",bty="n", legend=c("Not Student", "Student"), col=c("black","red"), lwd=2)

We can see that for a given balance, students are actually less likely to default than non-students. This is because students have higher balance on average.

To summarise, what we’ve learned in this example, thus far. ROC curves visualize the trade-off between sensitivity (true positive rate) and specificity (1 - false positive rate) across different classification thresholds. Curves closer to the top-left corner indicate better model performance, while the diagonal line represents random classification (AUC = 0.5). Curves above the diagonal are better than random, below are worse. ROC curves help choose optimal classification thresholds based on business requirements rather than default 0.5 cutoff.

Confounding occurs when the effect of one variable on the outcome is mixed with the effect of another variable. In the Default dataset, student status and balance are confounded. Students have higher average balances due to student loans and lower income. Without controlling for balance, students appear more likely to default. After adjusting for balance, students are actually less likely to default. The solution is to include confounding variables in the model to isolate individual effects. Always consider variable relationships when interpreting coefficients.

Now, a natural question is how to choose the cut-off value \(\alpha\)? Assume a bank is using our logistic regression model to predict probability of a loan default and would issue a loan if \(p(y=1) < \alpha\). Here \(\alpha\) is the level of risk bank is willing to take. If bank chooses \(\alpha=1\) and gives loans to everyone it is likely to loose a lot of money from defaulted accounts. If it chooses \(\alpha = 0\) it will not issue loan to anyone and wont make any money. In order to choose an appropriate \(\alpha\), we need to know what are the risks. Assume, bank makes $0.25 on every $1 borrowed in interest in fees and loose the entire amount of $1 if account defaults. This leads to the following pay-off matrix

Pay-off matrix for a loan
payer defaulter
loan 0.25 -1
no loan -0.25 0

We are making 25 cents on a dollar on a good loan and loose everything on a default!

Given this pay-off matrix, we can calculate the expected profit for the bank.

pred = predict.glm(glm.fit,newdata = Default, type="response")
res = roc(p=pred, y=Default$default)
alpha = res$alpha
specificity = res$specificity # correctly detect default
sensitivity = res$sensitivity # correctly detect non-default
# We loose money when we make a mistake. If we issue a loan to a non-defaulting customer, we loose $0.25. If we dont issue a loan to a defaulting customer, we loose $1.
profit = 0*specificity - 1*(1-specificity) + 0.25*sensitivity - 0.25*(1-sensitivity)
plot(alpha, profit, type='l', xlab = "alpha", ylab = "profit")
abline(h=0, lty=2)

Let’s find the values of \(\alpha\) for which profit is positive.

min(alpha[profit>0])
## [1] 0.02
max(alpha[profit>0])
## [1] 0.25

We have to be prudent and choose \(\alpha\) in this range. Now let’s find \(\alpha\) that maximises the profit.

# Find index of max value in profit
alpha[which.max(profit)]
## [1] 0.051

Thus, to maximise the profit, we only approve the loan if we are 95% sure that the customer will not default.

1.4 Multinomial logistic regression

Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. In logistic regression we assumed that the labels were binary: \(y_i \in \{0,1\}\) . We used such a classifier to distinguish between two kinds of hand-written digits. Softmax regression allows us to handle \(y_i \in \{1,\ldots ,K\}\) where \(K\) is the number of classes. Our model took the form: \[ f(x\mid \beta)=\dfrac{1}{1+\exp(-\beta^Tx)}~, \] and the model parameters \(\beta\) were trained to minimize the loss function (negative log-likelihood) \[ J(\beta) = -\left[ \sum_{i=1}^m y_i \log f(x\mid \beta) + (1-y_i) \log (1-f(x\mid \beta)) \right] \]

Given a test input \(x\), we want our model to estimate the probability that \(p(y=k|x)\) for each value of \(k=1,\ldots ,K\) Thus, our model will output a \(K\)-dimensional vector (whose elements sum to 1) giving us our K estimated probabilities. Concretely, our model \(f(x\mid \beta)\) takes the form: \[ \begin{aligned} f(x\mid \beta) = \begin{bmatrix} p(y = 1 | x; \beta) \\ p(y = 2 | x; \beta) \\ \vdots \\ p(y = K | x; \beta) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{K}{\exp(\beta_k^T x) }} \begin{bmatrix} \exp(\beta_1^{T} x ) \\ \exp(\beta_2^{T} x ) \\ \vdots \\ \exp(\beta_k^T x ) \\ \end{bmatrix}\end{aligned} \]

Here \(\beta_i \in R^n, i=1,\ldots,K\) are the parameters of our model. Notice that the term \(1/ \sum_{j=1}^{K}{\exp(\beta_k^T x) }\) normalizes the distribution, so that it sums to one.

For convenience, we will also write \(\beta\) to denote all the parameters of our model. When you implement softmax regression, it is usually convenient to represent \(\beta\) as an \(n\)-by-\(K\) matrix obtained by concatenating \(\beta_1,\beta_2,\ldots ,\beta_K\) into columns, so that \[ \beta = \left[\begin{array}{cccc}| & | & | & | \\ \beta_1 & \beta_2 & \cdots & \beta_K \\ | & | & | & | \end{array}\right]. \]

We now describe the cost function that we’ll use for softmax regression. In the equation below, \(1\) is the indicator function, so that \(1\)(a true statement)=1, and \(1\)(a false statement)=0. For example, 1(2+3 > 4) evaluates to 1; whereas 1(1+1 == 5) evaluates to 0. Our cost function will be: \[ \begin{aligned} J(\beta) = - \left[ \sum_{i=1}^{m} \sum_{k=1}^{K} 1\left\{y_i = k\right\} \log \frac{\exp(\beta_k^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i)}\right] \end{aligned} \]

Notice that this generalizes the logistic regression cost function, which could also have been written: \[ \begin{aligned} J(\beta) &= - \left[ \sum_{i=1}^m (1-y_i) \log (1-f(x\mid \beta)) + y_i \log f(x\mid \beta) \right] \\ &= - \left[ \sum_{i=1}^{m} \sum_{k=0}^{1} 1\left\{y_i = k\right\} \log p(y_i = k | x_i ; \beta) \right]\end{aligned} \] The softmax cost function is similar, except that we now sum over the \(K\) different possible values of the class label. Note also that in softmax regression, we have that \[ p(y_i = k | x_i ; \beta) = \frac{\exp(\beta_k^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i) }. \]

Softmax regression has an unusual property that it has a redundant set of parameters. To explain what this means, suppose we take each of our parameter vectors \(\beta_j\), and subtract some fixed vector \(\psi\). Our model now estimates the class label probabilities as \[ \begin{aligned} p(y_i = k | x_i ; \beta) &= \frac{\exp((\beta_k-\psi)^T x_i)}{\sum_{j=1}^K \exp( (\beta_j-\psi)^T x_i)} \\ &= \frac{\exp(\beta_k^T x_i) \exp(-\psi^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i) \exp(-\psi^T x_i)} \\ &= \frac{\exp(\beta_k^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i)}.\end{aligned} \] In other words, subtracting \(\psi\) does not affect our model’ predictions at all! This shows that softmax regression’s parameters are redundant. More formally, we say that our softmax model is overparameterized, meaning that for any model we might fit to the data, there are multiple parameter settings that give rise to exactly the same model function \(f(x \mid \beta)\) mapping from inputs \(x\) to the predictions.

Further, if the cost function \(J(\beta)\) is minimized by some setting of the parameters \((\beta_1,\ldots,\beta_K)\), then it is also minimized by \((\beta_1-\psi,\ldots,\beta_K-\psi)\) for any value of \(\psi\). Thus, the minimizer of \(J(\beta)\) is not unique. Interestingly, \(J(\beta)\) is still convex, and thus gradient descent will not run into local optima problems. But the Hessian is singular/non-invertible, which causes a straightforward implementation of Newton’s method to run into numerical problems. We can just set \(\psi\) to \(\beta_i\) and remove \(\beta_i\).

Example 1.4 (LinkedIn Study) How to Become an Executive(Irwin 2016; Gan and Fritzler 2016)?

Logistic regression was used to analyze the career paths of about \(459,000\) LinkedIn members who worked at a top 10 consultancy between 1990 and 2010 and became a VP, CXO, or partner at a company with at least 200 employees. About \(64,000\) members reached this milestone, \(\hat{p} = 0.1394\), conditional on making it into the database. The goals of the analysis were the following

  1. Look at their profiles – educational background, gender, work experience, and career transitions.
  2. Build a predictive model of the probability of becoming an executive
  3. Provide a tool for analysis of “what if” scenarios. For example, if you are to get a master’s degree, how your jobs perspectives change because of that.

Let’s build a logistic regression model with \(8\) key features (a.k.a. covariates): \[ \log\left ( \frac{p}{1-p} \right ) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_8x_8 \]

  1. \(p\): Probability of “success” – reach VP/CXO/Partner seniority at a company with at least 200 employees.

  2. Features to predict the “success” probability: \(x_i (i=1,2,\ldots,8)\).

    Variable Parameters
    \(x_1\) Metro region: whether a member has worked in one of the top 10 largest cities in the U.S. or globally.
    \(x_2\) Gender: Inferred from member names: ‘male’, or ‘female’
    \(x_3\) Graduate education type: whether a member has an MBA from a top U.S. program / a non-top program / a top non-U.S. program / another advanced degree
    \(x_4\) Undergraduate education type: whether a member has attended a school from the U.S. News national university rankings / a top 10 liberal arts college /a top 10 non-U.S. school
    \(x_5\) Company count: # different companies in which a member has worked
    \(x_6\) Function count: # different job functions in which a member has worked
    \(x_7\) Industry sector count: # different industries in which a member has worked
    \(x_8\) Years of experience: # years of work experience, including years in consulting, for a member.

The following estimated \(\hat\beta\)s of features were obtained. With a sample size of 456,000 thy are measured rather accurately. Recall, given each location/education choice in the “Choice and Impact” is a unit change in the feature.

  1. Location: Metro region: 0.28
  2. Personal: Gender(Male): 0.31
  3. Education: Graduate education type: 1.16, Undergraduate education type: 0.22
  4. Work Experience: Company count: 0.14, Function count: 0.26, Industry sector count: -0.22, Years of experience: 0.09

Here are three main findings

  1. Working across job functions, like marketing or finance, is good. Each additional job function provides a boost that, on average, is equal to three years of work experience. Switching industries has a slight negative impact. Learning curve? lost relationships?
  2. MBAs are worth the investment. But pedigree matters. Top five program equivalent to \(13\) years of work experience!!!
  3. Location matters. For example, NYC helps.

We can also personalize the prediction for predict future possible future executives. For example, Person A (p=6%): Male in Tulsa, Oklahoma, Undergraduate degree, 1 job function for 3 companies in 3 industries, 15-year experience.

Person B (p=15%): Male in London, Undergraduate degree from top international school, Non-MBA Master, 2 different job functions for 2 companies in 2 industries, 15-year experience.

Person C (p=63%): Female in New York City, Top undergraduate program, Top MBA program, 4 different job functions for 4 companies in 1 industry, 15-year experience.

Let’s re-design Person B.

Person B (p=15%): Male in London, Undergraduate degree from top international school, Non-MBA Master, 2 different job functions for 2 companies in 2 industries, 15-year experience.

  1. Work in one industry rather than two. Increase \(3\)%
  2. Undergrad from top \(10\) US program rather than top international school. \(3\)%
  3. Worked for \(4\) companies rather than \(2\). Another \(4\)%
  4. Move from London to NYC. \(4\)%
  5. Four job functions rather than two. \(8\)%. A \(1.5\)x effect.
  6. Worked for \(10\) more years. \(15\)%. A \(2\)X effect.

Choices and Impact (Person B) are shown below Choices and Impact (Person B)

1.5 Imbalanced Data

Often, you have much more observations with a specific label, such a sample is called imbalanced. This is a common problem in real-world classification tasks where one class significantly outnumbers the other(s). For example, in fraud detection, legitimate transactions vastly outnumber fraudulent ones; in medical diagnosis, healthy patients often outnumber those with rare diseases; and in manufacturing, defective products are typically much rarer than non-defective ones.

When dealing with imbalanced data, you should avoid using accuracy as a metric to choose a model. Consider a binary classification problem with 95% of samples labeled as class 1. A naive classifier that simply assigns label 1 to every input will achieve 95% accuracy, making it appear deceptively good while being completely useless for practical purposes.

Instead, more appropriate evaluation metrics should be used. The Receiver Operating Characteristic (ROC) curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various classification thresholds. The Area Under the Curve (AUC) provides a single scalar value that measures the model’s ability to distinguish between classes, regardless of the chosen threshold. An AUC of 0.5 indicates random guessing, while 1.0 represents perfect classification.

The F1 score combines precision and recall into a single score, providing a balanced measure that penalizes models that are either too conservative or too aggressive: \[ F1 = 2\dfrac{\mathrm{precision} \times \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}} \] where precision measures the proportion of true positives among predicted positives, and recall measures the proportion of true positives that were correctly identified.

The precision-recall curve is particularly useful for imbalanced datasets, as it plots precision against recall at various thresholds, focusing on the performance of the positive class. Cohen’s Kappa measures agreement between predicted and actual classifications while accounting for agreement by chance, making it more robust to class imbalance than accuracy.

To address imbalanced data, several strategies can be employed. Data-level approaches include oversampling, where you synthetically generate more samples of the minority class using techniques like bootstrap sampling with replacement, SMOTE (Synthetic Minority Over-sampling Technique) which creates synthetic examples by interpolating between existing minority class samples, or generative models like GANs or variational autoencoders to create realistic synthetic data. Undersampling reduces the majority class samples, which is particularly effective when the dataset is large enough. Hybrid approaches combine both oversampling and undersampling techniques.

Algorithm-level approaches include cost-sensitive learning, where you assign different misclassification costs to different classes, ensemble methods using techniques like bagging or boosting that can naturally handle imbalanced data, and threshold adjustment to modify the classification threshold to optimize for specific metrics like F1-score.

The choice of approach depends on the specific problem, available data, and computational resources. It’s often beneficial to experiment with multiple techniques and evaluate their performance using appropriate metrics rather than relying solely on accuracy.

1.6 Kernel Trick

The kernel trick is particularly valuable when dealing with complex, non-linear patterns in data that cannot be separated by simple linear boundaries. Many real-world classification problems exhibit such non-linear relationships, making the kernel trick an essential tool in machine learning.

The key insight is that whilemany problems appear intractable in their original feature space, they often become linearly separable when mapped to higher-dimensional spaces using appropriate kernel functions. This transformation allows us to apply powerful linear classification methods to solve complex non-linear problems efficiently.

Kernel trick is a method of using a linear classifier to solve a non-linear problem. The idea is to map the data into a higher dimensional space, where it becomes linearly separable. The kernel trick is to use a kernel function \(K(x_i,x_j)\) to calculate the inner product of two vectors in the higher dimensional space without explicitly calculating the mapping \(\phi(x_i)\) and \(\phi(x_j)\). The kernel function is defined as \(K(x_i,x_j) = \phi(x_i)^T\phi(x_j)\). The most popular kernel functions are polynomial kernel \(K(x_i,x_j) = (x_i^Tx_j)^d\) and Gaussian kernel \(K(x_i,x_j) = \exp(-\gamma||x_i-x_j||^2)\). The kernel trick is used in Support Vector Machines (SVM) and Gaussian Processes (GP).

Code
gencircledata = function(numSamples,radius,noise) {
    d = matrix(0,ncol = 3, nrow = numSamples); # matrix to store our generated data
    # Generate positive points inside the circle.
    for (i in 1:(numSamples/2) ) {
    r = runif(1,0, radius * 0.4);
    angle = runif(1,0, 2 * pi);
    x = r * sin(angle);
    y = r * cos(angle);
    noiseX = runif(1,-radius, radius) * noise;
    noiseY = runif(1,-radius, radius) * noise;
    d[i,] = c(0,x,y)
    }

    # Generate negative points outside the circle.
    for (i in (numSamples/2+1):numSamples ) {
    r = runif(1,radius * 0.8, radius);
    angle = runif(1,0, 2 * pi);
    x = r * sin(angle);
    y = r * cos(angle);
    noiseX = runif(1,-radius, radius) * noise;
    noiseY = runif(1,-radius, radius) * noise;
    d[i,] = c(1,x,y)
    }
    colnames(d) = c("label", "x1", "x2")
    return(d)
}
d = gencircledata(numSamples=200, radius=1, noise=0.001)
plot(d[,2],d[,3], col=d[,1]+1, pch=19, xlab="x", ylab="y")
Figure 1.1

The data on the left in Figure Figure 1.1 is clearly not linearly separable. However, if we map it to a three-dimensional space using the transformation: \[ \begin{aligned} \phi: R^{2} & \longrightarrow R^{3} \\ \left(x_{1}, x_{2}\right) & \longmapsto\left(z_{1}, z_{2}, z_{3}\right)=\left(x_{1}^{2}, \sqrt{2} x_{1} x_{2}, x_{2}^{2}\right), \end{aligned} \] and attempt to linearly separate the transformed data, the decision boundaries become hyperplanes in \(R^{3}\), expressed as \(\omega^{T} z + b = 0\). In terms of the original variables \(x\), these boundaries take the form: \[ \omega_{1} x_{1}^{2} + \omega_{2} \sqrt{2} x_{1} x_{2} + \omega_{3} x_{2}^{2} = 0, \] which corresponds to the equation of an ellipse. This demonstrates that we can apply a linear algorithm to transformed data to achieve a non-linear decision boundary with minimal effort.

Now, consider what the algorithm is actually doing. It relies solely on the Gram matrix \(K\) of the data. Once \(K\) is computed, the original data can be discarded: \[ \begin{aligned} K & = \left[\begin{array}{ccc} x_{1}^{T} x_{1} & x_{1}^{T} x_{2} & \cdots \\ x_{2}^{T} x_{1} & \ddots & \\ \vdots & & \end{array}\right]_{n \times n} = X X^{T}, \\ \text{where} \quad X & = \left[\begin{array}{c} x_{1}^{T} \\ \vdots \\ x_{n}^{T} \end{array}\right]_{n \times d}. \end{aligned} \] Here, \(X\), which contains all the data, is referred to as the design matrix.

When we map the data using \(\phi\), the Gram matrix becomes: \[ K = \left[\begin{array}{ccc} \phi\left(x_{1}\right)^{T} \phi\left(x_{1}\right) & \phi\left(x_{1}\right)^{T} \phi\left(x_{2}\right) & \cdots \\ \phi\left(x_{2}\right)^{T} \phi\left(x_{1}\right) & \ddots & \\ \vdots & & \end{array}\right]. \]

Let us compute these inner products explicitly. For vectors \(r\) and \(s\) in \(R^{3}\) corresponding to \(a\) and \(b\), respectively: \[ \begin{aligned} \langle r, s \rangle & = r_{1} s_{1} + r_{2} s_{2} + r_{3} s_{3} \\ & = a_{1}^{2} b_{1}^{2} + 2 a_{1} a_{2} b_{1} b_{2} + a_{2}^{2} b_{2}^{2} \\ & = \langle a, b \rangle^{2}. \end{aligned} \]

Thus, instead of explicitly mapping the data via \(\phi\) and then computing the inner product, we can compute it directly in one step, leaving the mapping \(\phi\) implicit. In fact, we do not even need to know \(\phi\) explicitly; all we require is the ability to compute the modified inner product. This modified inner product is called a kernel, denoted \(K(x, y)\). The matrix \(K\), which contains the kernel values for all pairs of data points, is also referred to as the kernel matrix.

Since the kernel itself is the primary object of interest, rather than the mapping \(\phi\), we aim to characterize kernels without explicitly relying on \(\phi\). Mercer’s Theorem provides the necessary framework for this characterization.

Let’s implement it

library("scatterplot3d")
phi <- function(x1, x2) {
    z1 <- x1^2
    z2 <- sqrt(2) * x1 * x2
    z3 <- x2^2
    return(cbind(z1, z2, z3))
}

# Generate sample 2D data (you can replace this with your actual data)


# Apply the transformation
transformed_data <- phi(d[,2], d[,3])
scatterplot3d(transformed_data, color = ifelse(d[,1] == 0, "red", "blue"), pch = 19,
                xlab = "z1 (x1^2)", ylab = "z2 (sqrt(2) * x1 * x2)", zlab = "z3 (x2^2)",
                main = "3D Scatter Plot of Transformed Data", angle=222, grid=FALSE, box=FALSE)