Chapter 5 Logistic Regression

When the value \(y\) we are trying to predict is categorical (or qualitative) we have a classification problem. For a binary output we predict the probability its going to happen \[ p ( Y=1 | X = x ), \] where \(X = (x_1,\ldots,x_p)\) is our usual list of predictors.

Suppose that we have a binary response, \(y\) taking the value \(0\) or \(1\)

  1. Win or lose

  2. Sick or healthy

  3. Buy or not buy

  4. Pay or default

The goal is to predict the probability that \(y\) equals \(1\). You can then do and categorize a new data point. Assessing credit risk and default data is a typical problem.

  • \(y\): whether or not a customer defaults on their credit card (No or Yes).

  • \(x\): The average balance that customer has remaining on their credit card after making their monthly payment, plus as many other features you think might predict \(Y\).

A linear model is a powerful tool to find relations among different variables \[y = x^T\beta + \epsilon\]

When relations are non-linear we can do several things: - Add squared terms - Add interactions - Tran form \(y\) or some of the x variables

It works assuming that \(y\) variable is contentious and ranges in \((-\infty,+\infty)\). Another assumption is that conditional distribution of \(y\) is normal \(p(y\mid x^T\beta) \sim N(x^T\beta, \sigma^2)\)

What do we do when assumptions about conditional normal distributions do not hold. For example \(y\) can be a binary variable with values 0 and 1. For example \(y\) is

  1. Outcome of an election

  2. Result of spam filter

  3. Decision variable about loan approval

We model response \(\{0,1\}\), using a continuous variable \(y \in [0,1]\) which is interpreted as the probability that response equals to 1. \[\bcol{p( y= 1 | x_1, \ldots , x_p ) = F \left ( \beta_1 x_1 + \ldots + b_p x_p \right ) }\] where \(f\) is increasing and \(0< f(x)<1\).

It seems logical to find a transformation \(F\) so that \(F(x^T\beta + \epsilon) \in [0,1]\). Then we can predict using \(F(x^T\beta)\) and intercepting interpret the result as a probability, i.e if \(F(x^T\beta) = z\) then we interpret it as \(p(y=1) = z\). Such function \(F\) is called a link function.

Do we know a function that maps any real number to a number in \([0,1]\) interval? What about commulative distribution function \(F(x) = p(Z \le x)\)? If we choose CDF \(\Phi(x)\) for \(N(0,1)\) then we have \[y = \Phi(x^T\beta + \epsilon)\] \[\Phi^{-1}(y) = x^T\beta + \epsilon\] \[y' = x^T\beta + \epsilon\]

You can thing of this as a change of units for variable \(y\). In this specific case, when we use normal CDF, the resulting model is called probit, it stands for probability unit. The resulting link function is \(\Phi^{-1}\) and now \(\Phi^{-1}(Y)\) follows a normal distribution!

This term was coined in the 1930’s by biologists studying the dosage-cure rate link.

# We can fit a probit model using glm function in R
probitModel = glm(y~x, family=binomial(link="probit"))
(mc = as.doubl\E{coef(probitModel)))
# we want to predict outcome for x = -1
xnew = -1
(yt = mc[1] + mc[2]*xnew)
(pred = predict(probitModel, list(x = c(xnew)), type="response"))
Probit model

Figure 5.1: Probit model

Our prediciton is the blue area which is eqal to 0.0219.

# lets look how well probit fits the data
pred_probit = predict(probitModel, list(x = x), type="response")
plot(x,pred_probit, pch=16, col="red", cex=0.5)
lines(x,y, type='p', pch=16, cex=0.2)
abline(h =pnorm(yt) )
Probit model

Figure 5.2: Probit model

A couple of observations: (i) this fits the data much better than the linear estimation, and (i) it always lies between 0 and 1.

Instead of thinking of \(y\) as a probability and transforming right hand side of the linear model we can think of transforming \(y\) so that transformed variable lies in \((-\infty,+\infty)\).

We can use odds ratio, that we talked about before \[\dfrac{y}{1-y}\]

Odds ration lies in the interval \((0,+\infty)\). Almost what we need, but not exactly. Can we do another transform that maps \((0,+\infty)\) to \((-\infty,+\infty)\)? \[\log\left(\dfrac{y}{1-y}\right)\] will do the trick!

This function is called a logit function and it is This function is called a logit function and it is the inverse of the sigmoidal "logistic" function or logistic transform. The is linear in \[\log \left ( \frac{ p \left ( y=1|x \right ) }{ 1 - p \left ( Y=1|x \right ) } \right ) = \beta_0 + \beta_1 x_1 + \ldots + x_p\] These model are easy to fit in R:

glm( y ~ x1 + x2,  family="binomial")
  • "" is for indicates \(y=0\) or \(1\)

  • "" has a bunch of other options.

Outside of specific field, i.e. behavioral economics, the logistic function is much more popular of a choice compared to probit model. Besides that fact that is more intuitive to work with logit transform, it also has several nice properties when we deal with multiple classes (more then 2). Also, it is comutationally easier then working with normal distributions.

The density function of the logit is very similar to the probit one.

logitModel  = glm(y~x, family=binomial(link="logit"))
pred_logit = predict(logitModel, list(x = x), type="response")
plot(x,pred_probit, pch=20, col="red", cex=0.9, ylab="y")
lines(x,pred_logit, type='p', pch=20, cex=0.5, col="blue")
lines(x,y, type='p', pch=21, cex=0.5, bg="lightblue")
legend("bottomright",pch=20, legend=c("Logit", "Probit"), col=c("blue","red"),y.intersp = 2)

We can easily derive an inverse of the logit function to get back the original \(y\) \[\log\left(\dfrac{y}{1-y}\right) = x^T\beta;~~ y=\dfrac{e^{x^T\beta}}{1+e^{x^T\beta}}\]

Example 5.1 () Example: Example: NBA point spread

NBA = read.csv("data/NBAspread.csv")
n = nrow(NBA)

hist(NBA$spread[favwin==1], col=5, main="", xlab="spread")
hist(NBA$spread[favwin==0], add=TRUE, col=6)
legend("topright", legend=c("favwin=1", "favwin=0"), fill=c(5,6), bty="n")
boxplot(NBA$spread ~ NBA$favwin, col=c(6,5), horizontal=TRUE, ylab="favwin", xlab="spread")

Does the Vegas point spread predict whether the favorite wins or not?

Turquoise = Favorites does win, Purple = Favorite does not win

R: Logistic Regression

In R: the output gives us ...

nbareg = glm(favwin~spread-1, family=binomial)
#> Call:
#> glm(formula = favwin ~ spread - 1, family = binomial)
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#> -2.574   0.159   0.462   0.814   1.112  
#> Coefficients:
#>        Estimate Std. Error z value Pr(>|z|)    
#> spread   0.1560     0.0138    11.3   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> (Dispersion parameter for binomial family taken to be 1)
#>     Null deviance: 766.62  on 553  degrees of freedom
#> Residual deviance: 527.97  on 552  degrees of freedom
#> AIC: 530
#> Number of Fisher Scoring iterations: 5
s = seq(0,30,length=100)
fit = exp(s*nbareg$coef[1])/(1+exp(s*nbareg$coef[1]))
plot(s, fit, typ="l", col=4, lwd=2, ylim=c(0.5,1), xlab="spread", ylab="P(favwin)")

The \(\beta\) measures how our log-odds change! \(\beta = 0.156\)

Logistic Regresion for Tennis Classification

Data science plays a major role in tennis, you can learn about recent AI tools developed by IBM from this This Yahoo Article.

We will analyze the Tennis Major Tournament Match Statistics Data Set from the UCI ML repository. The data set has one per each game from four major Tennis tournaments in 2013 (Australia Open, French Open, US Open, and Wimbledon).

Let’s load the data and familiarize ourselves with it

#> [1] 943  44
#> 'data.frame':    943 obs. of  5 variables:
#>  $ Player1: chr  "Lukas Lacko" "Leonardo Mayer" "Marcos Baghdatis" "Dmitry Tu"..
#>  $ Player2: chr  "Novak Djokovic" "Albert Montanes" "Denis Istomin" "Michael "..
#>  $ Round  : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Result : int  0 1 0 1 0 0 0 1 0 1 ...
#>  $ FNL1   : int  0 3 0 3 1 1 2 2 0 3 ...

Let’s look at the few coluns of the randomly selected five rows of the data

d[sample(1:943,size = 5),c("Player1","Player2","Round","Result","gender","surf")]
#>             Player1          Player2 Round Result gender  surf
#> 44  Stephane Robert     Aljaz Bedene     1      1      M  Hard
#> 296    Lukasz Kubot  Maxime Teixeira     1      1      M  Clay
#> 708        B.Becker         A.Murray     1      0      M Grass
#> 149     Shahar Peer Monica Niculescu     1      0      W  Hard
#> 66  Dmitry Tursunov    Denis Istomin     2      0      M  Hard

We have data for 943 matches and for each match we have 44 columns, including names of the players, their gender, surface type and match statistics.

Let’s look at the number of break points won by each player. We will plot BPW (break points won) by each player on the scatter plot and will colorize each dot according to the outcome

n = dim(d)[1]
plot(d$BPW.1+rnorm(n),d$BPW.2+rnorm(n), pch=21, col=d$Result+2, cex=0.6, bg="yellow", lwd=0.8,
     xlab="BPW by Player 1", ylab="BPW by Player 2")
legend("bottomright", c("P1 won", "P2 won"), col=c(3,2), pch=21, bg="yellow", bty='n')

We can clearly see that number of the break points won is a clear predictor of the match outcome. Which is obvious and follows from the rules, to win a match, a player must win break points. Now, we want to understand the impact of a winning a break point on the overall match outcome. We do it by building a logistic regression model

which($BPW.1)) # there is one row with NA value for the BPW.1 value and we remove it
#> [1] 171
d = d[-171,]; n = dim(d)[1]
m = glm(Result ~ BPW.1 + BPW.2-1, data=d, family = "binomial" )
#> Call:
#> glm(formula = Result ~ BPW.1 + BPW.2 - 1, family = "binomial", 
#>     data = d)
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#> -3.425  -0.668  -0.055   0.636   3.085  
#> Coefficients:
#>       Estimate Std. Error z value Pr(>|z|)    
#> BPW.1   0.4019     0.0264    15.2   <2e-16 ***
#> BPW.2  -0.4183     0.0277   -15.1   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> (Dispersion parameter for binomial family taken to be 1)
#>     Null deviance: 1305.89  on 942  degrees of freedom
#> Residual deviance:  768.49  on 940  degrees of freedom
#> AIC: 772.5
#> Number of Fisher Scoring iterations: 5

R output does not tell us how accurate our model is but we can quickly check it by using the table function. We will use \(0.5\) as a threshold for our classification.

table(d$Result, as.integer(m$fitted.values>0.5))
#>       0   1
#>   0 416  61
#>   1  65 400

Thus, our model got (416+416)/942 = 0.88% of the predictions correctly!

Essentially, the logistic regression is trying to draw a line that separates the red observations from the green one. In out case, we have two predictors \(x_1\) = BPW.1 and \(x_2\) = BPW.2 and our model is \[ \log\left(\dfrac{p}{1-p}\right) = \beta_1x_1 + \beta_2 x_2, \] where \(p\) is the probability of player 1 winning the match. We want to find the line along which the probability is 1/2, meaning that \(p/(1-p) = 1\) and log-odds \(\log(p/(1-p)) = 0\), thus the equation for the line is \(\beta_1x_1 + \beta_2 x_2 = 0\) or \[ x_2 = \dfrac{-\beta_1}{\beta_2}x_1 \]

Let’s see the line found by the glm function

plot(d$BPW.1+rnorm(n),d$BPW.2+rnorm(n), pch=21, col=d$Result+2, cex=0.6, bg="yellow", lwd=0.8,
     xlab="BPW by Player 1", ylab="BPW by Player 2")
legend("bottomright", c("P1 won", "P2 won"), col=c(3,2), pch=21, bg="yellow", bty='n')

x = seq(0,30,length.out = 200)
y  =  -m$coefficients[1]*x/m$coefficients[2]
lines(x,y, lwd=2, col="red") 

There are a couple of observations. First, effect of a break point on the game outcome is significant and symmetric, effect of loosing break point is the same as the effect of winning one. We also can interpret the effect of winning a break point in the following way. We will keep BPW.2 = 0 and will calculate what happens to the probability of winning when BPW.1 changes from 0 to 1. The odds ration for player 1 winning when BPW.1 = 0 is exp(0) which is 1, meaning that the probability that P1 wins is 1/2. Now when BPW.1 = 1, the odds ratio is 1.5

#> [1] 1.5

We can calculate probability of winning from the regression equation \[ \dfrac{p}{1-p} = 1.5,~~~p = 1.5(1-p),~~~2.5p = 1.5,~~~p = 0.6 \] Thus probability of winning goes from 50% to 60%, we can use predict function to get this result

predict.glm(m,newdata = data.frame(BPW.1 = c(0), BPW.2 = c(0)), type="response")
#>   1 
#> 0.5
predict.glm(m,newdata = data.frame(BPW.1 = c(1), BPW.2 = c(0)), type="response")
#>   1 
#> 0.6

What happens to the chances of winning when P1 wins three more break points compared to the opponent

predict.glm(m,newdata = data.frame(BPW.1 = c(0), BPW.2 = c(0)), type="response")
#>   1 
#> 0.5
predict.glm(m,newdata = data.frame(BPW.1 = c(3), BPW.2 = c(0)), type="response")
#>    1 
#> 0.77

Chances go up by 27%.

Tennis is arguably the sport in which mean and women are treated equally. Both man and women matches are shown during the prime-time on TV, they both have the same prize money. However, one of the comments you hear often is that Women’s matches are “less predictable,” meaning that an upset (when the favorite looses) is more likely to happen in a women’s match compared to man Matches. We can test thus statement by looking at the residuals. The large the residual the less accurate our prediction was.

d$res = abs(m$residuals)
outlind = which(d$res<2)
boxplot(d$res[outlind] ~ d$gender[outlind], col=c(2,3), xlab="Gender",ylab="Residual")

Let’s do a formal T-test on the residuals foe men’s and women’s matches

men = d %>% filter(res<2, gender=="M") %>% pull(res)
women = d %>% filter(res<2, gender=="W") %>% pull(res)
t.test(men, women, alternative = "two.sided")
#>  Welch Two Sample t-test
#> data:  men and women
#> t = -5, df = 811, p-value = 3e-06
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -0.105 -0.043
#> sample estimates:
#> mean of x mean of y 
#>       1.2       1.3

Looks like the crowd wisdom that Women’s matches are less predictable is correct.

NBA Point Spread Prediction

“Plug-in” the values for the new game into our logistic regression \[{ P \left ( {\rm favwin} \mid {\rm spread} \right ) = \frac{ e^{ \beta x } }{ 1 + e^{\beta x} } }\] Check that when \(\beta =0\) we have \(p= \frac{1}{2}\).

  • Given our new values spread\(=8\) or spread\(=4\),
    The win probabilities are \(77\)% and \(65\)%, respectively. Clearly, the bigger spread means a higher chance of winning.

\BeginKnitrBlock{example}\iffalse{-91-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-20"><strong>(\#exm:unnamed-chunk-20)  \iffalse () \fi{} </strong></span>Example: **Credit Card Default**

10,000 observations

Default = read.csv("data/CreditISLR.csv", stringsAsFactors = T)
#>   default student balance income
#> 1      No      No     730  44362
#> 2      No     Yes     817  12106
#> 3      No      No    1074  31767
#> 4      No      No     529  35704
#> 5      No      No     786  38463
#> 6      No     Yes     920   7492

Let’s build a logistic regression model,data=Default,family=binomial)
#> Call:
#> glm(formula = default ~ balance, family = binomial, data = Default)
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#> -2.270  -0.146  -0.059  -0.022   3.759  
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -10.65133    0.36116   -29.5   <2e-16 ***
#> balance       0.00550    0.00022    24.9   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> (Dispersion parameter for binomial family taken to be 1)
#>     Null deviance: 2920.6  on 9999  degrees of freedom
#> Residual deviance: 1596.5  on 9998  degrees of freedom
#> AIC: 1600
#> Number of Fisher Scoring iterations: 8

Predicting default

predict.glm(,newdata = list(balance=1000))
#>    1 
#> -5.2
-1.065e+01 + 5.499e-03*1000
#> [1] -5.2
predict.glm(,newdata = list(balance=1000), type="response")
#>      1 
#> 0.0058
exp(-1.065e+01 + 5.499e-03*1000)/(1+exp(-1.065e+01 + 5.499e-03*1000))
#> [1] 0.0058

Predicting default

x = list(balance=seq(1000,to = 3000,length.out = 100))
y = predict.glm(,newdata = x, type="response")
plot(x$balance,y, pch=20, col="red", xlab = "balance", ylab="Default")
lines(Default$balance, as.integer(Default$default)-1, type='p',pch=20)

5.1 Confusion Matrix

We can use accuracy rate: \[\text{accuracy} = \dfrac{\text{\# of Correct answers}}{n}\] or its dual, error rate \[\text{error rate} = 1 - \text{accuracy}\] You remember, we haw two types of errors. We can use confusion matrix to quantify those

Predicted: YES Predicted: NO
Actual: NO FPR TNR

True positive rate (TPR) is the sensetivity and false positive rate (FPR) is the specificity of our predictive model

Example: \[Evalute the previous model\] Accuracy = 0.96

Predicted: YES Predicted: NO
Actual: YES TPR=0.6 FNR=0.4
Actual: NO FPR=0.03 TNR=0.97

I used \(p=0.2\) as a cut-off. What if I use smaller or larger \(p\), e.g. \(p=0\)?

ROC Curve Shows what happens for different cut-off values

par(mar = c(4, 4, 0.1, 0.1), bty="n")
## roc curve and fitted distributions
par(mai=c(.9,.9,.2,.5), mfrow=c(1,1))
pred = predict.glm(,newdata = Default, type="response")
default = y
roc(p=pred, y=Default$default, bty="n", main="in-sample")
# our 1/5 rule cutoff
points(x= 1-mean((pred<.2)[default==0]), 
       cex=1.5, pch=20, col='red') 
## a standard `max prob' (p=.5) rule
points(x= 1-mean((pred<.5)[default==0]), 
       cex=1.5, pch=20, col='blue') 

Look at other predictors,data=Default,family=binomial)
#> Call:
#> glm(formula = default ~ balance + income + student, family = binomial, 
#>     data = Default)
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#> -2.469  -0.142  -0.056  -0.020   3.738  
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -1.09e+01   4.92e-01  -22.08   <2e-16 ***
#> balance      5.74e-03   2.32e-04   24.74   <2e-16 ***
#> income       3.03e-06   8.20e-06    0.37   0.7115    
#> studentYes  -6.47e-01   2.36e-01   -2.74   0.0062 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> (Dispersion parameter for binomial family taken to be 1)
#>     Null deviance: 2920.6  on 9999  degrees of freedom
#> Residual deviance: 1571.5  on 9996  degrees of freedom
#> AIC: 1580
#> Number of Fisher Scoring iterations: 8

Student is significant!?

Student vs Balance

boxplot(balance~student,data=Default, col = Default$student, ylab = "balance")

Let’s adjust for balance

x1 = data.frame(balance = seq(1000,2500,length.out = 100), student = as.factor(rep("Yes",100)), income=rep(40,100))
x2 = data.frame(balance = seq(1000,2500,length.out = 100), student = as.factor(rep("No",100)), income=rep(40,100))
y1 = predict.glm(,newdata = x1, type="response")
y2 = predict.glm(,newdata = x2, type="response")
plot(x1$balance,y1, type='l', col="red", xlab="Balance", ylab = "P(Default)")
lines(x2$balance,y2, type='l', col="black")
legend("topleft",bty="n", legend=c("Not Student", "Student"), col=c("black","red"), lwd=2)

5.2 Estimating the Regression Coefficients

The coefficients \(\beta = (\beta_0, \beta_1)\) can be estimated using maximul maxinmum leakelihiood method we dsicussed when talked about linear regression.

The model we derived above, gives us probability of \(y\), givcen \(x\) \[p(y\mid x) = \dfrac{e^{x^T\beta}}{1+e^{x^T\beta}}\]

Now the problem is as follows \[\maxi_{\beta} \prod_{i:y_i = 1}p(x_i)\prod_{j:y_j = 0}(1-p(x_j)).\]

Maximum likelihood is a very general approach that is used to fit many of the non-linear models.

5.3 Choosing \(p\) and Evaluating Quality of Classifier

In logistic regrssion we use the logistic function to calculate a probability of \(y = 1\) \[p(y=1\mid x) = \dfrac{e^{x^T\beta}}{1+e^{x^T\beta}}.\] Then, to predict a lablel we use a rule if \(y<p\), predict 0, and predict 1 otherwise. Now we anwer the question of how to choose the cut-off value \(p\). We show it throug an example.

Example 5.2 () Example: Assume a bank is using a logistic regresson model to predict probabiliy of a loan default and would issue a loan if \(a = p(y=1) < p\). Here \(p\) is the level of risk bank is willing to take. If bank chooses \(p=1\) and gives loans to everyone it is likely to loose a lot of money from defaulted accounts. If it chooses \(p = 0\) it will not issue loan to anyone and wont make any money. In order to choose an appropriate \(p\), we need to know what are the risks. Assume, bank makes $0.25 on every $1 borrowed in interest in fees and loose the entire amount of $1 if account defaults. This leads to the following pay-off matrix

Pay-off matrix for a loan
payer defaulter
loan -0.25 1
no load 0 0


Then, given \(a = p(y=1)\), the expected profit is profit = \(0.25(1-a) - a\) to maintain a positive profit we need to choose \[0.25(1-a) - a >0 \iff -1.25a > -0.25 \iff a < 0.25 /1.25= 0.2\] Thus, by choosing cutoff to be 0.2 or less, we guarantee to make profit on our loans.

To evaluate a binary classification predictor, we will use confuion matrix. It is shows numbers of correct predictions by the model (true positives and true negatives) and incorrect ones (false positive and false negatives). Say, we have a model that predicts weather person has a desiese or not and we evaluate this mmodel using 200 samples (\(n=200\)) with 60 being labeled as 0 (NO) and 140 labeled as 1 (YES) and model predicted correclty 130 YES labeled observations and 50 NOs.

Predicted: YES Predicted: NO
Actual: YES TP = 130 FN = 10
Actual: NO FP = 10 TN = 50

Sometimes, it is convinient to used rates rather than absolute counts and we compute

Predicted: YES Predicted: NO
Actual: YES TPR = 130/140 FNR = 10/140
Actual: NO FPR = 10/60 TNR = 50/60

True positive rate (TPR) is nothing but the sensetivity and false positive rate (FPR) is the specificity of our predictive model. Accuracy, which is the percent of correct predicitons is another metric can be used to evaluate a classifier. \[\mbox{Accuracy} = \dfrac{\mbox{TP + TN}}{n}.\] The error rate is opposite to accurace \[\mbox{Error rate} = 1- \mbox{Accuracy}\]

For a logistic regression, the confusion matrix will be different for different choices of the cut-off values \(p\). If we would like to understand the performance of the model for different values of \(p\) we can plit an ROC curve, which plots pairs of TPR and FPR for different values of \(p\). Saw we take a sequence of 11 values \(p \in {0, 0.1, 0.2,\ldots,1}\) and we evaluate TPR and FPR for those 10 values and plot those pairs on a 2D plot then we will get the ROC curve. A few facts about the ROC curve:

  1. If we set \(p=0\), then any model will always predict NO, this leads to FPR=0 and TPR=0

  2. If we set \(p=1\) and model always predicts YES, then we get FPR = 1 and TPR = 1

  3. If we have an "ideal" model then for any \(0<p<1\) we will have FPR = 0 and TPR = 1.

  4. A naive model that uses coin flip to classify will have FPR = 1/2 and TPR = 1/2

  5. An ROC curve for an model will lie in-between the ideal curve and naive curve. If your model is worse then naive, it is not a good model. And your model cannot be better than an ideal model.

Example 5.3 () Example: Let’s consider an example. We want to predict default given attributes of the loan applicant. We have 1000 observations of 9 variables

  Default       duration   amount    installment   age    history      purpose      foreign   rent

    $0$           $6$      $1,169$       $4$       $67$   terrible   goods/repair   foreign   FALSE
    $1$           $48$     $5,951$       $2$       $22$     poor     goods/repair   foreign   FALSE
    $0$           $12$     $2,096$       $2$       $49$   terrible       edu        foreign   FALSE
    $0$           $42$     $7,882$       $2$       $45$     poor     goods/repair   foreign   FALSE
    $1$           $24$     $4,870$       $3$       $53$     poor        newcar      foreign   FALSE
    $0$           $36$     $9,055$       $2$       $35$     poor         edu        foreign   FALSE
We build a logistic regression model using all of the 8 predictors and their interactions

credscore = glm(Default~.^2,data=credit,family=binomial)

Then we plot ROC curve (FPR vs TPR) for different values of $p$ and
compare the curve with the naive

<div class="figure" style="text-align: center">
<img src="fig/roc-example.pdf" alt="ROC curve for logistic regression model for repearting defualts" width="100%" />
<p class="caption">(\#fig:roc)ROC curve for logistic regression model for repearting defualts</p>

5.4 Multinomial logistic regression

Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. In logistic regression we assumed that the labels were binary: \(y_i \in \{0,1\}\) . We used such a classifier to distinguish between two kinds of hand-written digits. Softmax regression allows us to handle \(y_i \in \{1,\ldots ,K\}\) where \(K\) is the number of classes.

Our model took the form: \[f(x\mid \beta)=\dfrac{1}{1+\exp(-\beta^Tx)}~,\] and the model parameters \(\beta\) were trained to minimize the loss function (negative log-likelihood)

\[J(\beta) = -\left[ \sum_{i=1}^m y_i \log f(x\mid \beta) + (1-y_i) \log (1-f(x\mid \beta)) \right]\]

Given a test input \(x\), we want our model to estimate the probability that \(p(y=k|x)\) for each value of \(k=1,\ldots ,K\) Thus, our model will output a \(K\)-dimensional vector (whose elements sum to 1) giving us our K estimated probabilities. Concretely, our model \(f(x\mid \beta)\) takes the form: \[\begin{aligned} f(x\mid \beta) = \begin{bmatrix} p(y = 1 | x; \beta) \\ p(y = 2 | x; \beta) \\ \vdots \\ p(y = K | x; \beta) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{K}{\exp(\beta_k^T x) }} \begin{bmatrix} \exp(\beta_1^{T} x ) \\ \exp(\beta_2^{T} x ) \\ \vdots \\ \exp(\beta_k^T x ) \\ \end{bmatrix}\end{aligned}\]

Here \(\beta_i \in R^n, i=1,\ldots,K\) are the parameters of our model. Notice that the term \(1/ \sum_{j=1}^{K}{\exp(\beta_k^T x) }\) normalizes the distribution, so that it sums to one.

For convenience, we will also write \(\beta\) to denote all the parameters of our model. When you implement softmax regression, it is usually convenient to represent \(\beta\) as an \(n\)-by-\(K\) matrix obtained by concatenating \(\beta_1,\beta_2,\ldots ,\beta_K\) into columns, so that \[\beta = \left[\begin{array}{cccc}| & | & | & | \\ \beta_1 & \beta_2 & \cdots & \beta_K \\ | & | & | & | \end{array}\right].\]

We now describe the cost function that we’ll use for softmax regression. In the equation below, \(1\) is the indicator function, so that \(1\)(a true statement)=1, and \(1\)(a false statement)=0. For example, 1(2+3 > 4) evaluates to 1; whereas 1(1+1 == 5) evaluates to 0. Our cost function will be: \[\begin{aligned} J(\beta) = - \left[ \sum_{i=1}^{m} \sum_{k=1}^{K} 1\left\{y_i = k\right\} \log \frac{\exp(\beta_k^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i)}\right]\end{aligned}\]

Notice that this generalizes the logistic regression cost function, which could also have been written: \[\begin{aligned} J(\beta) &= - \left[ \sum_{i=1}^m (1-y_i) \log (1-f(x\mid \beta)) + y_i \log f(x\mid \beta) \right] \\ &= - \left[ \sum_{i=1}^{m} \sum_{k=0}^{1} 1\left\{y_i = k\right\} \log p(y_i = k | x_i ; \beta) \right]\end{aligned}\] The softmax cost function is similar, except that we now sum over the \(K\) different possible values of the class label. Note also that in softmax regression, we have that \[p(y_i = k | x_i ; \beta) = \frac{\exp(\beta_k^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i) }\]

Softmax regression has an unusual property that it has a redundant set of parameters. To explain what this means, suppose we take each of our parameter vectors \(\beta_j\), and subtract some fixed vector \(\psi\). Our model now estimates the class label probabilities as

\[\begin{aligned} p(y_i = k | x_i ; \beta) &= \frac{\exp((\beta_k-\psi)^T x_i)}{\sum_{j=1}^K \exp( (\beta_j-\psi)^T x_i)} \\ &= \frac{\exp(\beta_k^T x_i) \exp(-\psi^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i) \exp(-\psi^T x_i)} \\ &= \frac{\exp(\beta_k^T x_i)}{\sum_{j=1}^K \exp(\beta_k^T x_i)}.\end{aligned}\] In other words, subtracting \(\psi\) does not affect our model’ predictions at all! This shows that softmax regression’s parameters are redundant. More formally, we say that our softmax model is overparameterized, meaning that for any model we might fit to the data, there are multiple parameter settings that give rise to exactly the same model function \(f(x \mid \beta)\) mapping from inputs \(x\) to the predictions.

Further, if the cost function \(J(\beta)\) is minimized by some setting of the parameters \((\beta_1,\ldots,\beta_K)\), then it is also minimized by \((\beta_1-\psi,\ldots,\beta_K-\psi)\) for any value of \(\psi\). Thus, the minimizer of \(J(\beta)\) is not unique. Interestingly, \(J(\beta)\) is still convex, and thus gradient descent will not run into local optima problems. But the Hessian is singular/non-invertible, which causes a straightforward implementation of Newton’s method to run into numerical problems. We can just set \(\psi\) to \(\beta_i\) and remove \(\beta_i\).

Example 5.4 () Example: LinkedIn Study: How to Become an Executive(Irwin 2016; Gan and Fritzler 2016)?

Logistic regression was used to analyze the career paths of about \(459,000\) LinkedIn members who worked at a top 10 consultancy between 1990 and 2010 and became a VP, CXO, or partner at a company with at least 200 employees. About \(64,000\) members reached this milestone, \(\hat{p} = 0.1394\), conditional on making it into the database. The goals of the analysis were the following

  1. Look at their profiles – educational background, gender, work experience, and career transitions.

  2. Build a predictive model of the probability of becoming an executive

  3. Provide a tool for analysis of “what if” scenarios. For example, if you are to get a master’s degree, how your jobs perspectives change because of that.

Let’s build a logistic regression model with \(8\) key features (a.k.a. covariates): \[\log\left ( \frac{p}{1-p} \right ) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_8x_8\]

  1. \(p\): Probability of “success” – reach VP/CXO/Partner seniority at a company with at least 200 employees.

  2. Features to predict the “success” probability: \(x_i (i=1,2,\ldots,8)\).

    Variable Parameters
    \(x_1\) Metro region: whether a member has worked in one of the top 10 largest cities in the U.S. or globally.
    \(x_2\) Gender: Inferred from member names: ‘male,’ or ‘female’
    \(x_3\) Graduate education type: whether a member has an MBA from a top U.S. program / a non-top program / a top non-U.S. program / another advanced degree
    \(x_4\) Undergraduate education type: whether a member has attended a school from the U.S. News national university rankings / a top 10 liberal arts college /a top 10 non-U.S. school
    \(x_5\) Company count: # different companies in which a member has worked
    \(x_6\) Function count: # different job functions in which a member has worked
    \(x_7\) Industry sector count: # different industries in which a member has worked
    \(x_8\) Years of experience: # years of work experience, including years in consulting, for a member.

The following estimated \(\hat\beta's\) of features were obtained. With a sample size of 456,000 thy are measured rather accurately. Recall, given each location/education choice in the “Choice and Impact” is a unit change in the feature.

  1. Location: Metro region: 0.28

  2. Personal: Gender(Male): 0.31

  3. Education:

    Graduate education type: 1.16, Undergraduate education type: 0.22

  4. Work Experience: Company count: 0.14, Function count: 0.26, Industry sector count: -0.22, Years of experience: 0.09

Here are three main findings

  1. Working across job functions, like marketing or finance, is good. Each additional job function provides a boost that, on average, is equal to three years of work experience. Switching industries has a slight negative impact. Learning curve? lost relationships?

  2. MBAs are worth the investment. But pedigree matters. Top five program equivalent to \(13\) years of work experience!!!

  3. Location matters. For example, NYC helps.

We can also personalize the prediction for predict future possible future executives. For example,

Person A (p=6%): Male in Tulsa, Oklahoma, Undergraduate degree, 1 job function for 3 companies in 3 industries, 15-year experience.

Person B (p=15%): Male in London, Undergraduate degree from top international school, Non-MBA Master, 2 different job functions for 2 companies in 2 industries, 15-year experience.

Person C (p=63%): Female in New York City, Top undergraduate program, Top MBA program, 4 different job functions for 4 companies in 1 industry, 15-year experience.

Let’s re-design Person B.

Person B (p=15%): Male in London, Undergraduate degree from top international school, Non-MBA Master, 2 different job functions for 2 companies in 2 industries, 15-year experience.

  1. Work in one industry rather than two. Increase \(3\)%

  2. Undergrad from top \(10\) US program rather than top international school. \(3\)%

  3. Worked for \(4\) companies rather than \(2\). Another \(4\)%

  4. Move from London to NYC. \(4\)%

  5. Four job functions rather than two. \(8\)%. A \(1.5\)x effect.

  6. Worked for \(10\) more years. \(15\)%. A \(2\)X effect.

Choices and Impact (Person B) are shown in Figure\[fig:linkedin\].

5.5 Imbalanced Data

Often, you have much more observations with a specific label, such a sample is called imbalanced. You should avoid using accuracy as a metric to choose a model. Say, you have a binary classification problem with 95% of samples labeled as 1. Then a naive classifier that assigns label 1 for each input will be 95% accurate. An ROC curve and area under the curve (AUC) metric derived from it should be used. Alternatively, you can use F1 sore, which combines precision and recall \[F1 = 2\dfrac{\mathrm{precision} \times \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}\]

A modeler should consider synthetically generating more samples of class with small number of observation, e.g. bootstrap or using a generative model or subsampling observations with major label if data set is large enough.

5.6 Model Selection

A parametric model that we choose to fit to data is chosen from a family of functions. Then, we use optimization to find the best model from that family. To find the best model we either minimize empirical loss or maximize the likelihood. We also established, that when \(y \sim N(f(x\mid \beta),\sigma^2)\) then meas squared error loss and negative log-likelihood are the same function. \[\operatorname{E}\left(y | x\right) = f(\beta^Tx)\]

For a regression model, an empirical loss measures a distance between fitted values and measurements and the goal is to minimize it. A typical choice of loss function for regression is \[L (y,\hat y) = \dfrac{1}{n}\sum_{i=1}^n |y_i - f(\beta^Tx_i)|^p\] When \(p=1\) we have MAE (mean absolute error), \(p=2\) we have MSE (mean squared error).

Finding an appropriate family of functions is a major problem and is called model selection problem. For example, the choice of input variable to be included in the model is part of the model choice process. In practice we can find several models for the same data set that perform nearly identically. To summarize, the properties of a good model are

  • Good model is not the one that fits data very well

  • By including enough parameters we can make fit as close as we need

  • Can have perfect fit when number of observations = number of parameters

  • The goal not only to get a good fit but also to reduce complexity

  • Model selection: do not include parameters we do not need

  • Usually select a model from a relevant class

When we select a model for our analysis, we need to keep the following goals in mind

  • When we have many predictors (with many possible interactions), it can be difficult to find a good model.

  • Which input variables do we include?

  • Which interactions do we include?

  • Model selection tries to “simplify” this task.

The model selection task is sometimes one of the most consuming parts of the data analysis. Unfortunately, there is no single rule to find the best model. One way to think about the model choice problem as yet another optimization problem, with the goal to find best family of functions that describe the data. With a small number of predictors we can do brute force (check all possible models). For example, with \(p\) predictors there are \(2^p\) possible models with no interactions. Thus, the number of potential family functions is huge even for modest values of \(p\). One cannot consider all transformations and interactions.

5.7 Out of Sample Performance

Our goal is to build a model that predicts well for out-of-sample data, e.g. the data that was not used for training. Eventually, we are interested in using our models for prediction and thus, the out of sample performance is the most important metric and should be used to choose the final model. In-sample performance is of little interest when predictive model need to be chosen, as one of the winners of Netflix prize put it, “It’s like predicting how much someone will like a movie, having them watch it and tell you how much they really liked it.” The out-of-sample performance is the final judge of the quality of our model. The goal is to use data to find a pattern that we can exploit. The pattern will be "statistical" in its nature. To uncover the pattern we start with a training dataset, denoted by \[D = (y_i,x_i)_{i=1}^n\] and to test the validity of our mode we use out-of-sample testing dataset \[D^* = (y_j^*, x_j^*)_{j=1}^m,\] where \(x_i\) is a set of \(p\) predictors ans \(y_i\) is response variable.

A good predictor will “generalize” well and provide low MSE out-of-sample. These are a number of methods/objective functions that we will use to find, \(\hat f\). In a parameter-based style we will find a black box. There are a number of ways to build our black box model. Our goal is to find the map \(f\) that approximates the process that generated the data. For example data could be representing some physical observations and our goal is recover the “laws of nature" that led to those observations. One of the pitfalls is to find a map \(f\) that does not generalize. Generalization means that our model actually did learn the”laws of nature" and not just identified patterns presented in training. The lack of generalization of the model is called over-fitting. It can be demonstrated in one dimension by remembering the fact from calculus that any set of \(n\) points can be approximated by a polynomial of degree \(n\), e.g we can alway draw a line that connects two points. Thus, in one dimension we can always find a function with zero empirical risk. However, such a function is unlikely to generalize to the observations that were not in our training data. In other words, the empirical risk measure for \(D^*\) is likely to be very high. Let us illustrate that in-sample fit can be deceiving.

Example 5.5 () Example: Say we want to approximate the following function \[f(x) = \dfrac{1}{1+25x^2}.\] This function is simply a ratio of two polynomial functions and we will try to build a liner model to reconstruct this function

Approximation of ration of two polynomials with linear models with different number of features

Figure 5.3: Approximation of ration of two polynomials with linear models with different number of features

Figure\[fig:rungekutta\] shows the function itself (black line) on the interval \([-3,3]\). We used observations of \(x\) from the interval \([-2,2]\) to train the data (solid line) and from \([-3,-2) \cup (2,3]\) (dotted line) to test the model and measure the out-of-sample performance. We tried four different linear functions to capture the relations. We see that linear model \(\hat y = \beta_0 + \beta_1 x\) is not a good model. However, as we increas the degree of the polynomial to 20, the resulting model \(\hat y = \beta_0 + \beta_1x + \beta_2 x^2 +\ldots+\beta_{20}x^{20}\) does fit the training data set quite well, but does very poor job on the test data set. Thus, while in-sample performance is good, the out-of sample performance is unsatisfactory. We should not use the degree 20 polynomial function as a predictive model.

In practice in-sample out-of-simple loss or classification rates provide us with a metric for providing horse race between different predictors. It is worth mentioning here there should be a penalty for overly complex rules which fits extremely well in sample but perform poorly on out-of-sample data. As Einstein famous said "model should be simple, but not simpler."

Bias-Variance Trade-off

For any predictive model we seek to achieve best possible results, i.e.
smallest MSE or misclassification rate. However, a model performance can
be different as data used in one training/validation split may produce
results dissimilar to another random split. In addition, a model that
performed well on the test set may not produce good results given
additional data. Sometimes we observe a situation, when a small change
in the data leads to large change in the final estimated model, e.g.
parameters of the model. These results exemplify the bias/variance
tradeoff, where increasing model bias produces large variance in the
final results. Similarly, low bias results in low variance, but can also
produce an oversimplification of the final model. While Bias/variance
concept is depicted below.

<img src="fig/bias-variance.png" width="60%" style="display: block; margin: auto;" />

\BeginKnitrBlock{example}\iffalse{-91-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-34"><strong>(\#exm:unnamed-chunk-34)  \iffalse () \fi{} </strong></span>Example: We demonstrate bias-variance concept using Boston housing
example. We fit a model $\mathrm{medv} = f(\mathrm{lstat})$. We use
polynomial functions to approximate this relation. We fitted twelve
polynomial functions with degree $1,\ldots,12$ ten time. Each time we
randomly selected 20% of sample for testing and the rest for training.
We estimated in-of-sample performance (bias) and out-of-sample
performance by calculating MSE on training and testing sets
correspondingly. For each polynomial $f$ we averaged MSE from each of
the ten models.

reference="fig:boston-bias-variance"}(a) shows bias and variance for our
twelve different models. As expected, bias increases while variance
increases as model complexity grows. On the other hand out-of-sample MSE
is a U-shaped curve. The optimal model is the one that has smallest
out-of-sample MSE. In our case it is polynomial of degree 5!
<div class="figure" style="text-align: center">
<img src="fig/boston-bias-variance.pdf" alt="boston-bias-variance" width="50%" />
<p class="caption">(\#fig:unnamed-chunk-35)boston-bias-variance</p>
<div class="figure" style="text-align: center">
<img src="fig/boston-bias-variance.pdf" alt="boston-bias-variance" width="50%" />
<p class="caption">(\#fig:unnamed-chunk-36)boston-bias-variance</p>

Let’s take another, a more formal, look at bias-variance trade-off for a linear regression problem. We are interested in the decomposition of the error \(\operatorname{E}\left((y-\hat y)^2\right)\) as a function of bias \(\operatorname{E}\left(y-\hat y\right)\) and variance \(\operatorname{Var}\left(\hat y\right)\).

Here \(\hat y = \hat f_{\beta}(x)\) prediction from the model, and \(y = f(x) + \epsilon\) is the true value, which is measured with noise \(\operatorname{Var}\left(\epsilon\right) = \sigma^2\), \(f(x)\) is the true unknown function. The expectation above measures squared error of our model on a random sample \(x\). \[\begin{aligned} \operatorname{E}\left((y - \hat{y})^2\right) & = \operatorname{E}\left(y^2 + \hat{y}^2 - 2 y\hat{y}\right) \\ & = \operatorname{E}\left(y^2\right) + \operatorname{E}\left(\hat{y}^2\right) - \operatorname{E}\left(2y\hat{y}\right) \\ & = \operatorname{Var}\left(y\right) + \operatorname{E}\left(y\right)^2 + \operatorname{Var}\left(\hat{y}\right) + \operatorname{E}\left(\hat{y}\right)^2 - 2f\operatorname{E}\left(\hat{y}\right) \\ & = \operatorname{Var}\left(y\right) + \operatorname{Var}\left(\hat{y}\right) + (f^2 - 2f\operatorname{E}\left(\hat{y}\right) + \operatorname{E}\left(\hat{y}\right)^2) \\ & = \operatorname{Var}\left(y\right) + \operatorname{Var}\left(\hat{y}\right) + (f - \operatorname{E}\left(\hat{y}\right))^2 \\ & = \sigma^2 + \operatorname{Var}\left(\hat{y}\right) + \mathrm{Bias}(\hat{y})^2\end{aligned}\] Here we used the following identity: \(\operatorname{Var}\left(X\right) = \operatorname{E}\left(X^2\right) - \operatorname{E}\left(X\right)^2\) and the fact that \(f\) is deterministic and \(\operatorname{E}\left(\epsilon\right) = 0\), thus \(\operatorname{E}\left(y\right) = \operatorname{E}\left(f(x)+\epsilon\right) = f + \operatorname{E}\left(\epsilon\right) = f\).

5.8 Cross-Validation

If the data set at-hand is small and we cannot dedicate large enough sample size for testing, simply measuring error on test data set can lead to wrong conclusions. When size of the testing set \(D^*\) is small, the estimated out-of-sample performance is of high variance, depending on precisely which observations are included in the test set. On the other hand, when training set \(D^*\) is a large fraction of the entire sample available, estimated out-of-sample performance will be underestimated. Why?

A trivial solution is to perform the training/testing split randomly several times and then use average out-of-sample errors. This procedure has two parameters, the fraction of samples to be selected for testing \(p\) and number of estimates to be performed \(K\). The resulting algorithm is as follows

      fsz = as.integer(p*n)
      error = rep(0,K)
      for (k in 1:K)
          test_ind = sampl\E{1:n,size = fsz)
          training = d[-test_ind,]
          testing  = d[test_ind,]
          m = lm(y~x, data=training)
          yhat = predict(m,newdata = testing)
          error[k] = mean((yhat-testing$y)^2)
      res = mean(error)

Figure\[fig:cv\](a) shows the process of splitting data set randomly five times.

Cross validation modifies the random splitting approach uses more “disciplined” way to split data set for training and testing. Instead of randomly selecting training data points, CV chooses consecutive observations and thus, each data point is used once for testing. As the random approach, CV helps addressing the high variance issue of out-of-sample performance estimation when data set available is small. Figure\[fig:cv\](b) shows the process of splitting data set five times using cross-validation approach.

Training set (red) and testing set (green) from (a) cross-validation and (b) bootstrap.Training set (red) and testing set (green) from (a) cross-validation and (b) bootstrap.

Figure 5.5: Training set (red) and testing set (green) from (a) cross-validation and (b) bootstrap.

Example 5.6 () Example: We use simulated data set to demonstrate difference between estimated out-of-sample performance using random 20/80 split, 5-fold cross-validation and random split. We used \(x=-2,-1.99,-1.98,\ldots,2\) and \(y = 2+3x + \epsilon, ~ \epsilon \sim N(0,\sqrt{3})\). We simulated 35 datasets of size 100. For each of the simulated data sets, we fitted a linear model and estimated out-of-sample performance using three different approaches. Figure\[fig:test-error20\] compares empirical distribution of errors estimated from 35 samples.

Empirical comparison of simple split, cross-validation, and bootstrap approaches to estimate out-of sample performance.

Figure 5.6: Empirical comparison of simple split, cross-validation, and bootstrap approaches to estimate out-of sample performance.

As we can see the estimated out-of-sample performance by a training set approach is of high variance. While, both cross-validation and bootstrap approaches lead to better estimates, they require model to be fitted 5 times, which can be computationally costly for a complex model. On the other hand, estimate from cross-validation is of lower variance and less bias compared to the bootstrap estimate. Thus, we should prefer cross-validation.

5.9 Small Sample Size

When sample size is small and it is not feasible to divide your data into training and validation data sets, an information criterion could be used to assess a model. We can think of information criterion as a metric that “approximates” out-os-sample performance of the model. Akaike’s Information Criterion (AIC) takes the form \[\mathrm{AIC} = log(\sigma_k^2) + \dfrac{n+2k}{n}\] \[\hat{\sigma}_k^2 = \dfrac{SSE_k}{n}\] Here \(k\) = number of coefficients in regression model, \(SSE_k\) = residual sum of square, \(\hat{\sigma}_k^2\) = MLE estimator for variance. We do not need to proceed sequentially, each model individually evaluated

AIC is derived using the Kullback-Leibler information number. It is a ruler to measure the similarity between the statistical model and the true distribution. \[I(g ; f) = E_g\left(\log \left\{\dfrac{g(y)}{f(y)}\right\}\right) = \int_{-\infty}^{\infty}\log \left\{\dfrac{g(y)}{f(y)}\right\}g(y)dy.\] Here

  • \(I(g ; f) > 0\)

  • \(I(g ; f) = 0 \iff g(u) = f(y)\)

  • \(f \rightarrow g\) as \(I(g ; f) \rightarrow 0\)

To estimate \(I(g ; f)\), we write \[I(g ; f) = E_g\left(\log \left\{\dfrac{g(y)}{f(y)}\right\}\right) = E_g (\log g(y)) - E_g(\log f(y))\] Only the second term is important in evaluating the statistical model \(f(y)\). Thus we need to estimate \(E_g(\log f(y))\). Given sample \(z_1,...,z_n\), and estimated parameters \(\hat{\theta}\) a naive estimate is \[\hat{E}_g(\log f(y)) = \dfrac{1}{n} \sum_{i=1}^n \log f(z_i) = \dfrac{\ell(\hat{\theta})}{n}\] where \(\ell(\hat{\theta})\) is the log-likelihood function for model under test.

  • this estimate is very biased

  • data used used twice: to get the MLE and second to estimate the integral

  • it will favor those model that overfit

Akaike showed that the bias is approximately \(k/n\) where \(k\) is the number of parameters \(\theta\). Therefore we use \[\hat{E}_g(\log f(y)) = \dfrac{\ell(\hat{\theta})}{n} - \dfrac{k}{n}\] Which leads to AIC \[AIC = 2n \hat{E}_g(\log f(y)) = 2 \ell(\hat{\theta}) - 2k\]

Akaike’s Information Criterion (AIC) \[\mathrm{AIC} = \log(\sigma_k^2) + \dfrac{n+2k}{n}\] Controls for alance between model complexity (\(k\)) and minimizing variance. The model selection process involve trying different \(k\), chose model with smallest AIC.

A slightly modified version designed for small samples is the bias corrected AIC (AICc). \[\mathrm{AICc} = \log(\hat{\sigma}_k^2) + \dfrac{n+k}{n-k-2}\] This criterion should be used for regression models with small samples

Yet, another variation designed for larger datasets is the Bayesian Information Criterion (BIC). \[\mathrm{BIC} = \log(\hat{\sigma}_k^2) + \dfrac{k \log(n)}{n}\] Is is the same as AIC but harsher penalty, this chooses simpler models. It works better for large samples when compared to AICc. The motivation fo BIC is from the posterior distribution over model space. Bayes rule lets you calculate the joint probability of parameter and models as \[p(\theta,M\mid D) = \dfrac{p(D\mid \theta,M(p(M,\theta)}{p(D)},~~ p(M\mid D) = \int p(\theta,M\mid D)d\theta \approx n^{p/2}p(D\mid \hat \theta M)p(M).\]

Consider a problem of predicting mortality rates given pollution and temperature measurements. Let’s plot the data.

Comparison of (a) ideal scenario of an experiment being repeated many times and (b) bootstrap scenario when we simulate repeated experiments by re-sampling from existing sampleComparison of (a) ideal scenario of an experiment being repeated many times and (b) bootstrap scenario when we simulate repeated experiments by re-sampling from existing sample

(#fig:Pollution, Temperature and Mortality)Comparison of (a) ideal scenario of an experiment being repeated many times and (b) bootstrap scenario when we simulate repeated experiments by re-sampling from existing sample

Regression Model 1, which just uses the trend: \(M_t = \beta_1 + \beta_2 t + w_t\). We fit by calling lm(formula = cmort ~ trend) to get the following coefficients

                Estimate    Std. Error t value  
    (Intercept) 3297.6062   276.3132   11.93
    trend         -1.6249     0.1399  -11.61

Regression Model 2 regresses to time (trend) and temperature: \(M_t = \beta_1 + \beta_2 t + \beta_t(T_t - T)+ w_t\). The R call is lm(formula = cmort ~ trend + temp)

                Estimate    Std. Error t value 
    (Intercept) 3125.75988  245.48233   12.73 
    trend         -1.53785    0.12430  -12.37 
    temp          -0.45792    0.03893  -11.76 

Regression Model 3, uses trend, temperature and mortality: \(M_t = \beta_1 + \beta_2 t + \beta_3(T_t - T)+ \beta_4(T_t - T)^2 + w_t\). The R call is lm(formula = cmort ~ trend + temp + I(temp^2)

                Estimate    Std. Error t value 
    (Intercept)  3.038e+03  2.322e+02  13.083 
    trend       -1.494e+00  1.176e-01 -12.710 
    temp        -4.808e-01  3.689e-02 -13.031 
    temp2        2.583e-02  3.287e-03   7.858 

Regression Model 4 adds temperature squared: \(M_t = \beta_1 + \beta_2 t + \beta_3(T_t - T)+ \beta_4(T_t - T)^2 + \beta_5 P_t+ w_t\). The R call is lm(formula = cmort ~ trend + temp + I(temp^2) + part)

                Estimate    Std. Error t value 
    (Intercept)  2.831e+03  1.996e+02   14.19 
    trend       -1.396e+00  1.010e-01  -13.82 
    temp        -4.725e-01  3.162e-02  -14.94 
    temp2        2.259e-02  2.827e-03    7.99 
    part         2.554e-01  1.886e-02   13.54 

To choose the model, we look at the information criterion

Model \(k\) SSE df MSE \(R^2\) AIC BIC
1 2 40,020 506 79.0 .21 5.38 5.40
2 3 31,413 505 62.2 .38 5.14 5.17
3 4 27,985 504 55.5 .45 5.03 5.07
4 5 20,508 503 40.8 .60 4.72 4.77

\(R^2\) always decreases with number of covariates (that is what MLE does). Thus, cannot be used as a selection criteria. \(R^2\) for out-of-sample data is useful!

The message to take home on model selection

  • \(R^2\) is NOT a good metric for model selection

  • Value of likelihood function is NOT a god metric

  • are intuitive and work very well in practice (you should use those)

  • AIC is good for big \(n/df\), so it overfits in high dimensions

  • Should prefer AICc over AIC

  • BIC underfits for large \(n\)

  • Cross-validation is important, we will go over it later


Gan, Link, and Alan Fritzler. 2016. “How to Become an Executive.”
Irwin, Neil. 2016. “How to Become a c.e.o.? The Quickest Path Is a Winding One.”