Logistic Regresion for Tennis Classification

Data science playes a major role in tennis, you can learn about recent AI tools developed by IBM from this This Yahoo Article.

We will analyze the Tennis Major Tournament Match Statistics Data Set from the UCI ML repository. The data set has one per each game from four major Tennis tournaments in 2013 (Australia Open, French Open, US Open, and Wimbledon).

Let’s load the data and familiarize ourselves with it

d ## read.csv("data/tennis.csv")
dim(d)

## [](1) 943  44

str(d[](,1:5))

## 'data.frame':    943 obs. of  5 variables:
##  $ Player1: Factor w/ 478 levels "A Barty","A Cornet",..: 268 262 299 109..
##  $ Player2: Factor w/ 472 levels "A Dulgheru","A Kerber",..: 332 34 110 3..
##  $ Round  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Result : int  0 1 0 1 0 0 0 1 0 1 ...
##  $ FNL1   : int  0 3 0 3 1 1 2 2 0 3 ...

Let’s look at the few coluns of the randomly selected five rows of the data

d[5),c("Player1","Player2","Round","Result","gender","surf")](sample(1:943,size ##)

We have data for 943 matches and for each match we have 44 columns, including names of the players, their gender, surface type and match statistics.

Let’s look at the number of break points won by each player. We will plot BPW (break points won) by each player on the scatter plot and will colorize each dot according to the outcome

n ## dim(d)[](1)
plot(d$BPW.1+rnorm(n),d$BPW.2+rnorm(n), pch##21, col##d$Result+2, cex##0.6, bg##"yellow", lwd##0.8,
     xlab##"BPW by Player 1", ylab##"BPW by Player 2")
legend("bottomright", c("P1 won", "P2 won"), col##c(3,2), pch##21, bg##"yellow", bty##'n')

We can clearly see that number of the break points won is a clear predictor of the match outcome. Which is obvious and follows from the rules, to win a match, a player must win break points. Now, we want to understand the impact of a winning a break point on the overall match outcome. We do it by building a logistic regression model

which(is.na(d$BPW.1)) # there is one row with NA value for the BPW.1 value and we remove it

## [](1) 171

d ## d[dim(d)[](1](-171,); n ##)
m ## glm(Result ~ BPW.1 + BPW.2-1, data##d, family ## "binomial" )
summary(m)

## 
## Call:
## glm(formula ## Result ~ BPW.1 + BPW.2 - 1, family ## "binomial", 
##     data ## d)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.425  -0.668  -0.055   0.636   3.085  
## 
## Coefficients:
##       Estimate Std. Error z value Pr(>|z|)    
## BPW.1   0.4019     0.0264    15.2   <2e-16 ***
## BPW.2  -0.4183     0.0277   -15.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1305.89  on 942  degrees of freedom
## Residual deviance:  768.49  on 940  degrees of freedom
## AIC: 772.5
## 
## Number of Fisher Scoring iterations: 5

R output does not tell us how accurate our model is but we can quickly check it by using the table function. We will use \(0.5\) as a threshold for our classification.

table(d$Result, as.integer(m$fitted.values>0.5))

##    
##       0   1
##   0 416  61
##   1  65 400

Thus, our model got (416+416)/942 ## 0.88% of the predictions correctly!

Essentially, the logistic regression is trying to draw a line that separates the red observations from the green one. In out case, we have two predictors \(x_1\) ## BPW.1 and \(x_2\) ## BPW.2 and our model is \[]( \log\left(\dfrac{p}{1-p}\right) ## \beta_1x_1 + \beta_2 x_2, \) where \(p\) is the probability of player 1 winning the match. We want to find the line along which the probability is 1/2, meaning that \(p/(1-p) ## 1\) and log-odds \(\log(p/(1-p)) ## 0\), thus the equation for the line is \(\beta_1x_1 + \beta_2 x_2 ## 0\) or \[]( x_2 ## \dfrac{-\beta_1}{\beta_2}x_1 \)

Let’s see the line found by the glm function

plot(d$BPW.1+rnorm(n),d$BPW.2+rnorm(n), pch##21, col##d$Result+2, cex##0.6, bg##"yellow", lwd##0.8,
     xlab##"BPW by Player 1", ylab##"BPW by Player 2")
legend("bottomright", c("P1 won", "P2 won"), col##c(3,2), pch##21, bg##"yellow", bty##'n')

x ## seq(0,30,length.out ## 200)
y  ##  -m$coefficients[](1)*x/m$coefficients[](2)
lines(x,y, lwd##2, col##"red")

There are a couple of observations. First, effect of a break point on the game outcome is significant and symmetric, effect of loosing break point is the same as the effect of winning one. We also can interpret the effect of winning a break point in the following way. We will keep BPW.2 ## 0 and will calculate what happens to the probability of winning when BPW.1 changes from 0 to 1. The odds ration for player 1 winning when BPW.1 ## 0 is exp(0) which is 1, meaning that the probability that P1 wins is 1/2. Now when BPW.1 ## 1, the odds ratio is 1.5

exp(0.4019)

## [](1) 1.5

We can calculate probability of winning from the regression equation \[]( \dfrac{p}{1-p} ## 1.5,~~~p ## 1.5(1-p),~~~2.5p ## 1.5,~~~p ## 0.6 \) Thus probability of winning goes from 50% to 60%, we can use predict function to get this result

predict.glm(m,newdata ## data.frame(BPW.1 ## c(0), BPW.2 ## c(0)), type##"response")

##   1 
## 0.5

predict.glm(m,newdata ## data.frame(BPW.1 ## c(1), BPW.2 ## c(0)), type##"response")

##   1 
## 0.6

What happens to the chances of winning when P1 wins three more break points compared to the opponent

predict.glm(m,newdata ## data.frame(BPW.1 ## c(0), BPW.2 ## c(0)), type##"response")

##   1 
## 0.5

predict.glm(m,newdata ## data.frame(BPW.1 ## c(3), BPW.2 ## c(0)), type##"response")

##    1 
## 0.77

Chances go up by 27%.

Tennis is arguably the sport in which mean and women are treated equally. Both man and women matches are shown during the prime-time on TV, they both have the same prize money. However, one of the comments you hear often is that Women’s matches are “less predictable”, meaning that an upset (when the favorite looses) is more likely to happen in a women’s match compared to man Matches. We can test thus statement by looking at the residuals. The large the residual the less accurate our prediction was.

d$res ## abs(m$residuals)
outlind ## which(d$res<2)
boxplot(d$res[xlab##"Gender",ylab##"Residual"](outlind) ~ d$gender[](outlind), col##c(2,3),)

Let’s do a formal T-test on the residuals foe men’s and women’s matches

men ## d %>% filter(res<2, gender####"M") %>% pull(res)
women ## d %>% filter(res<2, gender####"W") %>% pull(res)
t.test(men, women, alternative ## "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  men and women
## t ## -5, df ## 811, p-value ## 3e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.105 -0.043
## sample estimates:
## mean of x mean of y 
##       1.2       1.3

Looks like the crowd wisdom that Women’s matches are less predictable is correct.