Chapter 4 Linear Regression

Regression analysis is the most widely used statistical tool for understanding relationships among variables. It provides a conceptually simple method for investigating functional relationships between one or more factors and an outcome of interest. This relationship is expressed in the form of an equation, which we call the model, connecting the response or dependent variable and one or more explanatory or predictor variable.

Example 4.1 (Henrietta Leavitt) Let’s start with a famous analysis of Henrietta Leavitt who used linear regression to estimate distances to pulsating stars. In her analysis of 1777 variable (pulsating) starts she found out that the brighter star have the longer oscilation period.

She used data for 25 cepheid variables, which are pulsating stars with varying diameter and temperature that leads to periodic changes in brightness of specific frequency and amplitude. Below you see the scatter plot of period (on logarithmic scale) vs brightness of those 25 stars

Henrietta Leavitt's 1912 data on 25 pulsating stars.

Figure 4.1: Henrietta Leavitt’s 1912 data on 25 pulsating stars.

Then she drew a straight a line through this cloud of points which allowed her to predict brightness from the period. Periods can be accurately measured and predicted value for brightness allows astronomers to measure cosmic distances over previously unimaginable scales by comparing the predicted brightness with the apparent brightness.

Though we have though we have many more sophisticated models to model the relations in a dataset, the simplicity and interoperability of the linear regression model makes it a viable option for many practical applications. The regression model can be used for both making a prediction or explaining relations between quantities. For example, we might be interested in predicting for how much will my house sell? An example of explanation question would be what is the impact of my education on my income?

To demonstrate linear regression we develop a model that predicts the price of a house, given its square footage. This is a simplified version of a regression model used by Zillow, a house pricing site. First, we look at property sales data where we know the price and some observed characteristics. Let’s look at the scatter plot of living areas measure in squared feet and the price measured in thousands of dollars.

Second, we build a decision model that predicts price as a function of the observed characteristics. Now, we need to decide on what characteristics do we use. There are many factors or variables that affect the price of a house, for example location. Some other basics ones include size, number of rooms, and parking.

Let’s run a simple linear regression on size, measured in square foots. The value that we seek to predict is called the dependent variable, and we denote this by \(y\) which is the price of house measured in thousand of dollars. The variable that we use to construct our prediction is the predictor variable, and we denote it with \(x\). What’s does the data look like?

d = read.csv("data/SaratogaHouses.csv")
d$price = d$price/1000; d$livingArea = d$livingArea/1000
plot(d$livingArea,d$price, 'p', pch=20, col="blue") 

We use lm function to draw a line

l = lm(price ~ livingArea, data=d)
plot(d$livingArea, d$price, 'p', pch=20, col="blue", xlab="Living Area", ylab="Price") 
abline(l, col="red", lwd=3)

We can use coef funciton to find out the slope and the intercept of this line.

coef(l)
#> (Intercept)  livingArea 
#>          13         113

The intercept value is in units of \(y\) ($1000). The slope is in units of \(y\) per units of \(x\) ($1000/1 sq. feet).

In general, adding statistical noise, the simple linear regression model is then given by the follwing equation \[ y = \beta_0 + \beta_1 x + \epsilon \; \; {\rm where} \; \epsilon \sim N(0, \sigma^2) \] Here error \(\epsilon\) term indicates that the relation between \(x\) and \(y\) is not excatly a line and for a given \(x\), the value of \(y\) can be deviating from the line. We will write this as \(y \mid x \sim N\left(\beta_0 + \beta_1 x, \sigma^2 \right)\)

In our housing example, we find that \(\beta_0 = 13.44\) and \(\beta_1 =113.12\). Thus, every \(1000\) sqft increase the price of the hous by $ 1.34^{4} and the price of a lot of land without a house is 113.12. The magnitude of \(\beta_1\) shows the economic significance of house’s square footage.


\BeginKnitrBlock{example}\iffalse{-91-77-97-107-105-110-103-32-80-114-101-100-105-99-116-105-111-110-115-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-7"><strong>(\#exm:unnamed-chunk-7)  \iffalse (Making Predictions) \fi{} </strong></span>
We can now predict the price of a house when we only know that size: take the value off the regression line. For example, given a house size of $x = 2.2$. Predicted Price: 
$$
\hat{y} = `r coef(l)[1]` + `r coef(l)[2]` \times 2.2 = `r coef(l)[1] + coef(l)[2]*2.2`
$$</div>\EndKnitrBlock{example}


\BeginKnitrBlock{example}\iffalse{-91-77-101-97-110-32-83-113-117-97-114-101-100-32-69-114-114-111-114-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-8"><strong>(\#exm:unnamed-chunk-8)  \iffalse (Mean Squared Error) \fi{} </strong></span>
Now we can use our training data set to compare prices predicted by the model and actual prices. We will do it by clculating an average squared difference, also called a mean squared error
$$
MSE = \dfrac{1}{n}\sum_{i=1}^{n} (\hat y_i - y_i)^2
$$
**Nick: Describe in-sample and out-of-sample MSE**

Or in code</div>\EndKnitrBlock{example}

```r
yhat = coef(l)[1] + coef(l)[2]*d$livingArea
y = d$price
mse = sum((yhat-y)^2)/length(y)
mse
#> [1] 4770

Sometimes it is more convinient to use root mean squared error (RMSE) = \(\sqrt{4769.91} = 69.06\) which is measured in the same units as variable \(y\), price in our case.


**Nick: This means that ---**

## Least Squares Principle
Due to Gauss, we will use the mean squared error as the criteria to pick the best model that describe the relations in our data. Least squares principle simply says that we find such $\beta_0$ and $\beta_1$ that minimize the value of MSE.

If MSE is zero we would have a perfect line. We can check and see that `lm` function indeed finds the best values of $\beta_0$ and $\beta_1$ by looking at thse MSE associated with perturbed coefficients

```r
set.seed(1)
plot(d$livingArea, d$price, 'p', pch=20, col="blue") 
abline(l, col="red", lwd=3)
cat("This produces output,")
#> This produces output,
for(i in 1:5) {
  beta0 = rnorm(1,1,0.2)*coef(l)[1]
  beta1 = rnorm(1,1,0.2)*coef(l)[2]
  yhat = beta0 + beta1*d$livingArea
  abline(beta0,beta1, lwd=2, lty=2, col="green")
  print(sum((yhat-y)^2)/length(y))
}

#> [1] 4808
#> [1] 9003
#> [1] 5907
#> [1] 5815
#> [1] 4900

Mathematically, least squares chooses \(\beta_0\) and \(\beta_1\) to minimize \[ (\hat{\beta}_0,\hat{\beta}_1) = \arg\min_{\beta_0,\beta_1}\quad\sum_{i=1}^n L_i \]

where \[ \begin{aligned} \sum_{i=1}^n L_i = & L_1 + \ldots + L_n = (y_1 - \hat{y}_1 )^2 + \ldots + ( y_n - \hat{y}_n )^2 = \\ & (y_1 - \beta_0 - \beta_1x_1)^2 - \ldots - (y_n - \beta_0 - \beta_1x_n)^2\end{aligned} \] This optimization problem can be solved analytically. The equation for the intercept guarantees that \[\bar{y} = \beta_0 + \beta_1 \bar{x}.\] Thus, the point \(( \bar{x} , \bar{y} )\) is always on the regression line.

The slope equation takes the form \[ \begin{aligned} \hat \beta_1 & = \Cor{x,y} \times \frac{s_y}{s_x} = \frac{ \sum_{i=1}^n (x_i - \bar{x}) (y_i - \bar{y} ) }{ \sum_{i=1}^n ( x_i - \bar{x})^2 }\\ & = \frac{ \Cov{x,y} }{ \Var{x} }.\end{aligned} \] Here \(\beta_1\) estimates the correlation \(r\) times a that ensures the proper units. This entails estimating parameters using their sample counterparts. Once we have parameters of the model estimated, we are now ready to generate.

4.0.1 Statistical Properties of Regression

It is convenient to introduce a regression model using the language of probability and uncertainty. We define the random variable, , of interest. Then regression model assumes that mean of \(y\) depends linearly on predictor \(x\) \[y = \beta_0 + \beta_1 x + \epsilon,~ \text{where}~\epsilon \sim N(0, \sigma^2).\] Line coefficients \(\beta_0\) and \(\beta_1\) have same interpretation as in the deterministic approach. Here, we have a new parameter \(\sigma^2\) that models dispersion of \(y_i\) around the mean \(\beta_0 + \beta_1 x\), let’s see an example

# Data with small and large measurements errors
# lm-sigma

In Figure TODO: how to insert references to R-generated plots the black solid line represents the mean of the estimated relation, i.e. \(\beta_0 + \beta_1 x\).

To estimate parameters \(\beta_0, \beta_1,\sigma^2\) that govern this relationship, we use historical data on characteristics and dependent variable \((x_1,y_1),\ldots (x_n,y_n)\). We can apply least squares principle to find \(\beta\)’s and then estimate \(\sigma^2\) empirically by calculating \[\hat \sigma^2 = s_e^2 = \Var{e_1}.\]

We compute the values of \(\beta_0\) and \(\beta_1\) using a sample data. Thus, the estimated values will not be equal to the true ones.

# Comparison of true and estimated lines
# lm-beta-estimated

To get a feeling for the amount of variability in our exmerimentation. Imagine that we have two sample data sets. For example, we have housing data from two different realtor firms. Do you think the estimated value of \(\beta_1\) will be the same for both of those? The answer is no. Let’s demonstrate with an example, we simulate eight data sets from the same distribution and estimate eight different linear models.

The value of \(\beta_1\) will differ from sample to sample. In other words, it will be a random variable. The sampling distribution of \(\beta_1\) describes how it varies over different samples with the \(x\) values fixed. Statistical view of linear regression allows us to calculate confidence and prediction intervals for estimated parameters. It turns out that when least squares principle is used, the estimated \(\hat\beta_1\) is normally distributed: \(\hat\beta_1 \sim N( \beta_1 , s_1^2 )\). Again, empirically if we plot coefficients for 10000 different models, we can see that the empirical distribution resembles a normal distribution.

# Histogram of parameters estimated from different samples from the same distribution
# lm-random-bets

Further, \(\hat\beta_1\) is unbiased, i.e. \[\E{\hat\beta_1} = \beta_1,\] the expectation of our estimated value is equal to the true value of the unobserved parameter \(\beta\). Further, \(s_1\) is the . The t-statistic is . The three factors: sample size (\(n\)), error variance (\(s^2\)), and \(x\)-spread, \(s_x\) impact the size of standard error for \(\beta_1\) \[\rcol{s_1^2 = \frac{s^2}{\sum_{i=1}^n ( x_i - \bar{x} )^2 } = \frac{ s^2 }{ (n-1) s_x^2 } }\]

Intuitively, we model regression-back-to-the-mean effect. This is one of the most interesting statistical effects you’ll see in daily life. Here are a few examples:

  • If your parents are on average higher than the average, you’ll regress back to the average.

  • Airline pilot training. Good landings are followed by average ones.

  • A negative correlation between first and second round scores at Golf.

Further, We always have to account for the random noise in day-to-day decisions, thus we have the error term \(\epsilon\) in our model.

The underlying assumptions of the statistical linear regression model are:

  1. Given the value of \(x\), the \(y\) values are normally distributed

  2. The means of these normal distributions of \(y\) values all lie on the straight line of regression

  3. The standard deviations of these normal distributions are equal

  4. The \(y\) values are statistically independent

Consider another example that demonstrates these effects.

Again, consider a house example. Say in our data we have 10 houses with the same squire footage, say 2000. Now the first point states, that the prices of those houses should follow a normal distribution. In other words, prices of all of the houses of size 2000 follow a normal distribution.

The second point merely states that mean of \(y\) is a linear function of \(x\). The third point states that if we are compare prices of 2000 sqft houses and 2500 sqft houses, they will have the same standard deviation. The fourth point means that price of one house does not depend on the price of another house.

All of the assumptions in the regression model can be written using probabilist notations The looks like: \[y \mid x \overset{iid}{\sim} N(\beta_0 +\beta_1 x, \sigma^2).\] Here, \(\beta_1\) measures the effect on \(y\) of increasing \(x\) by one and \(\beta_0\) measures the effect on \(y\) when \(x=0\). Sometimes, especially in finance applications, we use the notation \(\alpha = \beta_0\) and \(\beta = \beta_1\)

Another advantage of adopting a statistical view of the linear regression model is ability to quantify information about potential outliers. Outliers are points that are extreme relative to our model predictions. Recall, that residual is \(e_i = y_i- \hat y_i\). Since our predicted value \(\hat y_i\) follows a normal distribution, the residual also follows a normal distribution, since it is a difference of normal random variable \(\hat y_i\) and a constant \(y_i\). It easy to see that \[e_i \sim N(0, s_e^2),\] where \(s_e^2\) is an empirical estimate of the error’s variance.

Consider the relation between the fitted values \(\hat y_i\) and residuals \(e_i\). Our predictions are given by the line. The residual \(e_i\) and predicted value \(\hat y_i\) for the \(i\)th observation are related via \[y_i = \hat{y}_i + ( y_i - \hat{y}_i ) = \hat{y}_i + e_i\].

Residuals allow us to define outliers. They simply have large residuals. We re-scale the residuals by their standard errors. This lets us define \[r_i = \frac{ e_i }{ s_{ e} } =\frac{y_i - \hat{y}_i }{ s_{ e } }\] Since residuals follow normal distribution \(e \sim N(0,\sigma^2)\), in 95% of the time we expect the standardized residuals to satisfy \(- 2 < r_i < 2\). Any observation with is an extreme outlier, it is three sigmas away from the mean.

Another types of observations we are interested are the influential points. Those are are observations that affect the magnitude of our estimates \(\hat{\beta}\)’s. They are important to find as they typically have economic consequences. We will use to assess the significance of an influential point. Cook’s distance associated with sample \(i\) measure the change in estimated model parameters \(\hat \beta\) when sample \(i\) is removed from the training data set. In R code, we can use

  datanew = data[-i,]  # Deletes ith row

to remove \(i\)th sample.

4.0.2 Predictive Uncertainty

Another advantage of using a statistical view is that we can assess how much error that could be in our best prediction. Suppose we would like to predict value of output variable for a new observation \(x_f\), then we calculate \[ \hat{y}_f = \hat \beta_0 + \hat \beta x_f + e_f, \; {\rm where} \; e_f \sim N ( 0 , s^2 )\]

Recall, that besides \(\epsilon\) being random, our estimates \(\hat \beta_0\) and \(\hat \beta_1\) are also random. Thus, all three turns in the liner equation have errors.

After we account for all the uncertainty,

\[\Var{y_f} = s^2 \left ( 1 + \frac{1}{n} + \frac{ (x_f - \bar{x} )^2 }{ (n-1) s_x^2 } \right )\]

The value of \(\Var{y_f}\) can be used to calculate a prediction interval. For example we know that 95% prediction interval is given by \[\hat \beta_0 + \hat \beta_1 x_f \pm 1.96 \sqrt{\Var{y_f} }\] Prediction intervals tell us where we can expect to see the next data point sampled.

Similarly, confidence intervals tells us about how well we have determined the mean, i.e. it gives the range for \(\E{y_f\mid x_f}\), note prediction interval gives the range for \(y_f\) itself. Thus the 95% confidence interval is given by \[\hat \beta_0 + \hat \beta_1 x_f \pm s.\]

A large predictive error variance (high uncertainty) comes from four factors

  1. Large \(s\) (i.e. large errors, \(\epsilon\)’s)

  2. Small \(n\) (not enough data)

  3. Small \(s_x\) (not enough spread in the covariates)

  4. Large difference between \(x_f\) and \(\bar{x}\) (predicting extremes)

4.0.3 Input Transformations

A linear model assumes that output variable is proportional to the input variable plus an offset. However, this is rarely the case. Often, we need to transform input variables by combining multiple inputs into a single predictor, for example by taking a ratio or putting inputs on a different scale, e.g. log-scale. In machine learning, this process is called feature engineering

4.0.3.1 Factor Regression

One of the classic examples of feature engineering is Fama-French three-factor model which is used in asset prising and portfolio management. The model states that asset returns depend on (1) market risk, (2) the outperformance of small versus big companies, and (3) the outperformance of high book/market versus small book/market companies, mathematically \[r = R_f + \beta(R_m - R_f) + b_s\cdot SMB + b_v\cdot HML + \alpha\] Here \(R_f\) is risk-free return, \(R_m\) is the return of market, \(SMB\) stands for "Small market capitalization Minus Big" and \(HML\) for "High book-to-market ratio Minus Low"; they measure the historic excess returns of small caps over big caps and of value stocks over growth stocks. These factors are calculated with combinations of portfolios composed by ranked stocks (BtM ranking, Cap ranking) and available historical market data.

Consider an example, when the relations are multiplicative. Figure TODO shows Johnson and Johnson’s quarterly earnings

jj = read.csv("data/JNJ.csv")
jj$Date = lubridate::ymd(jj$Date)
jj = jj %>%arrange(Date)
jj$Date0 = 12*(lubridate::year(jj$Date)-lubridate::year(jj$Date[1])) + 
  (lubridate::month(jj$Date)-lubridate::month(jj$Date[1]))
model = lm(jj$Open ~ jj$Date0)
modellog = lm(log(jj$Open) ~ jj$Date0)
plot(jj$Date0,jj$Open, type='l', col="red", ylab="JNJ Open", xlab="Month")
abline(model, col="red")
par(new=T)
plot(log(jj$Open), type='l', xlab="", ylab="", xaxt='n', yaxt='n')
abline(modellog)
axis(side = 4)
legend("topleft", legend=c("Original", "log-Scale"), lty=1, col=c("red", "black"), lwd=3, bty="n", y.intersp = 1,cex=1)

summary(model)
#> 
#> Call:
#> lm(formula = jj$Open ~ jj$Date0)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -24.06 -11.12  -2.04   8.35  51.99 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -23.01349    1.20367   -19.1   <2e-16 ***
#> jj$Date0      0.19313    0.00361    53.5   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 14 on 576 degrees of freedom
#> Multiple R-squared:  0.832,  Adjusted R-squared:  0.832 
#> F-statistic: 2.86e+03 on 1 and 576 DF,  p-value: <2e-16
summary(modellog)
#> 
#> Call:
#> lm(formula = log(jj$Open) ~ jj$Date0)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.9037 -0.3101 -0.0773  0.3307  0.7830 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -9.31e-02   3.24e-02   -2.87   0.0042 ** 
#> jj$Date0     9.14e-03   9.72e-05   94.10   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.39 on 576 degrees of freedom
#> Multiple R-squared:  0.939,  Adjusted R-squared:  0.939 
#> F-statistic: 8.86e+03 on 1 and 576 DF,  p-value: <2e-16

Just by visual inspection we can see that a straight line will not be a good fit for this data. The relation between the time and earnings is multiplicative. There are many business and natural science examples where multiplicative relation holds. To model multiplicative or power relations we can still use the apparatus of linear models. However, we need to apply log-transformation to both input and output variables. \(\log\) is the natural \(\log_e\) with base \(e=2.718\ldots\) and that \(\log ( a b) = \log a + \log b\) and \(log ( a^b ) = b \log a\).

Consider a multiplicative model relating input-output, where \(A = e^a\) and \[y = A x^{\beta_1},\] The Log-Log transform (taking logs of both sides) leads to \[\log y = \log A + \log x^b = a + \beta_1 \log x\]

The slope, \(\beta_1\) can be interpreted as an , that is % change in \(y\) versus % change in \(x\). Variables are related on a multiplicative, or , scale.

We can estimate this model by regressing \(\log y\) on \(\log x\). In R: model = lm(log(y) ~ log(x)

Another case is the exponential model. Any process that is exponentially growing will lead to such a model. For example growth of population with time or compounding interest in savings account.

Suppose that we have an equation: \[y = A e^{\beta_1 x},\] where \(A = e^{\beta_0}\). This is equivalent to \(\log(y) = a + \beta_1 x\).

Taking logs of the original equation gives \[\begin{aligned} \log y &= \log A + \beta_1 x\\ \log y & = \beta_0 + \beta_1 x\end{aligned}\]

Therefore, we can run a regression of \(\log y\) on \(x\). However, there is a caveat: not all variables can be logged. There are some constraints

  1. \(y>0\) needs to be positive.

  2. Dummy variables \(x=0\) or \(1\) can’t be logged.

  3. Counting variables are usually left alone as well.

Example 4.2 (World’s Smartest Mammal) First of all, read in and attach our data ...

mammals = read.csv("data/mammals.csv")
attach(mammals)
head(mammals)
#>                      Mammal  Brain    Body
#> 1          African_elephant 5712.0 6654.00
#> 2 African_giant_pouched_rat    6.6    1.00
#> 3                Arctic_Fox   44.5    3.38
#> 4    Arctic_ground_squirrel    5.7    0.92
#> 5            Asian_elephant 4603.0 2547.00
#> 6                    Baboon  179.5   10.55
tail(mammals)
#>                   Mammal Brain Body
#> 57                Tenrec   2.6  0.9
#> 58            Tree_hyrax  12.3  2.0
#> 59            Tree_shrew   2.5  0.1
#> 60                Vervet  58.0  4.2
#> 61         Water_opossum   3.9  3.5
#> 62 Yellow-bellied_marmot  17.0  4.0
### Linear model
model = lm(Brain~Body)

Residual Diagnostics

The residuals show that you need a transformation .…

layout(matrix(1:2, ncol = 2))
plot(Body,Brain, pch=21, bg="lightblue", cex=2)
abline(model, col="red", lwd=3)
plot(Brain,residuals(model), bg="lightblue", pch=21, cex=2)

Residual Plots

par(mar=c(4,4.5,3,3))
plot(model, bg="lightblue", pch=21, cex=2)

log-log model

### Log-Log linear model
model = lm(log(Brain)~log(Body))
summary(model)
#> 
#> Call:
#> lm(formula = log(Brain) ~ log(Body))
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.7534 -0.5391 -0.0924  0.4207  2.6115 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   2.1833     0.1068    20.4   <2e-16 ***
#> log(Body)     0.7432     0.0317    23.5   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.77 on 60 degrees of freedom
#> Multiple R-squared:  0.902,  Adjusted R-squared:   0.9 
#> F-statistic:  551 on 1 and 60 DF,  p-value: <2e-16
par(mar=c(4.5,4.5,3,3))
layout(matrix(1:2, ncol = 2))
plot(log(Brain)~log(Body), pch=21, bg="lightblue", cex=2)
abline(model, col="red", lwd=3)
plot(log(Brain),residuals(model), bg="lightblue", pch=21, cex=2)

That’s better!

\(4\) in \(1\) Residuals: log-log model

par(mar=c(4,4.5,3,3))
plot(model, bg="lightblue", pch=21, cex=2)

log-log Model

\[{ {\rm log(Body)} = 2.18 + 0.74 \; {\rm log(Brain)} \; . }\] The coefficients are highly significant \(R^2 = 90\)%

Outliers

res = rstudent(model)
outliers = order(res,decreasing = T)[1:10]
cbind(mammals[outliers,],
      residual = res[outliers], 
      Fit = exp(model$fitted.values[outliers]))
#>             Mammal Brain  Body residual   Fit
#> 11      Chinchilla    64  0.42     3.78   4.7
#> 34             Man  1320 62.00     2.67 190.7
#> 50   Rhesus_monkey   179  6.80     2.12  36.9
#> 6           Baboon   180 10.55     1.67  51.1
#> 42      Owl_monkey    16  0.48     1.46   5.1
#> 10      Chimpanzee   440 52.16     1.27 167.7
#> 27 Ground_squirrel     4  0.10     1.20   1.6
#> 43    Patas_monkey   115 10.00     1.11  49.1
#> 60          Vervet    58  4.19     1.06  25.7
#> 3       Arctic_Fox    44  3.38     0.92  22.0

There is a residual value of 3.78 extreme outlier.

It corresponds to the Chinchilla.

This suggests that the Chinchilla is a master race of supreme intelligence!

Inference

NO!!! I checked and there was a data entry error.

  • The brain weight is given as 64 grams and should only be 6.4 grams.

  • The next largest residual corresponds to mankind

In this example the log-log transformation used seems to achieve two important goals, namely linearity and constant variance.

4.1 Multiple Regression

Many problems involve more than one independent (explanatory) variable or factor which affects the dependent or response variable.

  • Multi-factor asset pricing models (APT). Stock returns, book-to-market ratios, Interest rates

  • Demand for a product given prices of competing brands, advertising, household attributes (to formulate pricing strategies)

  • Internet Analytics What do I like? Suggestions instead of Search! Alexa “book my Xmas vacation,” “buy my best friend a birthday present”

4.1.1 Regression Model

\(Y\) = response or outcome variable \(X_1, \ldots , X_p\) = explanatory or input variable

The general relationship is given by \[Y = f( X_1, \ldots , X_p ) + \epsilon\] And a linear relationship is written \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon\]

MLR Assumptions

The Multiple Linear Regression (MLR) model \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon\] assumptions follow those of simple linear regression:

  1. The conditional mean of \(Y\) is linear in the \(X_j\) variables

  2. The errors are normal \(N( 0 , \sigma^2 )\). We write \[Y \mid X_1 , \ldots , X_p \sim N \left ( \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p , \sigma^2 \right )\]

Statistical versus Economic Significance

When looking at the \(\beta\) coefficients there are two issues

  1. Statistical Significance: The \(t\)-ratios of the \(\beta\)’s

  2. Economic Significance: The magnitudes of the \(\beta\)’s If \(X_i\) increases by one unit holding the other \(X\)’s constant Then \(Y\) will react by \(\beta_i\) units. They are called marginal effects

At the end of the day use your judgment!

Model Diagnostics

plot(model) provides diagnostics before model building!
There are many possible caveats

  1. Running simple regressions gives you the wrong answer!

  2. Multiple regression takes into account the correlation between the factors and the independent variable. It does all the work for you.

  3. A variable might be insignificant once we have incorporated a more important predictor variable.

A common sense approach usually works well. If a variable never seems to be significant it typically isn’t.

Model Prediction is the great equalizer!!

Example 4.3 (Newfood) Goal of Experiment

  • A six month market test has been performed on the Newfood product.

    A breakfast cereal.

  • Build a multiple regression model that gives us good sales forecasts.

  • This dataset is the outcome of a controlled experiment in which the values of the independent variables which affect sales are chosen by the analyst.

Analyses the factors which contribute to sales of a new breakfast cereal. Quantify the effects of business decisions such as choice of advertising level, location in store and pricing.

variable description
sales new cereal sales
price price
adv low or high advertising (\(0\) or \(1\))
locat bread or breakfast section (\(0\) or \(1\))
inc neighborhood income
svol size of store
  1. What happens when you regress sales on price, adv, locat?

  2. Run the “kitchen-sink” regression. Perform Diagnostic checks.

  3. Which variables should we transform?

  4. Run the new model. Perform diagnostics and variable selection.

  5. What’s the largest cooks distance?

  6. Provide a summary of coefficients and statistical significance

  7. Predict sales when \({\tt price} = 30, {\tt adv = 1} , {\tt income = 8}\) and \({\tt svol} = 34\). What happens when you predict at the median values of the characteristics?

First we examine the correlation matrix:

newfood = read.csv("data/newfood.csv")
attach(newfood)

names(newfood)
#> [1] "sales"  "price"  "adv"    "locat"  "income" "svol"   "city"   "indx"

head(newfood)
#>   sales price adv locat income svol city indx
#> 1   225    24   0     0    7.3   34    3    1
#> 2   190    24   0     0    7.3   34    3    2
#> 3   205    24   0     0    7.3   34    3    3
#> 4   323    24   0     0    8.3   41    4    1
#> 5   210    24   0     0    8.3   41    4    2
#> 6   241    24   0     0    8.3   41    4    3
tail(newfood)
#>    sales price adv locat income svol city indx
#> 67   180    34   1     1    6.6   29    1    1
#> 68   100    34   1     1    6.6   29    1    2
#> 69    75    34   1     1    6.6   29    1    3
#> 70   150    34   1     1    6.1   24    2    1
#> 71   122    34   1     1    6.1   24    2    2
#> 72   101    34   1     1    6.1   24    2    3

# correlation matrix
cm = cor(cbind(sales,price,adv,locat,income,svol))
cm[upper.tri(cm, diag = TRUE)] = NA
print(as.table(round(cm, 3)))
#>         sales  price    adv  locat income svol
#> sales                                         
#> price  -0.658                                 
#> adv     0.001  0.000                          
#> locat  -0.001  0.000  0.000                   
#> income  0.163 -0.131 -0.746  0.000            
#> svol    0.375 -0.179 -0.742 -0.040  0.809

Remember: correlations are not \(\beta\)’s!!

Total sales volume is negatively correlated to advertising.

Income is negatively correlated with advertising as well.

How is the negative correlation apt to affect the advertising effects?

print(as.table(round(cm[2:4, 1:3], 3)))
#>        sales  price    adv
#> price -0.658              
#> adv    0.001  0.000       
#> locat -0.001  0.000  0.000

There’s no correlation in the \(X\)’s by design!

Let’s start by only including price, adv, locat

model = lm(sales~price+adv+locat)
summary(model)
#> 
#> Call:
#> lm(formula = sales ~ price + adv + locat)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -97.81 -44.14  -9.11  23.62 237.19 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  562.312     53.139   10.58  5.1e-16 ***
#> price        -12.812      1.780   -7.20  6.3e-10 ***
#> adv            0.222     14.535    0.02     0.99    
#> locat         -0.222     14.535   -0.02     0.99    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 62 on 68 degrees of freedom
#> Multiple R-squared:  0.432,  Adjusted R-squared:  0.407 
#> F-statistic: 17.3 on 3 and 68 DF,  p-value: 1.94e-08
  • Why is the marketer likely to be upset by this regression?!

  • Why is the economist happy?

Let’s add income and svol to the regression!

4.1.2 Transformation

Power model: transform with log-log

model = lm(log(sales)~log(price)+adv+locat+log(income)+log(svol))
summary(model)
#> 
#> Call:
#> lm(formula = log(sales) ~ log(price) + adv + locat + log(income) + 
#>     log(svol))
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.5757 -0.1611 -0.0362  0.1928  0.5379 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   8.4070     1.3873    6.06  7.3e-08 ***
#> log(price)   -1.7430     0.2207   -7.90  4.0e-11 ***
#> adv           0.1496     0.1005    1.49  0.14131    
#> locat         0.0010     0.0609    0.02  0.98688    
#> log(income)  -0.5241     0.4958   -1.06  0.29439    
#> log(svol)     1.0308     0.2553    4.04  0.00014 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.26 on 66 degrees of freedom
#> Multiple R-squared:  0.608,  Adjusted R-squared:  0.578 
#> F-statistic: 20.5 on 5 and 66 DF,  p-value: 2.75e-12

Why no logs for adv and locat variables?

The log(svol) coefficient is close to one!

\(R^2= 60\)%

On the transformed scale, \[\log sales=8.41 - 1.74 \log price + 0.150 {\text{adv}} + 0.001 {\text{locat}} - 0.524 \log inc + 1.03 \log svol\] On the un-transformed scale, \[\text{sales} = e^{8.41} ( \text{price} )^{-1.74} e^{ 0. 15 \text{adv} } e^{ 0.001 \text{locat}} ( \text{inc} )^{-0.524} ( \text{svol} )^{1.03}\] sales/price,income and svol are a power sales/adv, locat are exponential

4.1.3 Interpretation

Interpret your regression model as follows

  • Price elasticity is \(\hat{\beta}_{\text{price}} = - 1.74\). A \(1\)% increase in price will drop sales \(1.74\)%

  • \({\tt adv}=1\) increases sales by a factor of \(e^{0.15} = 1.16\). That’s a \(16\)% improvement

Variable Selection: delete locat as its statistically insignificant.

4.1.4 Prediction

predict.lm provides a \(\hat{Y}\)-prediction given a new \(X_f\)

modelnew = lm(log(sales)~log(price)+adv+log(income)+log(svol))
summary(modelnew)
#> 
#> Call:
#> lm(formula = log(sales) ~ log(price) + adv + log(income) + log(svol))
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.5752 -0.1608 -0.0367  0.1923  0.5374 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   8.4086     1.3734    6.12  5.4e-08 ***
#> log(price)   -1.7431     0.2190   -7.96  2.8e-11 ***
#> adv           0.1495     0.0996    1.50  0.13811    
#> log(income)  -0.5236     0.4913   -1.07  0.29037    
#> log(svol)     1.0303     0.2515    4.10  0.00012 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.25 on 67 degrees of freedom
#> Multiple R-squared:  0.608,  Adjusted R-squared:  0.585 
#> F-statistic:   26 on 4 and 67 DF,  p-value: 5.01e-13
newdata=data.frame(price=30,adv=1,income=8,svol=34)
predict.lm(modelnew,newdata,se.fit=T,interval="confidence",level=0.99)
#> $fit
#>   fit lwr upr
#> 1 5.2 4.9 5.5
#> 
#> $se.fit
#> [1] 0.1
#> 
#> $df
#> [1] 67
#> 
#> $residual.scale
#> [1] 0.25

Exponentiate-back to find \(\text{sales} = e^{5.1739} = 176.60\).

4.2 Interactions

  • Does gender change the effect of education on wages?

  • Do patients recover faster when taking drug A?

  • How does advertisement affect price sensitivity?

  • Interactions are useful. Particularly with dummy variables.

  • We build a kitchen-sink model with all possible dummies (day of the week, gender,...)

4.2.1 Models with Interactions

In many situations, \(X_1\) and \(X_2\) interact when predicting \(Y\)

Interaction Model: run the regression \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon\]

In R:

model = lm(y = x1 * x2)

gives \(X_1+X_2+X_1X_2\), and

model = lm(y = x1:x2)

gives only \(X_1 X_2\)

The coefficients \(\beta_1\) and \(\beta_2\) are marginal effects.

If \(\beta_3\) is significant there’s an interaction effect.

We leave \(\beta_1\) and \(\beta_2\) in the model whether they are significant or not.

4.2.2 Models with Interactions and Dummies

\(X_1\) and \(D\) dummy

  • \(X_2 = D\) is a dummy variable with values of zero or one.

  • Model: typically we run a regression of the form \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_1 \star D + \epsilon\]

  • The coefficient \(\beta_1 + \beta_2\) is the effect of \(X_1\) when \(D=1\). The coefficient \(\beta_1\) is the effect when \(D=0\).


\BeginKnitrBlock{example}\iffalse{-91-79-114-97-110-103-101-32-74-117-105-99-101-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-34"><strong>(\#exm:unnamed-chunk-34)  \iffalse (Orange Juice) \fi{} </strong></span></div>\EndKnitrBlock{example}

```r
oj = read.csv("./data/oj.csv")
head(oj)
#>   store     brand week logmove feat price AGE60 EDUC ETHNIC INCOME HHLARGE
#> 1     2 tropicana   40     9.0    0   3.9  0.23 0.25   0.11     11     0.1
#> 2     2 tropicana   46     8.7    0   3.9  0.23 0.25   0.11     11     0.1
#> 3     2 tropicana   47     8.3    0   3.9  0.23 0.25   0.11     11     0.1
#> 4     2 tropicana   48     9.0    0   3.9  0.23 0.25   0.11     11     0.1
#> 5     2 tropicana   50     9.1    0   3.9  0.23 0.25   0.11     11     0.1
#> 6     2 tropicana   51     8.9    0   3.9  0.23 0.25   0.11     11     0.1
#>   WORKWOM HVAL150 SSTRDIST SSTRVOL CPDIST5 CPWVOL5
#> 1     0.3    0.46      2.1     1.1     1.9    0.38
#> 2     0.3    0.46      2.1     1.1     1.9    0.38
#> 3     0.3    0.46      2.1     1.1     1.9    0.38
#> 4     0.3    0.46      2.1     1.1     1.9    0.38
#> 5     0.3    0.46      2.1     1.1     1.9    0.38
#> 6     0.3    0.46      2.1     1.1     1.9    0.38
model = lm(logmove ~ log(price)*feat, data=oj)
summary(model)
#> 
#> Call:
#> lm(formula = logmove ~ log(price) * feat, data = oj)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -5.299 -0.471  0.020  0.495  3.077 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       9.6593     0.0164   588.3   <2e-16 ***
#> log(price)       -0.9582     0.0187   -51.3   <2e-16 ***
#> feat              1.7144     0.0304    56.4   <2e-16 ***
#> log(price):feat  -0.9773     0.0419   -23.3   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.79 on 28943 degrees of freedom
#> Multiple R-squared:  0.398,  Adjusted R-squared:  0.398 
#> F-statistic: 6.37e+03 on 3 and 28943 DF,  p-value: <2e-16


model = lm(log(price)~ brand-1, data = oj)
summary(model)
#> 
#> Call:
#> lm(formula = log(price) ~ brand - 1, data = oj)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.1809 -0.1026 -0.0022  0.1306  0.4626 
#> 
#> Coefficients:
#>                  Estimate Std. Error t value Pr(>|t|)    
#> branddominicks    0.52697    0.00207     254   <2e-16 ***
#> brandminute.maid  0.79069    0.00207     382   <2e-16 ***
#> brandtropicana    1.03460    0.00207     500   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.2 on 28944 degrees of freedom
#> Multiple R-squared:  0.941,  Adjusted R-squared:  0.941 
#> F-statistic: 1.53e+05 on 3 and 28944 DF,  p-value: <2e-16
levels(oj$brand)
#> NULL
brandcol <- c("green","red","gold")
par(mfrow=c(1,2))
boxplot(log(price) ~ brand, data=oj, col=brandcol)
plot(logmove ~ log(price), data=oj, col=brandcol[oj$brand])

  • 83 Chicagoland Stores (Demographic info for each)

  • Price, sales (log units moved), and whether advertised (feat)

Orange Juice: Price vs Sales

plot(exp(logmove) ~ price, data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="move")

Orange Juice: Price vs log(Sales)

plot(logmove ~ price, data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="log(move)")

Orange Juice: Price vs log(Sales)

l1 <- loess(logmove ~ price, data=oj, span=2)
smoothed1 <- predict(l1) 
ind = order(oj$price)
plot(logmove ~ price, data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="log(move)")
lines(smoothed1[ind], x=oj$price[ind], col="blue", lwd=2)

Orange Juice: log(Price) vs log(Sales)

plot(logmove ~ log(price), data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="log(move)")
l2 <- lm(logmove ~ log(price), data=oj)
smoothed2 <- predict(l2) 
ind = order(oj$price)
lines(smoothed2[ind], x=log(oj$price[ind]), col="blue", lwd=2)

Why? Multiplicative (rather than additive) change.

Now we are interested in how does advertisement affect price sensitivity?

Original model \[\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \beta_2 \mathrm{feat}\]

If we feature the brand (in-store display promo or flyer ad), does it affect price sensitivity \(\beta_1\)? If we assume it does \[\beta_1 = \beta_3 + \beta_4\mathrm{feat}\] The new model is \[\log(\mathrm{sales}) = \beta_0 + (\beta_3 + \beta_4\mathrm{feat})\log(\mathrm{price}) + \beta_2 \mathrm{feat}\] After expanding \[\log(\mathrm{sales}) = \beta_0 + \beta_3\log(\mathrm{price}) + \beta_4\mathrm{feat}*\log(\mathrm{price}) + \beta_2 \mathrm{feat}\]

## and finally, consider 3-way interactions
ojreg <- lm(logmove ~ log(price)*feat, data=oj)
summary(ojreg)
#> 
#> Call:
#> lm(formula = logmove ~ log(price) * feat, data = oj)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -5.299 -0.471  0.020  0.495  3.077 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       9.6593     0.0164   588.3   <2e-16 ***
#> log(price)       -0.9582     0.0187   -51.3   <2e-16 ***
#> feat              1.7144     0.0304    56.4   <2e-16 ***
#> log(price):feat  -0.9773     0.0419   -23.3   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.79 on 28943 degrees of freedom
#> Multiple R-squared:  0.398,  Adjusted R-squared:  0.398 
#> F-statistic: 6.37e+03 on 3 and 28943 DF,  p-value: <2e-16

print(lm(logmove ~ log(price), data=oj))
#> 
#> Call:
#> lm(formula = logmove ~ log(price), data = oj)
#> 
#> Coefficients:
#> (Intercept)   log(price)  
#>        10.4         -1.6
print(lm(logmove ~ log(price)+feat + brand, data=oj))
#> 
#> Call:
#> lm(formula = logmove ~ log(price) + feat + brand, data = oj)
#> 
#> Coefficients:
#>      (Intercept)        log(price)              feat  brandminute.maid  
#>           10.279            -2.530             0.891             0.682  
#>   brandtropicana  
#>            1.302
print(lm(logmove ~ log(price)*feat, data=oj))
#> 
#> Call:
#> lm(formula = logmove ~ log(price) * feat, data = oj)
#> 
#> Coefficients:
#>     (Intercept)       log(price)             feat  log(price):feat  
#>           9.659           -0.958            1.714           -0.977
print(lm(logmove ~ brand-1, data=oj))
#> 
#> Call:
#> lm(formula = logmove ~ brand - 1, data = oj)
#> 
#> Coefficients:
#>   branddominicks  brandminute.maid    brandtropicana  
#>             9.17              9.22              9.11

Advertisement increases price sensitivity from -0.96 to -0.958 - 0.98 = -1.94!

Why?

One of the reasons is that the price was lowered during the Ad campaign!

library(dplyr)
doj = oj %>% filter(brand=="dominicks")

par(mfrow=c(1,3), mar=c(4.2,4.6,2,1))
boxplot(price ~  feat, data = oj[oj$brand=="dominicks",], col=c(2,3), main="dominicks", ylab="Price ($)")
boxplot(price ~  feat, data = oj[oj$brand=="minute.maid",], col=c(2,3), main="minute.maid")
boxplot(price ~  feat, data = oj[oj$brand=="tropicana",], col=c(2,3), main="tropicana")

0 = not featured, 1 = featured

4.3 Dummies

We want to understand effect of the brand on the sales \[\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \xcancel{\beta_2\mathrm{brand}}\]

But brand is not a number!

How can you use it in your regression equation?

We introduce dummy variables

Brand Intercept brandminute.maid brandtropicana
minute.maid 1 1 0
tropicana 1 0 1
dominicks 1 0 0

\[\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \beta_{21}\mathrm{brandminute.maid} + \beta_{22}\mathrm{brandtropicana}\]

R will automatically do it it for you

print(lm(logmove ~ log(price)+brand, data=oj))
#> 
#> Call:
#> lm(formula = logmove ~ log(price) + brand, data = oj)
#> 
#> Coefficients:
#>      (Intercept)        log(price)  brandminute.maid    brandtropicana  
#>            10.83             -3.14              0.87              1.53

\[\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \beta_3\mathrm{brandminute.maid} + \beta_4\mathrm{brandtropicana}\]

\(\beta_3\) and \(\beta_4\) are "change relative to reference" (dominicks here).

How does brand affect price sensitivity?

Interactions: logmove ~ log(price) * brand

No Interactions: logmove ~ log(price) + brand

Parameter Interactions No Interactions
(Intercept) 10.95 10.8288
log(price) -3.37 -3.1387
brandminute.maid 0.89 0.8702
brandtropicana 0.96239 1.5299
log(price):brandminute.maid 0.057
log(price):brandtropicana 0.67

Example 4.4 (Golf Performance Data) Dave Pelz has written two best-selling books for golfers, Dave Pelz’s Short Game Bible, and Dave Pelz’s Putting Bible.

  • Dave Pelz was formerly a “rocket scientist” (literally) Data analytics helped him refine his analysis It’s the short-game that matters!

  • The optimal speed for a putt Best chance to make the putt is one that will leave the ball \(17\) inches past the hole, if it misses.

Golf Data

Year-end performance data on 195 players from the 2000 PGA Tour.

  1. nevents, the number of official PGA events included in the statistics

  2. money, the official dollar winnings of the player

  3. drivedist, the average number of yards driven on par 4 and par 5 holes

  4. gir, greens in regulation, measured as the percentage of time that the first (tee) shot on a par 3 hole ends up on the green, or the second shot on a par 4 hole ends up on the green, or the third shot on a par 5 hole ends up on the green

  5. avgputts, which is the average number of putts per round.

Analyze these data to see which of nevents, rivedist, gir, avgputts is most important for winning money.

Regression of Money on all explanatory variables:

source("src/utils.R")
library(tidyverse)
d00 = read_csv("data/pga-2000.csv")
d18 = read_csv("data/pga-2018.csv")
model18 = lm(money ~ nevents + drivedist + gir + avgputts, data=d18)
model00 = lm(money ~ nevents + drivedist + gir + avgputts, data=d00)
summary(model00)
#> 
#> Call:
#> lm(formula = money ~ nevents + drivedist + gir + avgputts, data = d00)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1158735  -364118   -62913   285081  5604080 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  14856638    4206466    3.53  0.00052 ***
#> nevents        -30066      11183   -2.69  0.00781 ** 
#> drivedist       21310       6913    3.08  0.00236 ** 
#> gir            120855      17429    6.93  6.2e-11 ***
#> avgputts    -15203045    2000905   -7.60  1.3e-12 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 672000 on 190 degrees of freedom
#> Multiple R-squared:  0.493,  Adjusted R-squared:  0.483 
#> F-statistic: 46.3 on 4 and 190 DF,  p-value: <2e-16

\(R^2 = 50\)%

Residuals

hist(rstandard(model00), breaks=20, col="lightblue", xlab = "Standartized Residual", main="")
arrows(x0 = 7.5,y0 = 20,x1 = 8.5,y1 = 2,length = 0.1)
text(x = 7,y = 22,labels = "Tiger Woods", cex=1.5)

Regression

Transform with log(Money) as it has much better residual diagnostic plots.

lm(formula = log(money) ~ nevents + drivedist + gir + avgputts, data = d00)
#> 
#> Call:
#> lm(formula = log(money) ~ nevents + drivedist + gir + avgputts, 
#>     data = d00)
#> 
#> Coefficients:
#> (Intercept)      nevents    drivedist          gir     avgputts  
#>    36.14923     -0.00899      0.01409      0.16567    -21.12875

\(R^2 = 67\)%. There’s still \(33\)% of variation to go

Residuals for log(Money)

par(mar = c(4,4.5,0,0),mfrow=c(1,1))
model00log = lm(log(money) ~ nevents + drivedist + gir + avgputts, data=d00)
summary(model00log)
#> 
#> Call:
#> lm(formula = log(money) ~ nevents + drivedist + gir + avgputts, 
#>     data = d00)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.5438 -0.3655 -0.0058  0.3968  1.9262 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  36.14923    3.57763   10.10   <2e-16 ***
#> nevents      -0.00899    0.00951   -0.94    0.346    
#> drivedist     0.01409    0.00588    2.40    0.018 *  
#> gir           0.16567    0.01482   11.18   <2e-16 ***
#> avgputts    -21.12875    1.70178  -12.42   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.57 on 190 degrees of freedom
#> Multiple R-squared:  0.671,  Adjusted R-squared:  0.664 
#> F-statistic: 96.9 on 4 and 190 DF,  p-value: <2e-16
hist(rstandard(model00log), breaks=20, col="lightblue", xlab = "Standartized Residual", main="")
arrows(x0 = 3,y0 = 20,x1 = 3.2,y1 = 2,length = 0.1)
text(x = 3,y = 22,labels = "Tiger Woods", cex=1.5)

Regression: Variable selection

\(t\)-stats for nevents is \(<1.5\).

lm(formula = log(money) ~ drivedist + gir + avgputts, data = d00)
#> 
#> Call:
#> lm(formula = log(money) ~ drivedist + gir + avgputts, data = d00)
#> 
#> Coefficients:
#> (Intercept)    drivedist          gir     avgputts  
#>     36.1737       0.0146       0.1658     -21.3684

The fewer the putts the better golfer you are. Duh!

avgputts per round is hard to decrease by one!

Evaluating the Coefficients

  1. Greens in Regulation (GIR) has a \(\hat{\beta} = 0.17\). If I can increase my GIR by one, I’ll earn \(e^{0.17} = 1.18\)% An extra \(18\)%

  2. DriveDis has a \(\hat{\beta} = 0.014\). A \(10\) yard improvement, I’ll earn \(e^{0.014 \times 10} =e^{0.14} = 1.15\)% An extra \(15\)%

Caveat: Everyone has gotten better since 2000!

Main Findings

Tiger was \(9\) standard deviations better than the model.

  • Taking logs of Money helps the residuals!

  • An exponential model seems to fit well. The residual diagnostics look good

  • The \(t\)-ratios for nevents are under \(1.5\).

Over-Performers

Outliers: biggest over and under-performers in terms of money winnings, compared with the performance statistics.

Woods, Mickelson, and Els won major championships by playing well when big money prizes were available.

Over-Performers

Table 4.1: Over-Performers
name money Predicted Error
1 Tiger Woods 9188321 3584241 5604080
2 Phil Mickelson 4746457 2302171 2444286
3 Ernie Els 3469405 1633468 1835937
4 Hal Sutton 3061444 1445904 1615540
20 Notah Begay III 1819323 426061 1393262
182 Steve Hart 107949 -1186685 1294634

Under-Performers

Underperformers are given by large negative residuals Glasson and Stankowski should win more money.

Table 4.2: Under-Performers
name money Predicted Error
47 Fred Couples 990215 1978477 -988262
52 Kenny Perry 889381 1965740 -1076359
70 Paul Stankowski 669709 1808690 -1138981
85 Bill Glasson 552795 1711530 -1158735
142 Jim McGovern 266647 1397818 -1131171

Lets look at 2018 data Highest earners are

Table 4.3: Highest earners 2018
name nevents money drivedist gir avgputts
Justin Thomas 23 8694821 312 69 1.7
Dustin Johnson 20 8457352 314 71 1.7
Justin Rose 18 8130678 304 70 1.7
Bryson DeChambeau 26 8094489 306 70 1.8
Brooks Koepka 17 7094047 313 68 1.8
Bubba Watson 24 5793748 313 68 1.8

Overperformers

Table 4.4: Overperformers 2018
name money Predicted Error
1 Justin Thomas 8694821 5026220 3668601
2 Dustin Johnson 8457352 6126775 2330577
3 Justin Rose 8130678 4392812 3737866
4 Bryson DeChambeau 8094489 3250898 4843591
5 Brooks Koepka 7094047 4219781 2874266
6 Bubba Watson 5793748 3018004 2775744
9 Webb Simpson 5376417 2766988 2609429
11 Francesco Molinari 5065842 2634466 2431376
12 Patrick Reed 5006267 2038455 2967812
84 Satoshi Kodaira 1471462 -1141085 2612547

Underperformers

Table 4.5: Underperformers 2018
name money Predicted Error
102 Trey Mullinax 1184245 3250089 -2065844
120 J.T. Poston 940661 3241369 -2300708
135 Tom Lovelady 700783 2755854 -2055071
148 Michael Thompson 563972 2512330 -1948358
150 Matt Jones 538681 2487139 -1948458
158 Hunter Mahan 457337 2855898 -2398561
168 Cameron Percy 387612 3021278 -2633666
173 Ricky Barnes 340591 3053262 -2712671
176 Brett Stegmaier 305607 2432494 -2126887

Findings

Here’s three interesting effects:

  • Tiger Woods is \(8\) standard deviations better!

  • Increasing driving distance by \(10\) yards makes you \(15\)% more money

  • Increasing GIR by one makes you \(18\)% more money.

  • Detect Under- and Over-Performers

Go Play!!

Regression

  1. Input and Plot Data In R: plot and summary commands

  2. “Kitchen-Sink” Regression lm command with all variables

  3. Residual Diagnostics and plot(model) Fitted values and Standardised residuals. Outliers and Influence

  4. Transformation?

    Correct the 4-in-1 plots and assumptions.

4.3.1 Regression Strategy

  1. Variable Selection \(t\)-state and \(p\)-values from summary(model)

  2. Final Regression Re-run the model. Interpret the coefficients summary(model). Economic and Statistical Significance

  3. Prediction predict.lm. Out-of-sample forecasting A model is only as good as its predictions!!

4.3.2 Predictive Analytics: General Introduction

Predictive Analytics is the most widely used tool for high dimensional input-output analysis \[{Y = F(X) \; \; {\rm where} \; \; X=(X_1, \ldots , X_p )}\]

  • Consumer Demand (Amazon, Airbnb, ... )

  • Maps (Bing, Uber)

  • Pricing

  • Healthcare

The applications are endless .…

Example 4.5 (Target) Target and other retailers use predictive analytics to study consumer purchasing behaviour to see what type of coupons or promotions you might like

Here’s a famous story about a father and his daughter. Target predicted that his daugther was pregnant from her purchasing behaviour long before they were buying diapers

Here’s the original link ...

Target and Pregnancy

Getting a customer to a routine is the key

  • M.I.T experiment: t-shaped maze with chocolate at the end and behind the barrier that opens after a loud click

  • While each animal wandered through the maze, its brain was working furiously

  • As the scientists repeated the experiment, again and again, the rats eventually stopped sniffing corners and making wrong turns and began to zip through the maze with more and more speed

  • As each rat learned how to complete the maze more quickly, its mental activity decreased

Leaerning routines from data is the basis for modern marketing

  • Habits is a three-step loop: cue, a trigger (go into automatic mode), then the routine

  • Febreze: original adds were targeting a wrong routine (kill the smell), no sails. They the ad said: use Febreze after cleaning each room. Now it is one of the most successful products.

  • Target used the fact that customers who going through a major life event change their habits (routines). They can identify due dates from registry.

Example 4.6 (Walmart) Walmart began using predictive analytics in 2004. Mining trillions of bytes’ worth of sales data from recent hurricanes

Determine what customers most want to purchase leading up to a storm.

Strawberry Pop-Tarts are one of the most purchased food items, especially after storms, as they require no heating and can be eaten at any meal

Walmart and Hurricances

Example 4.7 (Germany’s Otto) Otto sells other brands, does not stock those goods itself, hard to avoid one of the two evils: shipping delays until all the orders are ready for fulfilment, or lots of boxes arriving at different times.

  • Analyze around 3 billion past transactions and 200 variables–past sales, searches on Otto’s site and weather information. They predict what customers will buy a week before they order. This system has proved so reliable, predicting with 90% accuracy what will be sold within 30 days, that Otto allows it automatically to purchase around \(200,000\) items a month from third-party brands with no human intervention.

[Economist]

[Germany’s Otto]

Example 4.8 (Stitch) Stitch Fix CEO Says AI Is ‘So Real’ and Incredibly Valuable

Stitch Fix asks customers for insights and feedback alongside their size and color preference for items, even the ones customers didn’t like or buy, in exchange for a clear value proposition.

The breadth and depth of their data are valuable.

Their model relies on a combination of data science – machine learning, AI and natural language processing – and human stylists; on top of complex customer profiles built by data, stylists can layer the nuances of buying and wearing clothes.

Example 4.9 (Uber Pool)

Bayes predicts where you’re going to be dropped off.

Uber constructs prior probabilities for riders, Uber cars, and popular places.

Combine to construct a joint probability table

Then calculate the posterior probability of destination for each person and pool travellers together

[Uber Pool]

Example 4.10 () Example: \[Kaggle: Predictive Culture\]


\BeginKnitrBlock{example}\iffalse{-91-65-105-114-98-110-98-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-62"><strong>(\#exm:unnamed-chunk-62)  \iffalse (Airbnb) \fi{} </strong></span>

<!-- [Airbnb New User Bookings Prediction Competition]  -->
New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries.

Accurately predict where a new user will book their first travel
experience

Airbnb can then personalized content, decrease the average time to first
booking, and better forecast demand.

$12$ classes--major destinations, and a did not book category

  [Economist]: https://www.economist.com/news/business/21720675-firm-using-algorithm-designed-cern-laboratory-how-germanys-otto-uses
  [Germany's Otto]: https://www.youtube.com/watch?v=3XXqKxpr364
  [Stitch Fix CEO Says AI Is 'So Real' and Incredibly Valuable]: https://www.bloomberg.com/news/videos/2018-10-25/stitch-fix-ceo-says-ai-is-so-real-and-incredibly-valuable-video
  [Uber Pool]: https://newsroom.uber.com/inferring-uber-rider-destinations/
  [Airbnb New User Bookings Prediction Competition]: https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings
  
  
List of users, demographics, web session records, and content data
</div>\EndKnitrBlock{example}
<img src="fig/airbnb_arch-3dplace.png" width="70%" style="display: block; margin: auto;" />

Winner has the best out-of-sample prediction!!

Example 4.11 () Example: \[Hacking OkCupid\]

Sorted daters into seven clusters, like "Diverse" and "Mindful," each with distinct characteristics.

Wired article

NOVA Video


\BeginKnitrBlock{example}\iffalse{-91-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-66"><strong>(\#exm:unnamed-chunk-66)  \iffalse () \fi{} </strong></span>Example: \[NFL Dynamic Pricing\]

Predict price demand for any given Lions game for any given seat in the
stadium
</div>\EndKnitrBlock{example}
<img src="https://fast.wistia.com/embed/iframe/l85ltyxqbh?autoplay=0" width="100%" style="display: block; margin: auto;" />

<https://grahamschool.uchicago.edu/academic-programs/masters-degrees/analytics/nfl-capstone>

We submitted our report on June 2016 suggesting that some areas of the
stadium were priced efficiently and some were underpriced or overpriced.

On Fed 2017, Detroit Jock City wrote

"Detroit Lions tickets will cost a little more on average for 2017, but
some areas of the stadium will decrease or hold steady."

<img src="fig/lions.png" width="100%" style="display: block; margin: auto;" />

<https://detroitjockcity.com/2017/02/10/detroit-lions-2017-ticket-prices/>

Example 4.12 () Example: \[Latour 1982 Price History\]

wininvestment

Ch^ateau Latour: grand vin


\BeginKnitrBlock{example}\iffalse{-91-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-72"><strong>(\#exm:unnamed-chunk-72)  \iffalse () \fi{} </strong></span>
Example: Global Warming
</div>\EndKnitrBlock{example}


Shifting Distribution of Northern Hemisphere Summer Temperature
Anomalies, 1951-2011

[NASA article with animation]

Climate statistics and public policy

Change in global mean temperature is not one of the most sensitive indicator

  • Changes in radiation spectrum (atmosphere is less transparent in the infrared)

  • Sea surface temperature and Land surface temperature

  • Sea level rise (thermal expansion and ice melt): Greenland and West Antarctic are melting + glacial melt

  • Ocean acidification: CO\(_2\) gets absorbed by water, it produces carbolic acid

  • Seasonal changes; winter - summer temperature has been decreasing since 1954. Shift changes (earlier seasons) lead to ecological effects

  • Hurricanes: increase in maximum wind velocity = sea surface temperature + the difference between sea surface temperature and the average air temperature in the outflow of the hurricane

Guttorp paper

2018 was the fourth-warmest year on record.

NYT article

How Much Hotter Is Your Hometown Than When You Were Born?

NYT article

Ice cores is an important source of data

Ice core. Cylinder of ice drilled out of an ice sheet or glacier. Most ice core records come from Antarctica and Greenland.

The oldest continuous ice core records to date extend 123,000 years in Greenland and 800,000 years in Antarctica.

Ice Core Datasets

Ice Core Basics

  • has been around since the 1950s

  • Mostly from Greenland and Antarctica

  • bubbles in the ice core preserve actual samples of the world’s ancient atmosphere

The World Data Center (WDC) for Paleoclimatology maintains archives of ice core data from polar and low-latitude mountain glaciers and ice caps throughout the world. Proxy climate indicators include oxygen isotopes, methane concentrations, dust content, as well as many other parameters.

https://www.ncdc.noaa.gov/data-access/paleoclimatology-data/datasets/ice-core

CO2 was stable over the last millennium

In the early 19th century CO2 concentration started to rise, and its concentration is now nearly 40% higher than it was before the industrial revolution

Things we learned from ice core

  • Ice cores contain information about past temperature, and about many other aspects of the environment.

  • Atmospheric carbon dioxide levels are now 40% higher than before the industrial revolution. This increase is due to fossil fuel usage and deforestation.

  • The magnitude and rate of the recent increase are almost certainly unprecedented over the last 800,000 years.

  • Methane also shows a huge and unprecedented increase in concentration over the last two centuries.

BAS article, The Verge Article

Gates thinks we can use more renewables and nuclear

Need better storage + generation (wind/sun) technology

Source: https://www.gatesnotes.com/Energy/A-critical-step-to-reduce-climate-change