# Chapter 4 Linear Regression

Regression analysis is the most widely used statistical tool for understanding relationships among variables. It provides a conceptually simple method for investigating functional relationships between one or more factors and an outcome of interest. This relationship is expressed in the form of an equation, which we call the model, connecting the response or dependent variable and one or more explanatory or predictor variable.

Example 4.1 (Henrietta Leavitt) Let’s start with a famous analysis of Henrietta Leavitt who used linear regression to estimate distances to pulsating stars. In her analysis of 1777 variable (pulsating) starts she found out that the brighter star have the longer oscilation period.

She used data for 25 cepheid variables, which are pulsating stars with varying diameter and temperature that leads to periodic changes in brightness of specific frequency and amplitude. Below you see the scatter plot of period (on logarithmic scale) vs brightness of those 25 stars Figure 4.1: Henrietta Leavitt’s 1912 data on 25 pulsating stars.

Then she drew a straight a line through this cloud of points which allowed her to predict brightness from the period. Periods can be accurately measured and predicted value for brightness allows astronomers to measure cosmic distances over previously unimaginable scales by comparing the predicted brightness with the apparent brightness.

Though we have though we have many more sophisticated models to model the relations in a dataset, the simplicity and interoperability of the linear regression model makes it a viable option for many practical applications. The regression model can be used for both making a prediction or explaining relations between quantities. For example, we might be interested in predicting for how much will my house sell? An example of explanation question would be what is the impact of my education on my income?

To demonstrate linear regression we develop a model that predicts the price of a house, given its square footage. This is a simplified version of a regression model used by Zillow, a house pricing site. First, we look at property sales data where we know the price and some observed characteristics. Let’s look at the scatter plot of living areas measure in squared feet and the price measured in thousands of dollars.

Second, we build a decision model that predicts price as a function of the observed characteristics. Now, we need to decide on what characteristics do we use. There are many factors or variables that affect the price of a house, for example location. Some other basics ones include size, number of rooms, and parking.

Let’s run a simple linear regression on size, measured in square foots. The value that we seek to predict is called the dependent variable, and we denote this by $$y$$ which is the price of house measured in thousand of dollars. The variable that we use to construct our prediction is the predictor variable, and we denote it with $$x$$. What’s does the data look like?

d = read.csv("data/SaratogaHouses.csv")
d$price = d$price/1000; d$livingArea = d$livingArea/1000
plot(d$livingArea,d$price, 'p', pch=20, col="blue") We use lm function to draw a line

l = lm(price ~ livingArea, data=d)
plot(d$livingArea, d$price, 'p', pch=20, col="blue", xlab="Living Area", ylab="Price")
abline(l, col="red", lwd=3) We can use coef funciton to find out the slope and the intercept of this line.

coef(l)
#> (Intercept)  livingArea
#>          13         113

The intercept value is in units of $$y$$ ($1000). The slope is in units of $$y$$ per units of $$x$$ ($1000/1 sq. feet).

In general, adding statistical noise, the simple linear regression model is then given by the follwing equation $y = \beta_0 + \beta_1 x + \epsilon \; \; {\rm where} \; \epsilon \sim N(0, \sigma^2)$ Here error $$\epsilon$$ term indicates that the relation between $$x$$ and $$y$$ is not excatly a line and for a given $$x$$, the value of $$y$$ can be deviating from the line. We will write this as $$y \mid x \sim N\left(\beta_0 + \beta_1 x, \sigma^2 \right)$$

In our housing example, we find that $$\beta_0 = 13.44$$ and $$\beta_1 =113.12$$. Thus, every $$1000$$ sqft increase the price of the hous by $1.34^{4} and the price of a lot of land without a house is 113.12. The magnitude of $$\beta_1$$ shows the economic significance of house’s square footage.  \BeginKnitrBlock{example}\iffalse{-91-77-97-107-105-110-103-32-80-114-101-100-105-99-116-105-111-110-115-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-7"><strong>(\#exm:unnamed-chunk-7) \iffalse (Making Predictions) \fi{} </strong></span> We can now predict the price of a house when we only know that size: take the value off the regression line. For example, given a house size of$x = 2.2$. Predicted Price: $$\hat{y} = r coef(l) + r coef(l) \times 2.2 = r coef(l) + coef(l)*2.2$$</div>\EndKnitrBlock{example} \BeginKnitrBlock{example}\iffalse{-91-77-101-97-110-32-83-113-117-97-114-101-100-32-69-114-114-111-114-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-8"><strong>(\#exm:unnamed-chunk-8) \iffalse (Mean Squared Error) \fi{} </strong></span> Now we can use our training data set to compare prices predicted by the model and actual prices. We will do it by clculating an average squared difference, also called a mean squared error $$MSE = \dfrac{1}{n}\sum_{i=1}^{n} (\hat y_i - y_i)^2$$ **Nick: Describe in-sample and out-of-sample MSE** Or in code</div>\EndKnitrBlock{example} r yhat = coef(l) + coef(l)*d$livingArea
y = d$price mse = sum((yhat-y)^2)/length(y) mse #>  4770 Sometimes it is more convinient to use root mean squared error (RMSE) = $$\sqrt{4769.91} = 69.06$$ which is measured in the same units as variable $$y$$, price in our case.  **Nick: This means that ---** ## Least Squares Principle Due to Gauss, we will use the mean squared error as the criteria to pick the best model that describe the relations in our data. Least squares principle simply says that we find such$\beta_0$and$\beta_1$that minimize the value of MSE. If MSE is zero we would have a perfect line. We can check and see that lm function indeed finds the best values of$\beta_0$and$\beta_1$by looking at thse MSE associated with perturbed coefficients r set.seed(1) plot(d$livingArea, d$price, 'p', pch=20, col="blue") abline(l, col="red", lwd=3) cat("This produces output,") #> This produces output, for(i in 1:5) { beta0 = rnorm(1,1,0.2)*coef(l) beta1 = rnorm(1,1,0.2)*coef(l) yhat = beta0 + beta1*d$livingArea
abline(beta0,beta1, lwd=2, lty=2, col="green")
print(sum((yhat-y)^2)/length(y))
} #>  4808
#>  9003
#>  5907
#>  5815
#>  4900

Mathematically, least squares chooses $$\beta_0$$ and $$\beta_1$$ to minimize $(\hat{\beta}_0,\hat{\beta}_1) = \arg\min_{\beta_0,\beta_1}\quad\sum_{i=1}^n L_i$

where \begin{aligned} \sum_{i=1}^n L_i = & L_1 + \ldots + L_n = (y_1 - \hat{y}_1 )^2 + \ldots + ( y_n - \hat{y}_n )^2 = \\ & (y_1 - \beta_0 - \beta_1x_1)^2 - \ldots - (y_n - \beta_0 - \beta_1x_n)^2\end{aligned} This optimization problem can be solved analytically. The equation for the intercept guarantees that $\bar{y} = \beta_0 + \beta_1 \bar{x}.$ Thus, the point $$( \bar{x} , \bar{y} )$$ is always on the regression line.

The slope equation takes the form \begin{aligned} \hat \beta_1 & = \Cor{x,y} \times \frac{s_y}{s_x} = \frac{ \sum_{i=1}^n (x_i - \bar{x}) (y_i - \bar{y} ) }{ \sum_{i=1}^n ( x_i - \bar{x})^2 }\\ & = \frac{ \Cov{x,y} }{ \Var{x} }.\end{aligned} Here $$\beta_1$$ estimates the correlation $$r$$ times a that ensures the proper units. This entails estimating parameters using their sample counterparts. Once we have parameters of the model estimated, we are now ready to generate.

### 4.0.1 Statistical Properties of Regression

It is convenient to introduce a regression model using the language of probability and uncertainty. We define the random variable, , of interest. Then regression model assumes that mean of $$y$$ depends linearly on predictor $$x$$ $y = \beta_0 + \beta_1 x + \epsilon,~ \text{where}~\epsilon \sim N(0, \sigma^2).$ Line coefficients $$\beta_0$$ and $$\beta_1$$ have same interpretation as in the deterministic approach. Here, we have a new parameter $$\sigma^2$$ that models dispersion of $$y_i$$ around the mean $$\beta_0 + \beta_1 x$$, let’s see an example

# Data with small and large measurements errors
# lm-sigma

In Figure TODO: how to insert references to R-generated plots the black solid line represents the mean of the estimated relation, i.e. $$\beta_0 + \beta_1 x$$.

To estimate parameters $$\beta_0, \beta_1,\sigma^2$$ that govern this relationship, we use historical data on characteristics and dependent variable $$(x_1,y_1),\ldots (x_n,y_n)$$. We can apply least squares principle to find $$\beta$$’s and then estimate $$\sigma^2$$ empirically by calculating $\hat \sigma^2 = s_e^2 = \Var{e_1}.$

We compute the values of $$\beta_0$$ and $$\beta_1$$ using a sample data. Thus, the estimated values will not be equal to the true ones.

# Comparison of true and estimated lines
# lm-beta-estimated

To get a feeling for the amount of variability in our exmerimentation. Imagine that we have two sample data sets. For example, we have housing data from two different realtor firms. Do you think the estimated value of $$\beta_1$$ will be the same for both of those? The answer is no. Let’s demonstrate with an example, we simulate eight data sets from the same distribution and estimate eight different linear models.

The value of $$\beta_1$$ will differ from sample to sample. In other words, it will be a random variable. The sampling distribution of $$\beta_1$$ describes how it varies over different samples with the $$x$$ values fixed. Statistical view of linear regression allows us to calculate confidence and prediction intervals for estimated parameters. It turns out that when least squares principle is used, the estimated $$\hat\beta_1$$ is normally distributed: $$\hat\beta_1 \sim N( \beta_1 , s_1^2 )$$. Again, empirically if we plot coefficients for 10000 different models, we can see that the empirical distribution resembles a normal distribution.

# Histogram of parameters estimated from different samples from the same distribution
# lm-random-bets

Further, $$\hat\beta_1$$ is unbiased, i.e. $\E{\hat\beta_1} = \beta_1,$ the expectation of our estimated value is equal to the true value of the unobserved parameter $$\beta$$. Further, $$s_1$$ is the . The t-statistic is . The three factors: sample size ($$n$$), error variance ($$s^2$$), and $$x$$-spread, $$s_x$$ impact the size of standard error for $$\beta_1$$ $\rcol{s_1^2 = \frac{s^2}{\sum_{i=1}^n ( x_i - \bar{x} )^2 } = \frac{ s^2 }{ (n-1) s_x^2 } }$

Intuitively, we model regression-back-to-the-mean effect. This is one of the most interesting statistical effects you’ll see in daily life. Here are a few examples:

• If your parents are on average higher than the average, you’ll regress back to the average.

• Airline pilot training. Good landings are followed by average ones.

• A negative correlation between first and second round scores at Golf.

Further, We always have to account for the random noise in day-to-day decisions, thus we have the error term $$\epsilon$$ in our model.

The underlying assumptions of the statistical linear regression model are:

1. Given the value of $$x$$, the $$y$$ values are normally distributed

2. The means of these normal distributions of $$y$$ values all lie on the straight line of regression

3. The standard deviations of these normal distributions are equal

4. The $$y$$ values are statistically independent

Consider another example that demonstrates these effects.

Again, consider a house example. Say in our data we have 10 houses with the same squire footage, say 2000. Now the first point states, that the prices of those houses should follow a normal distribution. In other words, prices of all of the houses of size 2000 follow a normal distribution.

The second point merely states that mean of $$y$$ is a linear function of $$x$$. The third point states that if we are compare prices of 2000 sqft houses and 2500 sqft houses, they will have the same standard deviation. The fourth point means that price of one house does not depend on the price of another house.

All of the assumptions in the regression model can be written using probabilist notations The looks like: $y \mid x \overset{iid}{\sim} N(\beta_0 +\beta_1 x, \sigma^2).$ Here, $$\beta_1$$ measures the effect on $$y$$ of increasing $$x$$ by one and $$\beta_0$$ measures the effect on $$y$$ when $$x=0$$. Sometimes, especially in finance applications, we use the notation $$\alpha = \beta_0$$ and $$\beta = \beta_1$$

Another advantage of adopting a statistical view of the linear regression model is ability to quantify information about potential outliers. Outliers are points that are extreme relative to our model predictions. Recall, that residual is $$e_i = y_i- \hat y_i$$. Since our predicted value $$\hat y_i$$ follows a normal distribution, the residual also follows a normal distribution, since it is a difference of normal random variable $$\hat y_i$$ and a constant $$y_i$$. It easy to see that $e_i \sim N(0, s_e^2),$ where $$s_e^2$$ is an empirical estimate of the error’s variance.

Consider the relation between the fitted values $$\hat y_i$$ and residuals $$e_i$$. Our predictions are given by the line. The residual $$e_i$$ and predicted value $$\hat y_i$$ for the $$i$$th observation are related via $y_i = \hat{y}_i + ( y_i - \hat{y}_i ) = \hat{y}_i + e_i$.

Residuals allow us to define outliers. They simply have large residuals. We re-scale the residuals by their standard errors. This lets us define $r_i = \frac{ e_i }{ s_{ e} } =\frac{y_i - \hat{y}_i }{ s_{ e } }$ Since residuals follow normal distribution $$e \sim N(0,\sigma^2)$$, in 95% of the time we expect the standardized residuals to satisfy $$- 2 < r_i < 2$$. Any observation with is an extreme outlier, it is three sigmas away from the mean.

Another types of observations we are interested are the influential points. Those are are observations that affect the magnitude of our estimates $$\hat{\beta}$$’s. They are important to find as they typically have economic consequences. We will use to assess the significance of an influential point. Cook’s distance associated with sample $$i$$ measure the change in estimated model parameters $$\hat \beta$$ when sample $$i$$ is removed from the training data set. In R code, we can use

  datanew = data[-i,]  # Deletes ith row

to remove $$i$$th sample.

### 4.0.2 Predictive Uncertainty

Another advantage of using a statistical view is that we can assess how much error that could be in our best prediction. Suppose we would like to predict value of output variable for a new observation $$x_f$$, then we calculate $\hat{y}_f = \hat \beta_0 + \hat \beta x_f + e_f, \; {\rm where} \; e_f \sim N ( 0 , s^2 )$

Recall, that besides $$\epsilon$$ being random, our estimates $$\hat \beta_0$$ and $$\hat \beta_1$$ are also random. Thus, all three turns in the liner equation have errors.

After we account for all the uncertainty,

$\Var{y_f} = s^2 \left ( 1 + \frac{1}{n} + \frac{ (x_f - \bar{x} )^2 }{ (n-1) s_x^2 } \right )$

The value of $$\Var{y_f}$$ can be used to calculate a prediction interval. For example we know that 95% prediction interval is given by $\hat \beta_0 + \hat \beta_1 x_f \pm 1.96 \sqrt{\Var{y_f} }$ Prediction intervals tell us where we can expect to see the next data point sampled.

Similarly, confidence intervals tells us about how well we have determined the mean, i.e. it gives the range for $$\E{y_f\mid x_f}$$, note prediction interval gives the range for $$y_f$$ itself. Thus the 95% confidence interval is given by $\hat \beta_0 + \hat \beta_1 x_f \pm s.$

A large predictive error variance (high uncertainty) comes from four factors

1. Large $$s$$ (i.e. large errors, $$\epsilon$$’s)

2. Small $$n$$ (not enough data)

3. Small $$s_x$$ (not enough spread in the covariates)

4. Large difference between $$x_f$$ and $$\bar{x}$$ (predicting extremes)

### 4.0.3 Input Transformations

A linear model assumes that output variable is proportional to the input variable plus an offset. However, this is rarely the case. Often, we need to transform input variables by combining multiple inputs into a single predictor, for example by taking a ratio or putting inputs on a different scale, e.g. log-scale. In machine learning, this process is called feature engineering

#### 4.0.3.1 Factor Regression

One of the classic examples of feature engineering is Fama-French three-factor model which is used in asset prising and portfolio management. The model states that asset returns depend on (1) market risk, (2) the outperformance of small versus big companies, and (3) the outperformance of high book/market versus small book/market companies, mathematically $r = R_f + \beta(R_m - R_f) + b_s\cdot SMB + b_v\cdot HML + \alpha$ Here $$R_f$$ is risk-free return, $$R_m$$ is the return of market, $$SMB$$ stands for "Small market capitalization Minus Big" and $$HML$$ for "High book-to-market ratio Minus Low"; they measure the historic excess returns of small caps over big caps and of value stocks over growth stocks. These factors are calculated with combinations of portfolios composed by ranked stocks (BtM ranking, Cap ranking) and available historical market data.

Consider an example, when the relations are multiplicative. Figure TODO shows Johnson and Johnson’s quarterly earnings

jj = read.csv("data/JNJ.csv")
jj$Date = lubridate::ymd(jj$Date)
jj = jj %>%arrange(Date)
jj$Date0 = 12*(lubridate::year(jj$Date)-lubridate::year(jj$Date)) + (lubridate::month(jj$Date)-lubridate::month(jj$Date)) model = lm(jj$Open ~ jj$Date0) modellog = lm(log(jj$Open) ~ jj$Date0) plot(jj$Date0,jj$Open, type='l', col="red", ylab="JNJ Open", xlab="Month") abline(model, col="red") par(new=T) plot(log(jj$Open), type='l', xlab="", ylab="", xaxt='n', yaxt='n')
abline(modellog)
axis(side = 4)
legend("topleft", legend=c("Original", "log-Scale"), lty=1, col=c("red", "black"), lwd=3, bty="n", y.intersp = 1,cex=1) summary(model)
#>
#> Call:
#> lm(formula = jj$Open ~ jj$Date0)
#>
#> Residuals:
#>    Min     1Q Median     3Q    Max
#> -24.06 -11.12  -2.04   8.35  51.99
#>
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -23.01349    1.20367   -19.1   <2e-16 ***
#> jj$Date0 0.19313 0.00361 53.5 <2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 14 on 576 degrees of freedom #> Multiple R-squared: 0.832, Adjusted R-squared: 0.832 #> F-statistic: 2.86e+03 on 1 and 576 DF, p-value: <2e-16 summary(modellog) #> #> Call: #> lm(formula = log(jj$Open) ~ jj$Date0) #> #> Residuals: #> Min 1Q Median 3Q Max #> -0.9037 -0.3101 -0.0773 0.3307 0.7830 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -9.31e-02 3.24e-02 -2.87 0.0042 ** #> jj$Date0     9.14e-03   9.72e-05   94.10   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.39 on 576 degrees of freedom
#> Multiple R-squared:  0.939,  Adjusted R-squared:  0.939
#> F-statistic: 8.86e+03 on 1 and 576 DF,  p-value: <2e-16

Just by visual inspection we can see that a straight line will not be a good fit for this data. The relation between the time and earnings is multiplicative. There are many business and natural science examples where multiplicative relation holds. To model multiplicative or power relations we can still use the apparatus of linear models. However, we need to apply log-transformation to both input and output variables. $$\log$$ is the natural $$\log_e$$ with base $$e=2.718\ldots$$ and that $$\log ( a b) = \log a + \log b$$ and $$log ( a^b ) = b \log a$$.

Consider a multiplicative model relating input-output, where $$A = e^a$$ and $y = A x^{\beta_1},$ The Log-Log transform (taking logs of both sides) leads to $\log y = \log A + \log x^b = a + \beta_1 \log x$

The slope, $$\beta_1$$ can be interpreted as an , that is % change in $$y$$ versus % change in $$x$$. Variables are related on a multiplicative, or , scale.

We can estimate this model by regressing $$\log y$$ on $$\log x$$. In R: model = lm(log(y) ~ log(x)

Another case is the exponential model. Any process that is exponentially growing will lead to such a model. For example growth of population with time or compounding interest in savings account.

Suppose that we have an equation: $y = A e^{\beta_1 x},$ where $$A = e^{\beta_0}$$. This is equivalent to $$\log(y) = a + \beta_1 x$$.

Taking logs of the original equation gives \begin{aligned} \log y &= \log A + \beta_1 x\\ \log y & = \beta_0 + \beta_1 x\end{aligned}

Therefore, we can run a regression of $$\log y$$ on $$x$$. However, there is a caveat: not all variables can be logged. There are some constraints

1. $$y>0$$ needs to be positive.

2. Dummy variables $$x=0$$ or $$1$$ can’t be logged.

3. Counting variables are usually left alone as well.

Example 4.2 (World’s Smartest Mammal) First of all, read in and attach our data ...

mammals = read.csv("data/mammals.csv")
attach(mammals)
#>                      Mammal  Brain    Body
#> 1          African_elephant 5712.0 6654.00
#> 2 African_giant_pouched_rat    6.6    1.00
#> 3                Arctic_Fox   44.5    3.38
#> 4    Arctic_ground_squirrel    5.7    0.92
#> 5            Asian_elephant 4603.0 2547.00
#> 6                    Baboon  179.5   10.55
tail(mammals)
#>                   Mammal Brain Body
#> 57                Tenrec   2.6  0.9
#> 58            Tree_hyrax  12.3  2.0
#> 59            Tree_shrew   2.5  0.1
#> 60                Vervet  58.0  4.2
#> 61         Water_opossum   3.9  3.5
#> 62 Yellow-bellied_marmot  17.0  4.0
### Linear model
model = lm(Brain~Body)

Residual Diagnostics

The residuals show that you need a transformation .…

layout(matrix(1:2, ncol = 2))
plot(Body,Brain, pch=21, bg="lightblue", cex=2)
abline(model, col="red", lwd=3)
plot(Brain,residuals(model), bg="lightblue", pch=21, cex=2) Residual Plots

par(mar=c(4,4.5,3,3))
plot(model, bg="lightblue", pch=21, cex=2) log-log model

### Log-Log linear model
model = lm(log(Brain)~log(Body))
summary(model)
#>
#> Call:
#> lm(formula = log(Brain) ~ log(Body))
#>
#> Residuals:
#>     Min      1Q  Median      3Q     Max
#> -1.7534 -0.5391 -0.0924  0.4207  2.6115
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)   2.1833     0.1068    20.4   <2e-16 ***
#> log(Body)     0.7432     0.0317    23.5   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.77 on 60 degrees of freedom
#> Multiple R-squared:  0.902,  Adjusted R-squared:   0.9
#> F-statistic:  551 on 1 and 60 DF,  p-value: <2e-16
par(mar=c(4.5,4.5,3,3))
layout(matrix(1:2, ncol = 2))
plot(log(Brain)~log(Body), pch=21, bg="lightblue", cex=2)
abline(model, col="red", lwd=3)
plot(log(Brain),residuals(model), bg="lightblue", pch=21, cex=2) That’s better!

$$4$$ in $$1$$ Residuals: log-log model

par(mar=c(4,4.5,3,3))
plot(model, bg="lightblue", pch=21, cex=2) log-log Model

${ {\rm log(Body)} = 2.18 + 0.74 \; {\rm log(Brain)} \; . }$ The coefficients are highly significant $$R^2 = 90$$%

Outliers

res = rstudent(model)
outliers = order(res,decreasing = T)[1:10]
cbind(mammals[outliers,],
residual = res[outliers],
Fit = exp(model$fitted.values[outliers])) #> Mammal Brain Body residual Fit #> 11 Chinchilla 64 0.42 3.78 4.7 #> 34 Man 1320 62.00 2.67 190.7 #> 50 Rhesus_monkey 179 6.80 2.12 36.9 #> 6 Baboon 180 10.55 1.67 51.1 #> 42 Owl_monkey 16 0.48 1.46 5.1 #> 10 Chimpanzee 440 52.16 1.27 167.7 #> 27 Ground_squirrel 4 0.10 1.20 1.6 #> 43 Patas_monkey 115 10.00 1.11 49.1 #> 60 Vervet 58 4.19 1.06 25.7 #> 3 Arctic_Fox 44 3.38 0.92 22.0 There is a residual value of 3.78 extreme outlier. It corresponds to the Chinchilla. This suggests that the Chinchilla is a master race of supreme intelligence! Inference NO!!! I checked and there was a data entry error. • The brain weight is given as 64 grams and should only be 6.4 grams. • The next largest residual corresponds to mankind In this example the log-log transformation used seems to achieve two important goals, namely linearity and constant variance. ## 4.1 Multiple Regression Many problems involve more than one independent (explanatory) variable or factor which affects the dependent or response variable. • Multi-factor asset pricing models (APT). Stock returns, book-to-market ratios, Interest rates • Demand for a product given prices of competing brands, advertising, household attributes (to formulate pricing strategies) • Internet Analytics What do I like? Suggestions instead of Search! Alexa “book my Xmas vacation,” “buy my best friend a birthday present” ### 4.1.1 Regression Model $$Y$$ = response or outcome variable $$X_1, \ldots , X_p$$ = explanatory or input variable The general relationship is given by $Y = f( X_1, \ldots , X_p ) + \epsilon$ And a linear relationship is written $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon$ MLR Assumptions The Multiple Linear Regression (MLR) model $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon$ assumptions follow those of simple linear regression: 1. The conditional mean of $$Y$$ is linear in the $$X_j$$ variables 2. The errors are normal $$N( 0 , \sigma^2 )$$. We write $Y \mid X_1 , \ldots , X_p \sim N \left ( \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p , \sigma^2 \right )$ Statistical versus Economic Significance When looking at the $$\beta$$ coefficients there are two issues 1. Statistical Significance: The $$t$$-ratios of the $$\beta$$’s 2. Economic Significance: The magnitudes of the $$\beta$$’s If $$X_i$$ increases by one unit holding the other $$X$$’s constant Then $$Y$$ will react by $$\beta_i$$ units. They are called marginal effects At the end of the day use your judgment! Model Diagnostics plot(model) provides diagnostics before model building! There are many possible caveats 1. Running simple regressions gives you the wrong answer! 2. Multiple regression takes into account the correlation between the factors and the independent variable. It does all the work for you. 3. A variable might be insignificant once we have incorporated a more important predictor variable. A common sense approach usually works well. If a variable never seems to be significant it typically isn’t. Model Prediction is the great equalizer!! Example 4.3 (Newfood) Goal of Experiment • A six month market test has been performed on the Newfood product. A breakfast cereal. • Build a multiple regression model that gives us good sales forecasts. • This dataset is the outcome of a controlled experiment in which the values of the independent variables which affect sales are chosen by the analyst. Analyses the factors which contribute to sales of a new breakfast cereal. Quantify the effects of business decisions such as choice of advertising level, location in store and pricing. variable description sales new cereal sales price price adv low or high advertising ($$0$$ or $$1$$) locat bread or breakfast section ($$0$$ or $$1$$) inc neighborhood income svol size of store 1. What happens when you regress sales on price, adv, locat? 2. Run the “kitchen-sink” regression. Perform Diagnostic checks. 3. Which variables should we transform? 4. Run the new model. Perform diagnostics and variable selection. 5. What’s the largest cooks distance? 6. Provide a summary of coefficients and statistical significance 7. Predict sales when $${\tt price} = 30, {\tt adv = 1} , {\tt income = 8}$$ and $${\tt svol} = 34$$. What happens when you predict at the median values of the characteristics? First we examine the correlation matrix: newfood = read.csv("data/newfood.csv") attach(newfood) names(newfood) #>  "sales" "price" "adv" "locat" "income" "svol" "city" "indx" head(newfood) #> sales price adv locat income svol city indx #> 1 225 24 0 0 7.3 34 3 1 #> 2 190 24 0 0 7.3 34 3 2 #> 3 205 24 0 0 7.3 34 3 3 #> 4 323 24 0 0 8.3 41 4 1 #> 5 210 24 0 0 8.3 41 4 2 #> 6 241 24 0 0 8.3 41 4 3 tail(newfood) #> sales price adv locat income svol city indx #> 67 180 34 1 1 6.6 29 1 1 #> 68 100 34 1 1 6.6 29 1 2 #> 69 75 34 1 1 6.6 29 1 3 #> 70 150 34 1 1 6.1 24 2 1 #> 71 122 34 1 1 6.1 24 2 2 #> 72 101 34 1 1 6.1 24 2 3 # correlation matrix cm = cor(cbind(sales,price,adv,locat,income,svol)) cm[upper.tri(cm, diag = TRUE)] = NA print(as.table(round(cm, 3))) #> sales price adv locat income svol #> sales #> price -0.658 #> adv 0.001 0.000 #> locat -0.001 0.000 0.000 #> income 0.163 -0.131 -0.746 0.000 #> svol 0.375 -0.179 -0.742 -0.040 0.809 Remember: correlations are not $$\beta$$’s!! Total sales volume is negatively correlated to advertising. Income is negatively correlated with advertising as well. How is the negative correlation apt to affect the advertising effects? print(as.table(round(cm[2:4, 1:3], 3))) #> sales price adv #> price -0.658 #> adv 0.001 0.000 #> locat -0.001 0.000 0.000 There’s no correlation in the $$X$$’s by design! Let’s start by only including price, adv, locat model = lm(sales~price+adv+locat) summary(model) #> #> Call: #> lm(formula = sales ~ price + adv + locat) #> #> Residuals: #> Min 1Q Median 3Q Max #> -97.81 -44.14 -9.11 23.62 237.19 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 562.312 53.139 10.58 5.1e-16 *** #> price -12.812 1.780 -7.20 6.3e-10 *** #> adv 0.222 14.535 0.02 0.99 #> locat -0.222 14.535 -0.02 0.99 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 62 on 68 degrees of freedom #> Multiple R-squared: 0.432, Adjusted R-squared: 0.407 #> F-statistic: 17.3 on 3 and 68 DF, p-value: 1.94e-08 • Why is the marketer likely to be upset by this regression?! • Why is the economist happy? Let’s add income and svol to the regression! ### 4.1.2 Transformation Power model: transform with log-log model = lm(log(sales)~log(price)+adv+locat+log(income)+log(svol)) summary(model) #> #> Call: #> lm(formula = log(sales) ~ log(price) + adv + locat + log(income) + #> log(svol)) #> #> Residuals: #> Min 1Q Median 3Q Max #> -0.5757 -0.1611 -0.0362 0.1928 0.5379 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 8.4070 1.3873 6.06 7.3e-08 *** #> log(price) -1.7430 0.2207 -7.90 4.0e-11 *** #> adv 0.1496 0.1005 1.49 0.14131 #> locat 0.0010 0.0609 0.02 0.98688 #> log(income) -0.5241 0.4958 -1.06 0.29439 #> log(svol) 1.0308 0.2553 4.04 0.00014 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.26 on 66 degrees of freedom #> Multiple R-squared: 0.608, Adjusted R-squared: 0.578 #> F-statistic: 20.5 on 5 and 66 DF, p-value: 2.75e-12 Why no logs for adv and locat variables? The log(svol) coefficient is close to one! $$R^2= 60$$% On the transformed scale, $\log sales=8.41 - 1.74 \log price + 0.150 {\text{adv}} + 0.001 {\text{locat}} - 0.524 \log inc + 1.03 \log svol$ On the un-transformed scale, $\text{sales} = e^{8.41} ( \text{price} )^{-1.74} e^{ 0. 15 \text{adv} } e^{ 0.001 \text{locat}} ( \text{inc} )^{-0.524} ( \text{svol} )^{1.03}$ sales/price,income and svol are a power sales/adv, locat are exponential ### 4.1.3 Interpretation Interpret your regression model as follows • Price elasticity is $$\hat{\beta}_{\text{price}} = - 1.74$$. A $$1$$% increase in price will drop sales $$1.74$$% • $${\tt adv}=1$$ increases sales by a factor of $$e^{0.15} = 1.16$$. That’s a $$16$$% improvement Variable Selection: delete locat as its statistically insignificant. ### 4.1.4 Prediction predict.lm provides a $$\hat{Y}$$-prediction given a new $$X_f$$ modelnew = lm(log(sales)~log(price)+adv+log(income)+log(svol)) summary(modelnew) #> #> Call: #> lm(formula = log(sales) ~ log(price) + adv + log(income) + log(svol)) #> #> Residuals: #> Min 1Q Median 3Q Max #> -0.5752 -0.1608 -0.0367 0.1923 0.5374 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 8.4086 1.3734 6.12 5.4e-08 *** #> log(price) -1.7431 0.2190 -7.96 2.8e-11 *** #> adv 0.1495 0.0996 1.50 0.13811 #> log(income) -0.5236 0.4913 -1.07 0.29037 #> log(svol) 1.0303 0.2515 4.10 0.00012 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.25 on 67 degrees of freedom #> Multiple R-squared: 0.608, Adjusted R-squared: 0.585 #> F-statistic: 26 on 4 and 67 DF, p-value: 5.01e-13 newdata=data.frame(price=30,adv=1,income=8,svol=34) predict.lm(modelnew,newdata,se.fit=T,interval="confidence",level=0.99) #>$fit
#>   fit lwr upr
#> 1 5.2 4.9 5.5
#>
#> $se.fit #>  0.1 #> #>$df
#>  67
#>
#> $residual.scale #>  0.25 Exponentiate-back to find $$\text{sales} = e^{5.1739} = 176.60$$. ## 4.2 Interactions • Does gender change the effect of education on wages? • Do patients recover faster when taking drug A? • How does advertisement affect price sensitivity? • Interactions are useful. Particularly with dummy variables. • We build a kitchen-sink model with all possible dummies (day of the week, gender,...) ### 4.2.1 Models with Interactions In many situations, $$X_1$$ and $$X_2$$ interact when predicting $$Y$$ Interaction Model: run the regression $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon$ In R: model = lm(y = x1 * x2) gives $$X_1+X_2+X_1X_2$$, and model = lm(y = x1:x2) gives only $$X_1 X_2$$ The coefficients $$\beta_1$$ and $$\beta_2$$ are marginal effects. If $$\beta_3$$ is significant there’s an interaction effect. We leave $$\beta_1$$ and $$\beta_2$$ in the model whether they are significant or not. ### 4.2.2 Models with Interactions and Dummies $$X_1$$ and $$D$$ dummy • $$X_2 = D$$ is a dummy variable with values of zero or one. • Model: typically we run a regression of the form $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_1 \star D + \epsilon$ • The coefficient $$\beta_1 + \beta_2$$ is the effect of $$X_1$$ when $$D=1$$. The coefficient $$\beta_1$$ is the effect when $$D=0$$.  \BeginKnitrBlock{example}\iffalse{-91-79-114-97-110-103-101-32-74-117-105-99-101-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-34"><strong>(\#exm:unnamed-chunk-34) \iffalse (Orange Juice) \fi{} </strong></span></div>\EndKnitrBlock{example} r oj = read.csv("./data/oj.csv") head(oj) #> store brand week logmove feat price AGE60 EDUC ETHNIC INCOME HHLARGE #> 1 2 tropicana 40 9.0 0 3.9 0.23 0.25 0.11 11 0.1 #> 2 2 tropicana 46 8.7 0 3.9 0.23 0.25 0.11 11 0.1 #> 3 2 tropicana 47 8.3 0 3.9 0.23 0.25 0.11 11 0.1 #> 4 2 tropicana 48 9.0 0 3.9 0.23 0.25 0.11 11 0.1 #> 5 2 tropicana 50 9.1 0 3.9 0.23 0.25 0.11 11 0.1 #> 6 2 tropicana 51 8.9 0 3.9 0.23 0.25 0.11 11 0.1 #> WORKWOM HVAL150 SSTRDIST SSTRVOL CPDIST5 CPWVOL5 #> 1 0.3 0.46 2.1 1.1 1.9 0.38 #> 2 0.3 0.46 2.1 1.1 1.9 0.38 #> 3 0.3 0.46 2.1 1.1 1.9 0.38 #> 4 0.3 0.46 2.1 1.1 1.9 0.38 #> 5 0.3 0.46 2.1 1.1 1.9 0.38 #> 6 0.3 0.46 2.1 1.1 1.9 0.38 model = lm(logmove ~ log(price)*feat, data=oj) summary(model) #> #> Call: #> lm(formula = logmove ~ log(price) * feat, data = oj) #> #> Residuals: #> Min 1Q Median 3Q Max #> -5.299 -0.471 0.020 0.495 3.077 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 9.6593 0.0164 588.3 <2e-16 *** #> log(price) -0.9582 0.0187 -51.3 <2e-16 *** #> feat 1.7144 0.0304 56.4 <2e-16 *** #> log(price):feat -0.9773 0.0419 -23.3 <2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.79 on 28943 degrees of freedom #> Multiple R-squared: 0.398, Adjusted R-squared: 0.398 #> F-statistic: 6.37e+03 on 3 and 28943 DF, p-value: <2e-16 model = lm(log(price)~ brand-1, data = oj) summary(model) #> #> Call: #> lm(formula = log(price) ~ brand - 1, data = oj) #> #> Residuals: #> Min 1Q Median 3Q Max #> -1.1809 -0.1026 -0.0022 0.1306 0.4626 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> branddominicks 0.52697 0.00207 254 <2e-16 *** #> brandminute.maid 0.79069 0.00207 382 <2e-16 *** #> brandtropicana 1.03460 0.00207 500 <2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.2 on 28944 degrees of freedom #> Multiple R-squared: 0.941, Adjusted R-squared: 0.941 #> F-statistic: 1.53e+05 on 3 and 28944 DF, p-value: <2e-16 levels(oj$brand)
#> NULL
brandcol <- c("green","red","gold")
par(mfrow=c(1,2))
boxplot(log(price) ~ brand, data=oj, col=brandcol)
plot(logmove ~ log(price), data=oj, col=brandcol[oj$brand]) • 83 Chicagoland Stores (Demographic info for each) • Price, sales (log units moved), and whether advertised (feat) Orange Juice: Price vs Sales plot(exp(logmove) ~ price, data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="move") Orange Juice: Price vs log(Sales)

plot(logmove ~ price, data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="log(move)") Orange Juice: Price vs log(Sales) l1 <- loess(logmove ~ price, data=oj, span=2) smoothed1 <- predict(l1) ind = order(oj$price)
plot(logmove ~ price, data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="log(move)") lines(smoothed1[ind], x=oj$price[ind], col="blue", lwd=2) Orange Juice: log(Price) vs log(Sales)

plot(logmove ~ log(price), data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="log(move)") l2 <- lm(logmove ~ log(price), data=oj) smoothed2 <- predict(l2) ind = order(oj$price)
lines(smoothed2[ind], x=log(oj$price[ind]), col="blue", lwd=2) Why? Multiplicative (rather than additive) change. Now we are interested in how does advertisement affect price sensitivity? Original model $\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \beta_2 \mathrm{feat}$ If we feature the brand (in-store display promo or flyer ad), does it affect price sensitivity $$\beta_1$$? If we assume it does $\beta_1 = \beta_3 + \beta_4\mathrm{feat}$ The new model is $\log(\mathrm{sales}) = \beta_0 + (\beta_3 + \beta_4\mathrm{feat})\log(\mathrm{price}) + \beta_2 \mathrm{feat}$ After expanding $\log(\mathrm{sales}) = \beta_0 + \beta_3\log(\mathrm{price}) + \beta_4\mathrm{feat}*\log(\mathrm{price}) + \beta_2 \mathrm{feat}$ ## and finally, consider 3-way interactions ojreg <- lm(logmove ~ log(price)*feat, data=oj) summary(ojreg) #> #> Call: #> lm(formula = logmove ~ log(price) * feat, data = oj) #> #> Residuals: #> Min 1Q Median 3Q Max #> -5.299 -0.471 0.020 0.495 3.077 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 9.6593 0.0164 588.3 <2e-16 *** #> log(price) -0.9582 0.0187 -51.3 <2e-16 *** #> feat 1.7144 0.0304 56.4 <2e-16 *** #> log(price):feat -0.9773 0.0419 -23.3 <2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.79 on 28943 degrees of freedom #> Multiple R-squared: 0.398, Adjusted R-squared: 0.398 #> F-statistic: 6.37e+03 on 3 and 28943 DF, p-value: <2e-16 print(lm(logmove ~ log(price), data=oj)) #> #> Call: #> lm(formula = logmove ~ log(price), data = oj) #> #> Coefficients: #> (Intercept) log(price) #> 10.4 -1.6 print(lm(logmove ~ log(price)+feat + brand, data=oj)) #> #> Call: #> lm(formula = logmove ~ log(price) + feat + brand, data = oj) #> #> Coefficients: #> (Intercept) log(price) feat brandminute.maid #> 10.279 -2.530 0.891 0.682 #> brandtropicana #> 1.302 print(lm(logmove ~ log(price)*feat, data=oj)) #> #> Call: #> lm(formula = logmove ~ log(price) * feat, data = oj) #> #> Coefficients: #> (Intercept) log(price) feat log(price):feat #> 9.659 -0.958 1.714 -0.977 print(lm(logmove ~ brand-1, data=oj)) #> #> Call: #> lm(formula = logmove ~ brand - 1, data = oj) #> #> Coefficients: #> branddominicks brandminute.maid brandtropicana #> 9.17 9.22 9.11 Advertisement increases price sensitivity from -0.96 to -0.958 - 0.98 = -1.94! Why? One of the reasons is that the price was lowered during the Ad campaign! library(dplyr) doj = oj %>% filter(brand=="dominicks") par(mfrow=c(1,3), mar=c(4.2,4.6,2,1)) boxplot(price ~ feat, data = oj[oj$brand=="dominicks",], col=c(2,3), main="dominicks", ylab="Price ($)") boxplot(price ~ feat, data = oj[oj$brand=="minute.maid",], col=c(2,3), main="minute.maid")
boxplot(price ~  feat, data = oj[oj$brand=="tropicana",], col=c(2,3), main="tropicana") 0 = not featured, 1 = featured ## 4.3 Dummies We want to understand effect of the brand on the sales $\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \xcancel{\beta_2\mathrm{brand}}$ But brand is not a number! How can you use it in your regression equation? We introduce dummy variables Brand Intercept brandminute.maid brandtropicana minute.maid 1 1 0 tropicana 1 0 1 dominicks 1 0 0 $\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \beta_{21}\mathrm{brandminute.maid} + \beta_{22}\mathrm{brandtropicana}$ R will automatically do it it for you print(lm(logmove ~ log(price)+brand, data=oj)) #> #> Call: #> lm(formula = logmove ~ log(price) + brand, data = oj) #> #> Coefficients: #> (Intercept) log(price) brandminute.maid brandtropicana #> 10.83 -3.14 0.87 1.53 $\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \beta_3\mathrm{brandminute.maid} + \beta_4\mathrm{brandtropicana}$ $$\beta_3$$ and $$\beta_4$$ are "change relative to reference" (dominicks here). How does brand affect price sensitivity? Interactions: logmove ~ log(price) * brand No Interactions: logmove ~ log(price) + brand Parameter Interactions No Interactions (Intercept) 10.95 10.8288 log(price) -3.37 -3.1387 brandminute.maid 0.89 0.8702 brandtropicana 0.96239 1.5299 log(price):brandminute.maid 0.057 log(price):brandtropicana 0.67 Example 4.4 (Golf Performance Data) Dave Pelz has written two best-selling books for golfers, Dave Pelz’s Short Game Bible, and Dave Pelz’s Putting Bible. • Dave Pelz was formerly a “rocket scientist” (literally) Data analytics helped him refine his analysis It’s the short-game that matters! • The optimal speed for a putt Best chance to make the putt is one that will leave the ball $$17$$ inches past the hole, if it misses. Golf Data Year-end performance data on 195 players from the 2000 PGA Tour. 1. nevents, the number of official PGA events included in the statistics 2. money, the official dollar winnings of the player 3. drivedist, the average number of yards driven on par 4 and par 5 holes 4. gir, greens in regulation, measured as the percentage of time that the first (tee) shot on a par 3 hole ends up on the green, or the second shot on a par 4 hole ends up on the green, or the third shot on a par 5 hole ends up on the green 5. avgputts, which is the average number of putts per round. Analyze these data to see which of nevents, rivedist, gir, avgputts is most important for winning money. Regression of Money on all explanatory variables: source("src/utils.R") library(tidyverse) d00 = read_csv("data/pga-2000.csv") d18 = read_csv("data/pga-2018.csv") model18 = lm(money ~ nevents + drivedist + gir + avgputts, data=d18) model00 = lm(money ~ nevents + drivedist + gir + avgputts, data=d00) summary(model00) #> #> Call: #> lm(formula = money ~ nevents + drivedist + gir + avgputts, data = d00) #> #> Residuals: #> Min 1Q Median 3Q Max #> -1158735 -364118 -62913 285081 5604080 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 14856638 4206466 3.53 0.00052 *** #> nevents -30066 11183 -2.69 0.00781 ** #> drivedist 21310 6913 3.08 0.00236 ** #> gir 120855 17429 6.93 6.2e-11 *** #> avgputts -15203045 2000905 -7.60 1.3e-12 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 672000 on 190 degrees of freedom #> Multiple R-squared: 0.493, Adjusted R-squared: 0.483 #> F-statistic: 46.3 on 4 and 190 DF, p-value: <2e-16 $$R^2 = 50$$% Residuals hist(rstandard(model00), breaks=20, col="lightblue", xlab = "Standartized Residual", main="") arrows(x0 = 7.5,y0 = 20,x1 = 8.5,y1 = 2,length = 0.1) text(x = 7,y = 22,labels = "Tiger Woods", cex=1.5) Regression Transform with log(Money) as it has much better residual diagnostic plots. lm(formula = log(money) ~ nevents + drivedist + gir + avgputts, data = d00) #> #> Call: #> lm(formula = log(money) ~ nevents + drivedist + gir + avgputts, #> data = d00) #> #> Coefficients: #> (Intercept) nevents drivedist gir avgputts #> 36.14923 -0.00899 0.01409 0.16567 -21.12875 $$R^2 = 67$$%. There’s still $$33$$% of variation to go Residuals for log(Money) par(mar = c(4,4.5,0,0),mfrow=c(1,1)) model00log = lm(log(money) ~ nevents + drivedist + gir + avgputts, data=d00) summary(model00log) #> #> Call: #> lm(formula = log(money) ~ nevents + drivedist + gir + avgputts, #> data = d00) #> #> Residuals: #> Min 1Q Median 3Q Max #> -1.5438 -0.3655 -0.0058 0.3968 1.9262 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 36.14923 3.57763 10.10 <2e-16 *** #> nevents -0.00899 0.00951 -0.94 0.346 #> drivedist 0.01409 0.00588 2.40 0.018 * #> gir 0.16567 0.01482 11.18 <2e-16 *** #> avgputts -21.12875 1.70178 -12.42 <2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.57 on 190 degrees of freedom #> Multiple R-squared: 0.671, Adjusted R-squared: 0.664 #> F-statistic: 96.9 on 4 and 190 DF, p-value: <2e-16 hist(rstandard(model00log), breaks=20, col="lightblue", xlab = "Standartized Residual", main="") arrows(x0 = 3,y0 = 20,x1 = 3.2,y1 = 2,length = 0.1) text(x = 3,y = 22,labels = "Tiger Woods", cex=1.5) Regression: Variable selection $$t$$-stats for nevents is $$<1.5$$. lm(formula = log(money) ~ drivedist + gir + avgputts, data = d00) #> #> Call: #> lm(formula = log(money) ~ drivedist + gir + avgputts, data = d00) #> #> Coefficients: #> (Intercept) drivedist gir avgputts #> 36.1737 0.0146 0.1658 -21.3684 The fewer the putts the better golfer you are. Duh! avgputts per round is hard to decrease by one! Evaluating the Coefficients 1. Greens in Regulation (GIR) has a $$\hat{\beta} = 0.17$$. If I can increase my GIR by one, I’ll earn $$e^{0.17} = 1.18$$% An extra $$18$$% 2. DriveDis has a $$\hat{\beta} = 0.014$$. A $$10$$ yard improvement, I’ll earn $$e^{0.014 \times 10} =e^{0.14} = 1.15$$% An extra $$15$$% Caveat: Everyone has gotten better since 2000! Main Findings Tiger was $$9$$ standard deviations better than the model. • Taking logs of Money helps the residuals! • An exponential model seems to fit well. The residual diagnostics look good • The $$t$$-ratios for nevents are under $$1.5$$. Over-Performers Outliers: biggest over and under-performers in terms of money winnings, compared with the performance statistics. Woods, Mickelson, and Els won major championships by playing well when big money prizes were available. Over-Performers Table 4.1: Over-Performers name money Predicted Error 1 Tiger Woods 9188321 3584241 5604080 2 Phil Mickelson 4746457 2302171 2444286 3 Ernie Els 3469405 1633468 1835937 4 Hal Sutton 3061444 1445904 1615540 20 Notah Begay III 1819323 426061 1393262 182 Steve Hart 107949 -1186685 1294634 Under-Performers Underperformers are given by large negative residuals Glasson and Stankowski should win more money. Table 4.2: Under-Performers name money Predicted Error 47 Fred Couples 990215 1978477 -988262 52 Kenny Perry 889381 1965740 -1076359 70 Paul Stankowski 669709 1808690 -1138981 85 Bill Glasson 552795 1711530 -1158735 142 Jim McGovern 266647 1397818 -1131171 Lets look at 2018 data Highest earners are Table 4.3: Highest earners 2018 name nevents money drivedist gir avgputts Justin Thomas 23 8694821 312 69 1.7 Dustin Johnson 20 8457352 314 71 1.7 Justin Rose 18 8130678 304 70 1.7 Bryson DeChambeau 26 8094489 306 70 1.8 Brooks Koepka 17 7094047 313 68 1.8 Bubba Watson 24 5793748 313 68 1.8 Overperformers Table 4.4: Overperformers 2018 name money Predicted Error 1 Justin Thomas 8694821 5026220 3668601 2 Dustin Johnson 8457352 6126775 2330577 3 Justin Rose 8130678 4392812 3737866 4 Bryson DeChambeau 8094489 3250898 4843591 5 Brooks Koepka 7094047 4219781 2874266 6 Bubba Watson 5793748 3018004 2775744 9 Webb Simpson 5376417 2766988 2609429 11 Francesco Molinari 5065842 2634466 2431376 12 Patrick Reed 5006267 2038455 2967812 84 Satoshi Kodaira 1471462 -1141085 2612547 Underperformers Table 4.5: Underperformers 2018 name money Predicted Error 102 Trey Mullinax 1184245 3250089 -2065844 120 J.T. Poston 940661 3241369 -2300708 135 Tom Lovelady 700783 2755854 -2055071 148 Michael Thompson 563972 2512330 -1948358 150 Matt Jones 538681 2487139 -1948458 158 Hunter Mahan 457337 2855898 -2398561 168 Cameron Percy 387612 3021278 -2633666 173 Ricky Barnes 340591 3053262 -2712671 176 Brett Stegmaier 305607 2432494 -2126887 Findings Here’s three interesting effects: • Tiger Woods is $$8$$ standard deviations better! • Increasing driving distance by $$10$$ yards makes you $$15$$% more money • Increasing GIR by one makes you $$18$$% more money. • Detect Under- and Over-Performers Go Play!! Regression 1. Input and Plot Data In R: plot and summary commands 2. “Kitchen-Sink” Regression lm command with all variables 3. Residual Diagnostics and plot(model) Fitted values and Standardised residuals. Outliers and Influence 4. Transformation? Correct the 4-in-1 plots and assumptions. ### 4.3.1 Regression Strategy 1. Variable Selection $$t$$-state and $$p$$-values from summary(model) 2. Final Regression Re-run the model. Interpret the coefficients summary(model). Economic and Statistical Significance 3. Prediction predict.lm. Out-of-sample forecasting A model is only as good as its predictions!! ### 4.3.2 Predictive Analytics: General Introduction Predictive Analytics is the most widely used tool for high dimensional input-output analysis ${Y = F(X) \; \; {\rm where} \; \; X=(X_1, \ldots , X_p )}$ • Consumer Demand (Amazon, Airbnb, ... ) • Maps (Bing, Uber) • Pricing • Healthcare The applications are endless .… Example 4.5 (Target) Target and other retailers use predictive analytics to study consumer purchasing behaviour to see what type of coupons or promotions you might like Here’s a famous story about a father and his daughter. Target predicted that his daugther was pregnant from her purchasing behaviour long before they were buying diapers Here’s the original link ... Target and Pregnancy Getting a customer to a routine is the key • M.I.T experiment: t-shaped maze with chocolate at the end and behind the barrier that opens after a loud click • While each animal wandered through the maze, its brain was working furiously • As the scientists repeated the experiment, again and again, the rats eventually stopped sniffing corners and making wrong turns and began to zip through the maze with more and more speed • As each rat learned how to complete the maze more quickly, its mental activity decreased Leaerning routines from data is the basis for modern marketing • Habits is a three-step loop: cue, a trigger (go into automatic mode), then the routine • Febreze: original adds were targeting a wrong routine (kill the smell), no sails. They the ad said: use Febreze after cleaning each room. Now it is one of the most successful products. • Target used the fact that customers who going through a major life event change their habits (routines). They can identify due dates from registry. Example 4.6 (Walmart) Walmart began using predictive analytics in 2004. Mining trillions of bytes’ worth of sales data from recent hurricanes Determine what customers most want to purchase leading up to a storm. Strawberry Pop-Tarts are one of the most purchased food items, especially after storms, as they require no heating and can be eaten at any meal Walmart and Hurricances Example 4.7 (Germany’s Otto) Otto sells other brands, does not stock those goods itself, hard to avoid one of the two evils: shipping delays until all the orders are ready for fulfilment, or lots of boxes arriving at different times. • Analyze around 3 billion past transactions and 200 variables–past sales, searches on Otto’s site and weather information. They predict what customers will buy a week before they order. This system has proved so reliable, predicting with 90% accuracy what will be sold within 30 days, that Otto allows it automatically to purchase around $$200,000$$ items a month from third-party brands with no human intervention. [Economist] [Germany’s Otto] Example 4.8 (Stitch) Stitch Fix CEO Says AI Is ‘So Real’ and Incredibly Valuable Stitch Fix asks customers for insights and feedback alongside their size and color preference for items, even the ones customers didn’t like or buy, in exchange for a clear value proposition. The breadth and depth of their data are valuable. Their model relies on a combination of data science – machine learning, AI and natural language processing – and human stylists; on top of complex customer profiles built by data, stylists can layer the nuances of buying and wearing clothes. Example 4.9 (Uber Pool) Bayes predicts where you’re going to be dropped off. Uber constructs prior probabilities for riders, Uber cars, and popular places. Combine to construct a joint probability table Then calculate the posterior probability of destination for each person and pool travellers together [Uber Pool] Example 4.10 () Example: $Kaggle: Predictive Culture$ \BeginKnitrBlock{example}\iffalse{-91-65-105-114-98-110-98-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-62"><strong>(\#exm:unnamed-chunk-62) \iffalse (Airbnb) \fi{} </strong></span> <!-- [Airbnb New User Bookings Prediction Competition] --> New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. Accurately predict where a new user will book their first travel experience Airbnb can then personalized content, decrease the average time to first booking, and better forecast demand.$12\$ classes--major destinations, and a did not book category

[Stitch Fix CEO Says AI Is 'So Real' and Incredibly Valuable]: https://www.bloomberg.com/news/videos/2018-10-25/stitch-fix-ceo-says-ai-is-so-real-and-incredibly-valuable-video
[Uber Pool]: https://newsroom.uber.com/inferring-uber-rider-destinations/
[Airbnb New User Bookings Prediction Competition]: https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings

List of users, demographics, web session records, and content data
</div>\EndKnitrBlock{example}
<img src="fig/airbnb_arch-3dplace.png" width="70%" style="display: block; margin: auto;" />

Winner has the best out-of-sample prediction!!


Example 4.11 () Example: $Hacking OkCupid$ Sorted daters into seven clusters, like "Diverse" and "Mindful," each with distinct characteristics.

Wired article

NOVA Video


\BeginKnitrBlock{example}\iffalse{-91-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-66"><strong>(\#exm:unnamed-chunk-66)  \iffalse () \fi{} </strong></span>Example: $NFL Dynamic Pricing$

Predict price demand for any given Lions game for any given seat in the
</div>\EndKnitrBlock{example}
<img src="https://fast.wistia.com/embed/iframe/l85ltyxqbh?autoplay=0" width="100%" style="display: block; margin: auto;" />

We submitted our report on June 2016 suggesting that some areas of the
stadium were priced efficiently and some were underpriced or overpriced.

On Fed 2017, Detroit Jock City wrote

"Detroit Lions tickets will cost a little more on average for 2017, but

<img src="fig/lions.png" width="100%" style="display: block; margin: auto;" />

<https://detroitjockcity.com/2017/02/10/detroit-lions-2017-ticket-prices/>


Example 4.12 () Example: $Latour 1982 Price History$ wininvestment

Ch^ateau Latour: grand vin \BeginKnitrBlock{example}\iffalse{-91-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-72"><strong>(\#exm:unnamed-chunk-72)  \iffalse () \fi{} </strong></span>
Example: Global Warming
</div>\EndKnitrBlock{example}

Shifting Distribution of Northern Hemisphere Summer Temperature
Anomalies, 1951-2011

[NASA article with animation]`

Climate statistics and public policy

Change in global mean temperature is not one of the most sensitive indicator

• Changes in radiation spectrum (atmosphere is less transparent in the infrared)

• Sea surface temperature and Land surface temperature

• Sea level rise (thermal expansion and ice melt): Greenland and West Antarctic are melting + glacial melt

• Ocean acidification: CO$$_2$$ gets absorbed by water, it produces carbolic acid

• Seasonal changes; winter - summer temperature has been decreasing since 1954. Shift changes (earlier seasons) lead to ecological effects

• Hurricanes: increase in maximum wind velocity = sea surface temperature + the difference between sea surface temperature and the average air temperature in the outflow of the hurricane

Guttorp paper

2018 was the fourth-warmest year on record.  NYT article

How Much Hotter Is Your Hometown Than When You Were Born? NYT article

Ice cores is an important source of data

Ice core. Cylinder of ice drilled out of an ice sheet or glacier. Most ice core records come from Antarctica and Greenland.

The oldest continuous ice core records to date extend 123,000 years in Greenland and 800,000 years in Antarctica. Ice Core Datasets

Ice Core Basics

• has been around since the 1950s

• Mostly from Greenland and Antarctica

• bubbles in the ice core preserve actual samples of the world’s ancient atmosphere

The World Data Center (WDC) for Paleoclimatology maintains archives of ice core data from polar and low-latitude mountain glaciers and ice caps throughout the world. Proxy climate indicators include oxygen isotopes, methane concentrations, dust content, as well as many other parameters.

https://www.ncdc.noaa.gov/data-access/paleoclimatology-data/datasets/ice-core

CO2 was stable over the last millennium In the early 19th century CO2 concentration started to rise, and its concentration is now nearly 40% higher than it was before the industrial revolution

Things we learned from ice core

• Ice cores contain information about past temperature, and about many other aspects of the environment.

• Atmospheric carbon dioxide levels are now 40% higher than before the industrial revolution. This increase is due to fossil fuel usage and deforestation.

• The magnitude and rate of the recent increase are almost certainly unprecedented over the last 800,000 years.

• Methane also shows a huge and unprecedented increase in concentration over the last two centuries.

Gates thinks we can use more renewables and nuclear

Need better storage + generation (wind/sun) technology