Chapter 4 Linear Regression
Regression analysis is the most widely used statistical tool for understanding relationships among variables. It provides a conceptually simple method for investigating functional relationships between one or more factors and an outcome of interest. This relationship is expressed in the form of an equation, which we call the model, connecting the response or dependent variable and one or more explanatory or predictor variable.
Example 4.1 (Henrietta Leavitt) Let’s start with a famous analysis of Henrietta Leavitt who used linear regression to estimate distances to pulsating stars. In her analysis of 1777 variable (pulsating) starts she found out that the brighter star have the longer oscilation period.
She used data for 25 cepheid variables, which are pulsating stars with varying diameter and temperature that leads to periodic changes in brightness of specific frequency and amplitude. Below you see the scatter plot of period (on logarithmic scale) vs brightness of those 25 stars

Figure 4.1: Henrietta Leavitt’s 1912 data on 25 pulsating stars.
Then she drew a straight a line through this cloud of points which allowed her to predict brightness from the period. Periods can be accurately measured and predicted value for brightness allows astronomers to measure cosmic distances over previously unimaginable scales by comparing the predicted brightness with the apparent brightness.
Though we have though we have many more sophisticated models to model the relations in a dataset, the simplicity and interoperability of the linear regression model makes it a viable option for many practical applications. The regression model can be used for both making a prediction or explaining relations between quantities. For example, we might be interested in predicting for how much will my house sell? An example of explanation question would be what is the impact of my education on my income?
To demonstrate linear regression we develop a model that predicts the price of a house, given its square footage. This is a simplified version of a regression model used by Zillow, a house pricing site. First, we look at property sales data where we know the price and some observed characteristics. Let’s look at the scatter plot of living areas measure in squared feet and the price measured in thousands of dollars.
Second, we build a decision model that predicts price as a function of the observed characteristics. Now, we need to decide on what characteristics do we use. There are many factors or variables that affect the price of a house, for example location. Some other basics ones include size, number of rooms, and parking.
Let’s run a simple linear regression on size, measured in square foots. The value that we seek to predict is called the dependent variable, and we denote this by \(y\) which is the price of house measured in thousand of dollars. The variable that we use to construct our prediction is the predictor variable, and we denote it with \(x\). What’s does the data look like?
= read.csv("data/SaratogaHouses.csv")
d $price = d$price/1000; d$livingArea = d$livingArea/1000
dplot(d$livingArea,d$price, 'p', pch=20, col="blue")
We use lm
function to draw a line
= lm(price ~ livingArea, data=d)
l plot(d$livingArea, d$price, 'p', pch=20, col="blue", xlab="Living Area", ylab="Price")
abline(l, col="red", lwd=3)
We can use coef
funciton to find out the slope and the intercept of this line.
coef(l)
#> (Intercept) livingArea
#> 13 113
The intercept value is in units of \(y\) ($1000). The slope is in units of \(y\) per units of \(x\) ($1000/1 sq. feet).
In general, adding statistical noise, the simple linear regression model is then given by the follwing equation \[ y = \beta_0 + \beta_1 x + \epsilon \; \; {\rm where} \; \epsilon \sim N(0, \sigma^2) \] Here error \(\epsilon\) term indicates that the relation between \(x\) and \(y\) is not excatly a line and for a given \(x\), the value of \(y\) can be deviating from the line. We will write this as \(y \mid x \sim N\left(\beta_0 + \beta_1 x, \sigma^2 \right)\)
In our housing example, we find that \(\beta_0 = 13.44\) and \(\beta_1 =113.12\). Thus, every \(1000\) sqft increase the price of the hous by $ 1.34^{4} and the price of a lot of land without a house is 113.12. The magnitude of \(\beta_1\) shows the economic significance of house’s square footage.
\BeginKnitrBlock{example}\iffalse{-91-77-97-107-105-110-103-32-80-114-101-100-105-99-116-105-111-110-115-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-7"><strong>(\#exm:unnamed-chunk-7) \iffalse (Making Predictions) \fi{} </strong></span>
We can now predict the price of a house when we only know that size: take the value off the regression line. For example, given a house size of $x = 2.2$. Predicted Price:
$$
\hat{y} = `r coef(l)[1]` + `r coef(l)[2]` \times 2.2 = `r coef(l)[1] + coef(l)[2]*2.2`
$$</div>\EndKnitrBlock{example}
\BeginKnitrBlock{example}\iffalse{-91-77-101-97-110-32-83-113-117-97-114-101-100-32-69-114-114-111-114-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-8"><strong>(\#exm:unnamed-chunk-8) \iffalse (Mean Squared Error) \fi{} </strong></span>
Now we can use our training data set to compare prices predicted by the model and actual prices. We will do it by clculating an average squared difference, also called a mean squared error
$$
MSE = \dfrac{1}{n}\sum_{i=1}^{n} (\hat y_i - y_i)^2
$$
**Nick: Describe in-sample and out-of-sample MSE**
Or in code</div>\EndKnitrBlock{example}
```r
yhat = coef(l)[1] + coef(l)[2]*d$livingArea
y = d$price
mse = sum((yhat-y)^2)/length(y)
mse
#> [1] 4770
Sometimes it is more convinient to use root mean squared error (RMSE) = \(\sqrt{4769.91} = 69.06\) which is measured in the same units as variable \(y\), price in our case.
**Nick: This means that ---**
## Least Squares Principle
Due to Gauss, we will use the mean squared error as the criteria to pick the best model that describe the relations in our data. Least squares principle simply says that we find such $\beta_0$ and $\beta_1$ that minimize the value of MSE.
If MSE is zero we would have a perfect line. We can check and see that `lm` function indeed finds the best values of $\beta_0$ and $\beta_1$ by looking at thse MSE associated with perturbed coefficients
```r
set.seed(1)
plot(d$livingArea, d$price, 'p', pch=20, col="blue")
abline(l, col="red", lwd=3)
cat("This produces output,")
#> This produces output,
for(i in 1:5) {
beta0 = rnorm(1,1,0.2)*coef(l)[1]
beta1 = rnorm(1,1,0.2)*coef(l)[2]
yhat = beta0 + beta1*d$livingArea
abline(beta0,beta1, lwd=2, lty=2, col="green")
print(sum((yhat-y)^2)/length(y))
}
#> [1] 4808
#> [1] 9003
#> [1] 5907
#> [1] 5815
#> [1] 4900
Mathematically, least squares chooses \(\beta_0\) and \(\beta_1\) to minimize \[ (\hat{\beta}_0,\hat{\beta}_1) = \arg\min_{\beta_0,\beta_1}\quad\sum_{i=1}^n L_i \]
where \[ \begin{aligned} \sum_{i=1}^n L_i = & L_1 + \ldots + L_n = (y_1 - \hat{y}_1 )^2 + \ldots + ( y_n - \hat{y}_n )^2 = \\ & (y_1 - \beta_0 - \beta_1x_1)^2 - \ldots - (y_n - \beta_0 - \beta_1x_n)^2\end{aligned} \] This optimization problem can be solved analytically. The equation for the intercept guarantees that \[\bar{y} = \beta_0 + \beta_1 \bar{x}.\] Thus, the point \(( \bar{x} , \bar{y} )\) is always on the regression line.
The slope equation takes the form \[ \begin{aligned} \hat \beta_1 & = \Cor{x,y} \times \frac{s_y}{s_x} = \frac{ \sum_{i=1}^n (x_i - \bar{x}) (y_i - \bar{y} ) }{ \sum_{i=1}^n ( x_i - \bar{x})^2 }\\ & = \frac{ \Cov{x,y} }{ \Var{x} }.\end{aligned} \] Here \(\beta_1\) estimates the correlation \(r\) times a that ensures the proper units. This entails estimating parameters using their sample counterparts. Once we have parameters of the model estimated, we are now ready to generate.
4.0.1 Statistical Properties of Regression
It is convenient to introduce a regression model using the language of probability and uncertainty. We define the random variable, , of interest. Then regression model assumes that mean of \(y\) depends linearly on predictor \(x\) \[y = \beta_0 + \beta_1 x + \epsilon,~ \text{where}~\epsilon \sim N(0, \sigma^2).\] Line coefficients \(\beta_0\) and \(\beta_1\) have same interpretation as in the deterministic approach. Here, we have a new parameter \(\sigma^2\) that models dispersion of \(y_i\) around the mean \(\beta_0 + \beta_1 x\), let’s see an example
# Data with small and large measurements errors
# lm-sigma
In Figure TODO: how to insert references to R-generated plots the black solid line represents the mean of the estimated relation, i.e. \(\beta_0 + \beta_1 x\).
To estimate parameters \(\beta_0, \beta_1,\sigma^2\) that govern this relationship, we use historical data on characteristics and dependent variable \((x_1,y_1),\ldots (x_n,y_n)\). We can apply least squares principle to find \(\beta\)’s and then estimate \(\sigma^2\) empirically by calculating \[\hat \sigma^2 = s_e^2 = \Var{e_1}.\]
We compute the values of \(\beta_0\) and \(\beta_1\) using a sample data. Thus, the estimated values will not be equal to the true ones.
# Comparison of true and estimated lines
# lm-beta-estimated
To get a feeling for the amount of variability in our exmerimentation. Imagine that we have two sample data sets. For example, we have housing data from two different realtor firms. Do you think the estimated value of \(\beta_1\) will be the same for both of those? The answer is no. Let’s demonstrate with an example, we simulate eight data sets from the same distribution and estimate eight different linear models.
The value of \(\beta_1\) will differ from sample to sample. In other words, it will be a random variable. The sampling distribution of \(\beta_1\) describes how it varies over different samples with the \(x\) values fixed. Statistical view of linear regression allows us to calculate confidence and prediction intervals for estimated parameters. It turns out that when least squares principle is used, the estimated \(\hat\beta_1\) is normally distributed: \(\hat\beta_1 \sim N( \beta_1 , s_1^2 )\). Again, empirically if we plot coefficients for 10000 different models, we can see that the empirical distribution resembles a normal distribution.
# Histogram of parameters estimated from different samples from the same distribution
# lm-random-bets
Further, \(\hat\beta_1\) is unbiased, i.e. \[\E{\hat\beta_1} = \beta_1,\] the expectation of our estimated value is equal to the true value of the unobserved parameter \(\beta\). Further, \(s_1\) is the . The t-statistic is . The three factors: sample size (\(n\)), error variance (\(s^2\)), and \(x\)-spread, \(s_x\) impact the size of standard error for \(\beta_1\) \[\rcol{s_1^2 = \frac{s^2}{\sum_{i=1}^n ( x_i - \bar{x} )^2 } = \frac{ s^2 }{ (n-1) s_x^2 } }\]
Intuitively, we model regression-back-to-the-mean effect. This is one of the most interesting statistical effects you’ll see in daily life. Here are a few examples:
If your parents are on average higher than the average, you’ll regress back to the average.
Airline pilot training. Good landings are followed by average ones.
A negative correlation between first and second round scores at Golf.
Further, We always have to account for the random noise in day-to-day decisions, thus we have the error term \(\epsilon\) in our model.
The underlying assumptions of the statistical linear regression model are:
Given the value of \(x\), the \(y\) values are normally distributed
The means of these normal distributions of \(y\) values all lie on the straight line of regression
The standard deviations of these normal distributions are equal
The \(y\) values are statistically independent
Consider another example that demonstrates these effects.
Again, consider a house example. Say in our data we have 10 houses with the same squire footage, say 2000. Now the first point states, that the prices of those houses should follow a normal distribution. In other words, prices of all of the houses of size 2000 follow a normal distribution.
The second point merely states that mean of \(y\) is a linear function of \(x\). The third point states that if we are compare prices of 2000 sqft houses and 2500 sqft houses, they will have the same standard deviation. The fourth point means that price of one house does not depend on the price of another house.
All of the assumptions in the regression model can be written using probabilist notations The looks like: \[y \mid x \overset{iid}{\sim} N(\beta_0 +\beta_1 x, \sigma^2).\] Here, \(\beta_1\) measures the effect on \(y\) of increasing \(x\) by one and \(\beta_0\) measures the effect on \(y\) when \(x=0\). Sometimes, especially in finance applications, we use the notation \(\alpha = \beta_0\) and \(\beta = \beta_1\)
Another advantage of adopting a statistical view of the linear regression model is ability to quantify information about potential outliers. Outliers are points that are extreme relative to our model predictions. Recall, that residual is \(e_i = y_i- \hat y_i\). Since our predicted value \(\hat y_i\) follows a normal distribution, the residual also follows a normal distribution, since it is a difference of normal random variable \(\hat y_i\) and a constant \(y_i\). It easy to see that \[e_i \sim N(0, s_e^2),\] where \(s_e^2\) is an empirical estimate of the error’s variance.
Consider the relation between the fitted values \(\hat y_i\) and residuals \(e_i\). Our predictions are given by the line. The residual \(e_i\) and predicted value \(\hat y_i\) for the \(i\)th observation are related via \[y_i = \hat{y}_i + ( y_i - \hat{y}_i ) = \hat{y}_i + e_i\].
Residuals allow us to define outliers. They simply have large residuals. We re-scale the residuals by their standard errors. This lets us define \[r_i = \frac{ e_i }{ s_{ e} } =\frac{y_i - \hat{y}_i }{ s_{ e } }\] Since residuals follow normal distribution \(e \sim N(0,\sigma^2)\), in 95% of the time we expect the standardized residuals to satisfy \(- 2 < r_i < 2\). Any observation with is an extreme outlier, it is three sigmas away from the mean.
Another types of observations we are interested are the influential points. Those are are observations that affect the magnitude of our estimates \(\hat{\beta}\)’s. They are important to find as they typically have economic consequences. We will use to assess the significance of an influential point. Cook’s distance associated with sample \(i\) measure the change in estimated model parameters \(\hat \beta\) when sample \(i\) is removed from the training data set. In R
code, we can use
datanew = data[-i,] # Deletes ith row
to remove \(i\)th sample.
4.0.2 Predictive Uncertainty
Another advantage of using a statistical view is that we can assess how much error that could be in our best prediction. Suppose we would like to predict value of output variable for a new observation \(x_f\), then we calculate \[ \hat{y}_f = \hat \beta_0 + \hat \beta x_f + e_f, \; {\rm where} \; e_f \sim N ( 0 , s^2 )\]
Recall, that besides \(\epsilon\) being random, our estimates \(\hat \beta_0\) and \(\hat \beta_1\) are also random. Thus, all three turns in the liner equation have errors.
After we account for all the uncertainty,
\[\Var{y_f} = s^2 \left ( 1 + \frac{1}{n} + \frac{ (x_f - \bar{x} )^2 }{ (n-1) s_x^2 } \right )\]
The value of \(\Var{y_f}\) can be used to calculate a prediction interval. For example we know that 95% prediction interval is given by \[\hat \beta_0 + \hat \beta_1 x_f \pm 1.96 \sqrt{\Var{y_f} }\] Prediction intervals tell us where we can expect to see the next data point sampled.
Similarly, confidence intervals tells us about how well we have determined the mean, i.e. it gives the range for \(\E{y_f\mid x_f}\), note prediction interval gives the range for \(y_f\) itself. Thus the 95% confidence interval is given by \[\hat \beta_0 + \hat \beta_1 x_f \pm s.\]
A large predictive error variance (high uncertainty) comes from four factors
Large \(s\) (i.e. large errors, \(\epsilon\)’s)
Small \(n\) (not enough data)
Small \(s_x\) (not enough spread in the covariates)
Large difference between \(x_f\) and \(\bar{x}\) (predicting extremes)
4.0.3 Input Transformations
A linear model assumes that output variable is proportional to the input variable plus an offset. However, this is rarely the case. Often, we need to transform input variables by combining multiple inputs into a single predictor, for example by taking a ratio or putting inputs on a different scale, e.g. log-scale. In machine learning, this process is called feature engineering
4.0.3.1 Factor Regression
One of the classic examples of feature engineering is Fama-French three-factor model which is used in asset prising and portfolio management. The model states that asset returns depend on (1) market risk, (2) the outperformance of small versus big companies, and (3) the outperformance of high book/market versus small book/market companies, mathematically \[r = R_f + \beta(R_m - R_f) + b_s\cdot SMB + b_v\cdot HML + \alpha\] Here \(R_f\) is risk-free return, \(R_m\) is the return of market, \(SMB\) stands for "Small market capitalization Minus Big" and \(HML\) for "High book-to-market ratio Minus Low"; they measure the historic excess returns of small caps over big caps and of value stocks over growth stocks. These factors are calculated with combinations of portfolios composed by ranked stocks (BtM ranking, Cap ranking) and available historical market data.
Consider an example, when the relations are multiplicative. Figure TODO shows Johnson and Johnson’s quarterly earnings
= read.csv("data/JNJ.csv")
jj $Date = lubridate::ymd(jj$Date)
jj= jj %>%arrange(Date)
jj $Date0 = 12*(lubridate::year(jj$Date)-lubridate::year(jj$Date[1])) +
jj::month(jj$Date)-lubridate::month(jj$Date[1]))
(lubridate= lm(jj$Open ~ jj$Date0)
model = lm(log(jj$Open) ~ jj$Date0)
modellog plot(jj$Date0,jj$Open, type='l', col="red", ylab="JNJ Open", xlab="Month")
abline(model, col="red")
par(new=T)
plot(log(jj$Open), type='l', xlab="", ylab="", xaxt='n', yaxt='n')
abline(modellog)
axis(side = 4)
legend("topleft", legend=c("Original", "log-Scale"), lty=1, col=c("red", "black"), lwd=3, bty="n", y.intersp = 1,cex=1)
summary(model)
#>
#> Call:
#> lm(formula = jj$Open ~ jj$Date0)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -24.06 -11.12 -2.04 8.35 51.99
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -23.01349 1.20367 -19.1 <2e-16 ***
#> jj$Date0 0.19313 0.00361 53.5 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 14 on 576 degrees of freedom
#> Multiple R-squared: 0.832, Adjusted R-squared: 0.832
#> F-statistic: 2.86e+03 on 1 and 576 DF, p-value: <2e-16
summary(modellog)
#>
#> Call:
#> lm(formula = log(jj$Open) ~ jj$Date0)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.9037 -0.3101 -0.0773 0.3307 0.7830
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -9.31e-02 3.24e-02 -2.87 0.0042 **
#> jj$Date0 9.14e-03 9.72e-05 94.10 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.39 on 576 degrees of freedom
#> Multiple R-squared: 0.939, Adjusted R-squared: 0.939
#> F-statistic: 8.86e+03 on 1 and 576 DF, p-value: <2e-16
Just by visual inspection we can see that a straight line will not be a good fit for this data. The relation between the time and earnings is multiplicative. There are many business and natural science examples where multiplicative relation holds. To model multiplicative or power relations we can still use the apparatus of linear models. However, we need to apply log-transformation to both input and output variables. \(\log\) is the natural \(\log_e\) with base \(e=2.718\ldots\) and that \(\log ( a b) = \log a + \log b\) and \(log ( a^b ) = b \log a\).
Consider a multiplicative model relating input-output, where \(A = e^a\) and \[y = A x^{\beta_1},\] The Log-Log transform (taking logs of both sides) leads to \[\log y = \log A + \log x^b = a + \beta_1 \log x\]
The slope, \(\beta_1\) can be interpreted as an , that is % change in \(y\) versus % change in \(x\). Variables are related on a multiplicative, or , scale.
We can estimate this model by regressing \(\log y\) on \(\log x\). In R: model = lm(log(y) ~ log(x)
Another case is the exponential model. Any process that is exponentially growing will lead to such a model. For example growth of population with time or compounding interest in savings account.
Suppose that we have an equation: \[y = A e^{\beta_1 x},\] where \(A = e^{\beta_0}\). This is equivalent to \(\log(y) = a + \beta_1 x\).
Taking logs of the original equation gives \[\begin{aligned} \log y &= \log A + \beta_1 x\\ \log y & = \beta_0 + \beta_1 x\end{aligned}\]
Therefore, we can run a regression of \(\log y\) on \(x\). However, there is a caveat: not all variables can be logged. There are some constraints
\(y>0\) needs to be positive.
Dummy variables \(x=0\) or \(1\) can’t be logged.
Counting variables are usually left alone as well.
Example 4.2 (World’s Smartest Mammal) First of all, read in and attach our data ...
= read.csv("data/mammals.csv")
mammals attach(mammals)
head(mammals)
#> Mammal Brain Body
#> 1 African_elephant 5712.0 6654.00
#> 2 African_giant_pouched_rat 6.6 1.00
#> 3 Arctic_Fox 44.5 3.38
#> 4 Arctic_ground_squirrel 5.7 0.92
#> 5 Asian_elephant 4603.0 2547.00
#> 6 Baboon 179.5 10.55
tail(mammals)
#> Mammal Brain Body
#> 57 Tenrec 2.6 0.9
#> 58 Tree_hyrax 12.3 2.0
#> 59 Tree_shrew 2.5 0.1
#> 60 Vervet 58.0 4.2
#> 61 Water_opossum 3.9 3.5
#> 62 Yellow-bellied_marmot 17.0 4.0
### Linear model
= lm(Brain~Body) model
Residual Diagnostics
The residuals show that you need a transformation .…
layout(matrix(1:2, ncol = 2))
plot(Body,Brain, pch=21, bg="lightblue", cex=2)
abline(model, col="red", lwd=3)
plot(Brain,residuals(model), bg="lightblue", pch=21, cex=2)
Residual Plots
par(mar=c(4,4.5,3,3))
plot(model, bg="lightblue", pch=21, cex=2)
log-log model
### Log-Log linear model
= lm(log(Brain)~log(Body))
model summary(model)
#>
#> Call:
#> lm(formula = log(Brain) ~ log(Body))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.7534 -0.5391 -0.0924 0.4207 2.6115
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.1833 0.1068 20.4 <2e-16 ***
#> log(Body) 0.7432 0.0317 23.5 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.77 on 60 degrees of freedom
#> Multiple R-squared: 0.902, Adjusted R-squared: 0.9
#> F-statistic: 551 on 1 and 60 DF, p-value: <2e-16
par(mar=c(4.5,4.5,3,3))
layout(matrix(1:2, ncol = 2))
plot(log(Brain)~log(Body), pch=21, bg="lightblue", cex=2)
abline(model, col="red", lwd=3)
plot(log(Brain),residuals(model), bg="lightblue", pch=21, cex=2)
That’s better!
\(4\) in \(1\) Residuals: log-log model
par(mar=c(4,4.5,3,3))
plot(model, bg="lightblue", pch=21, cex=2)
log-log Model
\[{ {\rm log(Body)} = 2.18 + 0.74 \; {\rm log(Brain)} \; . }\] The coefficients are highly significant \(R^2 = 90\)%
Outliers
= rstudent(model)
res = order(res,decreasing = T)[1:10]
outliers cbind(mammals[outliers,],
residual = res[outliers],
Fit = exp(model$fitted.values[outliers]))
#> Mammal Brain Body residual Fit
#> 11 Chinchilla 64 0.42 3.78 4.7
#> 34 Man 1320 62.00 2.67 190.7
#> 50 Rhesus_monkey 179 6.80 2.12 36.9
#> 6 Baboon 180 10.55 1.67 51.1
#> 42 Owl_monkey 16 0.48 1.46 5.1
#> 10 Chimpanzee 440 52.16 1.27 167.7
#> 27 Ground_squirrel 4 0.10 1.20 1.6
#> 43 Patas_monkey 115 10.00 1.11 49.1
#> 60 Vervet 58 4.19 1.06 25.7
#> 3 Arctic_Fox 44 3.38 0.92 22.0
There is a residual value of 3.78 extreme outlier.
It corresponds to the Chinchilla.
This suggests that the Chinchilla is a master race of supreme intelligence!
Inference
NO!!! I checked and there was a data entry error.
The brain weight is given as 64 grams and should only be 6.4 grams.
The next largest residual corresponds to mankind
In this example the log-log transformation used seems to achieve two important goals, namely linearity and constant variance.
4.1 Multiple Regression
Many problems involve more than one independent (explanatory) variable or factor which affects the dependent or response variable.
Multi-factor asset pricing models (APT). Stock returns, book-to-market ratios, Interest rates
Demand for a product given prices of competing brands, advertising, household attributes (to formulate pricing strategies)
Internet Analytics What do I like? Suggestions instead of Search! Alexa “book my Xmas vacation,” “buy my best friend a birthday present”
4.1.1 Regression Model
\(Y\) = response or outcome variable \(X_1, \ldots , X_p\) = explanatory or input variable
The general relationship is given by \[Y = f( X_1, \ldots , X_p ) + \epsilon\] And a linear relationship is written \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon\]
MLR Assumptions
The Multiple Linear Regression (MLR) model \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon\] assumptions follow those of simple linear regression:
The conditional mean of \(Y\) is linear in the \(X_j\) variables
The errors are normal \(N( 0 , \sigma^2 )\). We write \[Y \mid X_1 , \ldots , X_p \sim N \left ( \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p , \sigma^2 \right )\]
Statistical versus Economic Significance
When looking at the \(\beta\) coefficients there are two issues
Statistical Significance: The \(t\)-ratios of the \(\beta\)’s
Economic Significance: The magnitudes of the \(\beta\)’s If \(X_i\) increases by one unit holding the other \(X\)’s constant Then \(Y\) will react by \(\beta_i\) units. They are called marginal effects
At the end of the day use your judgment!
Model Diagnostics
plot(model) provides diagnostics before model building!
There are many possible caveats
Running simple regressions gives you the wrong answer!
Multiple regression takes into account the correlation between the factors and the independent variable. It does all the work for you.
A variable might be insignificant once we have incorporated a more important predictor variable.
A common sense approach usually works well. If a variable never seems to be significant it typically isn’t.
Model Prediction is the great equalizer!!
Example 4.3 (Newfood) Goal of Experiment
A six month market test has been performed on the Newfood product.
A breakfast cereal.
Build a multiple regression model that gives us good sales forecasts.
This dataset is the outcome of a controlled experiment in which the values of the independent variables which affect sales are chosen by the analyst.
Analyses the factors which contribute to sales of a new breakfast cereal. Quantify the effects of business decisions such as choice of advertising level, location in store and pricing.
variable | description |
---|---|
sales |
new cereal sales |
price |
price |
adv |
low or high advertising (\(0\) or \(1\)) |
locat |
bread or breakfast section (\(0\) or \(1\)) |
inc |
neighborhood income |
svol |
size of store |
What happens when you regress
sales
onprice, adv, locat
?Run the “kitchen-sink” regression. Perform Diagnostic checks.
Which variables should we transform?
Run the new model. Perform diagnostics and variable selection.
What’s the largest cooks distance?
Provide a summary of coefficients and statistical significance
Predict
sales
when \({\tt price} = 30, {\tt adv = 1} , {\tt income = 8}\) and \({\tt svol} = 34\). What happens when you predict at the median values of the characteristics?
First we examine the correlation matrix:
= read.csv("data/newfood.csv")
newfood attach(newfood)
names(newfood)
#> [1] "sales" "price" "adv" "locat" "income" "svol" "city" "indx"
head(newfood)
#> sales price adv locat income svol city indx
#> 1 225 24 0 0 7.3 34 3 1
#> 2 190 24 0 0 7.3 34 3 2
#> 3 205 24 0 0 7.3 34 3 3
#> 4 323 24 0 0 8.3 41 4 1
#> 5 210 24 0 0 8.3 41 4 2
#> 6 241 24 0 0 8.3 41 4 3
tail(newfood)
#> sales price adv locat income svol city indx
#> 67 180 34 1 1 6.6 29 1 1
#> 68 100 34 1 1 6.6 29 1 2
#> 69 75 34 1 1 6.6 29 1 3
#> 70 150 34 1 1 6.1 24 2 1
#> 71 122 34 1 1 6.1 24 2 2
#> 72 101 34 1 1 6.1 24 2 3
# correlation matrix
= cor(cbind(sales,price,adv,locat,income,svol))
cm upper.tri(cm, diag = TRUE)] = NA
cm[print(as.table(round(cm, 3)))
#> sales price adv locat income svol
#> sales
#> price -0.658
#> adv 0.001 0.000
#> locat -0.001 0.000 0.000
#> income 0.163 -0.131 -0.746 0.000
#> svol 0.375 -0.179 -0.742 -0.040 0.809
Remember: correlations are not \(\beta\)’s!!
Total sales volume is negatively correlated to advertising.
Income is negatively correlated with advertising as well.
How is the negative correlation apt to affect the advertising effects?
print(as.table(round(cm[2:4, 1:3], 3)))
#> sales price adv
#> price -0.658
#> adv 0.001 0.000
#> locat -0.001 0.000 0.000
There’s no correlation in the \(X\)’s by design!
Let’s start by only including price, adv, locat
= lm(sales~price+adv+locat)
model summary(model)
#>
#> Call:
#> lm(formula = sales ~ price + adv + locat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -97.81 -44.14 -9.11 23.62 237.19
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 562.312 53.139 10.58 5.1e-16 ***
#> price -12.812 1.780 -7.20 6.3e-10 ***
#> adv 0.222 14.535 0.02 0.99
#> locat -0.222 14.535 -0.02 0.99
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 62 on 68 degrees of freedom
#> Multiple R-squared: 0.432, Adjusted R-squared: 0.407
#> F-statistic: 17.3 on 3 and 68 DF, p-value: 1.94e-08
Why is the marketer likely to be upset by this regression?!
Why is the economist happy?
Let’s add income
and svol
to the regression!
4.1.2 Transformation
Power model: transform with log-log
= lm(log(sales)~log(price)+adv+locat+log(income)+log(svol))
model summary(model)
#>
#> Call:
#> lm(formula = log(sales) ~ log(price) + adv + locat + log(income) +
#> log(svol))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.5757 -0.1611 -0.0362 0.1928 0.5379
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 8.4070 1.3873 6.06 7.3e-08 ***
#> log(price) -1.7430 0.2207 -7.90 4.0e-11 ***
#> adv 0.1496 0.1005 1.49 0.14131
#> locat 0.0010 0.0609 0.02 0.98688
#> log(income) -0.5241 0.4958 -1.06 0.29439
#> log(svol) 1.0308 0.2553 4.04 0.00014 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.26 on 66 degrees of freedom
#> Multiple R-squared: 0.608, Adjusted R-squared: 0.578
#> F-statistic: 20.5 on 5 and 66 DF, p-value: 2.75e-12
Why no logs for adv
and locat
variables?
The log(svol
) coefficient is close to one!
\(R^2= 60\)%
On the transformed scale,
\[\log sales=8.41 - 1.74 \log price + 0.150 {\text{adv}} + 0.001 {\text{locat}} - 0.524 \log inc + 1.03 \log svol\]
On the un-transformed scale,
\[\text{sales} = e^{8.41} ( \text{price} )^{-1.74} e^{ 0. 15 \text{adv} } e^{ 0.001 \text{locat}} ( \text{inc} )^{-0.524} ( \text{svol} )^{1.03}\]
sales/price,income and svol
are a power sales/adv, locat
are
exponential
4.1.3 Interpretation
Interpret your regression model as follows
Price elasticity is \(\hat{\beta}_{\text{price}} = - 1.74\). A \(1\)% increase in price will drop sales \(1.74\)%
\({\tt adv}=1\) increases
sales
by a factor of \(e^{0.15} = 1.16\). That’s a \(16\)% improvement
Variable Selection: delete locat
as its statistically insignificant.
4.1.4 Prediction
predict.lm provides a \(\hat{Y}\)-prediction given a new \(X_f\)
= lm(log(sales)~log(price)+adv+log(income)+log(svol))
modelnew summary(modelnew)
#>
#> Call:
#> lm(formula = log(sales) ~ log(price) + adv + log(income) + log(svol))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.5752 -0.1608 -0.0367 0.1923 0.5374
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 8.4086 1.3734 6.12 5.4e-08 ***
#> log(price) -1.7431 0.2190 -7.96 2.8e-11 ***
#> adv 0.1495 0.0996 1.50 0.13811
#> log(income) -0.5236 0.4913 -1.07 0.29037
#> log(svol) 1.0303 0.2515 4.10 0.00012 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.25 on 67 degrees of freedom
#> Multiple R-squared: 0.608, Adjusted R-squared: 0.585
#> F-statistic: 26 on 4 and 67 DF, p-value: 5.01e-13
=data.frame(price=30,adv=1,income=8,svol=34)
newdatapredict.lm(modelnew,newdata,se.fit=T,interval="confidence",level=0.99)
#> $fit
#> fit lwr upr
#> 1 5.2 4.9 5.5
#>
#> $se.fit
#> [1] 0.1
#>
#> $df
#> [1] 67
#>
#> $residual.scale
#> [1] 0.25
Exponentiate-back to find \(\text{sales} = e^{5.1739} = 176.60\).
4.2 Interactions
Does gender change the effect of education on wages?
Do patients recover faster when taking drug A?
How does advertisement affect price sensitivity?
Interactions are useful. Particularly with dummy variables.
We build a kitchen-sink model with all possible dummies (day of the week, gender,...)
4.2.1 Models with Interactions
In many situations, \(X_1\) and \(X_2\) interact when predicting \(Y\)
Interaction Model: run the regression \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon\]
In R
:
= lm(y = x1 * x2) model
gives \(X_1+X_2+X_1X_2\), and
= lm(y = x1:x2) model
gives only \(X_1 X_2\)
The coefficients \(\beta_1\) and \(\beta_2\) are marginal effects.
If \(\beta_3\) is significant there’s an interaction effect.
We leave \(\beta_1\) and \(\beta_2\) in the model whether they are significant or not.
4.2.2 Models with Interactions and Dummies
\(X_1\) and \(D\) dummy
\(X_2 = D\) is a dummy variable with values of zero or one.
Model: typically we run a regression of the form \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_1 \star D + \epsilon\]
The coefficient \(\beta_1 + \beta_2\) is the effect of \(X_1\) when \(D=1\). The coefficient \(\beta_1\) is the effect when \(D=0\).
\BeginKnitrBlock{example}\iffalse{-91-79-114-97-110-103-101-32-74-117-105-99-101-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-34"><strong>(\#exm:unnamed-chunk-34) \iffalse (Orange Juice) \fi{} </strong></span></div>\EndKnitrBlock{example}
```r
oj = read.csv("./data/oj.csv")
head(oj)
#> store brand week logmove feat price AGE60 EDUC ETHNIC INCOME HHLARGE
#> 1 2 tropicana 40 9.0 0 3.9 0.23 0.25 0.11 11 0.1
#> 2 2 tropicana 46 8.7 0 3.9 0.23 0.25 0.11 11 0.1
#> 3 2 tropicana 47 8.3 0 3.9 0.23 0.25 0.11 11 0.1
#> 4 2 tropicana 48 9.0 0 3.9 0.23 0.25 0.11 11 0.1
#> 5 2 tropicana 50 9.1 0 3.9 0.23 0.25 0.11 11 0.1
#> 6 2 tropicana 51 8.9 0 3.9 0.23 0.25 0.11 11 0.1
#> WORKWOM HVAL150 SSTRDIST SSTRVOL CPDIST5 CPWVOL5
#> 1 0.3 0.46 2.1 1.1 1.9 0.38
#> 2 0.3 0.46 2.1 1.1 1.9 0.38
#> 3 0.3 0.46 2.1 1.1 1.9 0.38
#> 4 0.3 0.46 2.1 1.1 1.9 0.38
#> 5 0.3 0.46 2.1 1.1 1.9 0.38
#> 6 0.3 0.46 2.1 1.1 1.9 0.38
model = lm(logmove ~ log(price)*feat, data=oj)
summary(model)
#>
#> Call:
#> lm(formula = logmove ~ log(price) * feat, data = oj)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.299 -0.471 0.020 0.495 3.077
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.6593 0.0164 588.3 <2e-16 ***
#> log(price) -0.9582 0.0187 -51.3 <2e-16 ***
#> feat 1.7144 0.0304 56.4 <2e-16 ***
#> log(price):feat -0.9773 0.0419 -23.3 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.79 on 28943 degrees of freedom
#> Multiple R-squared: 0.398, Adjusted R-squared: 0.398
#> F-statistic: 6.37e+03 on 3 and 28943 DF, p-value: <2e-16
model = lm(log(price)~ brand-1, data = oj)
summary(model)
#>
#> Call:
#> lm(formula = log(price) ~ brand - 1, data = oj)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.1809 -0.1026 -0.0022 0.1306 0.4626
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> branddominicks 0.52697 0.00207 254 <2e-16 ***
#> brandminute.maid 0.79069 0.00207 382 <2e-16 ***
#> brandtropicana 1.03460 0.00207 500 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.2 on 28944 degrees of freedom
#> Multiple R-squared: 0.941, Adjusted R-squared: 0.941
#> F-statistic: 1.53e+05 on 3 and 28944 DF, p-value: <2e-16
levels(oj$brand)
#> NULL
brandcol <- c("green","red","gold")
par(mfrow=c(1,2))
boxplot(log(price) ~ brand, data=oj, col=brandcol)
plot(logmove ~ log(price), data=oj, col=brandcol[oj$brand])
83 Chicagoland Stores (Demographic info for each)
Price, sales (log units moved), and whether advertised (feat)
Orange Juice: Price vs Sales
plot(exp(logmove) ~ price, data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="move")
Orange Juice: Price vs log(Sales)
plot(logmove ~ price, data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="log(move)")
Orange Juice: Price vs log(Sales)
<- loess(logmove ~ price, data=oj, span=2)
l1 <- predict(l1)
smoothed1 = order(oj$price)
ind plot(logmove ~ price, data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="log(move)")
lines(smoothed1[ind], x=oj$price[ind], col="blue", lwd=2)
Orange Juice: log(Price) vs log(Sales)
plot(logmove ~ log(price), data=oj, col=brandcol[oj$brand], pch=16, cex=0.5, ylab="log(move)")
<- lm(logmove ~ log(price), data=oj)
l2 <- predict(l2)
smoothed2 = order(oj$price)
ind lines(smoothed2[ind], x=log(oj$price[ind]), col="blue", lwd=2)
Why? Multiplicative (rather than additive) change.
Now we are interested in how does advertisement affect price sensitivity?
Original model \[\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \beta_2 \mathrm{feat}\]
If we feature the brand (in-store display promo or flyer ad), does it affect price sensitivity \(\beta_1\)? If we assume it does \[\beta_1 = \beta_3 + \beta_4\mathrm{feat}\] The new model is \[\log(\mathrm{sales}) = \beta_0 + (\beta_3 + \beta_4\mathrm{feat})\log(\mathrm{price}) + \beta_2 \mathrm{feat}\] After expanding \[\log(\mathrm{sales}) = \beta_0 + \beta_3\log(\mathrm{price}) + \beta_4\mathrm{feat}*\log(\mathrm{price}) + \beta_2 \mathrm{feat}\]
## and finally, consider 3-way interactions
<- lm(logmove ~ log(price)*feat, data=oj)
ojreg summary(ojreg)
#>
#> Call:
#> lm(formula = logmove ~ log(price) * feat, data = oj)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.299 -0.471 0.020 0.495 3.077
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.6593 0.0164 588.3 <2e-16 ***
#> log(price) -0.9582 0.0187 -51.3 <2e-16 ***
#> feat 1.7144 0.0304 56.4 <2e-16 ***
#> log(price):feat -0.9773 0.0419 -23.3 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.79 on 28943 degrees of freedom
#> Multiple R-squared: 0.398, Adjusted R-squared: 0.398
#> F-statistic: 6.37e+03 on 3 and 28943 DF, p-value: <2e-16
print(lm(logmove ~ log(price), data=oj))
#>
#> Call:
#> lm(formula = logmove ~ log(price), data = oj)
#>
#> Coefficients:
#> (Intercept) log(price)
#> 10.4 -1.6
print(lm(logmove ~ log(price)+feat + brand, data=oj))
#>
#> Call:
#> lm(formula = logmove ~ log(price) + feat + brand, data = oj)
#>
#> Coefficients:
#> (Intercept) log(price) feat brandminute.maid
#> 10.279 -2.530 0.891 0.682
#> brandtropicana
#> 1.302
print(lm(logmove ~ log(price)*feat, data=oj))
#>
#> Call:
#> lm(formula = logmove ~ log(price) * feat, data = oj)
#>
#> Coefficients:
#> (Intercept) log(price) feat log(price):feat
#> 9.659 -0.958 1.714 -0.977
print(lm(logmove ~ brand-1, data=oj))
#>
#> Call:
#> lm(formula = logmove ~ brand - 1, data = oj)
#>
#> Coefficients:
#> branddominicks brandminute.maid brandtropicana
#> 9.17 9.22 9.11
Advertisement increases price sensitivity from -0.96 to -0.958 - 0.98 = -1.94!
Why?
One of the reasons is that the price was lowered during the Ad campaign!
library(dplyr)
= oj %>% filter(brand=="dominicks")
doj
par(mfrow=c(1,3), mar=c(4.2,4.6,2,1))
boxplot(price ~ feat, data = oj[oj$brand=="dominicks",], col=c(2,3), main="dominicks", ylab="Price ($)")
boxplot(price ~ feat, data = oj[oj$brand=="minute.maid",], col=c(2,3), main="minute.maid")
boxplot(price ~ feat, data = oj[oj$brand=="tropicana",], col=c(2,3), main="tropicana")
0 = not featured, 1 = featured
4.3 Dummies
We want to understand effect of the brand on the sales \[\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \xcancel{\beta_2\mathrm{brand}}\]
But brand is not a number!
How can you use it in your regression equation?
We introduce dummy variables
Brand | Intercept | brandminute.maid | brandtropicana |
---|---|---|---|
minute.maid | 1 | 1 | 0 |
tropicana | 1 | 0 | 1 |
dominicks | 1 | 0 | 0 |
\[\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \beta_{21}\mathrm{brandminute.maid} + \beta_{22}\mathrm{brandtropicana}\]
R will automatically do it it for you
print(lm(logmove ~ log(price)+brand, data=oj))
#>
#> Call:
#> lm(formula = logmove ~ log(price) + brand, data = oj)
#>
#> Coefficients:
#> (Intercept) log(price) brandminute.maid brandtropicana
#> 10.83 -3.14 0.87 1.53
\[\log(\mathrm{sales}) = \beta_0 + \beta_1\log(\mathrm{price}) + \beta_3\mathrm{brandminute.maid} + \beta_4\mathrm{brandtropicana}\]
\(\beta_3\) and \(\beta_4\) are "change relative to reference" (dominicks here).
How does brand affect price sensitivity?
Interactions: logmove ~ log(price) * brand
No Interactions: logmove ~ log(price) + brand
Parameter | Interactions | No Interactions |
---|---|---|
(Intercept) | 10.95 | 10.8288 |
log(price) | -3.37 | -3.1387 |
brandminute.maid | 0.89 | 0.8702 |
brandtropicana | 0.96239 | 1.5299 |
log(price):brandminute.maid | 0.057 | |
log(price):brandtropicana | 0.67 |
Example 4.4 (Golf Performance Data) Dave Pelz has written two best-selling books for golfers, Dave Pelz’s Short Game Bible, and Dave Pelz’s Putting Bible.
Dave Pelz was formerly a “rocket scientist” (literally) Data analytics helped him refine his analysis It’s the short-game that matters!
The optimal speed for a putt Best chance to make the putt is one that will leave the ball \(17\) inches past the hole, if it misses.
Golf Data
Year-end performance data on 195 players from the 2000 PGA Tour.
nevents
, the number of official PGA events included in the statisticsmoney
, the official dollar winnings of the player
drivedist
, the average number of yards driven on par 4 and par 5 holesgir
, greens in regulation, measured as the percentage of time that the first (tee) shot on a par 3 hole ends up on the green, or the second shot on a par 4 hole ends up on the green, or the third shot on a par 5 hole ends up on the green
avgputts
, which is the average number of putts per round.
Analyze these data to see which of nevents, rivedist, gir, avgputts is most important for winning money.
Regression of Money on all explanatory variables:
source("src/utils.R")
library(tidyverse)
= read_csv("data/pga-2000.csv")
d00 = read_csv("data/pga-2018.csv") d18
= lm(money ~ nevents + drivedist + gir + avgputts, data=d18)
model18 = lm(money ~ nevents + drivedist + gir + avgputts, data=d00)
model00 summary(model00)
#>
#> Call:
#> lm(formula = money ~ nevents + drivedist + gir + avgputts, data = d00)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1158735 -364118 -62913 285081 5604080
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 14856638 4206466 3.53 0.00052 ***
#> nevents -30066 11183 -2.69 0.00781 **
#> drivedist 21310 6913 3.08 0.00236 **
#> gir 120855 17429 6.93 6.2e-11 ***
#> avgputts -15203045 2000905 -7.60 1.3e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 672000 on 190 degrees of freedom
#> Multiple R-squared: 0.493, Adjusted R-squared: 0.483
#> F-statistic: 46.3 on 4 and 190 DF, p-value: <2e-16
\(R^2 = 50\)%
Residuals
hist(rstandard(model00), breaks=20, col="lightblue", xlab = "Standartized Residual", main="")
arrows(x0 = 7.5,y0 = 20,x1 = 8.5,y1 = 2,length = 0.1)
text(x = 7,y = 22,labels = "Tiger Woods", cex=1.5)
Regression
Transform with log(Money)
as it has much better residual diagnostic
plots.
lm(formula = log(money) ~ nevents + drivedist + gir + avgputts, data = d00)
#>
#> Call:
#> lm(formula = log(money) ~ nevents + drivedist + gir + avgputts,
#> data = d00)
#>
#> Coefficients:
#> (Intercept) nevents drivedist gir avgputts
#> 36.14923 -0.00899 0.01409 0.16567 -21.12875
\(R^2 = 67\)%. There’s still \(33\)% of variation to go
Residuals for log(Money)
par(mar = c(4,4.5,0,0),mfrow=c(1,1))
= lm(log(money) ~ nevents + drivedist + gir + avgputts, data=d00)
model00log summary(model00log)
#>
#> Call:
#> lm(formula = log(money) ~ nevents + drivedist + gir + avgputts,
#> data = d00)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.5438 -0.3655 -0.0058 0.3968 1.9262
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 36.14923 3.57763 10.10 <2e-16 ***
#> nevents -0.00899 0.00951 -0.94 0.346
#> drivedist 0.01409 0.00588 2.40 0.018 *
#> gir 0.16567 0.01482 11.18 <2e-16 ***
#> avgputts -21.12875 1.70178 -12.42 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.57 on 190 degrees of freedom
#> Multiple R-squared: 0.671, Adjusted R-squared: 0.664
#> F-statistic: 96.9 on 4 and 190 DF, p-value: <2e-16
hist(rstandard(model00log), breaks=20, col="lightblue", xlab = "Standartized Residual", main="")
arrows(x0 = 3,y0 = 20,x1 = 3.2,y1 = 2,length = 0.1)
text(x = 3,y = 22,labels = "Tiger Woods", cex=1.5)
Regression: Variable selection
\(t\)-stats for nevents
is \(<1.5\).
lm(formula = log(money) ~ drivedist + gir + avgputts, data = d00)
#>
#> Call:
#> lm(formula = log(money) ~ drivedist + gir + avgputts, data = d00)
#>
#> Coefficients:
#> (Intercept) drivedist gir avgputts
#> 36.1737 0.0146 0.1658 -21.3684
The fewer the putts the better golfer you are. Duh!
avgputts
per round is hard to decrease by one!
Evaluating the Coefficients
Greens in Regulation (GIR) has a \(\hat{\beta} = 0.17\). If I can increase my GIR by one, I’ll earn \(e^{0.17} = 1.18\)% An extra \(18\)%
DriveDis has a \(\hat{\beta} = 0.014\). A \(10\) yard improvement, I’ll earn \(e^{0.014 \times 10} =e^{0.14} = 1.15\)% An extra \(15\)%
Caveat: Everyone has gotten better since 2000!
Main Findings
Tiger was \(9\) standard deviations better than the model.
Taking
logs
ofMoney
helps the residuals!An exponential model seems to fit well. The residual diagnostics look good
The \(t\)-ratios for
nevents
are under \(1.5\).
Over-Performers
Outliers: biggest over and under-performers in terms of money winnings, compared with the performance statistics.
Woods, Mickelson, and Els won major championships by playing well when big money prizes were available.
Over-Performers
name | money | Predicted | Error | |
---|---|---|---|---|
1 | Tiger Woods | 9188321 | 3584241 | 5604080 |
2 | Phil Mickelson | 4746457 | 2302171 | 2444286 |
3 | Ernie Els | 3469405 | 1633468 | 1835937 |
4 | Hal Sutton | 3061444 | 1445904 | 1615540 |
20 | Notah Begay III | 1819323 | 426061 | 1393262 |
182 | Steve Hart | 107949 | -1186685 | 1294634 |
Under-Performers
Underperformers are given by large negative residuals Glasson and Stankowski should win more money.
name | money | Predicted | Error | |
---|---|---|---|---|
47 | Fred Couples | 990215 | 1978477 | -988262 |
52 | Kenny Perry | 889381 | 1965740 | -1076359 |
70 | Paul Stankowski | 669709 | 1808690 | -1138981 |
85 | Bill Glasson | 552795 | 1711530 | -1158735 |
142 | Jim McGovern | 266647 | 1397818 | -1131171 |
Lets look at 2018 data Highest earners are
name | nevents | money | drivedist | gir | avgputts |
---|---|---|---|---|---|
Justin Thomas | 23 | 8694821 | 312 | 69 | 1.7 |
Dustin Johnson | 20 | 8457352 | 314 | 71 | 1.7 |
Justin Rose | 18 | 8130678 | 304 | 70 | 1.7 |
Bryson DeChambeau | 26 | 8094489 | 306 | 70 | 1.8 |
Brooks Koepka | 17 | 7094047 | 313 | 68 | 1.8 |
Bubba Watson | 24 | 5793748 | 313 | 68 | 1.8 |
Overperformers
name | money | Predicted | Error | |
---|---|---|---|---|
1 | Justin Thomas | 8694821 | 5026220 | 3668601 |
2 | Dustin Johnson | 8457352 | 6126775 | 2330577 |
3 | Justin Rose | 8130678 | 4392812 | 3737866 |
4 | Bryson DeChambeau | 8094489 | 3250898 | 4843591 |
5 | Brooks Koepka | 7094047 | 4219781 | 2874266 |
6 | Bubba Watson | 5793748 | 3018004 | 2775744 |
9 | Webb Simpson | 5376417 | 2766988 | 2609429 |
11 | Francesco Molinari | 5065842 | 2634466 | 2431376 |
12 | Patrick Reed | 5006267 | 2038455 | 2967812 |
84 | Satoshi Kodaira | 1471462 | -1141085 | 2612547 |
Underperformers
name | money | Predicted | Error | |
---|---|---|---|---|
102 | Trey Mullinax | 1184245 | 3250089 | -2065844 |
120 | J.T. Poston | 940661 | 3241369 | -2300708 |
135 | Tom Lovelady | 700783 | 2755854 | -2055071 |
148 | Michael Thompson | 563972 | 2512330 | -1948358 |
150 | Matt Jones | 538681 | 2487139 | -1948458 |
158 | Hunter Mahan | 457337 | 2855898 | -2398561 |
168 | Cameron Percy | 387612 | 3021278 | -2633666 |
173 | Ricky Barnes | 340591 | 3053262 | -2712671 |
176 | Brett Stegmaier | 305607 | 2432494 | -2126887 |
Findings
Here’s three interesting effects:
Tiger Woods is \(8\) standard deviations better!
Increasing driving distance by \(10\) yards makes you \(15\)% more money
Increasing GIR by one makes you \(18\)% more money.
Detect Under- and Over-Performers
Go Play!!
Regression
Input and Plot Data In
R
:plot
andsummary
commands“Kitchen-Sink” Regression
lm
command with all variablesResidual Diagnostics and
plot(model)
Fitted values and Standardised residuals. Outliers and InfluenceTransformation?
Correct the 4-in-1 plots and assumptions.
4.3.1 Regression Strategy
Variable Selection \(t\)-state and \(p\)-values from
summary(model)
Final Regression Re-run the model. Interpret the coefficients
summary(model)
. Economic and Statistical SignificancePrediction
predict.lm
. Out-of-sample forecasting A model is only as good as its predictions!!
4.3.2 Predictive Analytics: General Introduction
Predictive Analytics is the most widely used tool for high dimensional input-output analysis \[{Y = F(X) \; \; {\rm where} \; \; X=(X_1, \ldots , X_p )}\]
Consumer Demand (Amazon, Airbnb, ... )
Maps (Bing, Uber)
Pricing
Healthcare
The applications are endless .…
Example 4.5 (Target) Target and other retailers use predictive analytics to study consumer purchasing behaviour to see what type of coupons or promotions you might like
Here’s a famous story about a father and his daughter. Target predicted that his daugther was pregnant from her purchasing behaviour long before they were buying diapers
Here’s the original link ...
Getting a customer to a routine is the key
M.I.T experiment: t-shaped maze with chocolate at the end and behind the barrier that opens after a loud click
While each animal wandered through the maze, its brain was working furiously
As the scientists repeated the experiment, again and again, the rats eventually stopped sniffing corners and making wrong turns and began to zip through the maze with more and more speed
As each rat learned how to complete the maze more quickly, its mental activity decreased
Leaerning routines from data is the basis for modern marketing
Habits is a three-step loop: cue, a trigger (go into automatic mode), then the routine
Febreze: original adds were targeting a wrong routine (kill the smell), no sails. They the ad said: use Febreze after cleaning each room. Now it is one of the most successful products.
Target used the fact that customers who going through a major life event change their habits (routines). They can identify due dates from registry.
Example 4.6 (Walmart) Walmart began using predictive analytics in 2004. Mining trillions of bytes’ worth of sales data from recent hurricanes
Determine what customers most want to purchase leading up to a storm.
Strawberry Pop-Tarts are one of the most purchased food items, especially after storms, as they require no heating and can be eaten at any meal
Example 4.7 (Germany’s Otto) Otto sells other brands, does not stock those goods itself, hard to avoid one of the two evils: shipping delays until all the orders are ready for fulfilment, or lots of boxes arriving at different times.
- Analyze around 3 billion past transactions and 200 variables–past sales, searches on Otto’s site and weather information. They predict what customers will buy a week before they order. This system has proved so reliable, predicting with 90% accuracy what will be sold within 30 days, that Otto allows it automatically to purchase around \(200,000\) items a month from third-party brands with no human intervention.
[Economist]
[Germany’s Otto]
Example 4.8 (Stitch) Stitch Fix CEO Says AI Is ‘So Real’ and Incredibly Valuable
Stitch Fix asks customers for insights and feedback alongside their size and color preference for items, even the ones customers didn’t like or buy, in exchange for a clear value proposition.
The breadth and depth of their data are valuable.
Their model relies on a combination of data science – machine learning, AI and natural language processing – and human stylists; on top of complex customer profiles built by data, stylists can layer the nuances of buying and wearing clothes.
Example 4.9 (Uber Pool)
Bayes predicts where you’re going to be dropped off.
Uber constructs prior probabilities for riders, Uber cars, and popular places.
Combine to construct a joint probability table
Then calculate the posterior probability of destination for each person and pool travellers together
[Uber Pool]
Example 4.10 () Example: \[Kaggle: Predictive Culture\]
\BeginKnitrBlock{example}\iffalse{-91-65-105-114-98-110-98-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-62"><strong>(\#exm:unnamed-chunk-62) \iffalse (Airbnb) \fi{} </strong></span>
<!-- [Airbnb New User Bookings Prediction Competition] -->
New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries.
Accurately predict where a new user will book their first travel
experience
Airbnb can then personalized content, decrease the average time to first
booking, and better forecast demand.
$12$ classes--major destinations, and a did not book category
[Economist]: https://www.economist.com/news/business/21720675-firm-using-algorithm-designed-cern-laboratory-how-germanys-otto-uses
[Germany's Otto]: https://www.youtube.com/watch?v=3XXqKxpr364
[Stitch Fix CEO Says AI Is 'So Real' and Incredibly Valuable]: https://www.bloomberg.com/news/videos/2018-10-25/stitch-fix-ceo-says-ai-is-so-real-and-incredibly-valuable-video
[Uber Pool]: https://newsroom.uber.com/inferring-uber-rider-destinations/
[Airbnb New User Bookings Prediction Competition]: https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings
List of users, demographics, web session records, and content data
</div>\EndKnitrBlock{example}
<img src="fig/airbnb_arch-3dplace.png" width="70%" style="display: block; margin: auto;" />
Winner has the best out-of-sample prediction!!
Example 4.11 () Example: \[Hacking OkCupid\]
Sorted daters into seven clusters, like "Diverse" and "Mindful," each with distinct characteristics.
\BeginKnitrBlock{example}\iffalse{-91-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-66"><strong>(\#exm:unnamed-chunk-66) \iffalse () \fi{} </strong></span>Example: \[NFL Dynamic Pricing\]
Predict price demand for any given Lions game for any given seat in the
stadium
</div>\EndKnitrBlock{example}
<img src="https://fast.wistia.com/embed/iframe/l85ltyxqbh?autoplay=0" width="100%" style="display: block; margin: auto;" />
<https://grahamschool.uchicago.edu/academic-programs/masters-degrees/analytics/nfl-capstone>
We submitted our report on June 2016 suggesting that some areas of the
stadium were priced efficiently and some were underpriced or overpriced.
On Fed 2017, Detroit Jock City wrote
"Detroit Lions tickets will cost a little more on average for 2017, but
some areas of the stadium will decrease or hold steady."
<img src="fig/lions.png" width="100%" style="display: block; margin: auto;" />
<https://detroitjockcity.com/2017/02/10/detroit-lions-2017-ticket-prices/>
Example 4.12 () Example: \[Latour 1982 Price History\]
Ch^ateau Latour: grand vin
\BeginKnitrBlock{example}\iffalse{-91-93-}\fi{}<div class="example"><span class="example" id="exm:unnamed-chunk-72"><strong>(\#exm:unnamed-chunk-72) \iffalse () \fi{} </strong></span>
Example: Global Warming
</div>\EndKnitrBlock{example}
Shifting Distribution of Northern Hemisphere Summer Temperature
Anomalies, 1951-2011
[NASA article with animation]
Climate statistics and public policy
Change in global mean temperature is not one of the most sensitive indicator
Changes in radiation spectrum (atmosphere is less transparent in the infrared)
Sea surface temperature and Land surface temperature
Sea level rise (thermal expansion and ice melt): Greenland and West Antarctic are melting + glacial melt
Ocean acidification: CO\(_2\) gets absorbed by water, it produces carbolic acid
Seasonal changes; winter - summer temperature has been decreasing since 1954. Shift changes (earlier seasons) lead to ecological effects
Hurricanes: increase in maximum wind velocity = sea surface temperature + the difference between sea surface temperature and the average air temperature in the outflow of the hurricane
2018 was the fourth-warmest year on record.
How Much Hotter Is Your Hometown Than When You Were Born?
Ice cores is an important source of data
Ice core. Cylinder of ice drilled out of an ice sheet or glacier. Most ice core records come from Antarctica and Greenland.
The oldest continuous ice core records to date extend 123,000 years in Greenland and 800,000 years in Antarctica.
Ice Core Datasets
has been around since the 1950s
Mostly from Greenland and Antarctica
bubbles in the ice core preserve actual samples of the world’s ancient atmosphere
The World Data Center (WDC) for Paleoclimatology maintains archives of ice core data from polar and low-latitude mountain glaciers and ice caps throughout the world. Proxy climate indicators include oxygen isotopes, methane concentrations, dust content, as well as many other parameters.
https://www.ncdc.noaa.gov/data-access/paleoclimatology-data/datasets/ice-core
CO2 was stable over the last millennium
In the early 19th century CO2 concentration started to rise, and its concentration is now nearly 40% higher than it was before the industrial revolution
Things we learned from ice core
Ice cores contain information about past temperature, and about many other aspects of the environment.
Atmospheric carbon dioxide levels are now 40% higher than before the industrial revolution. This increase is due to fossil fuel usage and deforestation.
The magnitude and rate of the recent increase are almost certainly unprecedented over the last 800,000 years.
Methane also shows a huge and unprecedented increase in concentration over the last two centuries.
BAS article, The Verge Article
Gates thinks we can use more renewables and nuclear
Need better storage + generation (wind/sun) technology
Source: https://www.gatesnotes.com/Energy/A-critical-step-to-reduce-climate-change