Exercise 23.1 (Joint Distributions) A credit card company collects data on \(10,000\) users. The data contained two variables: an indicator of he customer status: whether they are in default (def =1) or if they are current with their payments (def =0). Moreover, they have a measure of their loan balance relative to income with three categories: a low balance (bal=1), medium (bal=2) and high (bal=3). The data are given in the following table:
def
bal
0
1
1
8,940
64
2
651
136
3
76
133
Compute the marginal distribution of customer status
What is the conditional distribution of bal given def =1
Make a prediction for the status of a customer with a high balance
Exercise 23.2 (Marginal) The table below is taken from the Hoff text and shows the joint distribution of occupations taken from a 1983 study of social mobility by Logan (1983). Each cell is P(father’s occupation, son’s occupation).
d =data.frame(farm =c(0.018,0.002,0.001,0.001,0.001),operatives=c(0.035,0.112,0.066,0.018,0.029),craftsman=c(0.031,0.064,0.094,0.019,0.032),sales=c(0.008,0.032,0.032,0.010,0.043),professional=c(0.018,0.069,0.084,0.051,0.130), row.names =c("farm","operative","craftsman","sales","professional"))d %>% knitr::kable()
farm
operatives
craftsman
sales
professional
farm
0.018
0.035
0.031
0.008
0.018
operative
0.002
0.112
0.064
0.032
0.069
craftsman
0.001
0.066
0.094
0.032
0.084
sales
0.001
0.018
0.019
0.010
0.051
professional
0.001
0.029
0.032
0.043
0.130
Find the marginal distribution of fathers’ occupations.
Find the marginal distribution of sons’ occupations.
Find the conditional distribution of the son’s occupation given that the father is a farmer.
Find the conditional distribution of the father’s occupation given that the son is a farmer.
Comment on these results. What do they say about changes in farming in the population from which these data are drawn?
Exercise 23.3 (Conditional) Netflix surveyed the general population as to the number of hours per week that you used their service. The following table provides the proportions of each category according to whether you are a teenager or adults.
Hours
Teenager
Adult
\(<4\)
0.18
0.20
\(4\) to \(6\)
0.12
0.32
\(>6\)
0.04
0.14
Calculate the following probabilities:
Given that you spend \(4\) to \(6\) hours a week watching movies, what’s the probability that you are a teenager?
What is the marginal distribution of hours spent watching movies.
Are hours spent watching Netflix movies independent of age?
Exercise 23.4 (Joint and Conditional) The following probability table relates \(Y\) the number of TV shows watched by the typical student in an evening to the number of drinks \(X\) consumed.
Y
X
0
1
2
3
0
0.07
0.09
0.06
0.01
1
0.07
0.06
0.07
0.01
2
0.06
0.07
0.14
0.03
3
0.02
0.04
0.16
0.04
What is the probability that a student has more than two drinks in an evening?
What is the probability that a student drink more than the number of TV shows they watch?
What’s the conditional distribution of the number of TV shows watched given they consume \(3\) drinks?
What’s the expected number of drinks given they do not watch TV
Are drinking and watching TV independent?
Exercise 23.5 (Conditional probability) Shipments from an online retailer take between 1 and 7 days to arrive, depending on where they ship from, when they were ordered, the size of the item, etc. Suppose the distribution of delivery times has the following distribution function:
x
1
2
3
4
5
6
7
\(\mbox{P}(X = x)\)
\(\mbox{P}(X \leq x)\)
0.10
0.20
0.70
0.75
0.80
0.90
1
Fill in the above probability table.
What is the conditional probability of a delivery arriving on day four given that it did not arrive in the first three days? (Hint: find \(P(X = 4 \mid X >= 4)\))
Exercise 23.6 (Joint and Conditional) A cable television company has \(10000\) subscribers in a suburban community. The company offers two premium channels, HBO and Showtime. Suppose \(2750\) subscribers receive HBO and \(2050\) receive Showtime and \(6200\) do not receive any premium channel.
What is the probability that a randomly selected subscriber receives both HBO and Showtime.
What is the probability that a randomly selected subscriber receives HBO but not Showtime.
You now obtain a new dataset, categorized by gender, on the proportions of people who watch HBO and Showtime given below
Cable
Female
Male
HBO
0.14
0.48
Showtime
0.17
0.21
Conditional on being female, what’s the probability you receive HBO?
Conditional on being female, what’s the probability you receive Showtime?
Exercise 23.7 (Conditionals and Expectations) The following probability table describes the daily sales volume, \(X\), in thousands of dollars for a salesperson for the number of years \(Y\) of sales experience for a particular company.
Y
X
1
2
3
4
10
0.14
0.03
0.03
0
20
0.05
0.10
0.12
0.07
30
0.10
0.06
0.25
0.05
Verify that this is a legal probability table.
What is the probability of at least two years experience?
Calculate the mean daily sales volume
Given a salesperson has three years experience, calculate the mean daily sales volume.
A salesperson is paid $1000 per week plus 2% of total sales. What is the expected compensation for a salesperson?
Exercise 23.8 (Expectation)\(E(X+Y) = E(X) + E(Y)\) only if the random variables \(X\) and \(Y\) are independent
Exercise 23.9 (Conditional Probability) A super market carried out a survey and found the following probabilities for people who buy generic products depending on whether they visit the store frequently or not
Purchase Generic
Visit
Often
Sometime
Never
Frequent
0.10
0.50
0.17
Infrequent
0.03
0.05
0.15
What is the probability that a customer who never buys generics visits the store?
What is the probability that a customer often purchases generic?
Are buying generics and visiting the store independent decisions?
What is the conditional distribution of purchasing generics given that you frequently visit the store?
Exercise 23.10 (Conditional Probability) Cooper Realty is a small real estate company located in Albany, New York, specializing primarily in residential listings. They have recently become interested in determining the likelihood of one of their listings being sold within a certain number of days. An analysis of recent company sales of 800 homes in produced the following table:
Days Listed until Sold
Under 20
31-90
Over 90
Total
Under $50K
50
40
10
100
$50-$100K
20
150
80
250
$100-$150K
20
280
100
400
Over $ 150K
10
30
10
50
Estimate the probability that a home listed for over 90 days before being sold
Estimate the probability that the initial asking price is under $50K.
What the the probability of both of the above happening? Are these two events independent?
Assuming that a contract has just been signed to list a home that has an initial asking price of less than $100K, what is the probability that the home will take Cooper Realty more than 90 days to sell?
Exercise 23.11 (Probability and Combinations.) In 2006, the St. Louis Cardinals and the Detroit Tigers played for the World Series. The two teams play seven games, and the first team to win four games wins the world series.
The Cardinals were leading the series 3 – 1. Given that each game is independent of another and that the probability of the Cardinals winning any single game is 0.55, what’s the probability that they would go on to win the World Series?
In 2012, the St. Louis Cardinals found themselves in a similar situation against the San Francisco Giants in the National League Championships. Now suppose that the probability of the Cardinals winning any single game is 0.45.
How does the probability that they get to the World Series differ from before?
Exercise 23.12 (Probability and Lotteries) The Powerball lottery is open to participants across several states. When entering the powerball lottery, a participant selects five numbers from 1-59 and then selects a powerball number from the digits 1-35. In addition, there’s a $1 million payoff for anybody selecting the first five numbers correctly.
Show that the odds of winning the Powerball Jackpot are 1 in 175,223,510.
Show that the odds of winning the $1 million are 1 in 5,153,632.
On February 18, 2006 the Jackpot reached $365 million. Assuming that you will either win the Jackpot or the $1 million prize, what’s your expected value of winning?
Mega Millions is a similar lottery where you pick 5 balls out of 56 and a powerball from 46. Show that the odds of winning mega millions are higher than the Powerball lottery On March 30, 2012 the Jackpot reached $656 million. Is your expected value higher or lower than that calculated for the Powerball lottery?
Exercise 23.13 (Joint Probability) A market research survey finds that in a particular week \(28\%\) of all adults watch a financial news television program; \(17\%\) read a financial publication and \(13\%\) do both.
Fill in the blanks in the following joint probability table
Watches TV
Doesn’t Watch
Total
Reads
.13
.17
Doesn’t Read
.28
1.00
What is the probability that someone who watches a financial TV program read a publication oriented towards finance?
What is the probability that someone who reads a finance publication watches a financial TV program.
Why aren’t the answers to the above questions equal?
Exercise 23.14 (Conditional Probability.) A local bank is reviewing its credit card policy. In the past 5% of card holders have defaulted. The bank further found that the chance of missing one or more monthly payments is 0.20 for customers who do not default. Of course, the probability of missing one or more payments for those who default is 1.
Given that a customer has missed a monthly payment, compute the probability that the customer will default.
The bank would like to recall its card if the probability that a customer will default is greater than 0.20. Should the bank recall its card if the customer misses a monthly payment? Why or why not?
Exercise 23.15 (Correlation) The following table shows the descriptive statistics from \(1000\) days of returns on IBM and Exxon’s stock prices.
N Mean StDev SE Mean
IBM 1000 0.0009 0.0157 0.00049
Exxon 1000 0.0018 0.0224 0.00071
Here is the covariance table
IBM Exxon
IBM 0.000247
Exxon 0.000068 0.00050
What is the variance of IBM returns?
What is the correlation between IBM and Exxon’s returns?
Consider a portfolio that invests \(50\)% in IBM and \(50\)% in Exxon. What are the mean and variance of the portfolio? Do you prefer this portfolio to just investing in IBM on its own?
Exercise 23.16 (Normal Distribution) After Facebook’s earnings announcement we have the following distribution of returns. First, the stock beats earnings expectations \(75\)% of the time, and the other \(25\)% of the time earnings are in line or disappoint. Second, when the stock beats earnings, the probability distribution of percent changes is normal with a mean of \(10\)% with a standard deviation of \(5\)% and, when the stock misses earnings, a normal with a mean of \(-5\)% and a standard deviation of \(8\)%, respectively.
Ahead of the earnings announcement, what is the probability that Facebook stock will have a return greater than \(5\)%?
Do you get the same answer for the probability that it drops at least \(5\)%?
Use simulation to provide empirical answers with sample of size N = 10, 000, check and see how close you get to the theoretical answers you’ve found to the questions posed above. Provide histograms of the distributions you simulate.
Exercise 23.17 (Probability) Answer the following statements TRUE or FALSE, providing a succinct explanation of your reasoning.
If the odds in favor of \(A\) are 3:5 then \(\mbox{P}(A) = 0.4\).
You roll two fair three-sided dice. The probability the two dice show the same number is 1/4.
If events \(A\) and \(B\) are independent and \(\mbox{P}(A) > 0\) and \(\mbox{P}(B)>0\), then \(\mbox{P}(A \mbox{ and } B) > 0\).
If \(\mbox{P}(A \; \text{ and} \; B) \geq 0.5\) then \(P(A) \leq 0.5\).
If two random variables have non-zero correlation, then they must be dependent.
If two random variables have zero correlation, then they must be independent.
If two random variables are independent, then the correlation between them must be zero.
If \(P(A \text{ and } B) \leq 0.2\), then \(P(A) \leq 0.2\).
Exercise 23.18 (Binomial Distribution) The Downhill Manufacturing company produces snowboards. The average life of their product is \(10\) years. A snowboard is considered defective if its life is less than \(5\) years. The distribution is approximately normal with a standard deviation for the life of a board of \(3\) years.
What’s the probability of a snowboard being defective?
In a shipment of \(120\) snowboards, what is the probability that the number of defective boards is greater than \(10\)?
Use simulation to provide empirical answers with sample of size N = 10, 000, check and see how close you get to the theoretical answers you’ve found to the questions posed above. Provide histograms of the distributions you simulate.
You can use R and simulation with rbinom, rnorm as an alternative
Exercise 23.19 (Chinese Stock Market) On August 24th, 2015, Chinese equities ended down \(- 8.5\)% (Black Monday). In the last \(25\) years, average is \(0.09\)% with a volatility of \(2.6\)%, and \(56\)% time close within one standard deviation. SP500, average is \(0.03\)% with a volatility of \(1.1\)%. \(74\)% time close within one standard deviation
Exercise 23.20 (Body Weight) Suppose that your model for weight \(X\): Normal distribution with mean \(190\) lbs and variance \(100\) lbs. The problem is to identify the proportion of people have weights over 200 lbs?
Exercise 23.21 (Google Returns) We estimated sample mean and sample variance for daily returns of Google stock \(\bar x = 0.025\), and \(s^2 = 1.1\). If I want to calculate the probability that I lose \(3\)% in a day, I need to assume a probabilistic model of the return and then calculate the \(p(r >3)\). Say, we assume that returns are normally distributed \(r \sim N( \mu , \sigma^2 )\). Estimate parameters of the distribution from the observed data and calculate \(p(r<-3) = 0.003\)
Exercise 23.22 (Portfolio Means, Standard Deviations and Correlation) Suppose you have a portfolio that is invested with a weight of 75% in the U.S. and 25% in HK. You take a sample of 10 years, or 120 months of historical means, standard deviations and correlations for U.S. and Hong Kong stock market returns. Given this information compute the mean and standard deviation of the returns on your portfolio.
N
MEAN S
TDEV
Hong Kong
120
0.0170
0.0751
US
120
0.0115
0.0330
Correlation = 0.3
Hint: you will find the following formulas useful. Let \(R_p\) denote the return on your portfolio which is a weighted combination \(R_p = pX + (1 - p)Y\) . Then \[
E(R_p) = p\mu_X + (1 - p)\mu_Y
\]\[
Var(R_p) = p^2\sigma_X^2 + (1 - p)^2\sigma_Y^2 + 2p(1-p)\rho \sigma_X \sigma_Y
\] where \(\mu_X\), \(\mu_Y\) and \(\sigma_X\), \(\sigma_Y\) are the underlying means and standard deviations for \(X\) and \(Y\).
Exercise 23.23 (Binomial) In the game Chuck-a-Luck you pick a number from 1 to 6. You roll three dice. If your number doesn’t appear on any dice, you lose $1. If your number appears exactly once, you win $1. If your number appears on exactly two dice, you win $2. If your number appears on all three dice, you win $3.
Hence every outcome has how much you win or lose on the game, namely \(-1, 1, 2\) or \(3\).
Fill in the blanks in the pdf and cdf values
X
-1
1
2
3
P(X)
F(X)
Explain your reasoning carefully.
Compute the expected value of the game, \(E(X)\).
Exercise 23.24 (Binomial Distribution) A real estate firm in Florida offers a free trip to Florida for potential customers. Experience has shown that of the people who accept the free trip, 5% decide to buy a property. If the firm brings \(1000\) people, what is the probability that at least \(125\) will decide to buy a property?
Exercise 23.25 (Expectation and Strategy) An oil company wants to drill in a new location. A preliminary geological study suggests that there is a \(20\)% chance of finding a small amount of oil, a \(50\)% chance of a moderate amount and a \(30\)% chance of a large amount of oil. The company has a choice of either a standard drill that simply burrows deep into the earth or a more sophisticated drill that is capable of horizontal drilling and can therefore extract more but is far more expensive. The following table provides the payoff table in millions of dollars under different states of the world and drilling conditions
Oil
small
moderate
large
Standard Drilling
20
30
40
Horizontal Drilling
-20
40
80
Find the following
The mean and variance of the payoffs for the two different strategies
The strategy that maximizes their expected payoff
Briefly discuss how the variance of the payoffs would affect your decision if you were risk averse
How much are you willing to pay for a geological evaluation that would tell you with certainty the quantity of oil at the site prior to drilling?
Exercise 23.26 (Google Survey) Visitors to your website are asked to answer a single survey Google website question before they get access to the content on the page. Among all of the users, there are two categories
Random Clicker (RC)
Truthful Clicker (TC)
There are two possible answers to the survey: yes and no.
Random clickers would click either one with equal probability. You are also giving the information that the expected fraction of random clickers is \(0.3\).
After a trial period, you get the following survey results. \(65\)% said Yes and \(35\)% said No.
How many people people who are truthful clickers answered yes?
23.1.1 Computing
Exercise 23.27 (Portfolio Means, Standard Deviations and Correlation) You want to build a portfolio of exchange traded funds (ETFs) for your retirement strategy. You’re thinking of whether to invest in growth or value stocks, or maybe a combination of both. Vanguard has two ETFs, one for growth (VUG) and one for value (VTV).
Plot the historical price series for VUG vs VTV.
Calculate the means and standard deviations of both ETFs.
Calculate their covariance.
Suppose you decide on a portfolio that is a 50 / 50 split. Calculate the new mean and variance of your portfolio.
Which portfolio best suits you?
What’s the probability that growth (VUG) will beat value (VTV) in the future?
You will find the following formulas useful. Let \(P\) denote the return on your portfolio which is a weighted combination \(P = aX + bY\). Then \[
E(P) = aE(X) + bE(Y )
\]\[
Var(P ) = a^2Var(X) + b^2Var(Y ) + 2abCov(X, Y ),
\] where \(Cov(X, Y )\) is the covariance for \(X\) and \(Y\).
Hint: You can use the following code to get the data
library(quantmod)getSymbols(c("VUG","VTV"), from ="2015-01-01", to ="2024-01-01")VUG = VUG$VUG.AdjustedVTV = VTV$VTV.Adjusted
Exercise 23.28 (Descriptive Statistics in R) Use the superbowl1.txt and derby2016.csv datasets. The Superbowl contains data on the outcome of all previous Superbowls. The outcome is defined as the difference in scores of the favorite minus the underdog. The spread is the bookmakers’ prediction of the outcome before the game begins. The Derby data consists of all of the results on the Kentucky Derby which is run on the first Saturday in May every year at Churchill Downs racetrack. Answer the following questions
For the Superbowl data.
Plot the spread and outcome variables. Calculate means, standard deviations, covariances, correlations.
What is the mean and the standard deviation of the winning margin (outcome)?
Use a boxplot to compare the favorites’ score versus the underdog.
Does this data look normally distributed?
For the Derby data.
Plot a histogram of the winning speeds and times of the horses. Why is there a long right-hand tail to the distribution of times?
Can you identify the outlying horse with the best winning time?
Exercise 23.29 (Berkshire Hathaway: Yahoo Finance Data) Download daily return data in Warren Buffett’s firm Berkshire Hathaway (ticker symbol: BRK-A) from 1990 to the present. Analyze this data in the following way:
Plot the Historical Price Performance of the stock.
Calculate the Daily returns. Plot a histogram of the returns. Comment on the distribution that you obtain.
Use the summary command to provide statistical data summaries.
Interpret your findings.
Exercise 23.30 (Confidence Intervals) A sample of weights of 40 rainbow trout revealed that the sample mean is 402.7 grams and the sample standard deviation is 8.8 grams.
What is the estimated mean weight of the population?
What is the 99% confidence interval for the mean?
Exercise 23.31 (Confidence Intervals) A research firm conducted a survey to determine the mean amount steady smokers spend on cigarettes during a week.
A sample of 49 steady smokers revealed that \(\bar{X} = 20\) and \(s = 5\) dollars.
What is the point estimate? Explain what it indicates.
Using a 95% confidence interval, determine the confidence interval for \(\mu\). Explain what it indicates.
Exercise 23.32 (Back cast US Presidential Elections) Use data from presidential polls to predict the winner of the elections. We will be using data from http://www.electoral-vote.com/. The goal is to use simulations to predict the winning percentage for each of the candidates. Use election.Rmd script as the starter.
Report prediction as a 50% confidence interval for each of the candidates.
Exercise 23.33 (Russian Parliament Election Fraud (5 pts)) On September 28, 2016 United Russia party won a super majority of seats, which will allow them to change the Constitution without any votes of other parties. Throughout the day there were reports of voting fraud including video purporting to show officials stuffing ballot boxes. Additionally, results in many regions demonstrate that United Russia on many poll stations got anomalously closed results, for example, 62.2% in more than hundred poll stations in Saratov Region.
Using assumption that United Russia’s range in Saratov was [57.5%, 67.5%] and results for each poll station are rounded to one decimal point (when measure in percent), calculate probability that in 100 poll stations out of 1800 in Saratov Region the majority party got exactly 62.2%.
Do you think it can happen by a chance?
Exercise 23.34 (A/B Testing) Use dataset from ab_browser_test.csv. Here is the definition of the columns:
userID: unique user ID
browser: browser which was used by userID
slot: status of the user (exp = saw modified page, control = saw unmodified page)
n_clicks: number of total clicks user did during as a result of n_queries
n_queries: number of queries made by userID, who used browser browser
n_nonclk_queries: number of queries that did not result in any clicks
Note, that not everyone uses a single browser, so there might be multiple rows with the same userID. In this data set combination of userID and browser is the unique row identifier.
Count how many users in each group. How much larger (in percent) exp group when compared to control group
Using bootstrap, construct 95% confidence interval for mean and median of number of clicks in group exp and group control. Are the mean and median significantly different?
Using bootstrap, check if mean of each group has a normal distribution. Generate \(B = 1000\) bootstrap samples, calculate mean of each and plot qqplot.
Use z-ratio for the means, to perform sis testing, with \(H_0\): there is no difference in average number of clicks between 2 groups
For each browser type and each of the 2 groups (control and exp) count the percent of queries that did not result in any clicks. You can do it be dividing sum of n_nonclk_queries by sum of n_queries. Comment your on your results.
Exercise 23.35 (Chicago Crime Data Analysis) On January 24, 2017 Donald Tramp tweeted about "horrible" murder rate in Chicago.
Our goal is to analyze the data and check how statistically significant such a statement. I downloaded Chicago’s crime data from the data portal: data.cityofchicago.org. This data contains reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. This data set has 6.3 million records. Each crime incident is categorized using one of the 35 primary crime types: NARCOTICS, THEFT, CRIMINAL TRESPASS, etc.. I frittered incidents of type HOMICIDE into a separate data set stored in chi_homicide.rds. Use chi_crime.R as a staring script for this problem.
Create a heat map for the homicide incidents. In which areas of the city you think houses are very affordable and in which they are not?
Create a map by plotting a dot for each of the homicide incidents. You will see similar picture as you saw with the heat plot. Look at the Hyde Park area in the south side Chicago. There is an "island" with no homicide incidents! Can you explain why? Hint: You might want open Google maps in your browser and zoom-in into this area.
Though president’s tweet is consistent with the data (goo.gl/VTPzFw), observing 52 homicides in January is not that unusual. Calculate the total number of homicides for each January. Use bootstrap to estimate 95% confidence interval for the mean \(\mu\) over January homicides. Is \(52\) within the interval? Calculate confidence interval using \(t\)-ratio. Do you think results from \(t\)-ratio based calculations are reliable?
The history of 2001-present data is rather short. Chicago tribune provided total number of homicides for Chicago for each month of the 1957-2014 period. Use this data set and calculate the confidence interval for \(\mu\) using bootstrap and \(t\)-ratio. Further answer the following questions: (i) Assuming monthly homicide rate follows Normal distribution, what is the probability that we observe 52 homicides or more? (ii) Do you think Normality assumption is valid? (iii) Assuming monthly homicide rate follows Poisson distribution, what is the probability that we observe 52 homicides or more?
There is a hypothesis that crime rates are related to temperatures (goo.gl/nPpHwv). Check this hypothesis using simple regression. Use linear model to regress homicide rate to the average maximum temperature. Does this relation appear significant? Perform residual diagnostics and find outliers and leverage points.
There is another hypothesis that rise in murder is related to the pullback in proactive policing that started in November of 2015 as a result of Laquan McDonald video release (https://goo.gl/7cm1CC, https://goo.gl/WcH2uB). I calculated total number of homicides for each day and split data into two parts: before and after video release. Using \(t\)-ratio, check the hypothesis \(H_0\): the homicide rate did not change after video release.
Exercise 23.36 (Gibbs Sampler) Suppose we model data using the following model \[\begin{aligned}
y_i \sim &N(\mu,\tau^{-1})\\
\mu \sim & N(0,1)\\
\tau \sim & N(2,1).\end{aligned}\]
The goal is to implement a Gibbs sample for the posterior \(\mu,\tau | y\), where \(y = (y_1,\ldots,y_n)\) is the observed data. Gibbs sampler algorithms iterates between two steps
Sample \(\mu_i\) from \(\mu \mid \tau_{i-1}, y\)
Sample \(\tau_i\) from \(\tau_i \mid \mu_i, y\)
Show that those full conditional distributions are given by \[
\begin{aligned}
\mu \mid \tau, y \sim & N\left(\dfrac{\tau n\bar y}{1+n\tau},\dfrac{1}{1+n\tau}\right)\\
\tau \mid \mu,y \sim & \mathrm{Gamma}\left(2+\dfrac{n}{2}, 1+\dfrac{1}{2}\sum_{i=1}^{n}(y_i-\mu)^2\right)\end{aligned}
\]
Use formulas for full conditional distributions and implement the Gibbs sampler. The data \(y\) is in the file MCMCexampleData.txt.
Plot samples from the joint distribution over \((\mu,\tau)\) on a scatter plot. Plot histograms for marginal \(\mu\) and \(\tau\) (marginal distributions).
Exercise 23.37 (AAPL vs GOOG) Download AAPL and GOOG return data from 2018 to 2024. Plot box-plot and histogram. Calculate summary statistics using summary function. Describe clearly what you learn from the summary and the plots.
Exercise 23.38 (Berkshire Realty) Berkshire Realty is interested in determining how long a property stays on the housing market. For a sample of \(800\) homes they find the following probability table for length of stay on the market before being sold as a function of the asking price
Days until Sold
Under 20
20-40
over 40
Under $250K
50
40
10
$250-500K
20
150
80
$500-1M
20
280
100
Over $1 M
10
30
10
What is the probability of a randomly selected house that is listed over \(40\) days before being sold?
What is the probability that a randomly selected initial asking price is under \(250\)K?
What is the joint probability of both of the above event happening?
Assuming that a contract has just been signed to list a home for under $500K, what is the probability that Berkshire realty will sell the home in under \(40\) days?
Exercise 23.39 (TrueF/False)
If \(\mathbb{P} \left ( A \; \mathrm{ and} \; B \right ) \leq 0.2\) then \(\mathbb{P} (A) \leq 0.2\).
If \(P( A | B ) = 0.5\) and \(P(B ) = 0.5\), then the events \(A\) and \(B\) are necessarily independent.
A box has three drawers; one contains two gold coins, one contains two silver coins, and one contains one gold and one silver coin. Assume that one drawer is selected randomly and that a randomly selected coin from that drawer turns out to be gold. Then the probability that the chosen drawer contains two gold coins is \(50\)%.
Suppose that \(P(A) = 0.4 , P(B)=0.5\) and \(P( A \text{ or }B ) = 0.7\) then\(P( A \text{ and }B ) = 0.3\).
If \(P( A \text{ or }B ) = 0.5\) and \(P(A \text{ and }B ) = 0.5\), then \(P(A) = P( B)\).
The following data on age and martial status of \(140\) customers of a Bondi beach night club were taken
Age
Single
Not Single
Under 30
77
14
Over 30
28
21
Given this data, age and martial status are independent.
If \(P( A \text{ and }B ) = 0.5\) and \(P(A) = 0.1\), then \(P(B|A) = 0.1\).
In a group of students, \(45\)% play golf, \(55\)% play tennis and \(70\)% play at least one of these sports. Then the probability that a student plays golf but not tennis is \(15\)%.
The following probability table related age with martial status
Age
Single
Not Single
Under 30
0.55
0.10
Over 30
0.20
0.15
Given these probabilities, age and martial status are independent.
Thirty six different kinds of ice cream can be found at Ben and Jerry’s.There are \(58,905\) different combinations of four choices of ice cream.
Suppose that for a certain Caribbean island the probability of a hurricane is \(0.25\), the probability of a tornado is \(0.44\) and the probability of both occurring is \(0.22\). Then the probability of a hurricane or a tornado occurring is \(0.05\).
If \(P ( A \text{ and }B ) \geq 0.10\) then \(P(A) \geq 0.10\).
If \(A\) and \(B\) are mutually exclusive events, then \(P(A|B) = 0\).
True. By definition, if \(A\) and \(B\) are mutually exclusive events then \(P( A \text{ and }B)=0\) and so \(P(A|B) = P(A \text{ and }B)/P(B) = 0\)
Exercise 23.40 (Marginal and Joint) Let \(X\) and \(Y\) be independent with a joint distribution given by \[f_{X,Y}(x,y) = \frac{1}{2 \pi} \sqrt{ \frac{1}{xy} } \exp \left( - \frac{x}{2} - \frac{y}{2} \right) \text{ where } x, y > 0 .\] Identify the following distributions
The marginal distribution of \(X\)
Compute the joint distribution of \(U = X\) and \(V = X+Y\)
Compute the marginal distribution of \(V\).
Exercise 23.41 (Conditional)
Let \(X\) and \(Y\) be independent standard \(N(0,1)\) random variables. Then \(X^2 + Y^2\) is an exponential distribution.
Let \(X\) and \(Y\) be independent Poisson random variables with rates \(\lambda\) and \(\mu\), respectively. Show that the conditional distribution of \(X | (X+Y)\) is Binomial with \(n = X+Y\) and \(p = \lambda / ( \lambda + \mu)\).
Exercise 23.42 (Joint and marginal) Let \(X\) and \(Y\) be independent exponential random variables with means \(\lambda\) and \(\mu\), respectively.
Find the joint distribution of \(U = X+Y\) and \(V = X / ( X+Y)\).
Find the marginal distributions for \(U\) and \(V\).
Exercise 23.43 (Toll road) You are designing a toll road that carries trucks and cars. Each week you see an average of 19,000 vehicles pass by. The current toll for cars is 50 cents and you wish to set the toll for trucks so that the revenue reaches $11,500 per week. You observe the following data: three of every four trucks on the road are followed by a car, while only one of every five cars is followed by a truck.
What is the equilibrium distribution of trucks and cars on the road?
What should you charge trucks so as to reach your goal of $11,500 in revenues per week?
Exercise 23.44 Let \(X, Y\) have a bivariate normal density with density given by \[f_{X,Y} ( x, y) = \frac{1}{ 2 \pi \sqrt{ ( 1 - \rho^2 )} } \exp \left ( - \frac{1}{2} \left ( x^2 - 2 \rho x y + y^2 \right ) \right )\] Consider the transformation \(W=X\) and \(Z = \frac{ Y - \rho X}{ \sqrt{ 1 - \rho^2 } }\). Show that \(W , Z\) are independent and identify their distributions.
23.2 Bayes Rule
Exercise 23.45 (Manchester) While watching a game of Champions League football in a cafe, you observe someone who is clearly supporting Manchester United in the game. What is the probability that they were actually born within 20 miles of Manchester? Assume that you have the following base rate probabilities
The probability that a randomly selected person in a typical local bar environment is born within 20 miles of Manchester is 1/20
The chance that a person born within 20 miles of Manchester actually supports United is 7/10
The probability that a person not born within 20 miles of Manchester supports United with probability 1/10
Exercise 23.46 (Lung Cancer) According to the Center for Disease Control (CDC), we know that “compared to nonsmokers, men who smoke are about \(23\) times more likely to develop lung cancer and women who smoke are about \(13\) times more likely” You are also given the information that \(17.9\)% of women in 2016 smoke.
If you learn that a woman has been diagnosed with lung cancer, and you know nothing else, what’s the probability she is a smoker?
Exercise 23.47 (Tesla Chip) Tesla purchases a particular chip called the HW 5.0 auto chip from three suppliers Matsushita Electric, Philips Electronics, and Hitachi. From historical experience we know that 30% of the chips are purchased from Matsushita; 20% from Philips and the remaining 50% from Hitachi. The manufacturer has extensive histories on the reliability of the chips. We know that 3% of the Matsushita chips are defective; 5% of the Philips and 4% of the Hitachi chips are defective.
A chip is later found to be defective; what is the probability it was manufactured by each of the manufacturers?
Exercise 23.48 (Light Aircraft) Seventy percent of the light aircraft that disappear while in flight in a certain country are subsequently discovered. Of the aircraft that are discovered, 60% have an emergency locator, whereas 90% of the aircraft not discovered do not have such a locator. Suppose that a light aircraft has disappeared. If it has an emergency locator, what is the probability that it will be discovered?
Exercise 23.49 (Floyd Landis) Floyd Landis was disqualified after winning the 2006 Tour de France. This was due to a urine sample that the French national anti-doping laboratory flagged after Landis had won stage \(17\) because it showed a high ratio of testosterone to epitestosterone. Because he was among the leaders he provided \(8\) pairs of urine samples – so there were \(8\) opportunities for a true positive and \(8\) opportunities for a false positive.
Assume that the test has a specificity of \(95\)%. What’s the probability of all \(8\) samples being labeled “negative”?
Now assume the specificity is \(99\)%. What’s the false positive rate for the test
Based on this data, explain how you would assess the probability of guilt of Landis in a court hearing.
Exercise 23.50 (BRCA1) Approximately \(1\)% of woman aged \(40-50\) have breast cancer. A woman with breast cancer has a \(90\)% chance of a positive mammogram test and a \(10\)% chance of a false positive result. Given that someone has a positive test, what’s the posterior probability that they have breast cancer?
If you have the BRCA1 gene mutation, you have a \(90\)% chance of developing breast cancer. The prevalence of mutations at BRCA1 has been estimated to be 0.04%-0.20% in the general population. The genetic test for a mutation of this gene has a \(99.9\)% chance of finding the mutation. The false positive rate is unknown, but you are willing to assume its still \(10\)%.
Given that someone tests positive for the BRCA1 mutation, what’s the posterior probability that they have breast cancer?
Exercise 23.51 (Another Cancer Test) Bayes is particularly useful when predicting outcomes that depend strongly on prior knowledge.
Suppose that a woman is her forties takes a mammogram test and receives the bad news of a positive outcome. Since not every positive outcome is real, you assess the following probabilities. The base rate for a woman is her forties to have breast cancer is \(1.4\)%. The probability of a positive test given breast cancer is \(75\)% and the probability of a false positive is \(10\)%.
Given the positive test, what’s the probability that she has breast cancer?
Exercise 23.52 (Bayes Rule: Hit and Run Taxi) A certain town has two taxi companies: Blue Birds, whose cabs are blue, and Uber, whose cabs are black. Blue Birds has 15 taxis in its fleet, and Uber has 75. Late one night, there is a hit-and-run accident involving a taxi.
The town’s taxis were all on the streets at the time of the accident. A witness saw the accident and claims that a blue taxi was involved. The witness undergoes a vision test under conditions similar to those on the night in question. Presented repeatedly with a blue taxi and a black taxi, in random order, they successfully identify the colour of the taxi 4 times out of 5.
Which company is more likely to have been involved in the accident?
Exercise 23.53 (Gold and Silver Coins) A chest has two drawers. It is known that one drawer has \(3\) gold coins and no silver coins. The other drawer is known to contain \(1\) gold coin and \(2\) silver coins.
You don’t know which drawer is which. You randomly select a drawer and without looking inside you pull out a coin. It is gold. Show that the probability that the remaining two coins in the drawer are gold is \(75\)%.
Exercise 23.54 (The Monty Hall Problem.) This problem is named after the host of the long-running TV show, Let’s Make a Deal. A contestant is given a choice of 3 doors. There is a prize (a car, say) behind one of the doors and something worthless behind the other two doors (say two goats).
After the contestant chooses a door Monty opens one of the other two doors, revealing a goat. The contestant has the choice of switching doors. Is it advantageous to switch doors or not?
Exercise 23.55 (Medical Testing for HIV) A controversial issue in recent years has been the the possible implementation of random drug and/or disease testing (e.g. testing medical workers for HIV virus, which causes AIDS). In the case of HIV testing, the standard test is the Wellcome Elisa test.
The test’s effectiveness is summarized by the following two attributes:
The sensitivity is about 0.993. That is, if someone has HIV, there is a probability of 0.993 that they will test positive.
The specificity is about 0.9999. This means that if someone doesn’t have HIV, there is probability of 0.9999 that they will test negative.
In the general population, incidence of HIV is reasonably rare. It is estimated that the chance that a randomly chosen person has HIV is \(0.000025\).
To investigate the possibility of implementing a random HIV-testing policy with the Elisa test, calculate the following:
The probability that someone will test positive and have HIV.
The probability that someone will test positive and not have HIV.
The probability that someone will test positive.
Suppose someone tests positive. What is the probability that they have HIV?
In light of the last calculation, do you envision any problems in implementing a random testing policy?
Exercise 23.56 (The Three Prisoners) An unknown two will be shot, the other freed. Prisoner A asks the warder for the name of one other than himself who will be shot, explaining that as there must be at least one, the warder won’t really be giving anything away. The warder agrees, and says that B will be shot. This cheers A up a little: his judgmental probability for being shot is now 1/2 instead of 2/3.
Show (via Bayes theorem) that
A is mistaken - assuming that he thinks the warder is as likely to say "C" as "B" when he can honestly say either; but that
A would be right, on the hypothesis that the warder will say "B" whenever he honestly can.
Exercise 23.57 (The Two Children) You meet Max walking with a boy whom he proudly introduces as his son.
What is your probability that his other child is also a boy, if you regard him as equally likely to have taken either child for a walk?
What would the answer be if you regarded him as sure to walk with the boy rather than the girl, if he has one of each?
What would the answer be if you regarded him as sure to walk with the girl rather than the boy, if he has one of each?
Exercise 23.58 (Medical Exam) As a result of medical examination, one of the tests revealed a serious illness in a person. This test has a high precision of 99% (the probability of a positive response in the presence of the disease is 99%, the probability of a negative response in the absence of the disease is also 99%). However, the detected disease is quite rare and occurs only in one person per 10,000. Calculate the probability that the person being examined does have an identified disease.
Exercise 23.59 (The Jury) Assume that the probability is 0.95 that a jury selected to try a criminal case will arrive at the correct verdict whether innocent or guilty. Further, suppose that the 80% of people brought to trial are in fact guilty.
Given that the jury finds a defendant innocent what’s the probability that they are in fact innocent?
Given that the jury finds a defendant guilty what’s the probability that they are in fact guilty?
Do these probabilities sum to one?
Exercise 23.60 (Oil company) An oil company has purchased an option on land in Alaska. Preliminary geologic studies have assigned the following probabilities of finding oil \[
P ( \text{ high \; quality \; oil} ) = 0.50 \; \;
P ( \text{ medium \; quality \; oil} ) = 0.20 \; \;
P ( \text{ no \; oil} ) = 0.30 \; \;
\] After 200 feet of drilling on the first well, a soil test is taken. The probabilities of finding the particular type of soil identified by the test are as follows: \[
P ( \text{ soil} \; | \; \text{ high \; quality \; oil} ) = 0.20 \; \;
P ( \text{ soil} \; | \; \text{ medium \; quality \; oil} ) = 0.80 \; \;
P ( \text{ soil} \; | \; \text{ no \; oil} ) = 0.20 \; \;
\]
What are the revised probabilities of finding the three different types of oil?
How should the firm interpret the soil test?
Exercise 23.61 A screening test for high blood pressure, corresponding to a diastolic blood pressure of \(90\)mm Hg or higher, produced the following probability table
Hypertension
Test
Present
Absent
+ve
0.09
0.03
-ve
0.03
0.85
What’s the probability that a random person has hypertension?
What’s the probability that someone tests positive on the test?
Given a person who tests positive, what is the probability that they have hypertension?
What would happen to your probability of having hypertension given you tested positive if you initially thought you had a \(50\)% chance of having hypertension.
Exercise 23.62 (Steroids) Suppose that a hypothetical baseball player (call him “Rafael”) tests positive for steroids. The test has the following sensitivity and specificity
If a player is on Steroids, there’s a \(95\)% chance of a positive result.
If a player is clean, there’s a \(10\)% chance of a positive result.
A respected baseball authority (call him “Bud”) claims that \(1\)% of all baseball players use Steroids. Another player (call him “Jose”) thinks that there’s a \(30\)% chance of all baseball players using Steroids.
What’s Bud’s probability that Rafael uses Steroids?
What’s Jose’s probability that Rafael uses Steroids?
Explain any probability rules that you use.
Exercise 23.63 A Breathalyzer test is calibrated so that if it is used on a driver whose blood alcohol concentration exceeds the legal limit, it will read positive \(99\)% of the time, while if the driver is below the limit it will read negative \(90\)% of the time. Suppose that based on prior experience, you have a prior probability that the driver is above the legal limit of \(10\)%.
If a driver tests positive, what is the posterior probability that they are above the legal limit?
At Christmas \(20\)% of the drivers on the road are above the legal limit. If all drivers were tested, what proportion of those testing positive would actually be above the limit
How does your answer to part \(1\) change. Explain
Exercise 23.64 (Chicago bearcats) The Chicago bearcats baseball team plays \(60\)% of its games at night and \(40\)% in the daytime. They win \(55\)% of their night games and only \(35\)% of their day games. You found out the next day that they won their last game
What is the probability that the game was played at night
What is the marginal probability that they will win their next game?
Explain clearly any rules of probability that you use.
Exercise 23.65 (Spam Filter) Several spam filters use Bayes rule. Suppose that you empirically find the following probability table for classifying emails with the phrase “buy now” in their title as either “spam” or “not spam”.
Spam
Not Spam
“buy now”
0.02
0.08
not “buy now”
0.18
0.72
What is the probability that you will receive an email with spam?
Suppose that you are given a new email with the phrase “buy now” in its title. What is the probability that this new email is spam?
Explain clearly any rules of probability that you use.
Exercise 23.66 (Chicago Cubs) The Chicago Cubs are having a great season. So far they’ve won \(72\) out of the \(100\) games played so far. You also have the expert opinion of Bob the sports analysis. He tells you that he thinks the Cubs will win. Historically his predictions have a \(60\)% chance of coming true.
Calculate the probability that the Cubs will win given Bob’s prediction
Suppose you now learn that it’s a home game and that the Cubs win \(60\)% of their games at Wrigley field. What’s you updated probability that the Cubs will win their game?
Exercise 23.67 (Student-Grade Causality) Consider the following probabilistic model. The student does poorly poorly in a class (\(c = 1\)) or well (\(c = 0\)) depending on the presence/absence of depression (\(d = 1\) or \(d = 0\)) and weather he/she partied last night (\(v = 1\) or \(v = 0\)) . Participation in the party can also lead to the fact that the student has a headache (\(h = 1\)). As a result of poor student’s performance, the teacher gets upset (\(t = 1\)). The probabilities are given by:
\(p(c=1|d,v)\)
v
d
0.999
1
1
0.9
1
0
0.9
0
1
0.01
0
0
\(p(h=1|v)\)
v
0.9
1
0.1
0
\(p(t=1|c)\)
c
0.95
1
0.05
0
\(p(v=1)=0.2\), and \(p(d=1) = 0.4\).
Draw the causal relationships in the model. Calculate \(p(v=1|h=1)\), \(p(v=1|t=1)\), \(p(v=1|t=1,h=1)\).
Exercise 23.68 (Prisoner) An unknown two will be shot, the other freed. Prisoner A asks the warder for the name of one other than himself who will be shot, explaining that as there must be at least one, the warder won’t really be giving anything away. The warder agrees, and says that B will be shot. This cheers A up a little: his judgmental probability for being shot is now 1/2 instead of 2/3. Show (via Bayes theorem) that
A is mistaken - assuming that he thinks the warder is as likely to say "C" as "B" when he can honestly say either; but that
A would be right, on the hypothesis that the warder will say "B" whenever he honestly can.
Exercise 23.69 (True/False)
In a sample of \(100,000\) emails you found that \(550\) are spam. Your next email contains the word “bigger”. From historical experience, you know that half of all spam email contains the word “bigger” and only \(2\)% of non-spam emails contain it. The probability that this new email is spam is approximately \(12\)%.
Suppose that there’s a \(5\)% chance that it snows tomorrow and a \(80\)%chance that the Chicago bears play their football game tomorrow given that it snows. The probability that they play tomorrow is then \(80\)%.
Bayes’ rule states that \(p(A|B) =p(B|A)\).
If \(P( A \text{ and }B ) = 0.4\) and \(P( B) = 0.8\), then \(P( A|B ) = 0.5\).
23.3 Utility and Decisions
Exercise 23.70 (Two Gambles) In an experiment, subjects were given the choice between two gambles:
Experiment 1
Gamble \({\cal G}_A\)
Gamble \({\cal G}_B\)
Win
Chance
Win
Chance
$2500
0.33
$2400
1
$2400
0.66
$0
0.01
Suppose that a person is an expected utility maximizer. Set the utility scale so that u($0) = 0 and u($2500) = 1. person is an expected utility maximizer. Set the utility scale so that u($0) = 0 and u($2500) = 1. Whether a utility maximizing person would choose Option A or Option B depends on the person’s utility for $2400. For what values of u($2400) would a rational person choose Option A? For what values would a rational person choose Option B?
Experiment 2
Gamble \({\cal G}_C\)
Gamble \({\cal G}_D\)
Win
Chance
Win
Chance
$2500
0.33
$2400
0.34
$0
0.67
$0
0.66
For what values of u($2400) would a person choose Option C? For what values would a person choose Option D? Explain why no expected utility maximizer would prefer B and C.
This problem is a version of the famous Allais paradox, named after the prominent critic of subjective expected utility theory who first presented it. Kahneman and Tversky found that 82% of subjects preferred B over A, and 83% preferred C over D. Explain why no expected utility maximizer would prefer both B in Gamble 1 and C in Gamble 2. (A utility maximizer might prefer B in Gamble 1. A different utility maximizer might prefer C in Gamble 2. But the same utility maximizer would not prefer both B in Gamble 1 and C in Gamble
Discuss these results. Why do you think many people prefer B in Gamble 1 and C in Gamble 2? Do you think this is reasonable even if it does not conform to expected utility theory?
Exercise 23.71 (Decisions) You are sponsoring a fund raising dinner for your favorite political candidate. There is uncertainty about the number of people who will attend (the random variable \(X\)), but based on past dinners, you think that the probability function looks like this:
\(x\)
100
200
300
400
500
\(P_X(x)\)
0.1
0.2
0.3
0.2
0.2
Calculate \(E(X)\), the expected number of people who will attend.
The owner of the venue is going to charge you $1500 for rental and other miscellaneous costs. You know that you will make a profit (after per person costs) of $40 for each person attending. Calculate the expected profit after the rental cost.
The owner of the venue proposes an alternative pricing scheme. Instead of charging $1500, she will charge you either $5 per person or $2100, whichever is smaller. So if 100 people come, you only pay $500. If 500 come, you pay $2100. Calculate the expected profit under this scheme (still assuming $40 per plate profit before you pay the owner).
Let \(Y_1\) be your profit under the first scheme and \(Y_2\) be your profit under the second. If you do the calculations, it turns out that the standard deviations of these profits are: \[\sigma_{Y_1} = 4996 \qquad \sigma_{Y_2} = 4488\] Using the expected values calculated above explain which of the two scenarios you prefer.
Exercise 23.72 (Marjorie Visit) Marjorie is worried about whether it is safe to visit a vulnerable relative during a pandemic. She is considering whether to take an at-home test for the virus before visiting her relative. Assume the test has sensitivity 85% and specificity 92%. That is, the probability that the test will be positive is about 85% if an individual is infected with the virus, and the probability that test will be negative is about 92% if an individual is not infected.
Further, assume the following losses for Marjorie
Event
Loss
Visit relative, not infected
0
Visit relative, infected
100
Do not visit relative, not infected
1
Do not visit relative, infected
5
Assume that about 2 in every 1,000 persons in the population is currently infected. What is the posterior probability that an individual with a positive test has the disease?
Suppose case counts have decreased substantially to about 15 in 100,000. What is the posterior probability that an individual with a positive test has the disease?
Suppose Marjorie is deciding whether to visit her relative and if so whether to test for the disease before visiting. If the prior probability that Marjorie has the disease is 200 in 100,000, find the policy that minimizes expected loss. That is, given each of the possible test results, should Marjorie visit her relative? Find the EVSI. Repeat for a prior probability of 15 in 100,000. Discuss.
For the decision of whether Marjorie should visit her relative, find the range of prior probabilities for which taking the at-home test results in lower expected loss than ignoring or not taking the test (assuming the test is free). Discuss your results.
Exercise 23.73 (True/False Variance)
If the sample covariance between two variables is one, then there must be a strong linear relationship between the variables
If the sample covariance between two variables is zero, then the variables are independent.
If \(X\) and \(Y\) are independent random variables, then \(Var(2X-Y)= 2 Var(X)-Var(Y)\).
The sample variance is unaffected by outlying observations.
Suppose that a random variable \(X\) can take the values \(\{0,1,2\}\) all with equal probability. Then the expected and variance of \(X\) are both\(1\).
The maximum correlation is \(1\) and the minimum is \(0\).
For independent random variables \(X\) and \(Y\), we have \(var(X-Y)=var(X)-var(Y)\).
If the correlation between \(X\) and \(Y\) is zero then the standard deviation of \(X+Y\) is the square root of the sum of the standard deviations of \(X\) and \(Y\).
It is always true that the standard deviation is less than the variance
If the correlation between \(X\) and \(Y\) is \(r = - 0.81\) and if the standard deviations are \(s_X = 20\) and \(s_Y = 25\), respectively, then the covariance is \(Cov (X, Y) = - 401\).
If we drop the largest observation from a sample, then the sample mean and variance will both be reduced.
Suppose \(X\) and \(Y\) are independent random variables and \(Var(X) = 6\) and \(Var(Y) = 6\). Then \(Var(X+Y) = Var(2X)\).
Let investment \(X\) have mean return 5% and a standard deviation of 5%and investment \(Y\) have a mean return of 10% with a standard deviation of 6%. Suppose that the correlation between returns is zero. Then I can find a portfolio with higher mean and lower variance then \(X\).
Exercise 23.74 (True/False Expectation)
LeBron James makes \(85\)% of his free throw attempts and \(50\)% of his regular shots from the field (field goals). Suppose that each shot is independent of the others. He takes \(20\) field goals and \(10\) free throws in a typical game. He gets one point for each free throw and two points for each field goal assuming no 3-point shots. The number of points he expects to score in a game is 28.5.
Suppose that you have a one in a hundred chance of hitting the jackpot on a slot machine. If you play the machine \(100\) times then you are certain to win.
The expected value of the sample mean is the population mean, that is \(E \left ( \bar{X} \right ) = \mu\).
The expectation of \(X\) minus \(2Y\) is just the expectation of \(X\) minus twice the expectation of \(Y\), that is \(E (X-2Y)= E(X) - 2E (Y)\).
A firm believes it has a 50-50 chance of winning a $80,000 contract if it spends $5,000 on a proposal. If the firm spends twice this amount,it feels its chances of winning improve to 60%. If the firm wants to maximize its expected value then it should spend $10,000 to try and gain the contract.
\(E(X+Y)=E(X)+E(Y)\) only if the random variables \(X\) and \(Y\) are independent.
23.4 Bayesian Parameter Learning
Exercise 23.75 (Beta-Binomial for Allais gambles) We’ve collected data on people’s preferences in the two Allais gambles from. For this problem, we will assume that responses are independent and identically distributed, and the probability is \(\theta\) that a person chooses both B in the first gamble and C in the second gamble.
Assume that the prior distribution for \(\theta\) is Beta(1, 3). Find the prior mean and standard deviation for \(\theta\). Find a 95% symmetric tail area credible interval for the prior probability that a person would choose B and C. Do you think this is a reasonable prior distribution to use for this problem? Why or why not?
In 2009, 19 out of 47 respondents chose B and C. Find the posterior distribution for the probability \(\theta\) that a person in this population would choose B and C.
Find the posterior mean and standard deviation. Find a 95% symmetric tail area credible interval for \(\theta\).
Make a triplot of the prior distribution, normalized likelihood, and posterior distribution.
Comment on your results.
Exercise 23.76 (Poisson for Car Counts) Times were recorded at which vehicles passed a fixed point on the M1 motorway in Bedfordshire, England on March 23, 1985.2 The total time was broken into 21 intervals of length 15 seconds. The number of cars passing in each interval was counted. The result was:
This can be summarized in the following table, that shows 3 intervals with zero cars, 5 intervals with 1 car, 7 intervals with 2 cars, 3 intervals with 3 cars and 3 intervals with 4 cars.
table(cnt)
## cnt
## 0 1 2 3 4
## 3 5 7 3 3
Do you think a Poisson distribution provides a good model for the count data? Justify your answer.
Assume that \(\Lambda\), the rate parameter of the Poisson distribution for counts (and the inverse of the mean of the exponential distribution for inter arrival times), has a discrete uniform prior distribution on 20 equally spaced values between \((0.2, 0.4,\ldots 3.8, 4.0)\) cars per 15-second interval. Find the posterior distribution of \(\Lambda\).
Find the posterior mean and standard deviation of \(\Lambda\).
Discuss what your results mean in terms of traffic on this motorway.
Exercise 23.77 (Car Count Part 2) This problem continues analysis of the automobile traffic data. As before, assume that counts of cars per 15-second interval are independent and identically distributed Poisson random variables with unknown mean \(\Lambda\).
Assume that \(\Lambda\), the rate parameter of the Poisson distribution for counts, has a continuous gamma prior distribution for \(\Lambda\) with shape 1 and scale 10e6. (The gamma distribution with shape 1 tends to a uniform distribution as the scale tends to \(\infty\), so this prior distribution is “almost” uniform.) Find the posterior distribution of \(\Lambda\). State the distribution type and hyperparameters.
Find the posterior mean and standard deviation of \(\Lambda\). Compare your results to Part I. Discuss.
Find a 95% symmetric tail area posterior credible interval for \(\Lambda\). Find a 95% symmetric tail area posterior credible interval for \(\theta\), the mean time between vehicle arrivals.
Find the predictive distribution for the number of cars passing in the next minute. Name the family of distributions and the parameters of the predictive distribution. Find the mean and standard deviation of the predictive distribution. Find the probability that more than 10 cars will pass in the next minute. (Hint: one minute is four 15-second time intervals.)
Exercise 23.78 (Lung disease) Chronic obstructive pulmonary disease (COPD) is a common lung disease characterized by difficulty in breathing. A substantial proportion of COPD patients admitted to emergency medical facilities are released as outpatients. A randomized, double-blind, placebo-controlled study examined the incidence of relapse in COPD patients released as outpatients as a function of whether the patients received treatment with corticosteroids. A total of 147 patients were enrolled in the study and were randomly assigned to treatment or placebo group on discharge from an emergency facility. Seven patients were lost from the study prior to follow-up. For the remaining 140 patients, the table below summarizes the primary outcome of the study, relapse within 30 days of discharge.
Relapse
No Relapse
Total
Treatment
19
51
70
Placebo
30
40
70
Total
49
91
140
Let \(Y_1\) and \(Y_2\) be the number of patients who relapse in the treatment and placebo groups, respectively. Assume \(Y_1\) and \(Y_2\) are independent Binomial(70,\(\theta_i\) ) distributions, for \(i=1,2\). Assume \(\theta_1\) and \(\theta_2\) have independent Beta prior distributions with shape parameters 1⁄2 and 1⁄2 (this is the Jeffreys prior distribution). Find the joint posterior distribution for \(\theta_1\) and \(\theta_2\). Name the distribution type and its hyperparameters.
Generate 5000 random pairs \((\theta_1, \theta_2)\), \(k=1,\ldots,5000\) from the joint posterior distribution. Use this random sample to estimate the posterior probability that the rate of relapse is lower for treatment than for placebo. Discuss your results.
Exercise 23.79 (Normal-Normal) Concentrations of the pollutants aldrin and hexachlorobenzene (HCB) in nanograms per liter were measured in ten surface water samples, ten mid-depth water samples, and ten bottom samples from the Wolf River in Tennessee. The samples were taken downstream from an abandoned dump site previously used by the pesticide industry. The full data set can be found at http://www.biostat.umn.edu/~lynn/iid/wolf.river.dat. For this problem, we consider only HCB measurements taken at the bottom and the surface. The question of interest is whether the distribution of HCB concentration depends on the depth at which the measurement was taken. The data for this problem are given below.
Surface
Bottom
3.74
5.44
4.61
6.88
4.00
5.37
4.67
5.44
4.87
5.03
5.12
6.48
4.52
3.89
5.29
5.85
5.74
6.85
5.48
7.16
Assume the observations are independent normal random variables with unknown depth-specific means \(\theta_s\) and \(\theta_b\) and precisions \(\rho_s = 1/\sigma^2_s\) and \(\rho_b = 1/\sigma_s^2\). Assume independent improper reference priors for the surface and bottom parameters: \[
g(\theta_s,\theta_b ,\rho_s,\rho_b ) = g(\theta_s,\rho_s)g(\theta_b ,\rho_b) \propto \rho_s^{-1}\rho_b^{-1}.
\]
This prior can be treated as the product of two normal-gamma priors with \(\mu_s = \mu_b = 0\), \(\sigma_s \rightarrow 0\) and \(\sigma_b \rightarrow 0\), \(a_s = a_b = -1/2\), and \(b_s = b_b \rightarrow 0\). (These are not valid normal-gamma distributions, but you can use the usual Bayesian conjugate updating rule to find the posterior distribution.) Find the joint posterior distribution for the parameters \((\theta_s,\theta_b,\rho_s,\rho_b)\). Find 90% posterior credible intervals for \((\theta_s,\theta_b,\rho_s,\rho_b)\). Comment on your results.
Use direct Monte Carlo to sample 10,000 observations from the joint posterior distribution of \((\theta_s,\theta_b,\rho_s,\rho_b)\). Use your Monte Carlo samples to estimate 90% posterior credible intervals for all four parameters. Compare with the result of part a.
Use your direct Monte Carlo sample to estimate the probability that the mean bottom concentration \(\theta_b\) is higher than the mean surface concentration \(\theta_s\) and to estimate the probability that the standard deviation ” of the bottom concentrations is higher than the standard deviation \(\sigma_b\) of the surface concentrations.
Comment on your analysis. What are your conclusions about the distributions of surface and bottom concentrations? Is the assumption of normality reasonable? Are the means different for surface and bottom? The standard deviations?
Find the predictive distribution for the sample mean of a future sample of size 10 from the surface and a future sample of size 10 from the bottom. Find 95% credible intervals on the sample mean of each future sample. Repeat for future samples of size 40. Compare your results and discuss.
Use direct Monte Carlo to estimate the predictive distribution for the difference in the two sample means for 10 future surface and bottom samples. Plot a kernel density estimator for the density function for the difference in means. Find a 95% credible interval for the difference in the two sample means. Repeat for future samples of 40 surface and 40 bottom observations. Comment on your results.
Repeat part e, but use a model in which the standard deviation is known and equal to the sample standard deviation, and the depth-specific means \(\theta_s\) and \(\theta_b\) have a uniform prior distribution. Compare the 95% credible intervals for the future sample means for the known and unknown standard deviation models. Discuss.
Assume that experts have provided the following prior information based on previous studies.
The unknown means \(\theta_s\) and \(\theta_b\) are independent and normally distributed with mean \(\mu\) and standard deviation \(\tau\). The unknown precisions \(\rho_s\) and \(\rho_b\) are independent of \(\theta_s\) and \(\theta_b\) and have gamma distributions with shape \(a\) and scale \(b\).
Experts specified a 95% prior credible interval of [3, 9] for \(\theta_s\) and \(\theta_b\). A good fit to this credible interval is obtained by setting the prior mean to \(\mu =6\) and the prior standard deviation to \(\tau=1.5\).
A 95% prior credible interval of [0.75, 2.0] is given for the unknown standard deviations \(\Sigma_s\) and \(\Sigma_b\). This translates to a credible interval of [0.25, 1.8] for \(\rho_s = \Sigma_s^{-1}\) and \(\rho_b = \Sigma_b^{-2}\). A good fit to this credible interval is obtained by setting the prior shape to $a = 4.5. and the prior scale to \(b\) = 0.19. Find the following conditional distributions: \(p(\theta_s \mid D,\theta_b,\rho_s,\rho_b)\), \(p(\theta_b \mid D,\theta_s,\rho_s,\rho_b)\), \(p(\rho_s \mid D,\theta_s,\theta_b,rho_b)\), \(p(\rho_b \mid D, \theta_s,\theta_b,\rho_s)\)
Using the distributions you found, draw 10,000 Gibbs samples of \((\theta_s,\theta_b,\rho_s,\rho_b)\). Estimate 90% credible intervals for \((\theta_s,\theta_b,\rho_s^{-1/2},\rho_b^{-1/2})\) and \(\theta_b-\theta_s\).
Do a traceplot of \(\theta_b-\theta_s\). Find the autocorrelation function of \(\theta_b-\theta_s\) and the effective sample size for your Monte Carlo sample for \(\theta_b-\theta_s\).
Comment on your results. Compare with parts a,b, and c.
Exercise 23.80 (Gibbs: Bird feeders) A biologist counts the number of sparrows visiting six bird feeders placed on a given day.
Feeder
Number of Birds
1
11
2
22
3
13
4
24
5
19
6
16
Assume that the bird counts are independent Poisson random variables with feeder- dependent means \(\lambda_i\), for \(i=1,\ldots,6\).
Assume that the means \(\lambda_i\) are independent and identically distributed gamma random variables with shape a and scale b (or equivalently, shape a and mean m = ab )
The mean m = ab of the gamma distribution is uniformly distributed on a grid of 200 equally spaced values starting at 5 and ending at 40.
The shape a is independent of the mean m and has a distribution that takes values on a grid of 200 equally spaced points starting at 1 and ending at 50, with prior probabilities proportional to a gamma density with shape 1 and scale 5.
Use Gibbs sampling to draw 10000 samples from the joint posterior distribution of the mean m, the shape parameter a, and the six mean parameters \(\lambda_i\), \(i=1,\ldots,6\), conditional on the observed bird counts. Using your sample, calculate 95% credible intervals for the mean m, the shape a, and the six mean parameters \(\lambda_i\), \(i=1,\ldots,6\).
Find the effective sample size for the Monte Carlo samples of the mean m, the shape parameter a, and the six mean parameters \(\lambda_i\), \(i=1,\ldots,6\).
Do traceplots for the mean m, the shape parameter a, and the six rate parameters \(\lambda_i\), \(i=1,\ldots,6\).
The fourth feeder had the highest bird count and the first feeder had the lowest bird count. Use your Monte Carlo sample to estimate the posterior probability that the first feeder has a smaller mean bird count than the fourth feeder. Explain how you obtained your estimate.
Discuss your results.
Exercise 23.81 (Poisson Distribution for EPL) We will analyze EPL data. Use epl.R as a starting script. This script uses football-data.org API to download the results for two EPL teams: Manchester United and Chelsea. Model the GoalsHome and GoalsAway for each of the teams using Poisson distribution. Given these distributions calculate the following
What’s the probability of a nil-nil (0 - 0) draw?
What’s the probability that MU wins the match?
Discuss how could you improve your model based on four Poisson distributions.
Use R and random number simulation to provide empirical answers. Using a random sample of size N = 10, 000, check and see how close you get to the theoretical answers you’ve found to the questions posed above. Provide histograms of the distributions you simulate.
Hint: The difference of two Poisson random variables follows a Skellam distribution, defined in skellam package. You can use dskellam to calculate the probability of a draw. You can use rpois to simulate draws from Poisson distribution.
Exercise 23.82 (Homicide Rate (Poisson Distribution)) Suppose that there are \(1.5\) homicides a week. The is the rate, so \(\lambda=1.5\). The tells us that there is a still a \(1.4\)% chance of seeing \(5\) homicides in a week \[
p( X= 5 ) = \frac{e^{ - 1.5 } ( 1.5 )^5 }{5!} = 0.014
\] On average this will happen once every \(71\) weeks, nearly once a year.
What’s the chance of having zero homicides in a week?
The binomial distribution is a discrete probability distribution.
Assuming the Joe DiMaggio’s batting average is \(0.325\) per at-bat and his hits are independent, then he has a probability of about \(12\)% of getting more than \(2\) hits in \(4\) at-bats.
Suppose that you toss a fair coin with probability \(0.5\) a head. The probability of getting five heads is a row is less than three percent.
Suppose that you toss a biased coin with probability \(0.25\) of getting ahead. The probability of getting five heads out of ten tosses is less than thirty percent.
Suppose that you toss a coin \(5\) times. Then there are \(10\) ways of getting \(3\) heads.
The probability of observing three heads out of five tosses of a fair coin is \(0.6\).
A mortgage bank knows from experience that \(2\)% of residential loan swill go into default. Suppose it makes \(10\) such loans, then the probability that at least one goes into default is \(95\)%.
Jessica Simpson is not a professional bowler and \(40\)% of her bowling swings are gutter balls. She is planning to take \(90\) blowing swings.The mean and standard deviation of the number of gutter balls is\(\mu = 36\) and \(\sigma = 3.65\).
The probability of at least one head when tossing a fair coin \(4\) times is \(0.9375\).
The Red Sox are to play the Yankees in a seven game series. Assume that the Red Sox have a 50% chance of winning each game, with the results being independent of each other. Then the probability of the series ending 4-3 in favor of the Red Sox is \(0.5^{7}=0.0078\).
Suppose that \(X\) is Binomially distributed with \(E(X)=5\) and \(Var(X)=2\),then \(n=10\) and \(p=0.5\).
If \(X\) is a Bernoulli random variable with probability of success, \(p\),then its variance is \(V(X)=p(1-p)\).
Historically 15% of chips manufactured by a computer company are defective. The probability of a random sample of 10 chips containing exactly one defect is 0.15.
If \(X \sim Poi (2)\) and \(Y \sim Poi (3)\), then \(X+Y \sim Poi (6)\).
Arsenal are playing Burnley at home in an English Premier League (EPL)game this weekend. They are favorites to win. They have a Poisson distribution for the number of goals they will score with a mean rate of \(2.5\) per game. Given this, the odds of Arsenal scoring at least two goals is greater than \(50\)%.
Arsenal are playing Swansea tomorrow in an English Premier League (EPL). They are favorites to win. The number of goals they expect to score is Poisson with a mean rate of \(2.2\). Given this, the odds of Arsenal scoring at least one goal is greater than \(60\%\).
Arsenal are playing Liverpool at home in an EPL game this weekend. You think that the number of goals to be scored by both teams follow a Poisson distribution with rates \(2.2\) and \(1.6\) respectively. Given this, the odds of a scoreless \(0-0\) draw are \(45-1\).
Suppose your website gets on average \(2\) hits per hour. Then the probability of at least one hit in the next hour is \(0.135\).
The soccer team Manchester United scores on average two goals per game.Given that the distribution of goals is Poisson, the chance that they score two or less goals is \(87\)%
Exercise 23.85 (True/False (Normal Distribution))
The returns for Google stock on the day of earnings are normally distributed with a mean of \(5\)% and a standard deviation of \(5\)%. The probability that you will make money on the day of earnings is approximately \(60\)%.
For any normal random variable, \(X\), we have \(\mathbb{P} \left ( \mu - \sigma < X < \mu + \sigma \right ) = 0.64\). Hint: You may use \(pnorm(1) = 0.841\)
Suppose that the annual returns for Facebook stock are normally distributed with a mean of \(15\)% and a standard deviation of \(20\)%. The probability that Facebook has returns greater than \(10\)% for next year is \(60\)%
Consider the standard normal random variable \(Z \sim N ( 0 , 1 )\). Then the random variable \(-Z\) is also standard normal.
A local bank experiences a \(2\)% default rate on residential loans made in a certain city. Suppose that the bank makes \(2000\) loans. Then the probability of more than \(50\) defaults is \(25\) percent.
The Binomial distribution can be approximated by a normal distribution when the number of trials is large.
Let \(X \sim N(5, 10)\). Then \(P \left (X>5 \right ) = \frac{1}{2}\).
Shaquille O’Neal has a \(55\)% chance of making a free throw in Basketball. Suppose he has \(900\) free throws this year. Then the chance he makes more than \(500\) free throws is \(45\)%
Suppose that the random variable \(X \sim N ( -2 , 4 )\) then \(- 2 X \sim N ( 4 , 16 )\).
Mortimer’s steak house advertises that it is the home of the \(16\) ounce steak. They claim that the weight of their steaks is normally distributed with mean \(16\) and standard deviation \(2\). If this is so,then the probability that a steak weights less that \(14\) ounces is\(16\)%.
Advertising costs for a \(30\)-second commercial are assumed to be normally distributed with a mean of \(10,000\) and a standard deviation of\(1000\). Then the probability that a given commercial costs between\(9000\) and \(10,000\) is \(50\)%.
In a sample of \(120\) Zagat’s ratings of Chicago restaurants, the average restaurant had a rating of \(19.6\) with a standard deviation of \(2.5\). If you randomly pick a restaurant, the chance that you pick one with with a rating over \(25\) is less than \(1\)%.
A hospital finds that \(20\)% of its bills are at least one month in arrears. A random sample of \(50\) bills were taken. Then the probability that less than \(10\) bills in the sample were at least one month in arrears is \(50\)%
A Chicago radio station believes \(30\)% of its listeners are younger than\(30\). Out of a sample of \(500\) they find that \(250\) are younger than\(30\). This data supports their claim at the \(1\)% level.
The probability that a standard normal distribution is more than \(1.96\)standard deviations from the mean is \(0.05\).
Suppose that the amount of money spent at Disney World is normally distributed with a mean of $60 and a standard deviation of $15. Then approximately 45% of people spend more than $70 per visit.
A Normal distribution with mean 4 and standard deviation 3.6 will provide a good approximation to a Binomial random variable with parameters \(n=40\) and \(p= 0.10\).
If \(X\) is normally distributed with mean \(3\) and variance \(9\) then the probability that \(X\) is greater than \(1\) is \(0.254\).
Exercise 23.86 (Tarone Study)Tarone (1982) reports data from 71 studies on tumor incidence in rats
In one of the studies, 2 out of 13 rats had tumors. Assume there are 20 possible tumor probabilities: \(0.025, 0.075,\ldots, 0.975\). Assume that the tumor probability is uniformly distributed. Find the posterior distribution for the tumor probability given the data for this study.
Repeat Part a for a second study in which 1 in 18 rats had a tumor.
Parts a and b assumed that each study had a different tumor probability, and that these tumor probabilities were uniformly distributed a priori. Now, assume the tumor probabilities are the same for the two studies, and that this probability has a uniform prior distribution. Find the posterior distribution for the common tumor probability given the combined results from the two studies.
Compare the three distributions for Parts a, b, and c. Comment on your results.
Exercise 23.87 Let \(X\) and \(Y\) be independent and identically distributed as a \(Exp(1)\) distribution. Their joint distribution is given by
Use the convolution formula to find the distribution of \(X + \frac{1}{2} Y\). Check your answer by also using a moment generating function approach.
Guess want happens if you consider \(X + \frac{1}{2} Y + \frac{1}{3} Z\) where \(Z\) is also \(Exp(1)\)?
Exercise 23.88 (Poisson MLE vs Baye) You are developing tools for monitoring number of adbvertisemnt clicks on a website. You have observed the following data:
y =c(4,1,3,4,3,2,7,3,4,6,5,5,3,2,4,5,4,7,5,2)
which represents the number of clicks every minute over the last 10 minutes. You assume that the number of clicks per minute follows a Poisson distribution with parameter \(\lambda\).
Plot likelihood function for \(\lambda\).
Estimate \(\lambda\) using Maximum Likelihood Estimation (MLE). MLE is the value of \(\lambda\) that maximizes the likelihood function or log-likelihood function. Maximizing likelihood is equivalent to maximizing the log-likelihood function (log is a monotonically increasing function).
Using barplot, plot the predicted vs observed probabilities of for number of clicks from 1 to 7. Is the model a good fit?
Assume that you know, that historically, the average number of clicks per minute is 4 and variance is also 4. Those numbers were valculated over a long period of time. You can use this information as a prior. Assume that the prior distribution is \(Gamma(\alpha,\beta)\). What would be appropriate values for \(\alpha\) and \(\beta\) that would represent this prior information?
Find the posterior distribution for \(\lambda\) and calculate the Bayesian estimate for \(\lambda\) as the expectation over the posterior.
After collecting data for a few days, you realized that about 20% of the observations are zero. How this information would change your prior distribution? This is an open-ended question.
Hint: For part c, you can use the following code to calculate the predicted probabilities for the number of clicks from 0 to 5.
Exercise 23.89 (Exponential Distribution) Let \(x_1, x_2,\ldots, x_N\) be an independent sample from the exponential distribution with density \(p (x | \lambda) = \lambda\exp (-\lambda x)\), \(x \ge 0\), \(\lambda> 0\). Find the maximum likelihood estimate \(\lambda_{\text{ML}}\). Choose the conjugate prior distribution \(p (\lambda)\), and find the posterior distribution \(p (\lambda | x_1,\ldots, x_N)\) and calculate the Bayesian estimate for \(\lambda\) as the expectation over the posterior.
Exercise 23.90 (Bernoulli Baeys) We have \(N\) Bernoulli trials with success probability in each trial being equal to \(q\), we observed \(k\) successes. Find the conjugate distribution \(p (q)\). Find the posterior distribution \(p (q | k, N)\) and its expectation.
Exercise 23.91 (Exponential Family (Gamma)) Write density function of Gamma distribution in a standard exponential family form. Find \(Ex\) and \(E\log x\) by differentiating the normalizing constant.
Exercise 23.92 (Exponential Family (Binomial)) Write density function of Binomial distribution in a standard exponential family form. Find \(Ex\) and \(Var x\) by differentiating the normalizing constant.
Exercise 23.93 (Normal) Let \(X_1\) and \(X_2\) are independent\(N(0,1)\) random variables. Let \[Y_1 = X_1^2 + X_2^2, \quad Y_2 = \frac{X_1}{\sqrt{Y_1}}\]
Find joint distribution of \(Y_1\) and \(Y_2\).
Are \(Y_1\) and \(Y_2\) independent or not?
Can you interpret the result geometrically?
Density of \(N(\mu, \sigma^2)\) is \[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]
23.5 AB Testing
Exercise 23.94 (A/B Testing) During a recent breakout of the flu, 850 out 6,224 people diagnosed with the virus presented severe symptoms. During the same flu season, a experimental anti-virus drug was being tested. The drug was given to 238 people with the flu and only 6 of them developed severe symptoms. Based on this information, can you conclude, for sure, that the drug is a success?
Exercise 23.95 (Tesla Supplier) Tesla purchases Lithium as a raw material for their batteries from either of two suppliers and is concerned about the amounts of impurity the material contains. The percentage impurity levels in consignments of the Lithium follows closely a normal distribution with the means and standard deviations given in the table below. The company is particularly anxious that the impurity level not exceed \(5\)% and wants to purchase from the supplier who is ore likely to meet that specification.
Mean
Standard Deviation
Supplier A
4.4
0.4
Supplier B
4.2
0.6
Which supplier should be chosen?
What if Supplier B implements some quality control which has no effect on the standard deviation but raises their mean to \(4.6\)?
Exercise 23.96 (Red Sox) On September 24, 2003, Pete Thamel in the New York Times reported that the Boston Red Sox had been accused of cheating by another American League Team. The claim was that the Red Sox had a much better winning record at home games than at games played in other cities.
The following table provides the wins and losses for home and away games for the Red Sox in the 2003 season
Record
Team
Home Wins
Home Losses
Away Wins
Away Losses
Boston Red Sox
53
28
42
39
Is there any evidence that the proportion of Home wins is significantly different from home and away games?
Discuss any other issues that are relevant.
Hint: a 95% confidence interval for a difference in proportions \(p_1 - p_2\) is given by \[
( \hat{p}_1 - \hat{p}_2 ) \pm 1.96
\sqrt{ \frac{ \hat{p}_1 ( 1 - \hat{p}_1 ) }{ n_1 } +
\frac{ \hat{p}_2 ( 1 - \hat{p}_2 ) }{ n_2 } }
\]
Exercise 23.97 (Myocardial Infarction) In a five year study of the effects of aspirin on Myocardial Infarction (MI), or heart attack, you have the following dataset on the reduction of the probability of getting MI from taking aspirin versus a Placebo, or control.
Treatment
with MI
without MI
Placebo
198
10845
Aspirin
104
10933
Find a \(95\)% confidence interval for the difference in proportions \(p_1-p_2\).
Perform a hypothesis test of the null \(H_0 : p_1 = p_2\) at a \(1\)% significance level.
Exercise 23.98 (San Francisco Giants) In October 1992, the ownership of the San Francisco Giants considered a sale of the franchise that would have resulted in a move to Florida. A survey from the San Francisco Chronicle found that in a random sample of \(625\) people, 50.7% would be disappointed by the move. Find a 95% confidence interval of the population proportion
Exercise 23.99 (Pfizer) Pfizer introduced Viagra in early 1998 and during \(1998\) of the \(6\) million Viagra users \(77\) died from coronary problems such as heart attacks. Pfizer claimed that this rate is no more than that in the general population.
You find from a clinical study of \(1,500,000\) men who were not on Viagra that \(11\) of then died of coronary problems in the same length of time during the \(77\) Viagra users who dies in \(1998\).
Do you agree with Pfizer’s claim that the proportion of Viagra users dying from coronary problems is no more than that of other comparable men?
Exercise 23.100 (Voter: CI) Given a random sample of \(1000\) voters, \(400\) say they will vote for Donald Trump if the Republican nomination for the 2016 US Presidential Election. Given that he gets the nomination, a \(95\%\) confidence interval for the true proportion of voters that will vote for him includes \(45\%\).
Exercise 23.101 (Survey AB testing) In a pre-election survey of \(435\) voters, \(10\) indicated that they planned to vote for Ralph Nader. In a separate survey of \(500\) people, \(250\) said they planned to vote for George Bush.
Perform a hypothesis test at the 5% level for the hypothesis that Nader will get 3% or less of the vote.
Find a 95% confidence interval for the difference in the proportion of people that will vote for George Bush versus Ralph Nader.
Exercise 23.102 (Significance Tests.) Some defendants in southern states have challenged verdicts made during the ’50s and ’60s, because of possible unfair jury selections. Juries are supposed to be selected at random from the population. In one specific case, only 12 jurors in an 80 person panel were African American. In the state in question, 50% of the population was African American. Could a jury with 12 African Americans out of 80 people be the result of pure chance?
Using \(n = 80\), \(X = 12\) find a 99% confidence interval for the proportion \(p\). Is that significantly different from \(p = 0.5\).
Exercise 23.103 A marketing firm is studying the effects of background music on people’s buying behavior. A random sample of \(150\) people had classical music playing while shopping and \(200\) had pop music playing. The group that listened to classical music spent on average $ \(74\) with a standard deviation of $ \(18\) while the pop music group spent $ \(78.4\) with a standard deviation of $ \(12\).
Test whether there is any significant difference between the difference in purchasing habits. Describe clearly your null and alternative hypotheses and any test statistics that you use.
Is there a difference between using a \(5\)% and \(1\)% significance level.
Exercise 23.104 (Sensitivity and Specificity) The quality of Nvidia’s graphic chips have the probability that a randomly chosen chip being defective is only \(0.1\)%. You have invented a new technology for testing whether a given chip is defective or not. This test will always identify a defective chip as defective and only “falsely” identify a good chip as defective with probability \(1\)%
What are the sensitivity and specificity of your testing device?
Given that the test identifies a defective chip, what’s the posterior probability that it is actually defective?
What percentage of the chips will the new technology identify as being defective?
Should you advise Nvidia to go ahead and implement your testing device? Explain.
Exercise 23.105 (CI fo Google) Google is test marketing a new website design to see if it increases the number of click-through on banner ads. In a small study of a million page views they find the following table of responses
Total Viewers
Click-Throughs
new design
700,000
10,000
old design
300,000
2,000
Find a \(99\)% confidence interval for the increase in the proportion of people who click-through on banner ads using the new web design.
Hint: a 99% confidence interval for a difference in proportions \(p_1 - p_2\) is given by \(( \hat{p}_1 - \hat{p}_2 ) \pm 2.58\sqrt{ \frac{ \hat{p}_1 ( 1 - \hat{p}_1 ) }{ n_1 } + \frac{ \hat{p}_2 ( 1 - \hat{p}_2 ) }{ n_2 } }\)
Exercise 23.106 (Amazon) Amazon is test marketing a new package delivery system. It wants to see if same-day service is feasible for packages bought with Amazon prime. In a small study of a hundred thousand delivers they find the following times for delivery
Deliveries
Mean-Time (Hours)
Standard Deviation-Time
new system
80,000
4.5
2.1
old system
20,000
5.6
2.5
Find a \(95\)% confidence interval for the decrease in delivery time.
If they switch to the new system, what proportion of deliveries will be under \(5\) hours which is required to guarantee same day service.
Hint: a 95% confidence interval for a difference in means \(\mu_1 - \mu_2\) is given by \(( \bar{x}_1 - \bar{x}_2 ) \pm 1.96
\sqrt{ \frac{s_1^2 }{ n_1 } + \frac{ s_2^2 }{ n_2 } }\) ]
Exercise 23.107 (Vitamin C) In the very famous study of the benefits of Vitamin C, \(279\) people were randomly assigned to a dose of vitamin C or a placebo (control of nothing). The objective was to study where vitamin C reduces the incidence of a common cold. The following table provides the responses from the experiment
Group
Colds
Total
Vitamin C
17
139
Placebo
31
140
Is there a significant difference in the proportion of colds between the vitamin C and placebo groups?
Find a \(99\)% confidence interval for the difference. Would you recommend the use of vitamin C to prevent a cold?
Hint: a 95% confidence interval for a difference in proportions \(p_1 - p_2\) is given by \[
( \hat{p}_1 - \hat{p}_2 ) \pm 1.96 \sqrt{ \frac{ \hat{p}_1 (1- \hat{p}_1) }{ n_1 } + \frac{ \hat{p}_2 (1- \hat{p}_2) }{ n_2 } }
\]
Exercise 23.108 (Facebook vs Reading) In a recent article it was claimed that “\(96\)% of Americans under the age of \(50\)” spent more than three hours a day on Facebook.
To test this hypothesis, a survey of \(418\) people under the age of \(50\) were taken and it was found that \(401\) used Facebook for more than three hours a day.
Test the hypothesis at the \(5\)% level that the claim of \(96\)% is correct.
Exercise 23.109 (Paired T-Test) The following table shows the outcome of eight years of a ten year bet that Warren Buffett placed with Protege Partners, a New York hedge fund. Buffett claimed that a simple index fund would beat a portfolio strategy (fund-of-funds) picked by Protege over a ten year time frame. At Buffett’s shareholder meeting, he provided an update of the current state of the bet. The bundle of hedge funds picked by Protege had returned \(21.9\)% in the eight years through \(2015\) and the S&P500 index fund had soared \(65.7\)%.
SP Index
Hedge Funds
2008
-37.0%
-23.9%
2009
26.6%
15.9%
2010
15.1%
8.5%
2011
2.1%
-1.9%
2012
16.0%
6.5%
2013
32.3%
11.8%
2014
13.6%
5.6%
2015
1.4%
1.7%
cumulative
65.7%
21.9%
Use a paired \(t\)-test to assess the statistical significance between the two return strategies
How likely is Buffett to win his bet in two years?
Exercise 23.110 (Shaquille O’Neal) Shaquille O’Neal (nicknamed Shaq) is an ex-American Professional basketball player. He was notoriously bad at free throws (an uncontested shot given to a player when they are fouled). The following table compares the first three years of Shaq’s career to his last three years.
Group
Free Throws Made
Free Throws Attempted
Early Years
1352
2425
Later Years
1121
2132
Did Shaq get worse at free throws over his career?
Exercise 23.111 (Furniture Website) Furniture.com wishes to estimate the average annual expenditure on furniture among its subscribers. A sample of 1000 subscribers is taken. The sample standard deviation is found to be $670 and the sample mean is $3790.
Give a 90% confidence interval for the average expenditure.
Test the hypothesis that the average expenditure is less than $2,500 at the 1% level.
Exercise 23.112 (Grocery AB Testting) A grocery delivery service is studying the impact of weather on customer ordering behavior. To do so, they identified households which made orders both in October and in February on the same day of the week and approximately the same time of day. They found 112 such matched orders. The mean purchase amount in October was $121.45, with a standard deviation of $32.78, while the mean purchase amount in February was $135.99 with a standard deviation of $24.81. The standard deviation of the difference between the two orders was $38.28. To quantify the evidence in the data regarding a difference in the average purchase amount between the two months, compute the \(p\)-value of the data.
Exercise 23.113 (SimCity AB Testing) SimCity 5 is one of Electronic Arts (EA’s) most popular video games. As EA prepared to release the new version, they released a promotional offer to drive more pre-orders. The offer was displayed on their webpage as a banner across the top of the pre-order page. They decided to test some other options to see what design or layout would drive more revenue.
The control removed the promotional offer from the page altogether. The test lead to some very surprising results. With a sample size of \(1000\) visitors, of the \(500\) which got the promotional offer they found \(143\) people wanted to purchase the games and of the half that got the control they found that \(199\) wanted to buy the new version of SimCity.
Test at the \(1\)% level whether EA should provide a promotional offer or not.
Exercise 23.114 (True/False)
At the Apple conference it was claimed that “\(97\)% of people love the iWatch”. From a market survey, you found empirically that \(1175\) out of a sample of \(1400\) people love the new iWatch. Statistically speaking,you can reject Apple’s claim at the \(1\)% significance level. Hint: You may use \(pnorm(2.58) = 0.995\)
Given a random sample of \(2000\) voters, \(800\) say they will vote for Hillary Clinton in the 2016 US Presidential Election. At the \(95\)%level, I can reject the null hypothesis that Hillary has an evens chance of winning the election.
The average movie is Netflix’s database has an average customer rating of \(3.1\) with a standard deviation of \(1\). The last episode of Breaking Bad had a rating of \(4.7\) with a standard deviation of \(0.5\). The\(p-value\) for testing whether Breaking Bad’s rating is statistical different from the average is a lot less than \(1\)%.
A chip manufacturer needs to add the right amount of chemicals to make the chips resistant to heat. On average the population of chips needs to be able to withstand heat of 300 degrees. Suppose you have a random sample of \(30\) chips with a mean of \(305\) and a standard deviation of\(8\). Then you can reject the null hypothesis \(H_0: \mu= 300\) versus\(H_a: \mu> 300\) at the \(5\)% level.
Zagats rates restaurants on food quality. In a random sample of \(100\)restaurants you observe a mean of \(20\) with a standard deviation of\(2.5\). Your favorite restaurant has a score of \(25\). This is statistically different from the population mean at the \(5\)% level.
The \(t\)-score is used to test whether a null hypothesis can be rejected.
An oil company introduces a new fuel that they claim has on average no more than \(100\) milligrams of toxic matter per liter. They take a sample of \(100\) liters and find that \(\bar{X} = 105\) with a given\(\sigma_X = 20\). Then there is evidence at the \(5\)% level that their claim is wrong.
A wine producer claims that the proportion of customers who cannot distinguish his product from grape juice is at most 5%. For a sample of\(100\) people he finds that \(10\) fail the taste test. He should reject his null hypothesis \(H_{0}:p=0.05\) at the \(5\)% level.
As the \(t\)-ratio increases the \(p\)-value of a hypothesis test decreases.
Exercise 23.115 (True/False: CI)
There is much discussion of the effects of second-hand smoke. In a survey of \(500\) children who live in families where someone smokes, it was found that \(10\) children were in poor health. A \(95\)% confidence interval for the probability of a child living in a smoking family being in poor health is then \(2\)% to \(4\)%.
You are finding a confidence interval for a population mean. Holding everything else constant, an interval based on an unknown standard deviation will be wider than one based on a known standard deviation no matter what the sample size is.
There is a \(95\)% probability that a normal random variable lies between \(\mu \pm \sigma\).
A recent CNN poll found that \(49\)% of \(10000\) voters said they would vote for Obama versus Romney if that was the election in November 2012. A \(95\)% confidence interval for the true proportion of voters that would vote for Obama is then \(0.49 \pm 0.03\).
A mileage test for a new electric car model called the “Pizzazz” is conducted. With a sample size of \(n=30\) the mean mileage for the sample is \(36.8\) miles with a sample standard deviation of \(4.5\). A \(95\)% confidence interval for the population mean is \((32.3,41.3)\) miles.
In a random sample of \(100\) NCAA basketball games, the team leading after one quarter won the game \(72\) times. Then a 95% confidence interval for the proportion of teams leading after the first quarter that go on to win is approximately \((0.6,0.84)\).
For the same sample, a 95% prediction interval for a particular team winning is also \((0.6,0.84)\).
In playing poker in Vegas, from \(100\) hours of play, you make an average of $50 per hour with a standard deviation of $10. A 95% confidence interval for your mean gain per hour is approximately $ \(( 48 , 52 )\)
If 27 out of 100 respondents to a survey state that they drink Pepsi then a 95% confidence interval for the proportion \(p\) of the population that drinks Pepsi is \(( 0.26 , 0.28 )\).
The \(p\)-value is the probability that the Null hypothesis is true.
The Central Limit Theorem states that the distribution of a sample mean is approximately Normal.
The Central Limit Theorem states that the distribution of the sample mean \(\bar{X}\) is Normally distributed for large samples.
The Central Limit Theorem guarantees that the distribution of \(\bar{X}\) is constant.
The sample mean, \(\bar{x}\), approximates the population mean for large random samples.
The trimmed mean of a dataset is more sensitive to outliers than the mean.
The sample mean of a dataset must be larger than its standard deviation
Selection bias is not a problem when you are estimating a population mean.
The kurtosis of a distribution is not sensitive to outliers.
23.6 Field vs Observational
Exercise 23.117 What is the difference between a randomized trial and an observational study?
Exercise 23.118 Consider a complex A/B experiment, with 6 alternatives, when you you have 5 variations to your page, plus the original.
Use Bonferroni correction for multiple comparisons. Calculate the significance level for each of the 5 tests and find the number of samples needed to achieve a power of 0.95. Assume that the significance level for each test is .05.
Implement TS for the same experiment. Assume an original arm with a 4% conversion rate, and an optimal arm with a 5% conversion rate. The other 4 arms include one suboptimal arm that beats the original with conversion rate of 4.5%, and three inferior arms with rates of 3%, 2%, and 3.5%. Plot the savings from a six-armed experiment, relative to a Bonferroni adjusted power calculation for a classical experiment. First plot should show the number of days required to end the experiment, with the vertical line showing the time required by the classical power calculation. The second plot should show the number of conversions that were saved by the bandit. What is the overall cost savings due to ending the experiment more quickly, and due to to the experiment being less wasteful while it is running?
Run your simulator 500 times and shows the history of the serving weights for all the arms in the first of our 500 simulation runs. Comment on the results.
Plot the daily cost of running the multi-armed bandit relative to an “oracle” strategy of always playing arm 2, the optimal arm
23.7 Added After Course Started
Exercise 23.119 (Emily, Car, Stock Market, Sweepstakes, Vacation and Bayes.)
Emily is taking Bayesian Analysis course. She believes she will get an A with probability 0.6, a B with probability 0.3, and a C or less with probability 0.1. At the end of semester she will get a car as a present form her (very) rich uncle depending on her class performance. For getting an A in the course Emily will get a car with probability 0.8, for B with probability 0.5, and for anything less than B, she will get a car with probability of 0.2. These are the probabilities if the market is bullish. If the market is bearish, the uncle is less likely to make expensive presents, and the above probabilities are 0.5, 0.3, and 0.1, respectively. The probabilities of bullish and bearish market are equal, 0.5 each. If Emily gets a car, she would travel to Redington Shores with probability 0.7, or stay on campus with probability 0.3. If she does not get a car, these two probabilities are 0.2 and 0.8, respectively. Independently, Emily may be a lucky winner of a sweepstake lottery for a free air ticket and vacation in hotel Sol at Redington Shores. The chance to win the sweepstake is 0.001, but if Emily wins, she will go to vacation with probability of 0.99, irrespective of what happened with the car.
After the semester was over you learned that Emily is at Redington Shores.
What is the probability that she got a car?
What is the probability that she won the sweepstakes?
What is the probability that she got a B in the course?
What is the probability that the market was bearish?
Hint: You can solve this problem by any of the 3 ways: (ii) direct simulation using R, or Python, and (ii) exact calculation. Use just one of the two ways to solve it. The exact solution, although straightforward, may be quite messy.
Exercise 23.120 (Poisson: Website visits) A web designer is analyzing traffic on a web site. Assume the number of visitors arriving at the site at a given time of day is modeled as a Poisson random variable with a rate of \(\lambda\) visitors per minute. Based on prior experience with similar web sites, the following estimates are given:
There is a 90% probability that the rate is greater than 5 visitors per minute.
The rate is equally likely to be greater than or less than 14 visitors per minute.
There is a 90% probability that the rate is less than 27 visitors per minute.
Find a Gamma prior distribution for the arrival rate that fits these judgments as well as possible. Comment on your results.
Hint: there is no “right” answer to this problem. You can use trial and error to find a distribution that fits as well as possible. You can also use an optimization method such as Excel Solver to minimize a measure of how far apart the given quantiles are from the ones in the target distribution.
Exercise 23.121 (Normal-Likelihod) We observe samples of normal random variable \(Y_i\mid \mu,\sigma \sim N(\mu,\sigma^2)\) with known \(\sigma\), specify and plot likelihood for \(\mu\)
y = (-4.3,0.7,-19.4), \(\sigma=10\)
y = (-12,12,-4.5,0.6), \(\sigma=6\)
y = (-4.3,0.7,-19.4), \(\sigma=2\)
y = (12.4,12.1), \(\sigma=5\)
Hint: Remember that the nornal likelihood is the product of the normal density function evaluated at each observation \[
L(\mu|y,\sigma) = \prod_{i=1}^n \dfrac{1}{\sqrt{2\pi}\sigma}e^{-(y_i-\mu)^2/2\sigma^2} = \dfrac{1}{(2\pi\sigma^2)^{n/2}}e^{-\sum_{i=1}^n(y_i-\mu)^2/2\sigma^2}
\] The expression under the exponenta can be simplified to \[
\sum_{i=1}^n(y_i-\mu)^2 = \sum_{i=1}^n (\mu^2 - 2\mu y_i + y_i^2) = n\mu^2 - 2\mu\sum_{i=1}^n y_i + \sum_{i=1}^n y_i^2 = n\mu^2 - 2\mu n\bar{y} + \sum_{i=1}^n y_i^2 = n(\mu^2 - 2\mu\bar{y} + \sum_{i=1}^n y_i^2)
\] Now \[
\mu^2 - 2\mu\bar{y} + \sum_{i=1}^n y_i^2 = \mu^2 - 2\mu\bar{y} + \bar y^2 - \bar y^2 + \sum_{i=1}^n y_i^2 (\mu - \bar{y})^2 + \sum_{i=1}^n(y_i-\bar{y})^2
\] The last summad does not depend on \(\mu\) and can be ignored. Thus, the likelihood is proportional to \[
e^\dfrac{-(\mu - \bar{y})^2}{2\sigma^2/n}
\]
Exercise 23.122 (Normal-Normal for Lock5Data)Lock5Data package (Lock et al. 2016), includes results for a cross-sectional study of hippocampal volumes among 75 subjects (Singh et al. 2014): 25 collegiate football players with a history of concussions (FBConcuss), 25 collegiate football players that do not have a history of concussions (FBNoConcuss), and 25 control subjects. For our analysis, we’ll focus on the subjects with a history of concussions:
Assume that the hippocampal volumes of the subjects with a history of concussions are independent normal random variables with unknown mean \(\mu\) and known standard deviation \(\sigma\). Estimate \(\sigma\) using the sample standard deviation of the hippocampal volumes (FBConcuss group). Calculate the likelihood for \(\mu\).
Assume that \(\mu\) has a normal prior distribution with mean \(\mu_0\) and standard deviation \(\sigma_0\). Estimate those parameters using the sample mean and standard error of the hippocampal volumes across all subjects.
Find the posterior distribution for \(\mu\).
Find a 95% posterior credible interval for \(\mu\).
Exercise 23.123 (Normal-Normal chest measurements) We have chest measurements of 10 000 men. Now, based on memories of my experience as an assistant in a gentlemen’s outfitters in my university vacations, I would suggest a prior \[
\mu \sim N(38, 9).
\] Of course, it is open to question whether these men form a random sample from the whole population, but unless I am given information to the contrary I would stick to the prior I have just quoted, except that I might be inclined to increase the variance. My data shows that the mean turned out to be 39.8 with a standard deviation of 2.0 for the sample of 10 000.
Calculate the posterior mean for the chest measurements of men in this population
Calculate the predictive distribution for the next observation \(x_{n+1}\)
Exercise 23.124 (Normal-Normal Hypothesis Test) Assume Normal-Normal model with known variance, \(X\mid \theta \sim N(\theta,\sigma^2)\), \(\theta \sim N(\mu ,\tau^2)\), and \(\theta\mid X \sim N(\mu^*,\rho^2)\). Given \(L_0 = L(d_1\mid H_0)\) and \(L_1 = L(d_0\mid H_1)\) losses, show that for testing \(H_0 : \theta \le \theta_0\) the “rejection region” is \(\mu^* > \theta_0 + z\rho\), where \(z = \Phi^{-1}\left(L_1/(L_0+L_1)\right)\). Remember, that in the classical \(\alpha\)-level test, the rejection region is \(X > \theta_0 + z_{1−\alpha}\sigma\)
Exercise 23.125 (Normal-Normal Hypothesis Test) Assume \(X\mid \theta \sim N(\theta,\sigma^2)\) and \(\theta \sim p(\theta)=1\). Consider testing \(H_0 : \theta \le \theta_0\) v.s. \(H_1 : \theta > \theta_0\). Show that \(p_0 = P(\theta \le \theta_0 \mid X)\) is equal to classical p-value.
Logan, John A. 1983. “A Multivariate Model for Mobility Tables.”American Journal of Sociology 89 (2): 324–49.
Tarone, Robert E. 1982. “The Use of Historical Control Information in Testing for a Trend in Proportions.”Biometrics, 215–20.