Exercises

Probability

Exercise 1 (Joint Distributions) A credit card company collects data on \(10,000\) users. The data contained two variables: an indicator of he customer status: whether they are in default (def =1) or if they are current with their payments (def =0). Moreover, they have a measure of their loan balance relative to income with three categories: a low balance (bal=1), medium (bal=2) and high (bal=3). The data are given in the following table:

def
bal 0 1
1 8,940 64
2 651 136
3 76 133
  1. Compute the marginal distribution of customer status
  2. What is the conditional distribution of bal given def =1
  3. Make a prediction for the status of a customer with a high balance

Exercise 2 (Marginal) The table below is taken from the Hoff text and shows the joint distribution of occupations taken from a 1983 study of social mobility by @logan1983multivariate. Each cell is P(father’s occupation, son’s occupation).

d = data.frame(farm = c(0.018,0.002,0.001,0.001,0.001),operatives=c(0.035,0.112,0.066,0.018,0.029),craftsman=c(0.031,0.064,0.094,0.019,0.032),sales=c(0.008,0.032,0.032,0.010,0.043),professional=c(0.018,0.069,0.084,0.051,0.130), row.names = c("farm","operative","craftsman","sales","professional"))
d %>% knitr::kable()
farm operatives craftsman sales professional
farm 0.02 0.04 0.03 0.01 0.02
operative 0.00 0.11 0.06 0.03 0.07
craftsman 0.00 0.07 0.09 0.03 0.08
sales 0.00 0.02 0.02 0.01 0.05
professional 0.00 0.03 0.03 0.04 0.13
  1. Find the marginal distribution of fathers’ occupations.
  2. Find the marginal distribution of sons’ occupations.
  3. Find the conditional distribution of the son’s occupation given that the father is a farmer.
  4. Find the conditional distribution of the father’s occupation given that the son is a farmer.
  5. Comment on these results. What do they say about changes in farming in the population from which these data are drawn?

Exercise 3 (Conditional) Netflix surveyed the general population as to the number of hours per week that you used their service. The following table provides the proportions of each category according to whether you are a teenager or adults.

Hours Teenager Adult
\(<4\) 0.18 0.20
\(4\) to \(6\) 0.12 0.32
\(>6\) 0.04 0.14

Calculate the following probabilities:

  1. Given that you spend \(4\) to \(6\) hours a week watching movies, what’s the probability that you are a teenager?
  2. What is the marginal distribution of hours spent watching movies.
  3. Are hours spent watching Netflix movies independent of age?

Exercise 4 (Joint and Conditional) The following probability table relates \(Y\) the number of TV shows watched by the typical student in an evening to the number of drinks \(X\) consumed.

Y
X 0 1 2 3
0 0.07 0.09 0.06 0.01
1 0.07 0.06 0.07 0.01
2 0.06 0.07 0.14 0.03
3 0.02 0.04 0.16 0.04
  1. What is the probability that a student has more than two drinks in an evening?
  2. What is the probability that a student drink more than the number of TV shows they watch?
  3. What’s the conditional distribution of the number of TV shows watched given they consume \(3\) drinks?
  4. What’s the expected number of drinks given they do not watch TV
  5. Are drinking and watching TV independent?

Exercise 5 (Conditional probability) Shipments from an online retailer take between 1 and 7 days to arrive, depending on where they ship from, when they were ordered, the size of the item, etc. Suppose the distribution of delivery times has the following distribution function:

x 1 2 3 4 5 6 7
\(\mbox{P}(X = x)\)
\(\mbox{P}(X \leq x)\) 0.10 0.20 0.70 0.75 0.80 0.90 1
  1. Fill in the above probability table.
  2. What is the conditional probability of a delivery arriving on day four given that it did not arrive in the first three days? (Hint: find \(P(X = 4 \mid X >= 4)\))

Exercise 6 (Joint and Conditional) A cable television company has \(10000\) subscribers in a suburban community. The company offers two premium channels, HBO and Showtime. Suppose \(2750\) subscribers receive HBO and \(2050\) receive Showtime and \(6200\) do not receive any premium channel.

  1. What is the probability that a randomly selected subscriber receives both HBO and Showtime.
  2. What is the probability that a randomly selected subscriber receives HBO but not Showtime.

You now obtain a new dataset, categorized by gender, on the proportions of people who watch HBO and Showtime given below

Cable Female Male
HBO 0.14 0.48
Showtime 0.17 0.21
  1. Conditional on being female, what’s the probability you receive HBO?
  2. Conditional on being female, what’s the probability you receive Showtime?

Exercise 7 (Conditionals and Expectations) The following probability table describes the daily sales volume, \(X\), in thousands of dollars for a salesperson for the number of years \(Y\) of sales experience for a particular company.

Y
X 1 2 3 4
10 0.14 0.03 0.03 0
20 0.05 0.10 0.12 0.07
30 0.10 0.06 0.25 0.05
  1. Verify that this is a legal probability table.
  2. What is the probability of at least two years experience?
  3. Calculate the mean daily sales volume
  4. Given a salesperson has three years experience, calculate the mean daily sales volume.
  5. A salesperson is paid $1000 per week plus 2% of total sales. What is the expected compensation for a salesperson?

Exercise 8 (Expectation) \(E(X+Y) = E(X) + E(Y)\) only if the random variables \(X\) and \(Y\) are independent

Exercise 9 (Conditional Probability) A super market carried out a survey and found the following probabilities for people who buy generic products depending on whether they visit the store frequently or not

Purchase Generic
Visit Often Sometime Never
Frequent 0.10 0.50 0.17
Infrequent 0.03 0.05 0.15
  1. What is the probability that a customer who never buys generics visits the store?
  2. What is the probability that a customer often purchases generic?
  3. Are buying generics and visiting the store independent decisions?
  4. What is the conditional distribution of purchasing generics given that you frequently visit the store?

Exercise 10 (Conditional Probability) Cooper Realty is a small real estate company located in Albany, New York, specializing primarily in residential listings. They have recently become interested in determining the likelihood of one of their listings being sold within a certain number of days. An analysis of recent company sales of 800 homes in produced the following table:

Days Listed until Sold Under 20 31-90 Over 90 Total
Under $50K 50 40 10 100
$50-$100K 20 150 80 250
$100-$150K 20 280 100 400
Over $ 150K 10 30 10 50
  1. Estimate the probability that a home listed for over 90 days before being sold
  2. Estimate the probability that the initial asking price is under $50K.
  3. What the the probability of both of the above happening? Are these two events independent?
  4. Assuming that a contract has just been signed to list a home that has an initial asking price of less than $100K, what is the probability that the home will take Cooper Realty more than 90 days to sell?

Exercise 11 (Probability and Combinations.) In 2006, the St. Louis Cardinals and the Detroit Tigers played for the World Series. The two teams play seven games, and the first team to win four games wins the world series.

The Cardinals were leading the series 3 – 1. Given that each game is independent of another and that the probability of the Cardinals winning any single game is 0.55, what’s the probability that they would go on to win the World Series?

In 2012, the St. Louis Cardinals found themselves in a similar situation against the San Francisco Giants in the National League Championships. Now suppose that the probability of the Cardinals winning any single game is 0.45.

How does the probability that they get to the World Series differ from before?

Exercise 12 (Probability and Lotteries) The Powerball lottery is open to participants across several states. When entering the powerball lottery, a participant selects five numbers from 1-59 and then selects a powerball number from the digits 1-35. In addition, there’s a $1 million payoff for anybody selecting the first five numbers correctly.

  1. Show that the odds of winning the Powerball Jackpot are 1 in 175,223,510.
  2. Show that the odds of winning the $1 million are 1 in 5,153,632.

On February 18, 2006 the Jackpot reached $365 million. Assuming that you will either win the Jackpot or the $1 million prize, what’s your expected value of winning?

Mega Millions is a similar lottery where you pick 5 balls out of 56 and a powerball from 46. Show that the odds of winning mega millions are higher than the Powerball lottery On March 30, 2012 the Jackpot reached $656 million. Is your expected value higher or lower than that calculated for the Powerball lottery?

Exercise 13 (Joint Probability) A market research survey finds that in a particular week \(28\%\) of all adults watch a financial news television program; \(17\%\) read a financial publication and \(13\%\) do both.

  1. Fill in the blanks in the following joint probability table
Watches TV Doesn’t Watch Total
Reads .13 .17
Doesn’t Read
.28 1.00
  1. What is the probability that someone who watches a financial TV program read a publication oriented towards finance?
  2. What is the probability that someone who reads a finance publication watches a financial TV program.
  3. Why aren’t the answers to the above questions equal?

Exercise 14 (Conditional Probability.) A local bank is reviewing its credit card policy. In the past 5% of card holders have defaulted. The bank further found that the chance of missing one or more monthly payments is 0.20 for customers who do not default. Of course, the probability of missing one or more payments for those who default is 1.

  1. Given that a customer has missed a monthly payment, compute the probability that the customer will default.
  2. The bank would like to recall its card if the probability that a customer will default is greater than 0.20. Should the bank recall its card if the customer misses a monthly payment? Why or why not?

Exercise 15 (Correlation) The following table shows the descriptive statistics from \(1000\) days of returns on IBM and Exxon’s stock prices.

            N   Mean    StDev    SE Mean
IBM      1000   0.0009   0.0157  0.00049 
Exxon    1000   0.0018   0.0224  0.00071

Here is the covariance table

          IBM     Exxon
IBM     0.000247
Exxon   0.000068  0.00050
  1. What is the variance of IBM returns?
  2. What is the correlation between IBM and Exxon’s returns?
  3. Consider a portfolio that invests \(50\)% in IBM and \(50\)% in Exxon. What are the mean and variance of the portfolio? Do you prefer this portfolio to just investing in IBM on its own?

Exercise 16 (Normal Distribution) After Facebook’s earnings announcement we have the following distribution of returns. First, the stock beats earnings expectations \(75\)% of the time, and the other \(25\)% of the time earnings are in line or disappoint. Second, when the stock beats earnings, the probability distribution of percent changes is normal with a mean of \(10\)% with a standard deviation of \(5\)% and, when the stock misses earnings, a normal with a mean of \(-5\)% and a standard deviation of \(8\)%, respectively.

  1. Ahead of the earnings announcement, what is the probability that Facebook stock will have a return greater than \(5\)%?
  2. Do you get the same answer for the probability that it drops at least \(5\)%?
  3. Use simulation to provide empirical answers with sample of size N = 10, 000, check and see how close you get to the theoretical answers you’ve found to the questions posed above. Provide histograms of the distributions you simulate.

Exercise 17 (Probability) Answer the following statements TRUE or FALSE, providing a succinct explanation of your reasoning.

  1. If the odds in favor of \(A\) are 3:5 then \(\mbox{P}(A) = 0.4\).
  2. You roll two fair three-sided dice. The probability the two dice show the same number is 1/4.
  3. If events \(A\) and \(B\) are independent and \(\mbox{P}(A) > 0\) and \(\mbox{P}(B)>0\), then \(\mbox{P}(A \mbox{ and } B) > 0\).
  4. If \(\mbox{P}(A \; \text{ and} \; B) \geq 0.5\) then \(P(A) \leq 0.5\).
  5. If two random variables have non-zero correlation, then they must be dependent.
  6. If two random variables have zero correlation, then they must be independent.
  7. If two random variables are independent, then the correlation between them must be zero.
  8. If \(P(A \text{ and } B) \leq 0.2\), then \(P(A) \leq 0.2\).
rnorm(10)
 [1]  1.806 -0.113  2.233 -0.772 -0.020  0.436  0.596 -0.037  2.277  1.223

Exercise 18 (Binomial Distribution) The Downhill Manufacturing company produces snowboards. The average life of their product is \(10\) years. A snowboard is considered defective if its life is less than \(5\) years. The distribution is approximately normal with a standard deviation for the life of a board of \(3\) years.

  1. What’s the probability of a snowboard being defective?
  2. In a shipment of \(120\) snowboards, what is the probability that the number of defective boards is greater than \(10\)?
  3. Use simulation to provide empirical answers with sample of size N = 10, 000, check and see how close you get to the theoretical answers you’ve found to the questions posed above. Provide histograms of the distributions you simulate.

You can use R and simulation with rbinom, rnorm as an alternative

Exercise 19 (Chinese Stock Market) On August 24th, 2015, Chinese equities ended down \(- 8.5\)% (Black Monday). In the last \(25\) years, average is \(0.09\)% with a volatility of \(2.6\)%, and \(56\)% time close within one standard deviation. SP500, average is \(0.03\)% with a volatility of \(1.1\)%. \(74\)% time close within one standard deviation

Economist article, August 2015.

Exercise 20 (Body Weight) Suppose that your model for weight \(X\): Normal distribution with mean \(190\) lbs and variance \(100\) lbs. The problem is to identify the proportion of people have weights over 200 lbs?

Exercise 21 (Google Returns) We estimated sample mean and sample variance for daily returns of Google stock \(\bar x = 0.025\), and \(s^2 = 1.1\). If I want to calculate the probability that I lose \(3\)% in a day, I need to assume a probabilistic model of the return and then calculate the \(p(r >3)\). Say, we assume that returns are normally distributed \(r \sim N( \mu , \sigma^2 )\). Estimate parameters of the distribution from the observed data and calculate \(p(r<-3) = 0.003\)

Exercise 22 (Portfolio Means, Standard Deviations and Correlation) Suppose you have a portfolio that is invested with a weight of 75% in the U.S. and 25% in HK. You take a sample of 10 years, or 120 months of historical means, standard deviations and correlations for U.S. and Hong Kong stock market returns. Given this information compute the mean and standard deviation of the returns on your portfolio.

N MEAN S TDEV
Hong Kong 120 0.0170 0.0751
US 120 0.0115 0.0330

Correlation = 0.3

Hint: you will find the following formulas useful. Let \(R_p\) denote the return on your portfolio which is a weighted combination \(R_p = pX + (1 - p)Y\) . Then \[ E(R_p) = p\mu_X + (1 - p)\mu_Y \] \[ Var(R_p) = p^2\sigma_X^2 + (1 - p)^2\sigma_Y^2 + 2p(1-p)\rho \sigma_X \sigma_Y \] where \(\mu_X\), \(\mu_Y\) and \(\sigma_X\), \(\sigma_Y\) are the underlying means and standard deviations for \(X\) and \(Y\).

Exercise 23 (Binomial) In the game Chuck-a-Luck you pick a number from 1 to 6. You roll three dice. If your number doesn’t appear on any dice, you lose $1. If your number appears exactly once, you win $1. If your number appears on exactly two dice, you win $2. If your number appears on all three dice, you win $3.

Hence every outcome has how much you win or lose on the game, namely \(-1, 1, 2\) or \(3\).

  1. Fill in the blanks in the pdf and cdf values
X -1 1 2 3
P(X)
F(X)

Explain your reasoning carefully.

  1. Compute the expected value of the game, \(E(X)\).

Exercise 24 (Binomial Distribution) A real estate firm in Florida offers a free trip to Florida for potential customers. Experience has shown that of the people who accept the free trip, 5% decide to buy a property. If the firm brings \(1000\) people, what is the probability that at least \(125\) will decide to buy a property?

Exercise 25 (Expectation and Strategy) An oil company wants to drill in a new location. A preliminary geological study suggests that there is a \(20\)% chance of finding a small amount of oil, a \(50\)% chance of a moderate amount and a \(30\)% chance of a large amount of oil. The company has a choice of either a standard drill that simply burrows deep into the earth or a more sophisticated drill that is capable of horizontal drilling and can therefore extract more but is far more expensive. The following table provides the payoff table in millions of dollars under different states of the world and drilling conditions

Oil small moderate large
Standard Drilling 20 30 40
Horizontal Drilling -20 40 80

Find the following

  1. The mean and variance of the payoffs for the two different strategies
  2. The strategy that maximizes their expected payoff
  3. Briefly discuss how the variance of the payoffs would affect your decision if you were risk averse
  4. How much are you willing to pay for a geological evaluation that would tell you with certainty the quantity of oil at the site prior to drilling?

Exercise 26 (Google Survey) Visitors to your website are asked to answer a single survey Google website question before they get access to the content on the page. Among all of the users, there are two categories

  1. Random Clicker (RC)
  2. Truthful Clicker (TC)

There are two possible answers to the survey: yes and no.

Random clickers would click either one with equal probability. You are also giving the information that the expected fraction of random clickers is \(0.3\).

After a trial period, you get the following survey results. \(65\)% said Yes and \(35\)% said No.

How many people people who are truthful clickers answered yes?

Computing

Exercise 27 (Portfolio Means, Standard Deviations and Correlation) You want to build a portfolio of exchange traded funds (ETFs) for your retirement strategy. You’re thinking of whether to invest in growth or value stocks, or maybe a combination of both. Vanguard has two ETFs, one for growth (VUG) and one for value (VTV).

  1. Plot the historical price series for VUG vs VTV.
  2. Calculate the means and standard deviations of both ETFs.
  3. Calculate their covariance.
  4. Suppose you decide on a portfolio that is a 50 / 50 split. Calculate the new mean and variance of your portfolio.
  5. Which portfolio best suits you?
  6. What’s the probability that growth (VUG) will beat value (VTV) in the future?

You will find the following formulas useful. Let \(P\) denote the return on your portfolio which is a weighted combination \(P = aX + bY\). Then \[ E(P) = aE(X) + bE(Y ) \] \[ Var(P ) = a^2Var(X) + b^2Var(Y ) + 2abCov(X, Y ), \] where \(Cov(X, Y )\) is the covariance for \(X\) and \(Y\).

Hint: You can use the following code to get the data

library(quantmod)
getSymbols(c("VUG","VTV"), from = "2015-01-01", to = "2024-01-01")
VUG = VUG$VUG.Adjusted
VTV = VTV$VTV.Adjusted

Exercise 28 (Descriptive Statistics in R) Use the superbowl1.txt and derby2016.csv datasets. The Superbowl contains data on the outcome of all previous Superbowls. The outcome is defined as the difference in scores of the favorite minus the underdog. The spread is the bookmakers’ prediction of the outcome before the game begins. The Derby data consists of all of the results on the Kentucky Derby which is run on the first Saturday in May every year at Churchill Downs racetrack. Answer the following questions

For the Superbowl data.

  1. Plot the spread and outcome variables. Calculate means, standard deviations, covariances, correlations.
  2. What is the mean and the standard deviation of the winning margin (outcome)?
  3. Use a boxplot to compare the favorites’ score versus the underdog.
  4. Does this data look normally distributed?

For the Derby data.

  1. Plot a histogram of the winning speeds and times of the horses. Why is there a long right-hand tail to the distribution of times?
  2. Can you identify the outlying horse with the best winning time?

Exercise 29 (Berkshire Hathaway: Yahoo Finance Data) Download daily return data in Warren Buffett’s firm Berkshire Hathaway (ticker symbol: BRK-A) from 1990 to the present. Analyze this data in the following way:

  1. Plot the Historical Price Performance of the stock.
  2. Calculate the Daily returns. Plot a histogram of the returns. Comment on the distribution that you obtain.
  3. Use the summary command to provide statistical data summaries.
  4. Interpret your findings.

Exercise 30 (Confidence Intervals) A sample of weights of 40 rainbow trout revealed that the sample mean is 402.7 grams and the sample standard deviation is 8.8 grams.

  1. What is the estimated mean weight of the population?
  2. What is the 99% confidence interval for the mean?

Exercise 31 (Confidence Intervals) A research firm conducted a survey to determine the mean amount steady smokers spend on cigarettes during a week.

A sample of 49 steady smokers revealed that \(\bar{X} = 20\) and \(s = 5\) dollars.

  1. What is the point estimate? Explain what it indicates.
  2. Using a 95% confidence interval, determine the confidence interval for \(\mu\). Explain what it indicates.

Exercise 32 (Back cast US Presidential Elections) Use data from presidential polls to predict the winner of the elections. We will be using data from http://www.electoral-vote.com/. The goal is to use simulations to predict the winning percentage for each of the candidates. Use election.Rmd script as the starter.

Report prediction as a 50% confidence interval for each of the candidates.

Exercise 33 (Russian Parliament Election Fraud (5 pts)) On September 28, 2016 United Russia party won a supermajority of seats, which will allow them to change the Constitution without any votes of other parties. Throughout the day there were reports of voting fraud including video purporting to show officials stuffing ballot boxes. Additionally, results in many regions demonstrate that United Russia on many poll stations got anomalously closed results, for example, 62.2% in more than hundred poll stations in Saratov Region.

Using assumption that United Russia’s range in Saratov was [57.5%, 67.5%] and results for each poll station are rounded to one decimal point (when measure in percent), calculate probability that in 100 poll stations out of 1800 in Saratov Region the majority party got exactly 62.2%.

Do you think it can happen by a chance?

Exercise 34 (A/B Testing) Use dataset from ab_browser_test.csv. Here is the definition of the columns:

  • userID: unique user ID
  • browser: browser which was used by userID
  • slot: status of the user (exp = saw modified page, control = saw unmodified page)
  • n_clicks: number of total clicks user did during as a result of n_queries
  • n_queries: number of queries made by userID, who used browser browser
  • n_nonclk_queries: number of queries that did not result in any clicks

Note, that not everyone uses a single browser, so there might be multiple rows with the same userID. In this data set combination of userID and browser is the unique row identifier.

  1. Count how many users in each group. How much larger (in percent) exp group when compared to control group
  2. Using bootstrap, construct 95% confidence interval for mean and median of number of clicks in group exp and group control. Are the mean and median significantly different?
  3. Using bootstrap, check if mean of each group has a normal distribution. Generate \(B = 1000\) bootstrap samples, calculate mean of each and plot qqplot.
  4. Use z-ratio for the means, to perform sis testing, with \(H_0\): there is no difference in average number of clicks between 2 groups
  5. Mann-Whitney (http://www.statmethods.net/stats/nonparametric.html) is another test for comparing means, that does not require normality assumption. Use this test to check hypothesis that means are equal.
  6. For each browser type and each of the 2 groups (control and exp) count the percent of queries that did not result in any clicks. You can do it be dividing sum of n_nonclk_queries by sum of n_queries. Comment your on your results.

Exercise 35 (Chicago Crime Data Analysis) On January 24, 2017 Donald Tramp tweeted about "horrible" murder rate in Chicago.

Trump Tweets

Our goal is to analyze the data and check how statistically significant such a statement. I downloaded Chicago’s crime data from the data portal: data.cityofchicago.org. This data contains reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. This data set has 6.3 million records. Each crime incident is categorized using one of the 35 primary crime types: NARCOTICS, THEFT, CRIMINAL TRESPASS, etc.. I frittered incidents of type HOMICIDE into a separate data set stored in chi_homicide.rds. Use chi_crime.R as a starging script for this problem.

  1. Create a heat map for the homicide incidents. In which areas of the city you think houses are very affordable and in which they are not?
  2. Create a map by plotting a dot for each of the homicide incidents. You will see similar picture as you saw with the heat plot. Look at the Hyde Park area in the south side Chicago. There is an "island" with no homicide incidents! Can you explain why? Hint: You might want open Google maps in your browser and zoom-in into this area.
  3. Though president’t tweet is consistent with the data (goo.gl/VTPzFw), observing 52 homicides in January is not that unusual. Calculate the total number of homicides for each January. Use bootstrap to estimate 95% confidence interval for the mean \(\mu\) over January homicides. Is \(52\) within the interval? Calculate confidence interval using \(t\)-ratio. Do you think results from \(t\)-ratio based calculations are reliable?
  4. The history of 2001-present data is rather short. Chicago tribune provided total number of homicides for Chicago for each month of the 1957-2014 period. Use this data set and calculate the confidence interval for \(\mu\) using bootstrap and \(t\)-ratio. Further answer the following questions: (i) Assuming monthly homicide rate follows Normal distribution, what is the probability that we observe 52 homicides or more? (ii) Do you think Normality assumption is valid? (iii) Assuming monthly homicide rate follows Poisson distribution, what is the probability that we observe 52 homicides or more?
  5. There is a hypothesis that crime rates are related to temperatures (goo.gl/nPpHwv). Check this hypothesis using simple regression. Use linear model to regress homicide rate to the average maximum temperature. Does this relation appear significant? Perform residual diagnostics and find outliers and leverage points.
  6. There is another hypothesis that rise in murder is related to the pullback in proactive policing that started in November of 2015 as a result of Laquan McDonald video release (https://goo.gl/7cm1CC, https://goo.gl/WcH2uB). I calculated total number of homicides for each day and split data into two parts: before and after video release. Using \(t\)-ratio, check the hypothesis \(H_0\): the homicide rate did not change after video release.

Exercise 36 (Gibbs Sampler) Suppose we model data using the following model \[\begin{aligned} y_i \sim &N(\mu,\tau^{-1})\\ \mu \sim & N(0,1)\\ \tau \sim & N(2,1).\end{aligned}\]

The goal is to implement a Gibbs sample for the posterior \(\mu,\tau | y\), where \(y = (y_1,\ldots,y_n)\) is the observed data. Gibbs sampler algorithms iterates between two steps

  1. Sample \(\mu_i\) from \(\mu \mid \tau_{i-1}, y\)
  2. Sample \(\tau_i\) from \(\tau_i \mid \mu_i, y\)

Show that those full conditional distributions are given by \[ \begin{aligned} \mu \mid \tau, y \sim & N\left(\dfrac{\tau n\bar y}{1+n\tau},\dfrac{1}{1+n\tau}\right)\\ \tau \mid \mu,y \sim & \mathrm{Gamma}\left(2+\dfrac{n}{2}, 1+\dfrac{1}{2}\sum_{i=1}^{n}(y_i-\mu)^2\right)\end{aligned} \]

Use formulas for full conditional distributions and implement the Gibbs sampler. The data \(y\) is in the file MCMCexampleData.txt.

Plot samples from the joint distribution over \((\mu,\tau)\) on a scatter plot. Plot histograms for marginal \(\mu\) and \(\tau\) (marginal distributions).

Exercise 37 (AAPL vs GOOG) Download AAPL and GOOG return data from 2018 to 2024. Plot box-plot and histogram. Calculate summary statistics using summary function. Describe clearly what you learn from the summary and the plots.

Exercise 38 (Berkshire Realty) Berkshire Realty is interested in determining how long a property stays on the housing market. For a sample of \(800\) homes they find the following probability table for length of stay on the market before being sold as a function of the asking price

Days until Sold Under 20 20-40 over 40
Under $250K 50 40 10
$250-500K 20 150 80
$500-1M 20 280 100
Over $1 M 10 30 10
  1. What is the probability of a randomly selected house that is listed over \(40\) days before being sold?
  2. What is the probability that a randomly selected initial asking price is under \(250\)K?
  3. What is the joint probability of both of the above event happening?
  4. Assuming that a contract has just been signed to list a home for under $500K, what is the probability that Berkshire realty will sell the home in under \(40\) days?

Exercise 39 (TrueF/False)  

  1. If \(\mathbb{P} \left ( A \; \mathrm{ and} \; B \right ) \leq 0.2\) then \(\mathbb{P} (A) \leq 0.2\).

  2. If \(P( A | B ) = 0.5\) and \(P(B ) = 0.5\), then the events \(A\) and \(B\) are necessarily independent.

  3. A box has three drawers; one contains two gold coins, one contains two silver coins, and one contains one gold and one silver coin. Assume that one drawer is selected randomly and that a randomly selected coin from that drawer turns out to be gold. Then the probability that the chosen drawer contains two gold coins is \(50\)%.

  4. Suppose that \(P(A) = 0.4 , P(B)=0.5\) and \(P( A \cup B ) = 0.7\) then\(P( A \cap B ) = 0.3\).

  5. If \(P( A \cup B ) = 0.5\) and \(P(A \cap B ) = 0.5\), then \(P(A) = P( B)\).

  6. The following data on age and martial status of \(140\) customers of aBondi beach night club were taken

    Age Single Not Single
    Under 30 77 14
    Over 30 28 21

    Given this data, age and martial status are independent.

  7. If \(P( A \cap B ) = 0.5\) and \(P(A) = 0.1\), then \(P(B|A) = 0.1\).

  8. In a group of students, \(45\)% play golf, \(55\)% play tennis and \(70\)% play at least one of these sports. Then the probability that a studentplays golf but not tennis is \(15\)%.

  9. The following probability table related age with martial status

    Age Single Not Single
    Under 30 0.55 0.10
    Over 30 0.20 0.15

    Given these probabilities, age and martial status are independent.

  10. Thirty six different kinds of ice cream can be found at Ben and Jerry’s.There are \(58,905\) different combinations of four choices of ice cream.

  11. Suppose that for a certain Caribbean island the probability of ahurricane is \(0.25\), the probability of a tornado is \(0.44\) and theprobability of both occurring is \(0.22\). Then the probability of ahurricane or a tornado occurring is \(0.05\).

  12. If \(P ( A \cap B ) \geq 0.10\) then \(P(A) \geq 0.10\).

  13. If \(A\) and \(B\) are mutually exclusive events, then \(P(A|B) = 0\).

  14. True. By definition, if \(A\) and \(B\) are mutually exclusive events then \(P( A \cap B)=0\) and so \(P(A|B) = P(A \cap B)/P(B) = 0\)

Exercise 40 (Marginal and Joint) Let \(X\) and \(Y\) be independent with a joint distribution given by \[f_{X,Y}(x,y) = \frac{1}{2 \pi} \sqrt{ \frac{1}{xy} } \exp \left( - \frac{x}{2} - \frac{y}{2} \right) \text{ where } x, y > 0 .\] Identify the following distributions

  1. The marginal distribution of \(X\)
  2. Compute the joint distribution of \(U = X\) and \(V = X+Y\)
  3. Compute the marginal distribution of \(V\).

Exercise 41 (Conditional)  

  1. Let \(X\) and \(Y\) be independent standard \(N(0,1)\) random variables. Then \(X^2 + Y^2\) is an exponential distribution.
  2. Let \(X\) and \(Y\) be independent Poisson random variables with rates \(\lambda\) and \(\mu\), respectively. Show that the conditional distribution of \(X | (X+Y)\) is Binomial with \(n = X+Y\) and \(p = \lambda / ( \lambda + \mu)\).

Exercise 42 (Joint and marginal) Let \(X\) and \(Y\) be independent exponential random variables with means \(\lambda\) and \(\mu\), respectively.

  1. Find the joint distribution of \(U = X+Y\) and \(V = X / ( X+Y)\).
  2. Find the marginal distributions for \(U\) and \(V\).

Exercise 43 (toll road) You are designing a toll road that carries trucks and cars. Each week you see an average of 19,000 vehicles pass by. The current toll for cars is 50 cents and you wish to set the toll for trucks so that the revenue reaches $11,500 per week. You observe the following data: three of every four trucks on the road are followed by a car, while only one of every five cars is followed by a truck.

  1. What is the equilibrium distribution of trucks and cars on the road?
  2. What should you charge trucks so as to reach your goal of $11,500 in revenues per week?

Exercise 44 Let \(X, Y\) have a bivariate normal density with density given by \[f_{X,Y} ( x, y) = \frac{1}{ 2 \pi \sqrt{ ( 1 - \rho^2 )} } \exp \left ( - \frac{1}{2} \left ( x^2 - 2 \rho x y + y^2 \right ) \right )\] Consider the transformation \(W=X\) and \(Z = \frac{ Y - \rho X}{ \sqrt{ 1 - \rho^2 } }\). Show that \(W , Z\) are independent and identify their distributions.

Bayes Rule

Exercise 45 (Manchester) While watching a game of Champions League football in a cafe, you observe someone who is clearly supporting Manchester United in the game. What is the probability that they were actually born within 20 miles of Manchester? Assume that you have the following base rate probabilities

  1. The probability that a randomly selected person in a typical local bar environment is born within 20 miles of Manchester is 1/20
  2. The chance that a person born within 20 miles of Manchester actually supports United is 7/10
  3. The probability that a person not born within 20 miles of Manchester supports United with probability 1/10

Exercise 46 (Lung Cancer) According to the Center for Disease Control (CDC), we know that “compared to nonsmokers, men who smoke are about \(23\) times more likely to develop lung cancer and women who smoke are about \(13\) times more likely” You are also given the information that \(17.9\)% of women in 2016 smoke.

If you learn that a woman has been diagnosed with lung cancer, and you know nothing else, what’s the probability she is a smoker?

Exercise 47 (Tesla Chip) Tesla purchases a particular chip called the HW 5.0 auto chip from three suppliers Matsushita Electric, Philips Electronics, and Hitachi. From historical experience we know that 30% of the chips are purchased from Matsushita; 20% from Philips and the remaining 50% from Hitachi. The manufacturer has extensive histories on the reliability of the chips. We know that 3% of the Matsushita chips are defective; 5% of the Philips and 4% of the Hitachi chips are defective.

A chip is later found to be defective; what is the probability it was manufactured by each of the manufacturers?

Exercise 48 (Light Aircraft) Seventy percent of the light aircraft that disappear while in flight in a certain country are subsequently discovered. Of the aircraft that are discovered, 60% have an emergency locator, whereas 90% of the aircraft not discovered do not have such a locator. Suppose that a light aircraft has disappeared. If it has an emergency locator, what is the probability that it will be discovered?

Exercise 49 (Floyd Landis) Floyd Landis was disqualified after winning the 2006 Tour de France. This was due to a urine sample that the French national anti-doping laboratory flagged after Landis had won stage \(17\) because it showed a high ratio of testosterone to epitestosterone. Because he was among the leaders he provided \(8\) pairs of urine samples – so there were \(8\) opportunities for a true positive and \(8\) opportunities for a false positive.

  1. Assume that the test has a specificity of \(95\)%. What’s the probability of all \(8\) samples being labeled “negative”?
  2. Now assume the specificity is \(99\)%. What’s the false positive rate for the test
  3. Based on this data, explain how you would assess the probability of guilt of Landis in a court hearing.

Exercise 50 (BRCA1) Approximately \(1\)% of woman aged \(40-50\) have breast cancer. A woman with breast cancer has a \(90\)% chance of a positive mammogram test and a \(10\)% chance of a false positive result. Given that someone has a positive test, what’s the posterior probability that they have breast cancer?

If you have the BRCA1 gene mutation, you have a \(90\)% chance of developing breast cancer. The prevalence of mutations at BRCA1 has been estimated to be 0.04%-0.20% in the general population. The genetic test for a mutation of this gene has a \(99.9\)% chance of finding the mutation. The false positive rate is unknown, but you are willing to assume its still \(10\)%.

Given that someone tests positive for the BRCA1 mutation, what’s the posterior probability that they have breast cancer?

Exercise 51 (Another Cancer Test) Bayes is particularly useful when predicting outcomes that depend strongly on prior knowledge.

Suppose that a woman is her forties takes a mammogram test and receives the bad news of a positive outcome. Since not every positive outcome is real, you assess the following probabilities. The base rate for a woman is her forties to have breast cancer is \(1.4\)%. The probability of a positive test given breast cancer is \(75\)% and the probability of a false positive is \(10\)%.

Given the positive test, what’s the probability that she has breast cancer?

Exercise 52 (Bayes Rule: Hit and Run Taxi) A certain town has two taxi companies: Blue Birds, whose cabs are blue, and Uber, whose cabs are black. Blue Birds has 15 taxis in its fleet, and Uber has 75. Late one night, there is a hit-and-run accident involving a taxi.

The town’s taxis were all on the streets at the time of the accident. A witness saw the accident and claims that a blue taxi was involved. The witness undergoes a vision test under conditions similar to those on the night in question. Presented repeatedly with a blue taxi and a black taxi, in random order, they successfully identify the colour of the taxi 4 times out of 5.

Which company is more likely to have been involved in the accident?

Exercise 53 (Gold and Silver Coins) A chest has two drawers. It is known that one drawer has \(3\) gold coins and no silver coins. The other drawer is known to contain \(1\) gold coin and \(2\) silver coins.

You don’t know which drawer is which. You randomly select a drawer and without looking inside you pull out a coin. It is gold. Show that the probability that the remaining two coins in the drawer are gold is \(75\)%.

Exercise 54 (The Monty Hall Problem.) This problem is named after the host of the long-running TV show, Let’s Make a Deal. A contestant is given a choice of 3 doors. There is a prize (a car, say) behind one of the doors and something worthless behind the other two doors (say two goats).

After the contestant chooses a door Monty opens one of the other two doors, revealing a goat. The contestant has the choice of switching doors. Is it advantageous to switch doors or not?

Exercise 55 (Medical Testing for HIV) A controversial issue in recent years has been the the possible implementation of random drug and/or disease testing (e.g. testing medical workers for HIV virus, which causes AIDS). In the case of HIV testing, the standard test is the Wellcome Elisa test.

The test’s effectiveness is summarized by the following two attributes:

  • The sensitivity is about 0.993. That is, if someone has HIV, there is a probability of 0.993 that they will test positive.
  • The specificity is about 0.9999. This means that if someone doesn’t have HIV, there is probability of 0.9999 that they will test negative.

In the general population, incidence of HIV is reasonably rare. It is estimated that the chance that a randomly chosen person has HIV is \(0.000025\).

To investigate the possibility of implementing a random HIV-testing policy with the Elisa test, calculate the following:

  1. The probability that someone will test positive and have HIV.
  2. The probability that someone will test positive and not have HIV.
  3. The probability that someone will test positive.
  4. Suppose someone tests positive. What is the probability that they have HIV?

In light of the last calculation, do you envision any problems in implementing a random testing policy?

Exercise 56 (The Three Prisoners) An unknown two will be shot, the other freed. Prisoner A asks the warder for the name of one other than himself who will be shot, explaining that as there must be at least one, the warder won’t really be giving anything away. The warder agrees, and says that B will be shot. This cheers A up a little: his judgmental probability for being shot is now 1/2 instead of 2/3.

Show (via Bayes theorem) that

  1. A is mistaken - assuming that he thinks the warder is as likely to say "C" as "B" when he can honestly say either; but that
  2. A would be right, on the hypothesis that the warder will say "B" whenever he honestly can.

Exercise 57 (The Two Children) You meet Max walking with a boy whom he proudly introduces as his son.

  1. What is your probability that his other child is also a boy, if you regard him as equally likely to have taken either child for a walk?
  2. What would the answer be if you regarded him as sure to walk with the boy rather than the girl, if he has one of each?
  3. What would the answer be if you regarded him as sure to walk with the girl rather than the boy, if he has one of each?

Exercise 58 (Medical Exam) As a result of medical examination, one of the tests revealed a serious illness in a person. This test has a high precision of 99% (the probability of a positive response in the presence of the disease is 99%, the probability of a negative response in the absence of the disease is also 99%). However, the detected disease is quite rare and occurs only in one person per 10,000. Calculate the probability that the person being examined does have an identified disease.

Exercise 59 (The Jury) Assume that the probability is 0.95 that a jury selected to try a criminal case will arrive at the correct verdict whether innocent or guilty. Further, suppose that the 80% of people brought to trial are in fact guilty.

  1. Given that the jury finds a defendant innocent what’s the probability that they are in fact innocent?
  2. Given that the jury finds a defendant guilty what’s the probability that they are in fact guilty?
  3. Do these probabilities sum to one?

Exercise 60 (Oil company) An oil company has purchased an option on land in Alaska. Preliminary geologic studies have assigned the following probabilities of finding oil \[ P ( \text{ high \; quality \; oil} ) = 0.50 \; \; P ( \text{ medium \; quality \; oil} ) = 0.20 \; \; P ( \text{ no \; oil} ) = 0.30 \; \; \] After 200 feet of drilling on the first well, a soil test is taken. The probabilities of finding the particular type of soil identified by the test are as follows: \[P ( \text{ soil} \; | \; \text{ high \; quality \; oil} ) = 0.20 \; \; P ( \text{ soil} \; | \; \text{ medium \; quality \; oil} ) = 0.80 \; \; P ( \text{ soil} \; | \; \text{ no \; oil} ) = 0.20 \; \;\]

  1. What are the revised probabilities of finding the three different types of oil?
  2. How should the firm interpret the soil test?

Exercise 61 A screening test for high blood pressure, corresponding to a diastolic blood pressure of \(90\)mm Hg or higher, produced the following probability table

Hypertension
Test Present Absent
+ve 0.09 0.03
-ve 0.03 0.85
  1. What’s the probability that a random person has hypertension?
  2. What’s the probability that someone tests positive on the test?
  3. Given a person who tests positive, what is the probability that they have hypertension?
  4. What would happen to your probability of having hypertension given you tested positive if you initially thought you had a \(50\)% chance of having hypertension.

Exercise 62 (Steroids) Suppose that a hypothetical baseball player (call him “Rafael”) tests positive for steroids. The test has the following sensitivity and specificity

  1. If a player is on Steroids, there’s a \(95\)% chance of a positive result.
  2. If a player is clean, there’s a \(10\)% chance of a positive result.

A respected baseball authority (call him “Bud”) claims that \(1\)% of all baseball players use Steroids. Another player (call him “Jose”) thinks that there’s a \(30\)% chance of all baseball players using Steroids.

  1. What’s Bud’s probability that Rafael uses Steroids?
  2. What’s Jose’s probability that Rafael uses Steroids?

Explain any probability rules that you use.

Exercise 63 A Breathalyzer test is calibrated so that if it is used on a driver whose blood alcohol concentration exceeds the legal limit, it will read positive \(99\)% of the time, while if the driver is below the limit it will read negative \(90\)% of the time. Suppose that based on prior experience, you have a prior probability that the driver is above the legal limit of \(10\)%.

  1. If a driver tests positive, what is the posterior probability that they are above the legal limit?
  2. At Christmas \(20\)% of the drivers on the road are above the legal limit. If all drivers were tested, what proportion of those testing positive would actually be above the limit
  3. How does your answer to part \(1\) change. Explain

Exercise 64 (Chicago bearcats) The Chicago bearcats baseball team plays \(60\)% of its games at night and \(40\)% in the daytime. They win \(55\)% of their night games and only \(35\)% of their day games. You found out the next day that they won their last game

  1. What is the probability that the game was played at night
  2. What is the marginal probability that they will win their next game?

Explain clearly any rules of probability that you use.

Exercise 65 (Spam Filter) Several spam filters use Bayes rule. Suppose that you empirically find the following probability table for classifying emails with the phrase “buy now” in their title as either “spam” or “not spam”.

Spam Not Spam
“buy now” 0.02 0.08
not “buy now” 0.18 0.72
  1. What is the probability that you will receive an email with spam?
  2. Suppose that you are given a new email with the phrase “buy now” in its title. What is the probability that this new email is spam?
  3. Explain clearly any rules of probability that you use.

Exercise 66 (Chicago Cubs) The Chicago Cubs are having a great season. So far they’ve won \(72\) out of the \(100\) games played so far. You also have the expert opinion of Bob the sports analysis. He tells you that he thinks the Cubs will win. Historically his predictions have a \(60\)% chance of coming true.

  1. Calculate the probability that the Cubs will win given Bob’s prediction
  1. Suppose you now learn that it’s a home game and that the Cubs win \(60\)% of their games at Wrigley field. What’s you updated probability that the Cubs will win their game?

Exercise 67 (Student-Grade Causality) Consider the following probabilistic model. The student does poorly poorly in a class (\(c = 1\)) or well (\(c = 0\)) depending on the presence/absence of depression (\(d = 1\) or \(d = 0\)) and weather he/she partied last night (\(v = 1\) or \(v = 0\)) . Participation in the party can also lead to the fact that the student has a headache (\(h = 1\)). As a result of poor student’s performance, the teacher gets upset (\(t = 1\)). The probabilities are given by:

\(p(c=1|d,v)\) v d
0.999 1 1
0.9 1 0
0.9 0 1
0.01 0 0
\(p(h=1|v)\) v
0.9 1
0.1 0
\(p(t=1|c)\) c
0.95 1
0.05 0

\(p(v=1)=0.2\), and \(p(d=1) = 0.4\).

Draw the causal relationships in the model. Calculate \(p(v=1|h=1)\), \(p(v=1|t=1)\), \(p(v=1|t=1,h=1)\).

Exercise 68 (Prisoner) An unknown two will be shot, the other freed. Prisoner A asks the warder for the name of one other than himself who will be shot, explaining that as there must be at least one, the warder won’t really be giving anything away. The warder agrees, and says that B will be shot. This cheers A up a little: his judgmental probability for being shot is now 1/2 instead of 2/3. Show (via Bayes theorem) that

  1. A is mistaken - assuming that he thinks the warder is as likely to say "C" as "B" when he can honestly say either; but that
  2. A would be right, on the hypothesis that the warder will say "B" whenever he honestly can.

Exercise 69 (True/False)  

  1. In a sample of \(100,000\) emails you found that \(550\) are spam. Your next email contains the word “bigger”. From historical experience, you know that half of all spam email contains the word “bigger” and only \(2\)% of non-spam emails contain it. The probability that this new email is spam is approximately \(12\)%.
  2. Suppose that there’s a \(5\)% chance that it snows tomorrow and a \(80\)%chance that the Chicago bears play their football game tomorrow given that it snows. The probability that they play tomorrow is then \(80\)%.
  3. Bayes’ rule states that \(p(A|B) =p(B|A)\).
  4. If \(P( A \cap B ) = 0.4\) and \(P( B) = 0.8\), then \(P( A|B ) = 0.5\).

Utility and Decisions

Exercise 70 (Two Gambles) In an experiment, subjects were given the choice between two gambles:

Experiment 1
Gamble \({\cal G}_A\) Gamble \({\cal G}_B\)
Win Chance Win Chance
$2500 0.33 $2400 1
$2400 0.66
$0 0.01

Suppose that a person is an expected utility maximizer. Set the utility scale so that u($0) = 0 and u($2500) = 1. person is an expected utility maximizer. Set the utility scale so that u($0) = 0 and u($2500) = 1. Whether a utility maximizing person would choose Option A or Option B depends on the person’s utility for $2400. For what values of u($2400) would a rational person choose Option A? For what values would a rational person choose Option B?

Experiment 2
Gamble \({\cal G}_C\) Gamble \({\cal G}_D\)
Win Chance Win Chance
$2500 0.33 $2400 0.34
$0 0.67 $0 0.66

For what values of u($2400) would a person choose Option C? For what values would a person choose Option D? Explain why no expected utility maximizer would prefer B and C.

This problem is a version of the famous Allais paradox, named after the prominent critic of subjective expected utility theory who first presented it. Kahneman and Tversky found that 82% of subjects preferred B over A, and 83% preferred C over D. Explain why no expected utility maximizer would prefer both B in Gamble 1 and C in Gamble 2. (A utility maximizer might prefer B in Gamble 1. A different utility maximizer might prefer C in Gamble 2. But the same utility maximizer would not prefer both B in Gamble 1 and C in Gamble 2.) Discuss these results. Why do you think many people prefer B in Gamble 1 and C in Gamble 2? Do you think this is reasonable even if it does not conform to expected utility theory?

Exercise 71 (Decisions) You are sponsoring a fund raising dinner for your favorite politicalcandidate. There is uncertainty about the number of people who will attend (the random variable \(X\)), but based on past dinners, you think that the probability function looks like this:

\(x\) 100 200 300 400 500
\(P_X(x)\) 0.1 0.2 0.3 0.2 0.2
  1. Calculate \(E(X)\), the expected number of people who will attend.
  2. The owner of the venue is going to charge you $1500 for rental and other miscellaneous costs. You know that you will make a profit (after per person costs) of $40 for each person attending. Calculate the expected profit after the rental cost.
  3. The owner of the venue proposes an alternative pricing scheme. Instead of charging $1500, she will charge you either $5 per person or $2100, whichever is smaller. So if 100 people come, you only pay $500. If 500 come, you pay $2100. Calculate the expected profit under this scheme (still assuming $40 per plate profit before you pay the owner).
  4. Let \(Y_1\) be your profit under the first scheme and \(Y_2\) be your profit under the second. If you do the calculations, it turns out that the standard deviations of these profits are: \[\sigma_{Y_1} = 4996 \qquad \sigma_{Y_2} = 4488\] Using the expected values calculated above explain which of the two scenarios you prefer.

Exercise 72 (Marjorie Visit) Marjorie is worried about whether it is safe to visit a vulnerable relative during a pandemic. She is considering whether to take an at-home test for the virus before visiting her relative. Assume the test has sensitivity 85% and specificity 92%. That is, the probability that the test will be positive is about 85% if an individual is infected with the virus, and the probability that test will be negative is about 92% if an individual is not infected.

Further, assume the following losses for Marjorie

Event Loss
Visit relative, not infected 0
Visit relative, infected 100
Do not visit relative, not infected 1
Do not visit relative, infected 5
  1. Assume that about 2 in every 1,000 persons in the population is currently infected. What is the posterior probability that an individual with a positive test has the disease?
  2. Suppose case counts have decreased substantially to about 15 in 100,000. What is the posterior probability that an individual with a positive test has the disease?
  3. Suppose Marjorie is deciding whether to visit her relative and if so whether to test for the disease before visiting. If the prior probability that Marjorie has the disease is 200 in 100,000, find the policy that minimizes expected loss. That is, given each of the possible test results, should Marjorie visit her relative? Find the EVSI. Repeat for a prior probability of 15 in 100,000. Discuss.
  4. For the decision of whether Marjorie should visit her relative, find the range of prior probabilities for which taking the at-home test results in lower expected loss than ignoring or not taking the test (assuming the test is free). Discuss your results.

Exercise 73 (True/False Variance)  

  1. If the sample covariance between two variables is one, then there must be a strong linear relationship between the variables
  2. If the sample covariance between two variables is zero, then the variables are independent.
  3. If \(X\) and \(Y\) are independent random variables, then \(Var(2X-Y)= 2 Var(X)-Var(Y)\).
  4. The sample variance is unaffected by outlying observations.
  5. Suppose that a random variable \(X\) can take the values \(\{0,1,2\}\) all with equal probability. Then the expected and variance of \(X\) are both\(1\).
  6. The maximum correlation is \(1\) and the minimum is \(0\).
  7. For independent random variables \(X\) and \(Y\), we have \(var(X-Y)=var(X)-var(Y)\).
  8. If the correlation between \(X\) and \(Y\) is zero then the standard deviation of \(X+Y\) is the square root of the sum of the standard deviations of \(X\) and \(Y\).
  9. It is always true that the standard deviation is less than the variance
  10. If the correlation between \(X\) and \(Y\) is \(r = - 0.81\) and if the standard deviations are \(s_X = 20\) and \(s_Y = 25\), respectively, then the covariance is \(Cov (X, Y) = - 401\).
  11. If we drop the largest observation from a sample, then the sample mean and variance will both be reduced.
  12. Suppose \(X\) and \(Y\) are independent random variables and \(Var(X) = 6\) and \(Var(Y) = 6\). Then \(Var(X+Y) = Var(2X)\).
  13. Let investment \(X\) have mean return 5% and a standard deviation of 5%and investment \(Y\) have a mean return of 10% with a standard deviation of 6%. Suppose that the correlation between returns is zero. Then I can find a portfolio with higher mean and lower variance then \(X\).

Exercise 74 (True/False Expectation)  

  1. LeBron James makes \(85\)% of his free throw attempts and \(50\)% of his regular shots from the field (field goals). Suppose that each shot is independent of the others. He takes \(20\) field goals and \(10\) free throws in a typical game. He gets one point for each free throw and two points for each field goal assuming no 3-point shots. The number of points he expects to score in a game is 28.5.
  2. Suppose that you have a one in a hundred chance of hitting the jackpot on a slot machine. If you play the machine \(100\) times then you are certain to win.
  3. The expected value of the sample mean is the population mean, that is \(E \left ( \bar{X} \right ) = \mu\).
  4. The expectation of \(X\) minus \(2Y\) is just the expectation of \(X\) minus twice the expectation of \(Y\), that is \(E (X-2Y)= E(X) - 2E (Y)\).
  5. A firm believes it has a 50-50 chance of winning a $80,000 contract if it spends $5,000 on a proposal. If the firm spends twice this amount,it feels its chances of winning improve to 60%. If the firm wants to maximize its expected value then it should spend $10,000 to try and gain the contract.
  6. \(E(X+Y)=E(X)+E(Y)\) only if the random variables \(X\) and \(Y\) are independent.

Bayesian Parameter Learning

Exercise 75 (Beta-Binomial for Allais gambles) We’ve collected data on people’s preferences in the two Allais gambles from. For this problem, we will assume that responses are independent and identically distributed, and the probability is \(\theta\) that a person chooses both B in the first gamble and C in the second gamble.

  1. Assume that the prior distribution for \(\theta\) is Beta(1, 3). Find the prior mean and standard deviation for \(\theta\). Find a 95% symmetric tail area credible interval for the prior probability that a person would choose B and C. Do you think this is a reasonable prior distribution to use for this problem? Why or why not?
  2. In 2009, 19 out of 47 respondents chose B and C. Find the posterior distribution for the probability \(\theta\) that a person in this population would choose B and C.
  3. Find the posterior mean and standard deviation. Find a 95% symmetric tail area credible interval for \(\theta\).
  4. Make a triplot of the prior distribution, normalized likelihood, and posterior distribution.
  5. Comment on your results.

Exercise 76 (Poisson for Car Counts) Times were recorded at which vehicles passed a fixed point on the M1 motorway in Bedfordshire, England on March 23, 1985.2 The total time was broken into 21 intervals of length 15 seconds. The number of cars passing in each interval was counted. The result was:

cnt = c(2, 2, 1, 1, 0, 4, 3, 0, 2, 1, 1, 1, 4, 0, 2, 2, 3, 2, 4, 3, 2)

This can be summarized in the following table, that shows 3 intervals with zero cars, 5 intervals with 1 car, 7 intervals with 2 cars, 3 intervals with 3 cars and 3 intervals with 4 cars.

table(cnt)
cnt
0 1 2 3 4 
3 5 7 3 3 
  1. Do you think a Poisson distribution provides a good model for the count data? Justify your answer.
  2. Assume that \(\Lambda\), the rate parameter of the Poisson distribution for counts (and the inverse of the mean of the exponential distribution for interarrival times), has a discrete uniform prior distribution on 20 equally spaced values between (0.2, 0.4,…, 3.8, 4.0) cars per 15-second interval. Find the posterior distribution of \(\Lambda\).
  3. Find the posterior mean and standard deviation of \(\Lambda\).
  4. Discuss what your results mean in terms of traffic on this motorway.

Exercise 77 (Car Count Part 2) This problem continues analysis of the automobile traffic data. As before, assume that counts of cars per 15-second interval are independent and identically distributed Poisson random variables with unknown mean \(\Lambda\).

  1. Assume that \(\Lambda\), the rate parameter of the Poisson distribution for counts, has a continuous gamma prior distribution for \(\Lambda\) with shape 1 and scale 10e6. (The gamma distribution with shape 1 tends to a uniform distribution as the scale tends to \(\infty\), so this prior distribution is “almost” uniform.) Find the posterior distribution of \(\Lambda\). State the distribution type and hyperparameters.
  2. Find the posterior mean and standard deviation of \(\Lambda\). Compare your results to Part I. Discuss.
  3. Find a 95% symmetric tail area posterior credible interval for \(\Lambda\). Find a 95% symmetric tail area posterior credible interval for , the mean time between vehicle arrivals.
  4. Find the predictive distribution for the number of cars passing in the next minute. Name the family of distributions and the parameters of the predictive distribution. Find the mean and standard deviation of the predictive distribution. Find the probability that more than 10 cars will pass in the next minute. (Hint: one minute is four 15-second time intervals.)

Exercise 78 (Lung disease) Chronic obstructive pulmonary disease (COPD) is a common lung disease characterized by difficulty in breathing. A substantial proportion of COPD patients admitted to emergency medical facilities are released as outpatients. A randomized, double-blind, placebo-controlled study examined the incidence of relapse in COPD patients released as outpatients as a function of whether the patients received treatment with corticosteroids. A total of 147 patients were enrolled in the study and were randomly assigned to treatment or placebo group on discharge from an emergency facility. Seven patients were lost from the study prior to follow-up. For the remaining 140 patients, the table below summarizes the primary outcome of the study, relapse within 30 days of discharge.

Relapse No Relapse Total
Treatment 19 51 70
Placebo 30 40 70
Total 49 91 140
  1. Let \(Y_1\) and \(Y_2\) be the number of patients who relapse in the treatment and placebo groups, respectively. Assume \(Y_1\) and \(Y_2\) are independent Binomial(70,\(\theta_i\) ) distributions, for \(i=1,2\). Assume \(\theta_1\) and \(\theta_2\) have independent Beta prior distributions with shape parameters 1⁄2 and 1⁄2 (this is the Jeffreys prior distribution). Find the joint posterior distribution for \(\theta_1\) and \(\theta_2\). Name the distribution type and its hyperparameters.
  2. Generate 5000 random pairs \((\theta_1, \theta_2)\), \(k=1,\ldots,5000\) from the joint posterior distribution. Use this random sample to estimate the posterior probability that the rate of relapse is lower for treatment than for placebo. Discuss your results.

Exercise 79 (Normal-Normal) Concentrations of the pollutants aldrin and hexachlorobenzene (HCB) in nanograms per liter were measured in ten surface water samples, ten mid-depth water samples, and ten bottom samples from the Wolf River in Tennessee. The samples were taken downstream from an abandoned dump site previously used by the pesticide industry. The full data set can be found at http://www.biostat.umn.edu/~lynn/iid/wolf.river.dat. For this problem, we consider only HCB measurements taken at the bottom and the surface. The question of interest is whether the distribution of HCB concentration depends on the depth at which the measurement was taken. The data for this problem are given below.

Surface Bottom
3.74 5.44
4.61 6.88
4.00 5.37
4.67 5.44
4.87 5.03
5.12 6.48
4.52 3.89
5.29 5.85
5.74 6.85
5.48 7.16

Assume the observations are independent normal random variables with unknown depth-specific means \(\Theta_s\) and \(\Theta_b\) and precisions \(P_s\) and \(P_b\). Assume independent improper reference priors for the surface and bottom parameters: \[ g(\theta_s,\theta_b ,\rho_s,\rho_b ) = g(\theta_s,\rho_s)g(\theta_b ,\rho_b) \propto \rho_s^{(-1)}\rho_b^{(-1)} \]

  1. This prior can be treated as the product of two normal-gamma priors with \(\mu_s = \mu_b = 0\), \(k_s = k_b =0\), \(a_s = a\)b = -1/2$, and \(b_s = b_b =\infty\). (These are not valid normal-gamma distributions, but you can use the usual Bayesian conjugate updating rule to find the posterior distribution.) Find the joint posterior distribution for the parameters \((\theta_s,\theta_b,\rho_s,\rho_b)\). Find 90% posterior credible intervals for \((\theta_s,\theta_b,\rho_s,\rho_b)\). Comment on your results.
  2. Use direct Monte Carlo to sample 10,000 observations from the joint posterior distribution of \((\theta_s,\theta_b,\rho_s,\rho_b)\). Use your Monte Carlo samples to estimate 90% posterior credible intervals for all four parameters. Compare with the result of part a.
  3. Use your direct Monte Carlo sample to estimate the probability that the mean bottom concentration \(\theta_b\) is higher than the mean surface concentration \(\theta_s\) and to estimate the probability that the standard deviation σ” of the bottom concentrations is higher than the standard deviation \(\sigma_b\) of the surface concentrations.
  4. Comment on your analysis. What are your conclusions about the distributions of surface and bottom concentrations? Is the assumption of normality reasonable? Are the means different for surface and bottom? The standard deviations?
  5. Find the predictive distribution for the sample mean of a future sample of size 10 from the surface and a future sample of size 10 from the bottom. Find 95% credible intervals on the sample mean of each future sample. Repeat for future samples of size 40. Compare your results and discuss.
  6. Use direct Monte Carlo to estimate the predictive distribution for the difference in the two sample means for 10 future surface and bottom samples. Plot a kernel density estimator for the density function for the difference in means. Find a 95% credible interval for the difference in the two sample means. Repeat for future samples of 40 surface and 40 bottom observations. Comment on your results.
  7. Repeat part e, but use a model in which the standard deviation is known and equal to the sample standard deviation, and the depth-specific means \(\theta_s\) and \(\theta_b\) have a uniform prior distribution. Compare the 95% credible intervals for the future sample means for the known and unknown standard deviation models. Discuss.
  8. Assume that experts have provided the following prior information based on previous studies.
  • The unknown means \(\theta_s\) and \(\theta_b\) are independent and normally distributed with mean \(\mu\) and standard deviation \(\tau\). The unknown precisions \(\rho_s\) and \(\rho_b\) are independent of \(\theta_s\) and \(\theta_b\) and have gamma distributions with shape \(a\) and scale \(b\).
  • Experts specified a 95% prior credible interval of [3, 9] for \(\theta_s\) and \(\theta_b\). A good fit to this credible interval is obtained by setting the prior mean to \(\mu =6\) and the prior standard deviation to \(\tau=1.5\).
  • A 95% prior credible interval of [0.75, 2.0] is given for the unknown standard deviations \(\Sigma_s\) and \(\Sigma_b\). This translates to a credible interval of [0.25, 1.8] for \(\rho_s = \Sigma_s^{-1}\) and \(\rho_b = \Sigma_b^{-2}\). A good fit to this credible interval is obtained by setting the prior shape to $a = 4.5. and the prior scale to \(b\) = 0.19. Find the following conditional distributions: \(p(\theta_s \mid D,\theta_b,\rho_s,\rho_b)\), \(p(\theta_b \mid D,\theta_s,\rho_s,\rho_b)\), \(p(\rho_s \mid D,\theta_s,\theta_b,rho_b)\), \(p(\rho_b \mid D, \theta_s,\theta_b,\rho_s)\)
  1. Using the distributions you found, draw 10,000 Gibbs samples of \((\theta_s,\theta_b,\rho_s,\rho_b)\). Estimate 90% credible intervals for \((\theta_s,\theta_b,\rho_s^{-1/2},\rho_b^{-1/2})\) and \(\theta_b-\theta_s\).
  2. Do a traceplot of \(\theta_b-\theta_s\). Find the autocorrelation function of \(\theta_b-\theta_s\) and the effective sample size for your Monte Carlo sample for \(\theta_b-\theta_s\).
  3. Comment on your results. Compare with parts a,b, and c.

Exercise 80 (Gibbs: Bird feeders) A biologist counts the number of sparrows visiting six bird feeders placed on a given day.

Feeder Number of Birds
1 11
2 22
3 13
4 24
5 19
6 16
  • Assume that the bird counts are independent Poisson random variables with feeder- dependent means \(\lambda_i\), for \(i=1,\ldots,6\).
  • Assume that the means \(\lambda_i\) are independent and identically distributed gamma random variables with shape a and scale b (or equivalently, shape a and mean m = ab )
  • The mean m = ab of the gamma distribution is uniformly distributed on a grid of 200 equally spaced values starting at 5 and ending at 40.
  • The shape a is independent of the mean m and has a distribution that takes values on a grid of 200 equally spaced points starting at 1 and ending at 50, with prior probabilities proportional to a gamma density with shape 1 and scale 5.
  1. Use Gibbs sampling to draw 10000 samples from the joint posterior distribution of the mean m, the shape parameter a, and the six mean parameters \(\lambda_i\), \(i=1,\ldots,6\), conditional on the observed bird counts. Using your sample, calculate 95% credible intervals for the mean m, the shape a, and the six mean parameters \(\lambda_i\), \(i=1,\ldots,6\).
  2. Find the effective sample size for the Monte Carlo samples of the mean m, the shape parameter a, and the six mean parameters \(\lambda_i\), \(i=1,\ldots,6\).
  3. Do traceplots for the mean m, the shape parameter a, and the six rate parameters \(\lambda_i\), \(i=1,\ldots,6\).
  4. The fourth feeder had the highest bird count and the first feeder had the lowest bird count. Use your Monte Carlo sample to estimate the posterior probability that the first feeder has a smaller mean bird count than the fourth feeder. Explain how you obtained your estimate.
  5. Discuss your results.

Exercise 81 (Poisson Distribution for EPL) We will analyze EPL data. Use epl.R as a starting script. This script uses football-data.org API to download the results for two EPL teams: Manchester United and Chelsea. Model the GoalsHome and GoalsAway for each of the teams using Poisson distribution. Given these distributions calculate the following

  1. What’s the probability of a nil-nil (0 - 0) draw?
  2. What’s the probability that MU wins the match?
  3. Discuss how could you improve your model based on four Poisson distributions.
  4. Use R and random number simulation to provide empirical answers. Using a random sample of size N = 10, 000, check and see how close you get to the theoretical answers you’ve found to the questions posed above. Provide histograms of the distributions you simulate.

Hint: The difference of two Poisson random variables follows a Skellam distribution, defined in skellam package. You can use dskellam to calculate the probability of a draw. You can use rpois to simulate draws from Poisson distribution.

Exercise 82 (Homicide Rate (Poisson Distribution)) Suppose that there are \(1.5\) homicides a week. The is the rate, so \(\lambda=1.5\). The tells us that there is a still a \(1.4\)% chance of seeing \(5\) homicides in a week \[ p( X= 5 ) = \frac{e^{ - 1.5 } ( 1.5 )^5 }{5!} = 0.014 \] On average this will happen once every \(71\) weeks, nearly once a year.

What’s the chance of having zero homicides in a week?

Exercise 83 (True/False (Binomial Distribution))  

  1. The binomial distribution is a discrete probability distribution.
  2. Assuming the Joe DiMaggio’s batting average is \(0.325\) per at-bat andhis hits are independent, then he has a probability of about \(12\)% ofgetting more than \(2\) hits in \(4\) at-bats.
  3. Suppose that you toss a fair coin with probability \(0.5\) a head. The probability of getting five heads is a row is less than three percent.
  4. Suppose that you toss a biased coin with probability \(0.25\) of getting ahead. The probability of getting five heads out of ten tosses is less than thirty percent.
  5. Suppose that you toss a coin \(5\) times. Then there are \(10\) ways of getting \(3\) heads.
  6. The probability of observing three heads out of five tosses of a faircoin is \(0.6\).
  7. A mortgage bank knows from experience that \(2\)% of residential loanswill go into default. Suppose it makes \(10\) such loans, then theprobability that at least one goes into default is \(95\)%.
  8. Jessica Simpson is not a professional bowler and \(40\)% of her bowlingswings are gutter balls. She is planning to take \(90\) blowing swings.The mean and standard deviation of the number of gutter balls is\(\mu = 36\) and \(\sigma = 3.65\).
  9. The probability of at least one head when tossing a fair coin \(4\) timesis \(0.9375\).
  10. The Red Sox are to play the Yankees in a seven game series. Assume thatthe Red Sox have a 50% chance of winning each game, with the resultsbeing independent of each other. Then the probability of the seriesending 4-3 in favor of the Red Sox is \(0.5^{7}=0.0078\).
  11. Suppose that \(X\) is Binomially distributed with \(E(X)=5\) and \(Var(X)=2\),then \(n=10\) and \(p=0.5\).
  12. If \(X\) is a Bernoulli random variable with probability of success, \(p\),then its variance is \(V(X)=p(1-p)\).
  13. Historically 15% of chips manufactured by a computer company aredefective. The probability of a random sample of 10 chips containingexactly one defect is 0.15.

Exercise 84 (True/False (Poisson Distribution))  

  1. If \(X \sim Poi (2)\) and \(Y \sim Poi (3)\), then \(X+Y \sim Poi (6)\).
  2. Arsenal are playing Burnley at home in an English Premier League (EPL)game this weekend. They are favorites to win. They have a Poissondistribution for the number of goals they will score with a mean rate of \(2.5\) per game. Given this, the odds of Arsenal scoring at least twogoals is greater than \(50\)%.
  3. Arsenal are playing Swansea tomorrow in an English Premier League (EPL).They are favourites to win. The number of goals they expect to score isPoisson with a mean rate of \(2.2\). Given this, the odds of Arsenalscoring at least one goal is greater than \(60\%\).
  4. Arsenal are playing Liverpool at home in an EPL game this weekend. Youthink that the number of goals to be scored by both teams follow aPoisson distribution with rates \(2.2\) and \(1.6\) respectively. Giventhis, the odds of a scoreless \(0-0\) draw are \(45-1\).
  5. Suppose your website gets on average \(2\) hits per hour. Then theprobability of at least one hit in the next hour is \(0.135\).
  6. The soccer team Manchester United scores on average two goals per game.Given that the distribution of goals is Poisson, the chance that theyscore two or less goals is \(87\)%

Exercise 85 (True/False (Normal Distribution))  

  1. The returns for Google stock on the day of earnings are normallydistributed with a mean of \(5\)% and a standard deviation of \(5\)%. Theprobability that you will make money on the day of earnings isapproximately \(60\)%.
  2. For any normal random variable, \(X\), we have \(\mathbb{P} \left ( \mu - \sigma < X < \mu + \sigma \right ) = 0.64\). Hint: You may use \(pnorm(1) = 0.841\)
  3. Suppose that the annual returns for Facebook stock are normallydistributed with a mean of \(15\)% and a standard deviation of \(20\)%. Theprobability that Facebook has returns greater than \(10\)% for next yearis \(60\)%
  4. Consider the standard normal random variable \(Z \sim N ( 0 , 1 )\). Thenthe random variable \(-Z\) is also standard normal.
  5. A local bank experiences a \(2\)% default rate on residential loans madein a certain city. Suppose that the bank makes \(2000\) loans. Then theprobability of more than \(50\) defaults is \(25\) percent.
  6. The Binomial distribution can be approximated by a normal distributionwhen the number of trials is large.
  7. Let \(X \sim N(5, 10)\). Then \(P \left (X>5 \right ) = \frac{1}{2}\).
  8. Shaquille O’Neal has a \(55\)% chance of making a free throw inBasketball. Suppose he has \(900\) free throws this year. Then the chancehe makes more than \(500\) free throws is \(45\)%
  9. Suppose that the random variable \(X \sim N ( -2 , 4 )\) then \(- 2 X \sim N ( 4 , 16 )\).
  10. Mortimer’s steak house advertises that it is the home of the \(16\) ouncesteak. They claim that the weight of their steaks is normallydistributed with mean \(16\) and standard deviation \(2\). If this is so,then the probability that a steak weights less that \(14\) ounces is\(16\)%.
  11. Advertising costs for a \(30\)-second commercial are assumed to benormally distributed with a mean of \(10,000\) and a standard deviation of\(1000\). Then the probability that a given commercial costs between\(9000\) and \(10,000\) is \(50\)%.
  12. In a sample of \(120\) Zagat’s ratings of Chicago restaurants, the averagerestaurant had a rating of \(19.6\) with a standard deviation of \(2.5\). Ifyou randomly pick a restaurant, the chance that you pick one with with arating over \(25\) is less than \(1\)%.
  13. A hospital finds that \(20\)% of its bills are at least one month inarrears. A random sample of \(50\) bills were taken. Then the probabilitythat less than \(10\) bills in the sample were at least one month inarrears is \(50\)%
  14. A Chicago radio station believes \(30\)% of its listeners are younger than\(30\). Out of a sample of \(500\) they find that \(250\) are younger than\(30\). This data supports their claim at the \(1\)% level.
  15. The probability that a standard normal distribution is more than \(1.96\)standard deviations from the mean is \(0.05\).
  16. Suppose that the amount of money spent at Disney World is normallydistributed with a mean of $60 and a standard deviation of $15. Thenapproximately 45% of people spend more than $70 per visit.
  17. A Normal distribution with mean 4 and standard deviation 3.6 willprovide a good approximation to a Binomial random variable withparameters \(n=40\) and \(p= 0.10\).
  18. If \(X\) is normally distributed with mean \(3\) and variance \(9\) then theprobability that \(X\) is greater than \(1\) is \(0.254\).

Exercise 86 (Tarone Study) @tarone1982use reports data from 71 studies on tumor incidence in rats

  1. In one of the studies, 2 out of 13 rats had tumors. Assume there are 20 possible tumor probabilities: 0.025, 0.075, …, 0.975. Assume that the tumor probability is uniformly distributed. Find the posterior distribution for the tumor probability given the data for this study.
  2. Repeat Part a for a second study in which 1 in 18 rats had a tumor.
  3. Parts a and b assumed that each study had a different tumor probability, and that these tumor probabilities were uniformly distributed a priori. Now, assume the tumor probabilities are the same for the two studies, and that this probability has a uniform prior distribution. Find the posterior distribution for the common tumor probability given the combined results from the two studies.
  4. Compare the three distributions for Parts a, b, and c. Comment on your results.

Exercise 87 Let \(X\) and \(Y\) be independent and identically distributed as a \(Exp(1)\) distribution. Their joint distribution is given by

\[f_{X,Y}(x,y) = \exp(-x)\exp(-y) \mathrm{ , where} \; \; 0 < x,y < \infty\]

  1. Use the convolution formula to find the distribution of \(X + \frac{1}{2} Y\). Check your answer by also using a moment generating function approach.
  2. Guess want happens if you consider \(X + \frac{1}{2} Y + \frac{1}{3} Z\) where \(Z\) is also \(Exp(1)\)?

Solution:

First, we need to show the family of exponential distributions is closed under scaling. Let \(Z = aY\) so that \[P(Z \leq z) = P(Y \leq z/a) = 1 - \exp(-z/a)\] Therefore, \(Z \sim Exp(1/a)\) and \(f_{aY} = \frac{1}{a}\exp(-Y/a)\).

Let \(W = X + \frac{1}{2}Y\). The density function of W is \[\begin{aligned} f_W(W = w) &=& P(X+\frac{1}{2}Y = w) \\ &=& \int_0^w f_{X,\frac{1}{2}Y}(x, w-x)dx \\ &=& \int_0^w f_X(x)f_{\frac{1}{2}Y}(w-x)dx \\ &=& \exp(-w)*\left(\frac{1}{2}\exp(-2w)\right)\end{aligned}\]

\[\begin{aligned} E(W) &=& \int w*f_W(w)dw \\ &=& \frac{3}{2}\left(\int w*\frac{1}{3}\exp(-3w)dw\right) \\ &=& \frac{3}{2} = E(X) + \frac{1}{2}E(Y)\end{aligned}\]

By induction, the density function for \(W = X + \frac{1}{2} Y + \frac{1}{3} Z\) should be \(\exp(-w)*\left(\frac{1}{2}\exp(-2w)\right)*\left(\frac{1}{3}\exp(-3w)\right)\)

Exercise 88 (Poisson MLE) Let \(x_i \sim \mathrm{Poss}(\lambda)\), \(1=1,\ldots,N\). Find \(\lambda\) using MLE.

Exercise 89 (Poisson Bayes) Let \(x_i \sim \mathrm{Poss}(\lambda)\), \(1=1,\ldots,N\). Find \(p(\lambda | x_1,\ldots x_N)\), assuming the Gamma prior \(\lambda \sim \Gamma(\lambda|a,b)\).

Exercise 90 (Exponential Distribution) Let \(x_1, x_2,\ldots, x_N\) be an independent sample from the exponential distribution with density \(p (x | \lambda) = \lambda\exp (-\lambda x)\), \(x \ge 0\), \(\lambda> 0\). Find the maximum likelihood estimate \(\lambda_{\text{ML}}\). Choose the conjugate prior distribution \(p (\lambda)\), and find the posterior distribution \(p (\lambda | x_1,\ldots, x_N)\) and calculate the Bayesian estimate for \(\lambda\) as the expectation over the posterior.

Exercise 91 (Bernoulli Baeys) We have \(N\) Bernoulli trials with success probability in each trial being equal to \(q\), we observed \(k\) successes. Find the conjugate distribution \(p (q)\). Find the posterior distribution \(p (q | k, N)\) and its expectation.

Exercise 92 (Exponential Family (Gamma)) Write density function of Gamma distribution in a standard exponential family form. Find \(Ex\) and \(E\log x\) by differentiating the normalizing constant.

Exercise 93 (Exponential Family (Binomial)) Write density function of Binomial distribution in a standard exponential family form. Find \(Ex\) and \(Var x\) by differentiating the normalizing constant.

Exercise 94 (Normal) Let \(X_1\) and \(X_2\) are independent \(N(0,1)\) random variables. Let \[Y_1 = X_1^2 + X_2^2, \quad Y_2 = \frac{X_1}{\sqrt{Y_1}}\]

  1. Find joint distribution of \(Y_1\) and \(Y_2\).
  2. Are \(Y_1\) and \(Y_2\) independent or not?
  3. Can you interpret the result geometrically?

Density of \(N(\mu, \sigma^2)\) is \[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]

AB Testing

Exercise 95 (A/B Testing) During a recent breakout of the flu, 850 out 6,224 people diagnosed with the virus presented severe symptoms. During the same flu season, a experimental anti-virus drug was being tested. The drug was given to 238 people with the flu and only 6 of them developed severe symptoms. Based on this information, can you conclude, for sure, that the drug is a success?

Exercise 96 (Tesla Supplier) Tesla purchases Lithium as a raw material for their batteries from either of two suppliers and is concerned about the amounts of impurity the material contains. The percentage impurity levels in consignments of the Lithium follows closely a normal distribution with the means and standard deviations given in the table below. The company is particularly anxious that the impurity level not exceed \(5\)% and wants to purchase from the supplier who is ore likely to meet that specification.

Mean Standard Deviation
Supplier A 4.4 0.4
Supplier B 4.2 0.6
  1. Which supplier should be chosen?
  2. What if Supplier B implements some quality control which has no effect on the standard deviation but raises their mean to \(4.6\)?

Exercise 97 (Red Sox) On September 24, 2003, Pete Thamel in the New York Times reported that the Boston Red Sox had been accused of cheating by another American League Team. The claim was that the Red Sox had a much better winning record at home games than at games played in other cities.

The following table provides the wins and losses for home and away games for the Red Sox in the 2003 season

Record
Team Home Wins Home Losses Away Wins Away Losses
Boston Red Sox 53 28 42 39

Is there any evidence that the proportion of Home wins is significantly different from home and away games?

Discuss any other issues that are relevant.

Hint: a 95% confidence interval for a difference in proportions \(p_1 - p_2\) is given by \[ ( \hat{p}_1 - \hat{p}_2 ) \pm 1.96 \sqrt{ \frac{ \hat{p}_1 ( 1 - \hat{p}_1 ) }{ n_1 } + \frac{ \hat{p}_2 ( 1 - \hat{p}_2 ) }{ n_2 } } \]

Exercise 98 (Myocardial Infarction) In a five year study of the effects of aspirin on Myocardial Infarction (MI), or heart attack, you have the following dataset on the reduction of the probability of getting MI from taking aspirin versus a Placebo, or control.

Treatment with MI without MI
Placebo 198 10845
Aspirin 104 10933
  1. Find a \(95\)% confidence interval for the difference in proportions \(p_1-p_2\).
  2. Perform a hypothesis test of the null \(H_0 : p_1 = p_2\) at a \(1\)% significance level.

Exercise 99 (San Francisco Giants) In October 1992, the ownership of the San Francisco Giants considered a sale of the franchise that would have resulted in a move to Florida. A survey from the San Francisco Chronicle found that in a random sample of \(625\) people, 50.7% would be disappointed by the move. Find a 95% confidence interval of the population proportion

Exercise 100 (Pfizer) Pfizer introduced Viagra in early 1998 and during \(1998\) of the \(6\) million Viagra users \(77\) died from coronary problems such as heart attacks. Pfizer claimed that this rate is no more than that in the general population.

You find from a clinical study of \(1,500,000\) men who were not on Viagra that \(11\) of then died of coronary problems in the same length of time during the \(77\) Viagra users who dies in \(1998\).

Do you agree with Pfizer’s claim that the proportion of Viagra users dying from coronary problems is no more than that of other comparable men?

Exercise 101 (Voter: CI) Given a random sample of \(1000\) voters, \(400\) say they will vote for Donald Trump if the Republican nomination for the 2016 US Presidential Election. Given that he gets the nomination, a \(95\%\) confidence interval for the true proportion of voters that will vote for him includes \(45\%\).

Exercise 102 (Survey AB testing) In a pre-election survey of \(435\) voters, \(10\) indicated that they planned to vote for Ralph Nader. In a separate survey of \(500\) people, \(250\) said they planned to vote for George Bush.

  1. Perform a hypothesis test at the 5% level for the hypothesis that Nader will get 3% or less of the vote.
  2. Find a 95% confidence interval for the difference in the proportion of people that will vote for George Bush versus Ralph Nader.

Exercise 103 (Significance Tests.) Some defendants in southern states have challenged verdicts made during the ’50s and ’60s, because of possible unfair jury selections. Juries are supposed to be selected at random from the population. In one specific case, only 12 jurors in an 80 person panel were African American. In the state in question, 50% of the population was African American. Could a jury with 12 African Americans out of 80 people be the result of pure chance?

Using \(n = 80\), \(X = 12\) find a 99% confidence interval for the proportion \(p\). Is that significantly different from \(p = 0.5\).

Exercise 104 A marketing firm is studying the effects of background music on people’s buying behavior. A random sample of \(150\) people had classical music playing while shopping and \(200\) had pop music playing. The group that listened to classical music spent on average $ \(74\) with a standard deviation of $ \(18\) while the pop music group spent $ \(78.4\) with a standard deviation of $ \(12\).

  1. Test whether there is any significant difference between the difference in purchasing habits. Describe clearly your null and alternative hypotheses and any test statistics that you use.
  2. Is there a difference between using a \(5\)% and \(1\)% significance level.

Exercise 105 (Sensitivity and Specificity) The quality of Nvidia’s graphic chips have the probability that a randomly chosen chip being defective is only \(0.1\)%. You have invented a new technology for testing whether a given chip is defective or not. This test will always identify a defective chip as defective and only “falsely” identify a good chip as defective with probability \(1\)%

  1. What are the sensitivity and specificity of your testing device?
  2. Given that the test identifies a defective chip, what’s the posterior probability that it is actually defective?
  3. What percentage of the chips will the new technology identify as being defective?
  4. Should you advise Nvidia to go ahead and implement your testing device? Explain.

Exercise 106 (CI fo Google) Google is test marketing a new website design to see if it increases the number of click-through on banner ads. In a small study of a million page views they find the following table of responses

Total Viewers Click-Throughs
new design 700,000 10,000
old design 300,000 2,000

Find a \(99\)% confidence interval for the increase in the proportion of people who click-through on banner ads using the new web design.

Hint: a 99% confidence interval for a difference in proportions \(p_1 - p_2\) is given by \(( \hat{p}_1 - \hat{p}_2 ) \pm 2.58\sqrt{ \frac{ \hat{p}_1 ( 1 - \hat{p}_1 ) }{ n_1 } + \frac{ \hat{p}_2 ( 1 - \hat{p}_2 ) }{ n_2 } }\)

Exercise 107 (Amazon) Amazon is test marketing a new package delivery system. It wants to see if same-day service is feasible for packages bought with Amazon prime. In a small study of a hundred thousand delivers they find the following times for delivery

Deliveries Mean-Time (Hours) Standard Deviation-Time
new system 80,000 4.5 2.1
old system 20,000 5.6 2.5
  1. Find a \(95\)% confidence interval for the decrease in delivery time.
  2. If they switch to the new system, what proportion of deliveries will be under \(5\) hours which is required to guarantee same day service.

Hint: a 95% confidence interval for a difference in means \(\mu_1 - \mu_2\) is given by \(( \bar{x}_1 - \bar{x}_2 ) \pm 1.96 \sqrt{ \frac{s_1^2 }{ n_1 } + \frac{ s_2^2 }{ n_2 } }\) ]

Exercise 108 (Vitamin C) In the very famous study of the benefits of Vitamin C, \(279\) people were randomly assigned to a dose of vitamin C or a placebo (control of nothing). The objective was to study where vitamin C reduces the incidence of a common cold. The following table provides the responses from the experiment

Group Colds Total
Vitamin C 17 139
Placebo 31 140
  1. Is there a significant difference in the proportion of colds between the vitamin C and placebo groups?
  2. Find a \(99\)% confidence interval for the difference. Would you recommend the use of vitamin C to prevent a cold?

Hint: a 95% confidence interval for a difference in proportions \(p_1 - p_2\) is given by \[ ( \hat{p}_1 - \hat{p}_2 ) \pm 1.96 \sqrt{ \frac{ \hat{p}_1 (1- \hat{p}_1) }{ n_1 } + \frac{ \hat{p}_2 (1- \hat{p}_2) }{ n_2 } } \]

Exercise 109 (Facebook vs Reading) In a recent article it was claimed that “\(96\)% of Americans under the age of \(50\)” spent more than three hours a day on Facebook.

To test this hypothesis, a survey of \(418\) people under the age of \(50\) were taken and it was found that \(401\) used Facebook for more than three hours a day.

Test the hypothesis at the \(5\)% level that the claim of \(96\)% is correct.

Exercise 110 (Paired T-Test) The following table shows the outcome of eight years of a ten year bet that Warren Buffett placed with Protege Partners, a New York hedge fund. Buffett claimed that a simple index fund would beat a portfolio strategy (fund-of-funds) picked by Protege over a ten year time frame. At Buffett’s shareholder meeting, he provided an update of the current state of the bet. The bundle of hedge funds picked by Protege had returned \(21.9\)% in the eight years through \(2015\) and the S&P500 index fund had soared \(65.7\)%.

SP Index Hedge Funds
2008 -37.0% -23.9%
2009 26.6% 15.9%
2010 15.1% 8.5%
2011 2.1% -1.9%
2012 16.0% 6.5%
2013 32.3% 11.8%
2014 13.6% 5.6%
2015 1.4% 1.7%
cumulative 65.7% 21.9%
  1. Use a paired \(t\)-test to assess the statistical significance between the two return strategies
  2. How likely is Buffett to win his bet in two years?

Exercise 111 (Shaquille O’Neal) Shaquille O’Neal (nicknamed Shaq) is an ex-American Professional basketball player. He was notoriously bad at free throws (an uncontested shot given to a player when they are fouled). The following table compares the first three years of Shaq’s career to his last three years.

Group Free Throws Made Free Throws Attempted
Early Years 1352 2425
Later Years 1121 2132

Did Shaq get worse at free throws over his career?

Exercise 112 (Furniture Website) Furniture.com wishes to estimate the average annual expenditure on furniture among its subscribers. A sample of 1000 subscribers is taken. The sample standard deviation is found to be $670 and the sample mean is $3790.

  1. Give a 90% confidence interval for the average expenditure.
  2. Test the hypothesis that the average expenditure is less than $2,500 at the 1% level.

Exercise 113 (Grocery AB Testting) A grocery delivery service is studying the impact of weather on customer ordering behavior. To do so, they identified households which made orders both in October and in February on the same day of the week and approximately the same time of day. They found 112 such matched orders. The mean purchase amount in October was $121.45, with a standard deviation of $32.78, while the mean purchase amount in February was $135.99 with a standard deviation of $24.81. The standard deviation of the difference between the two orders was $38.28. To quantify the evidence in the data regarding a difference in the average purchase amount between the two months, compute the \(p\)-value of the data.

Exercise 114 (SimCity AB Testing) SimCity 5 is one of Electronic Arts (EA’s) most popular video games. As EA prepared to release the new version, they released a promotional offer to drive more pre-orders. The offer was displayed on their webpage as a banner across the top of the pre-order page. They decided to test some other options to see what design or layout would drive more revenue.

The control removed the promotional offer from the page altogether. The test lead to some very surprising results. With a sample size of \(1000\) visitors, of the \(500\) which got the promotional offer they found \(143\) people wanted to purchase the games and of the half that got the control they found that \(199\) wanted to buy the new version of SimCity.

Test at the \(1\)% level whether EA should provide a promotional offer or not.

Exercise 115 (True/False)  

  1. At the Apple conference it was claimed that “\(97\)% of people love theiWatch”. From a market survey, you found empirically that \(1175\) out ofa sample of \(1400\) people love the new iWatch. Statistically speaking,you can reject Apple’s claim at the \(1\)% significance level. Hint: Youmay use \(pnorm(2.58) = 0.995\)
  2. Given a random sample of \(2000\) voters, \(800\) say they will vote forHillary Clinton in the 2016 US Presidential Election. At the \(95\)%level, I can reject the null hypothesis that Hillary has an evens chanceof winning the election.
  3. The average movie is Netflix’s database has an average customer ratingof \(3.1\) with a standard deviation of \(1\). The last episode of BreakingBad had a rating of \(4.7\) with a standard deviation of \(0.5\). The\(p-value\) for testing whether Breaking Bad’s rating is statisticaldifferent from the average is a lot less than \(1\)%.
  4. A chip manufacturer needs to add the right amount of chemicals to makethe chips resistant to heat. On average the population of chips needs tobe able to withstand heat of 300 degrees. Suppose you have a randomsample of \(30\) chips with a mean of \(305\) and a standard deviation of\(8\). Then you can reject the null hypothesis \(H_0: \mu= 300\) versus\(H_a: \mu> 300\) at the \(5\)% level.
  5. Zagats rates restaurants on food quality. In a random sample of \(100\)restaurants you observe a mean of \(20\) with a standard deviation of\(2.5\). Your favorite restaurant has a score of \(25\). This is statistically different from the population mean at the \(5\)% level.
  6. The \(t\)-score is used to test whether a null hypothesis can be rejected.
  7. An oil company introduces a new fuel that they claim has on average no more than \(100\) milligrams of toxic matter per liter. They take a sample of \(100\) liters and find that \(\bar{X} = 105\) with a given\(\sigma_X = 20\). Then there is evidence at the \(5\)% level that their claim is wrong.
  8. A wine producer claims that the proportion of customers who cannotdistinguish his product from grape juice is at most 5%. For a sample of\(100\) people he finds that \(10\) fail the taste test. He should rejecthis null hypothesis \(H_{0}:p=0.05\) at the \(5\)% level.
  9. As the \(t\)-ratio increases the \(p\)-value of a hypothesis test decreases.

Exercise 116 (True/False: CI)  

  1. There is much discussion of the effects of second-hand smoke. In a survey of \(500\) children who live in families where someone smokes, it was found that \(10\) children were in poor health. A \(95\)% confidence interval for the probability of a child living in a smoking family being in poor health is then \(2\)% to \(4\)%.
  2. You are finding a confidence interval for a population mean. Holding everything else constant, an interval based on an unknown standard deviation will be wider than one based on a known standard deviation no matter what the sample size is.
  3. There is a \(95\)% probability that a normal random variable lies between \(\mu \pm \sigma\).
  4. A recent CNN poll found that \(49\)% of \(10000\) voters said they would vote for Obama versus Romney if that was the election in November 2012. A \(95\)% confidence interval for the true proportion of voters that would vote for Obama is then \(0.49 \pm 0.03\).
  5. A mileage test for a new electric car model called the “Pizzazz” is conducted. With a sample size of \(n=30\) the mean mileage for the sample is \(36.8\) miles with a sample standard deviation of \(4.5\). A \(95\)% confidence interval for the population mean is \((32.3,41.3)\) miles.
  6. In a random sample of \(100\) NCAA basketball games, the team leading after one quarter won the game \(72\) times. Then a 95% confidence interval for the proportion of teams leading after the first quarter that go on to win is approximately \((0.6,0.84)\).
  7. For the same sample, a 95% prediction interval for a particular team winning is also \((0.6,0.84)\).
  8. In playing poker in Vegas, from \(100\) hours of play, you make an average of $50 per hour with a standard deviation of $10. A 95% confidence interval for your mean gain per hour is approximately $ \(( 48 , 52 )\)
  9. If 27 out of 100 respondents to a survey state that they drink Pepsi then a 95% confidence interval for the proportion \(p\) of the population that drinks Pepsi is \(( 0.26 , 0.28 )\).
  10. The \(p\)-value is the probability that the Null hypothesis is true.

Exercise 117 (True/False: Sampling Distribution)  

  1. The Central Limit Theorem states that the distribution of a sample mean is approximately Normal.
  2. The Central Limit Theorem states that the distribution of the sample mean \(\bar{X}\) is Normally distributed for large samples.
  3. The Central Limit Theorem guarantees that the distribution of \(\bar{X}\) is constant.
  4. The sample mean, \(\bar{x}\), approximates the population mean for large random samples.
  5. The trimmed mean of a dataset is more sensitive to outliers than the mean.
  6. The sample mean of a dataset must be larger than its standard deviation
  7. Selection bias is not a problem when you are estimating a population mean.
  8. The kurtosis of a distribution is not sensitive to outliers.

Field vs Observational

Exercise 118 What is the difference between a randomized trial and an observational study?

Exercise 119 Consider a complex A/B experiment, with 6 alternatives, when you you have 5 variations to your page, plus the original.

  1. Use Bonferroni correction for multiple comparisons. Calculate the significance level for each of the 5 tests and find the number of samples needed to achieve a power of 0.95. Assume that the significance level for each test is .05.
  2. Implement TS for the same experiment. Assume an original arm with a 4% conversion rate, and an optimal arm with a 5% conversion rate. The other 4 arms include one suboptimal arm that beats the original with conversion rate of 4.5%, and three inferior arms with rates of 3%, 2%, and 3.5%. Plot the savings from a six-armed experiment, relative to a Bonferroni adjusted power calculation for a classical experiment. First plot should show the number of days required to end the experiment, with the vertical line showing the time required by the classical power calculation. The second plot should show the number of conversions that were saved by the bandit. What is the overall cost savings due to ending the experiment more quickly, and due to to the experiment being less wasteful while it is running?
  3. Run your simulator 500 times and shows the history of the serving weights for all the arms in the first of our 500 simulation runs. Comment on the results.
  4. Plot the daily cost of running the multi-armed bandit relative to an “oracle” strategy of always playing arm 2, the optimal arm

Added After Course Started

Exercise 120  

Emily, Car, Stock Market, Sweepstakes, Vacation and Bayes.

Emily is taking Bayesian Analysis course. She believes she will get an A with probability 0.6, a B with probability 0.3, and a C or less with probability 0.1. At the end of semester she will get a car as a present form her (very) rich uncle depending on her class performance. For getting an A in the course Emily will get a car with probability 0.8, for B with probability 0.5, and for anything less than B, she will get a car with probability of 0.2. These are the probabilities if the market is bullish. If the market is bearish, the uncle is less likely to make expensive presents, and the above probabilities are 0.5, 0.3, and 0.1, respectively. The probabilities of bullish and bearish market are equal, 0.5 each. If Emily gets a car, she would travel to Redington Shores with probability 0.7, or stay on campus with probability 0.3. If she does not get a car, these two probabilities are 0.2 and 0.8, respectively. Independently, Emily may be a lucky winner of a sweepstake lottery for a free air ticket and vacation in hotel Sol at Redington Shores. The chance to win the sweepstake is 0.001, but if Emily wins, she will go to vacation with probability of 0.99, irrespective of what happened with the car.

After the semester was over you learned that Emily is at Redington Shores.

  1. What is the probability that she got a car?
  2. What is the probability that she won the sweepstakes?
  3. What is the probability that she got a B in the course?
  4. What is the probability that the market was bearish?

Hint: You can solve this problem by any of the 3 ways: (ii) direct simulation using R, or Python, and (ii) exact calculation. Use just one of the two ways to solve it. The exact solution, although straightforward, may be quite messy.

Exercise 121 (Poisson: Websire visits) A web designer is analyzing traffic on a web site. Assume the number of visitors arriving at the site at a given time of day is modeled as a Poisson random variable with a rate of \(\lambda\) visitors per minute. Based on prior experience with similar web sites, the following estimates are given:

  • There is a 90% probability that the rate is greater than 5 visitors per minute.
  • The rate is equally likely to be greater than or less than 14 visitors per minute.
  • There is a 90% probability that the rate is less than 27 visitors per minute.

Find a Gamma prior distribution for the arrival rate that fits these judgments as well as possible. Comment on your results.

Hint: there is no “right” answer to this problem. You can use trial and error to find a distribution that fits as well as possible. You can also use an optimization method such as Excel Solver to minimize a measure of how far apart the given quantiles are from the ones in the target distribution.