Part 1
Here are the data we are working with.
We create a grid of 10001 points along \([0, 1]\).
The probabilities are thus:
probs1 <- grid %>%
mutate(
prior = 1, # flat prior
likelihood = dbinom(W, N, p),
unst_posterior = prior * likelihood,
posterior = unst_posterior / sum(unst_posterior)
) %>%
select(-unst_posterior)
We can sample from the posterior many times, then just calculate the quantiles from the sample.
probs1 %>%
sample_n(10000, replace = T, weight = posterior) %>%
summarise(
q005 = quantile(p, 0.005),
q25 = quantile(p, 0.25),
q50 = median(p, 0.5),
q75 = quantile(p, 0.75),
q995 = quantile(p, 0.995)
)
# A tibble: 1 x 5
q005 q25 q50 q75 q995
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.234 0.449 0.531 0.614 0.808
Part 2
The only change here is specifying that the prior is 0 below 0.5.
probs2 <- grid %>%
mutate(
prior = if_else(p >= 0.5, 1, 0),
likelihood = dbinom(W, N, p),
unst_posterior = prior * likelihood,
posterior = unst_posterior / sum(unst_posterior)
) %>%
select(-unst_posterior)
The lower bound for the quantiles is now higher.
probs2 %>%
sample_n(10000, replace = T, weight = posterior) %>%
summarise(
q005 = quantile(p, 0.005),
q25 = quantile(p, 0.25),
q50 = median(p, 0.5),
q75 = quantile(p, 0.75),
q995 = quantile(p, 0.995)
)
# A tibble: 1 x 5
q005 q25 q50 q75 q995
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.501 0.547 0.596 0.654 0.830
We can calculate the probabilty that the posterior lies within 0.05 of the true value by sampling and averaging.
probs1 %>%
bind_rows(probs2, .id = 'exercise') %>%
group_by(exercise) %>%
sample_n(10000, replace = T, weight = posterior) %>%
summarise(prob_p_near_true = mean(between(p, p_true - 0.05, p_true + 0.05)))
# A tibble: 2 x 2
exercise prob_p_near_true
<chr> <dbl>
1 1 0.130
2 2 0.219
Here we see that the prior from this exercise has more probability mass around the true value.
Point-mass priors
Let’s mess around with some strange priors to see what happens. Here we put a prior with mass only on two points.
probs_points <- grid %>%
mutate(
prior = if_else(p == 0.4 | p == 0.6, 1, 0),
likelihood = dbinom(W, N, p),
unst_posterior = prior * likelihood,
posterior = unst_posterior / sum(unst_posterior)
) %>%
select(-unst_posterior)
probs_points %>%
gather(component, probability, -p) %>%
mutate(component = component %>% ordered(probability_levels)) %>%
ggplot() +
aes(p, probability, colour = component) +
geom_line() +
facet_wrap(~component, scales = 'free_y')
probs_points %>%
sample_n(10000, replace = T, weight = posterior) %>%
summarise(
q005 = quantile(p, 0.005),
q25 = quantile(p, 0.25),
q50 = median(p, 0.5),
q75 = quantile(p, 0.75),
q995 = quantile(p, 0.995)
)
# A tibble: 1 x 5
q005 q25 q50 q75 q995
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.4 0.4 0.6 0.6 0.6
Multiple interval priors
Now let’s put prior density on disjoint intervals.
probs_bars <- grid %>%
mutate(
prior = if_else(between(p, 0.2, 0.3) | between(p, 0.7, 0.8), 1, 0),
likelihood = dbinom(W, N, p),
unst_posterior = prior * likelihood,
posterior = unst_posterior / sum(unst_posterior)
) %>%
select(-unst_posterior)
probs_bars %>%
gather(component, probability, -p) %>%
mutate(component = component %>% ordered(probability_levels)) %>%
ggplot() +
aes(p, probability, colour = component) +
geom_line() +
facet_wrap(~component, scales = 'free_y')
probs_bars %>%
sample_n(10000, replace = T, weight = posterior) %>%
summarise(
q005 = quantile(p, 0.005),
q25 = quantile(p, 0.25),
q50 = median(p, 0.5),
q75 = quantile(p, 0.75),
q995 = quantile(p, 0.995)
)
# A tibble: 1 x 5
q005 q25 q50 q75 q995
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.206 0.298 0.719 0.746 0.798
Part 3
An easy way is just to simulate the calculation for different values of N using our best guess as to the true value we are trying to estimate. Note that we run multiple iterations for each value of N.
probs <- tibble(N = c(10^(1:4), 3000)) %>%
crossing(iter = 1:10) %>%
mutate(W = rbinom(n(), N, p_true)) %>%
crossing(grid) %>%
group_by(N, iter) %>%
mutate(
prior = 1,
likelihood = dbinom(W, N, p),
unst_posterior = prior * likelihood,
posterior = unst_posterior / sum(unst_posterior)
) %>%
select(-unst_posterior)
probs %>%
sample_n(10000, replace = T, weight = posterior) %>%
summarise(
q005 = quantile(p, 0.005),
q50 = median(p, 0.5),
q995 = quantile(p, 0.995)
) %>%
mutate(width = q995 - q005) %>%
summarise(
q005 = mean(q005),
q50 = mean(q50),
q995 = mean(q995),
width = mean(width)
)
# A tibble: 5 x 5
N q005 q50 q995 width
<dbl> <dbl> <dbl> <dbl> <dbl>
1 10 0.247 0.589 0.875 0.629
2 100 0.570 0.695 0.801 0.231
3 1000 0.659 0.697 0.734 0.0743
4 3000 0.676 0.698 0.719 0.0431
5 10000 0.685 0.697 0.709 0.0237
Thus it seems we need around 3000 trials to get a 99th percentile interval to have a width below 0.05.
Alternatively, we can use a normal approximaton to get a rough idea how large N is. The standard deviation is
\[ \sqrt{p_{true} (1 - p_{true}) / N}. \]
The 99th percentile of our estimate must be plus/minus 3 standard deviations. So
\[ 6\sqrt{p_{true} (1 - p_{true}) / N} < 0.05 \]
which means
\[ N > (p_{true} (1 - p_{true})) / (0.05 / 6)^2. \]
Thus, \(N\) will be somewhere around
[1] 3024