# Difference between revisions of "Eleisha's Segment 23: Bootstrap Estimation of Uncertainty"

To Compute:

1. Generate 100 i.i.d. random draws from the beta distribution $\displaystyle \text{Beta}(2.5,5.)$ , for example using MATLAB's betarnd or Python's random.betavariate. Use these to estimate this statistic of the underlying distribution: "value of the 75% percentile point minus value of the 25th percentile point". Now use statistical bootstrap to estimate the distribution of uncertainty of your estimate, for example as a histogram.

After generating 100 i.i.d random values, my estimated value the statistic was approximately: 0.20363.

After bootstrapping (nboot = 100,000), the mean of my test statistic was approximately: 0.20417 and the standard deviation was approximately: 0.02427.

Below is a histogram of the bootstrapping values:

2. Suppose instead that you can draw any number of desired samples (each 100 draws) from the distribution. How does the histogram of the desired statistic from these samples compare with the bootstrap histogram from problem 1?

When drawing from the true distribution the mean was approximately: 0.23115 and the standard deviation was: 0.02653.

Below is a histogram of these values:

3. What is the actual value of the desired statistic for this beta distribution, computed numerically (that is, not by random sampling)? (Hint: I did this in Mathematica in three lines.)

The actual value of the desired statistic for this beta distribution is approximately: 0.23295. This was calculated using the inverse cdf of the Beta Distribution with the proper quartiles.

Below is the python script used to do the calculations and plot the histograms:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta

ndata = 100
nboot = 100000

#Generates a new dataset of ndata i.i.d random draws from the Beta(2.5, 5.0)
def generate_data(ndata):
data = np.random.beta(2.5, 5.0, ndata)
return data

#Calculates the statistic of interest: "value of the 75% percentile point minus value of the 25th percentile point"
def calculate_test_stat(data):
data = np.sort(data)
return (data - data)

#Generate a dataset of values
data = generate_data(ndata)
estimated_stat = calculate_test_stat(data)
print "Estimated Test Statistic: " + str(estimated_stat)

#Creates a dataset of statistics calculated by sampling from the data
values = []
for x in xrange(0, nboot):
sample_data = np.random.choice(data, size=ndata, replace=True)
values.append(calculate_test_stat(sample_data))

#Plot the Histogram of the desired statistic when sampling from the data
plt.figure(1)
print "Bootstrapping Values"
print "Mean of values: " + str(np.mean(values))
print "Standard Deviation of values: " + str(np.std(values))
plt.title("Histogram of Bootstrapped Values")
plt.ylabel("Frequency")
plt.xlabel("Statistic Value")
plt.hist(values, 100)
plt.savefig("Eleisha_HW23_Figure1.png")
plt.show()

#Create a data set of statistics calculated by sampling from the true distribution
values = []
for x in xrange(0, nboot):
true_data = generate_data(ndata)
values.append(calculate_test_stat(true_data))

#Plot the Histogram of the desired statistic when sampling from the distribution
plt.figure(2)
print "Values when sampling from the True Distribution"
print "Mean of values: " + str(np.mean(values))
print "Standard Deviation of values: " + str(np.std(values))
plt.title("Histogram of Statistics from Actual Distribution")
plt.ylabel("Frequency")
plt.xlabel("Statistic Value")
plt.hist(values, 100)
plt.savefig("Eleisha_HW23_Figure2.png")
plt.show()

#Calculate the actual value of the statistic
true_value = beta.ppf(0.75, 2.5, 5.0) - beta.ppf(0.25, 2.5, 5.0)
print "Actual Value: " + str(true_value)


Sample output:

Estimated Test Statistic: 0.203631474755
Bootstrapping Values
Mean of values: 0.204169593645
Standard Deviation of values: 0.0242746089495
Values when sampling from the True Distribution
Mean of values: 0.231145899062
Standard Deviation of values: 0.0265314651351
Actual Value: 0.232952354264