Segment 13: Daniel Shepard
1. With p=0.3, and various values of n, how big is the largest discrepancy between the Binomial probability pdf and the approximating Normal pdf? At what value of n does this value become smaller than ?
If you take the Binomial and Normal distributions and shift left by the mean and scale the x axis by dividing by the standard deviation , then one can see that the difference between the distributions has the same shape regardless of . Therefore, we can identify the location of the maximum in this normalized space and easily determine the maximum difference for any by evaluating only points near this value. The shape of this distribution is shown below and the location of the maximum is between and .
MATLAB code to determine this maximum difference between the distributions is shown below.
p = 0.3; f = @(x,N) abs(pdf('bino',x,N,p) - normpdf(x,N*p,sqrt(N*p*(1-p)))); N = 10.^(1:16); maxDiff = zeros(size(N)); for(i = 1:length(N)) maxDiff(i) = max(f(max([0 floor(N(i)*p+0.741*sqrt(N(i)*p*(1-p)))]):min([N(i) ceil(N(i)*p+0.743*sqrt(N(i)*p*(1-p)))]),N(i))); end
From the plot of these values below, one can see that the relationship is linear on a log-log scale.
Therefore, the minimum N for which the discrepancy is or less can be determined using a linear fit to the data as follows
P = polyfit(log10(N),log10(maxDiff),1) P = -0.999572910920835 -0.763144870503447 ceil(10^((-15 - P(2))/P(1))) ans = 1.749597432971020e+014
2. Show that if four random variables are (together) multinomially distributed, each separately is binomially distributed.
The multinomial distribution for four random variables , , , and representing the number of occurrences of a particular event with probabilities of occurrence , , , and , respectively, is given by
Note the following relation
Therefore, by grouping terms, summing over all but one variable, and applying the definition of the binomial expansion it can be seen that
This is simply a binomial distribution in . The same manipulation can be done for , , and .
To Think About
1. The segment suggests that and comes about because genes are randomly distributed on one strand or the other. Could you use the observed discrepancies to estimate, even roughly, the number of genes in the yeast genome? If so, how? If not, why not?
You cannot because you need to know the length of the genes in order to even attempt this. You might be able to estimate the number of genes as a function of the average length of the genes.
2. Suppose that a Bayesian thinks that the prior probability of the hypothesis that "" is 0.9, and that the set of all hypotheses that "" have a total prior of 0.1. How might he calculate the odds ratio ? Hint: Are there nuisance variables to be marginalized over?
He would need to compute the following