Segment 16: Daniel Shepard

Problems

To Calculate

1. Simulate the following: You have M=50 p-values, none actually causal, so that they are drawn from a uniform distribution. Not knowing this sad fact, you apply the Benjamini-Hochberg prescription with and possibly call some discoveries as true. By repeated simulation, estimate the probability of thus getting N wrongly-called discoveries, for N=0, 1, 2, and 3.

Solution:

The following MATLAB code was used for the simulation

   M = 50;
alpha = 0.05;
N = 0:M;
P = zeros(size(N));
for(i = 1:10^6)
p = rand(1,M);
p = sort(p);
ind = find(p >= (1:M)/M*alpha,1);
P(ind) = P(ind)+1;
end
P = P/10^6;


This resulted in the probabilities , , , and .

2. Does the distribution that you found in problem 1 depend on M? On ? Derive its form analytically for the usual case of ?

Solution:

The distribution does not change with M, but does change with . First, we need to work out the distribution of each entry in the sorted list of p-values. I will refer to these p-values (for notational purposes) as where is the index of the value in the ordered list starting with 1. Before being ordered, these p-values were distributed normally. However, ordering the p-values changed the probability. For the first member of the list, the probability distribution is the probability of that particular p-value taking on that particular value (which is 1) multiplied by the probabilities that every other p-value is greater than that value:




The reason for the multiplication by is that any of the p-values could be the smallest one. For the second member of the list, the probability distribution is the probability of that particular p-value taking on that particular value (which is 1) multiplied by the probability that one p-value is below this value and the probability that the other p-values are above that value:




The reason for the multiplication by is that we are choosing 2 of the p-values to be the lowest 2 p-values. The reason for the multiplication by 2 is that the particular p-value that takes on the role of could be either of the chosen 2 lowest p-values. By induction, the distribution of the kth entry in the ordered list of p-values is




Now that we know the distribution of each of these members of the ordered list, we can determine the probability that a certain number of the null hypothesis are wrongfully discarded. This probability is given by the equation




It is easily shown that this is a valid probability distribution (i.e., it sums to 1). Assuming results in , , , and .