Segment 16: Daniel Shepard
1. Simulate the following: You have M=50 p-values, none actually causal, so that they are drawn from a uniform distribution. Not knowing this sad fact, you apply the Benjamini-Hochberg prescription with and possibly call some discoveries as true. By repeated simulation, estimate the probability of thus getting N wrongly-called discoveries, for N=0, 1, 2, and 3.
The following MATLAB code was used for the simulation
M = 50; alpha = 0.05; N = 0:M; P = zeros(size(N)); for(i = 1:10^6) p = rand(1,M); p = sort(p); ind = find(p >= (1:M)/M*alpha,1); P(ind) = P(ind)+1; end P = P/10^6;
This resulted in the probabilities , , , and .
2. Does the distribution that you found in problem 1 depend on M? On ? Derive its form analytically for the usual case of ?
The distribution does not change with M, but does change with . First, we need to work out the distribution of each entry in the sorted list of p-values. I will refer to these p-values (for notational purposes) as where is the index of the value in the ordered list starting with 1. Before being ordered, these p-values were distributed normally. However, ordering the p-values changed the probability. For the first member of the list, the probability distribution is the probability of that particular p-value taking on that particular value (which is 1) multiplied by the probabilities that every other p-value is greater than that value:
The reason for the multiplication by is that any of the p-values could be the smallest one. For the second member of the list, the probability distribution is the probability of that particular p-value taking on that particular value (which is 1) multiplied by the probability that one p-value is below this value and the probability that the other p-values are above that value:
The reason for the multiplication by is that we are choosing 2 of the p-values to be the lowest 2 p-values. The reason for the multiplication by 2 is that the particular p-value that takes on the role of could be either of the chosen 2 lowest p-values. By induction, the distribution of the kth entry in the ordered list of p-values is
Now that we know the distribution of each of these members of the ordered list, we can determine the probability that a certain number of the null hypothesis are wrongfully discarded. This probability is given by the equation
It is easily shown that this is a valid probability distribution (i.e., it sums to 1). Assuming results in , , , and .
To Think About
1. Suppose you have M independent trials of an experiment, each of which yields an independent p-value. Fisher proposed combining them by forming the statistic
Show that, under the null hypothesis, S is distributed as and describe how you would obtain a combined p-value for this statistic.
2. Fisher is sometimes credited, on the basis of problem 1, with having invented "meta-analysis", whereby results from multiple investigations can be combined to get an overall more significant result. Can you see any pitfalls in this?