Segment 16: Daniel Shepard

From Computational Statistics Course Wiki
Jump to navigation Jump to search

Problems

To Calculate

1. Simulate the following: You have M=50 p-values, none actually causal, so that they are drawn from a uniform distribution. Not knowing this sad fact, you apply the Benjamini-Hochberg prescription with and possibly call some discoveries as true. By repeated simulation, estimate the probability of thus getting N wrongly-called discoveries, for N=0, 1, 2, and 3.

Solution:

The following MATLAB code was used for the simulation

   M = 50;
   alpha = 0.05;
   N = 0:M;
   P = zeros(size(N));
   for(i = 1:10^6)
       p = rand(1,M);
       p = sort(p);
       ind = find(p >= (1:M)/M*alpha,1);
       P(ind) = P(ind)+1;
   end
   P = P/10^6;

This resulted in the probabilities , , , and .


2. Does the distribution that you found in problem 1 depend on M? On ? Derive its form analytically for the usual case of ?

Solution:

The distribution does not change with M, but does change with . First, we need to work out the distribution of each entry in the sorted list of p-values. I will refer to these p-values (for notational purposes) as where is the index of the value in the ordered list starting with 1. Before being ordered, these p-values were distributed normally. However, ordering the p-values changed the probability. For the first member of the list, the probability distribution is the probability of that particular p-value taking on that particular value (which is 1) multiplied by the probabilities that every other p-value is greater than that value:

   

The reason for the multiplication by is that any of the p-values could be the smallest one. For the second member of the list, the probability distribution is the probability of that particular p-value taking on that particular value (which is 1) multiplied by the probability that one p-value is below this value and the probability that the other p-values are above that value:

   

The reason for the multiplication by is that we are choosing 2 of the p-values to be the lowest 2 p-values. The reason for the multiplication by 2 is that the particular p-value that takes on the role of could be either of the chosen 2 lowest p-values. By induction, the distribution of the kth entry in the ordered list of p-values is

   

Now that we know the distribution of each of these members of the ordered list, we can determine the probability that a certain number of the null hypothesis are wrongfully discarded. This probability is given by the equation

   

It is easily shown that this is a valid probability distribution (i.e., it sums to 1). Assuming results in , , , and .


To Think About

1. Suppose you have M independent trials of an experiment, each of which yields an independent p-value. Fisher proposed combining them by forming the statistic

Show that, under the null hypothesis, S is distributed as and describe how you would obtain a combined p-value for this statistic.


2. Fisher is sometimes credited, on the basis of problem 1, with having invented "meta-analysis", whereby results from multiple investigations can be combined to get an overall more significant result. Can you see any pitfalls in this?