Segment 16 Sanmit Narvekar

From Computational Statistics Course Wiki
Jump to navigation Jump to search

Segment 16

To Calculate

1. Simulate the following: You have M=50 p-values, none actually causal, so that they are drawn from a uniform distribution. Not knowing this sad fact, you apply the Benjamini-Hochberg prescription with Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha=0.05} and possibly call some discoveries as true. By repeated simulation, estimate the probability of thus getting N wrongly-called discoveries, for N=0, 1, 2, and 3.

Here is the Matlab code:


M = 50;
alpha = 0.05;
nIters = 1000000;
discoveries = zeros(10,1);

for iter=1:nIters

    % Draw 50 pvalues, uniformly distributed
    pvals = rand(M,1);

    % Benjamini & Hochberg FDR "prescription"
    sortedPvals = sort(pvals);
    nDiscoveries = sum((sortedPvals ./ ((1:M)' .* alpha ./ M)) < 1);
    discoveries(nDiscoveries+1) = discoveries(nDiscoveries+1) + 1;
end

discoveries ./ nIters

I actually calculated it for N = 0 to 9 since some of the other values were found as well. Here are the results (starting with N=0 at the top):

   0.950169
   0.046188
   0.003375
   0.000250
   0.000017
   0.000001
   0
   0
   0
   0


2. Does the distribution that you found in problem 1 depend on M? On Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} ? Derive its form analytically for the usual case of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha \ll 1} ?

The distribution does not depend on M. I ran the code above with M = 1000, and received the following results, which are very similar to the ones above:

   0.950202
   0.046139
   0.003332
   0.000297
   0.000025
   0.000005
   0
   0
   0
   0

Obviously, it does depend on alpha, since changing alpha will (at the very least) affect the first bin N = 0, with changes propagating to the other ones. Here is alpha = 0.10:

   0.899984
   0.086055
   0.011699
   0.001868
   0.000321
   0.000056
   0.000014
   0.000003
   0
   0

To Think About

1. Suppose you have M independent trials of an experiment, each of which yields an independent p-value. Fisher proposed combining them by forming the statistic

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S = -2\sum_{i=0}^{i=M}\log(p_i)}

Show that, under the null hypothesis, S is distributed as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \text{Chisquare}(2M)} and describe how you would obtain a combined p-value for this statistic.


2. Fisher is sometimes credited, on the basis of problem 1, with having invented "meta-analysis", whereby results from multiple investigations can be combined to get an overall more significant result. Can you see any pitfalls in this?