Segment 33 Sanmit Narvekar

From Computational Statistics Course Wiki
Revision as of 00:52, 29 April 2014 by Sanmit (talk | contribs) (Segment 33 - To Calculate)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Segment 33

To Calculate

1. How many distinct m by n contingency tables are there that have exactly N total events?

This reminds me of a very similar problem I had when I took algebra... There are N books on a bookshelf, and you want to divide them into some number of groups (using, for example, one of those book divider "thingies" that hold them upright). So, imagine you have some number of slots, each of which has to contain either a book or a divider. If there are N books that you want to divide into x groups, you need x-1 dividers. Hence, there must be N + x - 1 slots. Now you can either choose locations for the dividers (x-1) or the books (N).

The problem here is equivalent: the N books are the N events. The x groups are the mn buckets we wish to drop the events into. So, the answer is:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "":): {\displaystyle \binom{N + mn - 1}{mn -1} = \binom{N + mn - 1}{N}}

2. For every distinct 2 by 2 contingency table containing exactly 14 elements, compute its chi-square statistic, and also its Wald statistic. Display your results as a scatter plot of one statistic versus the other.

Here is the Matlab code:

close all;
hold on;

N = 14;

% Brute force!
count = 0;
for c11=0:N  
    for c12=0:(N-c11)       
        for c21=0:(N-c11-c12)
            c22 = N-c11-c12-c21;
            table = [c11 c12; c21 c22];
            % Chisquare statistic
            expectedTable = sum(table,2)*sum(table,1)/sum(sum(table));            
            chisquare = sum(sum(((table - expectedTable).^2) ./ expectedTable));
            % Wald statistic
            pcol = table(1,:) ./ sum(table, 1);
            phat = sum(table(1,:)) / sum(sum(table));           
            wald = (pcol(1) - pcol(2)) / sqrt(phat * (1-phat) * sum((1 ./ sum(table, 1))));
            % Plot
            plot(chisquare, wald, 'b.')           
            count = count + 1;

title(sprintf('chisquare vs wald for 2x2 table with %d elements', N))
xlabel('chisquare statistic')
ylabel('wald statistic')

foundAll = (count == nchoosek(N+3,3))

And here is the resulting scatter plot:


One thing to note is that the chisquare statistic is always positive, whereas the wald statistic does allow negative values, which is probably why it is symmetric on the x axis. The result of the square factor in chisquare is also evident in the parabolic shape of the curve.

To Think About

1. Suppose you want to find out of living under power lines causes cancer. Describe in detail how you would do this (1) as a case/control study, (2) as a longitudinal study, (3) as a snapshot study. Can you think of a way to do it as a study with all the marginals fixed (protocol 4)?

2. For an m by n contingency table, can you think of a systematic way to code "the loop over all possible contingency tables with the same marginals" in slide 8?