Segment 34 Sanmit Narvekar

From Computational Statistics Course Wiki
Jump to navigation Jump to search

Segment 34

To Calculate

1. Use the permutation test to decide whether the contingency table Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "":): {\displaystyle \begin{bmatrix} 5 & 3 & 2\\ 2 & 3 & 6 \\ 0 & 2 & 3 \end{bmatrix}} shows a significant association. What is the p-value?

Here is Matlab code for both questions:

function pearson = pearsonTable(table)

expectedTable = sum(table,2)*sum(table,1)/sum(sum(table));            
pearson = sum(sum(((table - expectedTable).^2) ./ expectedTable));
function wald = waldTable(table)

pcol = table(1,:) ./ sum(table, 1);
phat = sum(table(1,:)) / sum(sum(table));           
wald = (pcol(1) - pcol(2)) / sqrt(phat * (1-phat) * sum((1 ./ sum(table, 1))));
table = [5 3 2; 2 3 6; 0 2 3];

% Chisquare statistic
originalTablePearson = pearsonTable(table)

% Wald statistic
originalTableWald = waldTable(table)

% Expand the data
[row col] = ndgrid(1:size(table,1),1:size(table,2));
d = [];
for k=1:numel(table); d = cat(1,d,repmat([row(k),col(k)],table(k),1)); end;

% Function to randomly permute tables and compute their pearson and wald
% statistics
genPearson = @(x) pearsonTable( accumarray([ d(randperm(size(d,1)),1) d(:,2)],1,size(table)));
genWald = @(x) waldTable( accumarray([ d(randperm(size(d,1)),1) d(:,2)],1,size(table)));

pearsonVals = arrayfun(genPearson,1:100000);
waldVals = arrayfun(genWald,1:100000);

% Plots
hist(pearsonVals, 100)
title('Distribution of Pearson Statistic')
xlabel('Pearson value')

hist(waldVals, 100)
title('Distribution of Wald Statistic')
xlabel('Wald value')

% Pvalue calculation (1 and 2 tailed tests for Wald)
pvalPearson = sum(pearsonVals>=originalTablePearson)/numel(pearsonVals)
pvalWald1t = sum(waldVals >= originalTableWald) / numel(waldVals)
pvalWald2t = (sum(waldVals >= originalTableWald)+sum(waldVals <= -originalTableWald))/numel(waldVals)

The Pearson statistic of the original table is 5.7560. The distribution of the pearson statistic over the population of permuted tables is shown in the histogram below. Our table has a pvalue of 0.2293, so we can't rule out the hypothesis that the rows and columns aren't correlated.


2. Repeat the calculation using the Pearson chi-square statistic instead of the Wald statistic, or vice versa.

The Wald statistic of the original table is 1.1643. The distribution of the Wald statistic over the population of tables is shown in the histogram below. Using a 2-tailed pvalue test gives a value of 0.2092 while the 1-tailed test gives 0.0772. In both cases, we still cannot rule out the null hypothesis, so the result is not significant. It's worth noting that the 2-tailed test is more appropriate for the Wald statistic, since values that are very negative/low are just as bad as those that are very positive/high.

EDIT: So apparently the Wald statistic was only defined for 2x2 tables. My implementation didn't rely on this, which is why it computed a value. However, there is no real meaning to that number.


To Think About

1. Is slide's 7 suggestion, that you figure out how to implement the permutation test without "expanding all the data", actually possible? If so, what is your method?