Difference between revisions of "Segment 34 Sanmit Narvekar"

From Computational Statistics Course Wiki
Jump to navigation Jump to search
(Segment 34 - To Calculate)
 
(Segment 34 Edits)
 
Line 14: Line 14:
  
 
Here is Matlab code for both questions:
 
Here is Matlab code for both questions:
 +
 +
<pre>
 +
function pearson = pearsonTable(table)
 +
 +
expectedTable = sum(table,2)*sum(table,1)/sum(sum(table));           
 +
pearson = sum(sum(((table - expectedTable).^2) ./ expectedTable));
 +
</pre>
 +
 +
<pre>
 +
function wald = waldTable(table)
 +
 +
pcol = table(1,:) ./ sum(table, 1);
 +
phat = sum(table(1,:)) / sum(sum(table));         
 +
wald = (pcol(1) - pcol(2)) / sqrt(phat * (1-phat) * sum((1 ./ sum(table, 1))));
 +
</pre>
  
 
<pre>
 
<pre>
Line 66: Line 81:
 
The Wald statistic of the original table is 1.1643. The distribution of the Wald statistic over the population of tables is shown in the histogram below. Using a 2-tailed pvalue test gives a value of 0.2092 while the 1-tailed test gives 0.0772. In both cases, we still cannot rule out the null hypothesis, so the result is not significant. It's worth noting that the 2-tailed test is more appropriate for the Wald statistic, since values that are very negative/low are just as bad as those that are very positive/high.  
 
The Wald statistic of the original table is 1.1643. The distribution of the Wald statistic over the population of tables is shown in the histogram below. Using a 2-tailed pvalue test gives a value of 0.2092 while the 1-tailed test gives 0.0772. In both cases, we still cannot rule out the null hypothesis, so the result is not significant. It's worth noting that the 2-tailed test is more appropriate for the Wald statistic, since values that are very negative/low are just as bad as those that are very positive/high.  
  
 +
EDIT: So apparently the Wald statistic was only defined for 2x2 tables. My implementation didn't rely on this, which is why it computed a value. However, there is no real meaning to that number.
  
 
[[File:SanmitSeg34Wald.png]]
 
[[File:SanmitSeg34Wald.png]]

Latest revision as of 20:57, 29 April 2014

Segment 34

To Calculate

1. Use the permutation test to decide whether the contingency table Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{bmatrix} 5 & 3 & 2\\ 2 & 3 & 6 \\ 0 & 2 & 3 \end{bmatrix}} shows a significant association. What is the p-value?


Here is Matlab code for both questions:

function pearson = pearsonTable(table)

expectedTable = sum(table,2)*sum(table,1)/sum(sum(table));            
pearson = sum(sum(((table - expectedTable).^2) ./ expectedTable));
function wald = waldTable(table)

pcol = table(1,:) ./ sum(table, 1);
phat = sum(table(1,:)) / sum(sum(table));           
wald = (pcol(1) - pcol(2)) / sqrt(phat * (1-phat) * sum((1 ./ sum(table, 1))));
table = [5 3 2; 2 3 6; 0 2 3];

% Chisquare statistic
originalTablePearson = pearsonTable(table)

% Wald statistic
originalTableWald = waldTable(table)

% Expand the data
[row col] = ndgrid(1:size(table,1),1:size(table,2));
d = [];
for k=1:numel(table); d = cat(1,d,repmat([row(k),col(k)],table(k),1)); end;

% Function to randomly permute tables and compute their pearson and wald
% statistics
genPearson = @(x) pearsonTable( accumarray([ d(randperm(size(d,1)),1) d(:,2)],1,size(table)));
genWald = @(x) waldTable( accumarray([ d(randperm(size(d,1)),1) d(:,2)],1,size(table)));

pearsonVals = arrayfun(genPearson,1:100000);
waldVals = arrayfun(genWald,1:100000);

% Plots
hist(pearsonVals, 100)
title('Distribution of Pearson Statistic')
xlabel('Pearson value')
ylabel('Counts')
figure;

hist(waldVals, 100)
title('Distribution of Wald Statistic')
xlabel('Wald value')
ylabel('Counts')

% Pvalue calculation (1 and 2 tailed tests for Wald)
pvalPearson = sum(pearsonVals>=originalTablePearson)/numel(pearsonVals)
pvalWald1t = sum(waldVals >= originalTableWald) / numel(waldVals)
pvalWald2t = (sum(waldVals >= originalTableWald)+sum(waldVals <= -originalTableWald))/numel(waldVals)

The Pearson statistic of the original table is 5.7560. The distribution of the pearson statistic over the population of permuted tables is shown in the histogram below. Our table has a pvalue of 0.2293, so we can't rule out the hypothesis that the rows and columns aren't correlated.


SanmitSeg34Pearson.png


2. Repeat the calculation using the Pearson chi-square statistic instead of the Wald statistic, or vice versa.

The Wald statistic of the original table is 1.1643. The distribution of the Wald statistic over the population of tables is shown in the histogram below. Using a 2-tailed pvalue test gives a value of 0.2092 while the 1-tailed test gives 0.0772. In both cases, we still cannot rule out the null hypothesis, so the result is not significant. It's worth noting that the 2-tailed test is more appropriate for the Wald statistic, since values that are very negative/low are just as bad as those that are very positive/high.

EDIT: So apparently the Wald statistic was only defined for 2x2 tables. My implementation didn't rely on this, which is why it computed a value. However, there is no real meaning to that number.

SanmitSeg34Wald.png


To Think About

1. Is slide's 7 suggestion, that you figure out how to implement the permutation test without "expanding all the data", actually possible? If so, what is your method?