Segment 13...The Yeast Genome

From Computational Statistics Course Wiki
Jump to navigation Jump to search


from Segment 13. The Yeast Genome

To Calculate

1. With p=0.3, and various values of n, how big is the largest discrepancy between the Binomial probability pdf and the approximating Normal pdf? At what value of n does this value become smaller than ?

, with
, with
I tested plots of these pdfs against each other in matlab for various N to get a general sense of what N to test for, and then wrote a function to keep increasing the N while the difference between the pdf’s is larger than .


For , we get the difference smaller than .

2. Show that if four random variables are (together) multinomially distributed, each separately is binomially distributed.

Given a multinomial distribution for each with probability st and occuring times, st we have the multinomial distribution:

We claim that each is binomially distributed. WLOG let’s find the probability distribution of , i.e, the probability of successes in tries, using the multinomial formula.

This would be equivalent to summing over all in the multinomial equation above, so:

by the [multinomial expansion theorem]

So is binomially distributed and the proof is the same for the other i’s.

To Think About

1. The segment suggests that and comes about because genes are randomly distributed on one strand or the other. Could you use the observed discrepancies to estimate, even roughly, the number of genes in the yeast genome? If so, how? If not, why not?

2. Suppose that a Bayesian thinks that the prior probability of the hypothesis that "" is 0.9, and that the set of all hypotheses that "" have a total prior of 0.1. How might he calculate the odds ratio ? Hint: Are there nuisance variables to be marginalized over?

Class Activity

with Andrea, Eleisha, and Sanmit I tried to figure out the length of each ORF in the file and save each value in an array using MATLAB while my other group members worked in python (a race!) but didn't finish before class was up.

Here's my code: (doesn't compile)

a = fscanf(fid,'%c') ; a = a'; fclose(fid)

i=1; for c=1:3:length(a)

   codon = [a(c),a(c+1),a(c+2)];
   if (strcmp(codon,'TGA')==1|strcmp(codon,'TAA ')==1|strcmp(codon,'TAG')==1)
       startindex = c+3;
       endindex = 


c=1 while c<length(a)

function myindex = nextstopcodon(string,startin) for c=startin:3:length(a) codon = [string(c),string(c+1),string(c+2)];

   if (strcmp(codon,'TGA')==1|strcmp(codon,'TAA ')==1|strcmp(codon,'TAG')==1)


Sanmit Narvekar got a graph of p values and will post it later.