Segment 40 Sanmit Narvekar

From Computational Statistics Course Wiki
Jump to navigation Jump to search

Segment 40

To Calculate

1. The file Twoexondata.txt has 3000 pairs of (first, second) exon lengths. Choose 600 of the first exon lengths at random. Then, in your favorite programming language, repeat the calculation shown in the segment to model the chosen first exon lengths as a mixture of two Student distributions. That is (see slide 2): "6 parameters: two centers, two widths, ratio of peak heights, and Student t index." After running your Markov chain, plot the posterior distribution of the ratio of areas of the two Student components, as in slide 6.

Here is my Matlab code:

function el = twostudloglikenu(c,x)
p1 = tpdf((x-c(1))./c(2),c(6));
p2 = c(3)*tpdf((x-c(4))./c(5),c(6));
p = (p1 + p2) ./ (c(2)+c(3)*c(5));
el = sum(-log(p));
allPairs = load('Twoexondata.txt');


data = randsample(allPairs(:,1), 600);
%hist(data, 50)

% Negative log likelihood of the data given the parameters
negloglikefn = @(cc) twostudloglikenu(cc,data);

% Initial guess for parameters
cstart = [2.0 .2 .16 2.9 .4 4];
%covar = 0.1 * inv(0.5 * hessian(negloglikefn,cstart,.001));
%covar = real(covar);


covar = rand(length(cstart));
covar = covar * covar';
covar = covar * 0.001;

% Generate the chain
chainLength = 100000;
chain = zeros(chainLength, length(cstart));

chain(1,:) = cstart;
fprintf('Constructing markov chain...\n')
for step=2:chainLength
   
    chain(step,:) = mcmcstep(chain(step-1,:), covar, data);
    
end

figure

plot(chain(:,5),chain(:,6),'.r')
ylabel('Student nu parameter')
xlabel('Second Student model width')

figure

areas = chain(:,2)./(chain(:,3).*chain(:,5));
plot(areas,chain(:,6))
xlabel('Ratio of areas')
ylabel('Student nu paramter')

figure
hist(areas,0:50)
xlabel('Ratio of areas')
axis([0 50 0 10000])

mle = fminsearch(negloglikefn, cstart);
mleAreaRatio = mle(2) / (mle(3)*mle(5))

SanmitSeg40Ex1Data.png

SanmitSeg40Ex1Scatter.png

SanmitSeg40Ex1Hist.png


There are 2 peaks as in the slides. However, the data I loaded and the resulting plots look different, so maybe something else is happening...

I also cheated a bit on generating the covariance for the proposal distribution. The covariance it was finding was including imaginary numbers. Since it really doesn't matter what proposal distribution you choose (caveat: so long as it doesn't reject everything), I chose a random small covariance matrix.

One thing I noticed while computing the MLE estimates for initializing MCMC is that nu jumps from 3 to 7 (sometimes also as high as 21..), which corresponds to the algorithm getting stuck in one of the 2 maxima, as seen in the slides.


2. Make a histogram of the 2nd exon lengths. Do they seem to require two separate components? If so, repeat the calculations of problem 1. If not, use MCMC to explore the posterior of a model with a single Student component. Plot the posterior distribution of the Student parameter .

Here is a repeat of problem 1 using 2nd exon lengths. It looks like it is actually only 1 component (even the first exon length seemed like only 1 component though...).

SanmitSeg40Ex2Data.png

SanmitSeg40Ex2Scatter.png

SanmitSeg40Ex2Hist.png


Apologies for being sloppy on this one...

To Think About

1. As a Bayesian, how would you decide whether, in problem 2 above, you need one vs. two components? What about 7 components? What about 200? Can you think of a way to enforce model simplicity?


2. After you have given a good "textbook" answer to the preceding problem, think harder about whether this can really work for large data sets. The problem is that even tiny differences in log-likelihood per data point become huge log-odds differences when the number of data points is large. So, given the opportunity, models are almost always driven to high complexity. What do you think that practical Bayesians actually do about this?