Travis: Segment 40

From Computational Statistics (CSE383M and CS395T)
Jump to navigation Jump to search

To Calculate

1. The file twoexondata.txt has 3000 pairs of (first, second) exon lengths. Choose 600 of the first exon lengths at random. Then, in your favorite programming language, repeat the calculation shown in the segment to model the chosen first exon lengths as a mixture of two Student distributions. That is (see slide 2): "6 parameters: two centers, two widths, ratio of peak heights, and Student t index." After running your Markov chain, plot the posterior distribution of the ratio of areas of the two Student components, as in slide 6.

2. Make a histogram of the 2nd exon lengths. Do they seem to require two separate components? If so, repeat the calculations of problem 1. If not, use MCMC to explore the posterior of a model with a single Student component. Plot the posterior distribution of the Student parameter <math>\nu</math>.

To Think About

1. As a Bayesian, how would you decide whether, in problem 2 above, you need one vs. two components? What about 7 components? What about 200? Can you think of a way to enforce model simplicity?

2. After you have given a good "textbook" answer to the preceding problem, think harder about whether this can really work for large data sets. The problem is that even tiny differences in log-likelihood per data point become huge log-odds differences when the number of data points is large. So, given the opportunity, models are almost always driven to high complexity. What do you think that practical Bayesians actually do about this?