Segment 6 Sanmit Narvekar

From Computational Statistics Course Wiki
Jump to navigation Jump to search

Segment 6

To Calculate

1. Write down an explicit expression for what the slides denote as bin(n,N,r).

The binomial distribution is the same as the Bernoulli distribution applied to multiple trials. Given N trials, the following formula gives the probability that n trials will "succeed," given that each has probability r of succeeding is:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle bin(n,N,r) = \binom{N}{n} r^n (1-r)^{N-n} }


2. There is a small error on slide 7 that carries through to the first equation on slide 8 and the graph on slide 9. Find the error, fix it, and redo the graph of slide 9. Does it make a big difference? Why or why not?

There are two problems:

1) Descendant T5 of Joseph Towne had 3 mutated loci, however (most likely due to a typo) the binomial distribution implies there was only one mutation. This is fixed on slide 9.

2) The conditional distribution is missing a factor for the branch with Jacob Towne (one of William Towne's children). Note that when we add this factor, the factor corresponding to Samuel Towne and T4 also change, since Samuel is now only 2 generations away and T4 is only 10 generations away.

The new Bayes estimate of the parameter, using the log-uniform prior is:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P(r | data) = \text{bin}(0,1 \times 37, r) \times \text{bin}(0,2 \times 37, r) \times \text{bin}(0,3 \times 37, r) } Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \times \text{bin}(1,5 \times 37, r) \times \text{bin}(0,5 \times 37, r) \times \text{bin}(0,6 \times 37, r) \times \text{bin}(1,10 \times 37, r) \times \text{bin}(3,10 \times 37, r) \times \frac{1}{r}}

The new graph is shown below in red (the old graph from the slides is in blue). There was not a big difference, since only one new factor was introduced (and it is almost 1, since the anything choose 0 is 1, as well as r^0. Thus we are multiplying by (1-r)^37, which is close to 1), and the other factors had minor changes. Quantitatively, the calculations in the slides suggested r = 0.0025 was the most likely value with a probability of 0.4649. The corrected calculations give r = 0.0026 with a probability of 0.4643.

SanmitSeg6Graph.png


To Think About

1. Suppose you knew the value of r (say, r = 0.0038). How would you simulate many instances of the Towne family data (e.g., the tables on slides 4 and 5?

For each Towne member, I would start with the sequence of his/her parent, generate 37 random numbers to determine which loci have been mutated, and then add/delete STRs. This approach would need to be done "top-down."


2. How would you use your simulation to decide if the assumption of ignoring backmutations (the red note on slide 7) is justified?

The approach I described in 1 above doesn't prevent mutations from "correcting" themselves. Thus, we can just see how often a mutation was reversed. It is still very unlikely, since the exact same locus would have to be picked, and then mutated exactly to reverse the previous mutation. I didn't specify a procedure for mutating, but the more complex it is (that is, the number of repeats that can be added or deleted), the less likely it is for a mutation to be exactly reversed.


3. How would you use your simulation to decide if our decision to trim T2, T11, and T13 from the estimation of r was justified? (This question anticipates several later discussions in the course, but thinking about it now will be a good start.)

I suppose one way would be to generate a bunch of these sequences, and see with what probability that many mutations occur.

Comments