Segment 6...The Towne Family Tree

From Computational Statistics Course Wiki
Jump to navigation Jump to search

Problems from Segment 6. The Towne Family Tree

To Compute

1. Write down an explicit expression for what the slides denote as bin(n,N,r).

The probability of getting events in draws, each with i.i.d. probability is


Some expressions that will be useful for question 2 are:

2. There is a small error on slide 7 that carries through to the first equation on slide 8 and the graph on slide 9. Find the error, fix it, and redo the graph of slide 9. Does it make a big difference? Why or why not?

The error is that the first Jacob is counted twice in our calculation i.e. that he is skipped as a parent node to Sam and T4 in tree terminology.

A snapshot of where the error is.
The first Jacob is counted twice.

Here's the fixed slide (well not completely fixed, do you see the error that doesn't carry over to the next slides?):

A snapshot of where the error is.
Now it's correct.

Re-doing slide 9 by adding a term and changing two terms in the original function, we get the new graph in red (original in blue):

The new slide nine
Corrected slide 9, with new graph in red.

We can see that it's not a terrible difference, because for each probability immediately to the right of .4%, the new graph lies slightly above the old, which means that for each of these probabilities we will have slightly more mutations.

For example before we had a .5% chance of having 93 total mutations in the Towne tree. Now it says the Towne tree has a .5% chance of having about 100 mutations. This makes sense because we are no longer double counting the 0-mutation node twice, which would have pulled the distribution to the left.

Here's the Matlab code:

The code to redo slide 9
The code to redo slide 9.

To Think About

1. Suppose you knew the value of r (say, r = 0.0038). How would you simulate many instances of the Towne family data (e.g., the tables on slides 4 and 5?

Here's an example of how I would simulate t3 and t4, and the other T's would be similar.

a=rand(1,37)<0.0038; %jacob towne I's DNA
b=rand(2,37)<0.0038; % 2 generations to samuel towne I's DNA
c=rand(6,37)<0.0038; % 6 generations to T3's DNA

%compute cumulative mutations to get T3's differences in DNA

d=rand(10,37)<0.0038;% 10 generations to T3's DNA

%compute cumulative mutations to get T4's differences in DNA

2. How would you use your simulation to decide if the assumption of ignoring backmutations (the red note on slide 7) is justified?

If there are any backmutations, they would show up in t4 or t3 in the above code as 2 since there is a either a 1 or a zero in the 1 by 37 array if there has been a change somewhere along the way. So in the handful of times that I ran this code I never got a 2, but I would need to run it probably a hundred times to how often 2's show up, and then assuming back mutations are as likely as a forward change, I would divide this number by two and this would be my likelihood of backmutations, which would be small if this statement is justified.

3. How would you use your simulation to decide if our decision to trim T2, T11, and T13 from the estimation of r was justified? (This question anticipates several later discussions in the course, but thinking about it now will be a good start.)

I would simulate T2, T11, and T13 and measure the variance in the number of changes of each DNA slot over a good number of simulations. If the actual T2, T11, and T13 fall outside of that variance I would trim them.

Class Activity
Group 1 with Rene and Eleisha

Activity checkpoints
1.What does a joint uniform prior on w and b look like?

p(wbd) identically equal to 2 since the constraint is w+b=1 in the unit square of (w,b) space and the integral over this is 1.

2.Suppose we know that w=0.4, b = 0.3, and d = 0.3. If we watch N = 10 games, what is the probability that W = 3, B = 5, and D = 2?

3. For general w, b, d, W, B, D, what is P(W, B, D | w, b, d)?

4.Applying Bayes, what is P(w, b, d | W, B, D)? (The Bayes denominator is tricky - if you present us with the integral to evaluate, we will provide the answer.)

where we use the formulas in 1 and 3.

The denominator is

5.Here is the real data - chess_outcomes.txt. Each line represents the outcome of one game. Count the outcomes of the first N games and produce a visualization of the joint posterior of the win rates for N = 0, 3, 10, 100, 1000, and 10000.

We would have used a ln of the formula in 4 to turn the difficult factorials (multiplications) into additions and exponents into coefficients, then exponentiate at the end.

Back to Ellen Le or Segment 6. The Towne Family Tree.