CS395T/CAM383M Computational Statistics  

Go Back   CS395T/CAM383M Computational Statistics > Previous year: Spring, 2010 > Student Term Projects

Reply
 
Thread Tools Display Modes
  #1  
Old 03-22-2010, 10:08 AM
namphuon namphuon is offline
Member
 
Join Date: Jan 2010
Posts: 25
Default Nam Nguyen Term Project

This thread will contain my thoughts and ideas about the term project.

Attached is my proposal. Please give feedback on my ideas. I'll flesh this out more as I begin research on this project.
Attached Images
File Type: pdf paper.pdf (40.5 KB, 580 views)

Last edited by namphuon; 03-22-2010 at 10:53 AM.
Reply With Quote
  #2  
Old 04-13-2010, 11:07 AM
wpress wpress is offline
Professor
 
Join Date: Jan 2009
Posts: 222
Default

Hi, Nam. I'm looking for more detail on where you are, for your mid-course project report. See http://wpressutexas.net/forum/showthread.php?t=221 , and post something here soon.
Reply With Quote
  #3  
Old 04-16-2010, 10:44 AM
namphuon namphuon is offline
Member
 
Join Date: Jan 2010
Posts: 25
Default

Quote:
Originally Posted by wpress View Post
Hi, Nam. I'm looking for more detail on where you are, for your mid-course project report. See http://wpressutexas.net/forum/showthread.php?t=221 , and post something here soon.
Hi Bill. Sorry for missing the deadline. I've been busy writing scripts to analyze all my data before writing the mid-course project report. I will have a comprehensive report this Sunday.
Reply With Quote
  #4  
Old 04-19-2010, 09:54 AM
namphuon namphuon is offline
Member
 
Join Date: Jan 2010
Posts: 25
Default

Project Description:

Multiple sequence alignment (MSA) methods are an important first step in computational biology. The MSA can aid biologists in constructing phylogenetic trees. The resulting MSA provide a wealth of information such as identifying homology between sequences, inferring phylogenetic trees, and finding proteins of similar function. However, errors in MSA can hamper efforts of inferring information from the sequences. In addition, with real biological data, the true evolutionary history of the set of sequences is unknown. Therefore, it would be advantageous to have a method to estimate the reliability of an estimated MSA.

Landan and Graur presented a method to determine the local reliability of an MSA by generating many different alternative MSA using a variant of the Heads or Tails (HoT) method [1,2]. First, an estimated MSA is generated from the set of unaligned sequences. Next, a guide tree is generated from the estimated MSA. The guide tree is used to generate 8(N-3) alternate MSAs, where . The reliability of the estimated MSA
is measured by examining the frequency a nucleotide (NT) pair shows up in the alternate MSA sets. Landan and Graur showed that
NT pairs with high support were much more likely to be in the true alignment, and NT pairs with low support are less likely to be in the true alignment.

My project is to examine whether the NT pairing frequency is a good metric for inferring alignment accuracy. I will use SATe-II [3,4] to generate an estimated MSA. However, rather than using the HoT method to generate the alternate set of MSA, I will use several different MSA methods: PRANK, OPAL, MUSCLE, MAFFT, PROBTREE. The advantage of this method is that I will only need to perform 5 alignments on the dataset rather than the 8(N-3) alignments necessary for the HoT method, where N is the number of taxa.

Once I have the an estimated MSA and a set of alternate MSA, I can score each NT pairing in the MSA with a support value, where the support value is the proportion of alternate MSA that contain the NT pairing. The goal is to examine whether NT pairs with high support value are also likely to be in the true alignment.

Datasets:

I have two datasets. The first is a simulated dataset of 10 different model conditions, with 100 taxa, 1000 average sequence length, and has 20 replicates. Since the dataset is generated through simulation, I know the true tree and the true alignment. The second is a biological dataset of RNA data from the Gutell Lab. While the true tree and alignment are unknown for biological dataset, there are curated alignments for these datasets which are based on secondary structure, and believed to be very accurate. I have not finalized which biological dataset to analyze yet, as I am running into memory problems on the larger datasets.

Progress:

I have written several scripts to generate estimated MSA and the set of alternate MSAs. My scripts also leverage code written by Landan and Graur to calculate support values of NT pairs. I have also run my scripts on all 100 taxa datasets and have some preliminary results. Result1 shows the proportion of NT pairs that are TP for a given support value. Result2 shows the porportion of NT pairs that have a given support value. From the graphs, it's clear that there is a correlation between the support value and whether an NT pair is a TP or FP.


To be done:

I plan to generate contingency tables from the data and perform different analysis on the tables. Since my counts are extremely large (hundreds of thousands), I believe I can perform a Chi-square test for significance. Just from looking at the data, I can already tell that the results will be more significant.

More interestingly, however, is the link between mean support value of an entire site in an MSA and error. I plan find the error at each site in an estimated MSA, measured as the total FP divided by total NT at a site, and find the mean support of each NT pair at a site and see if there is a correlation between these numbers. If there is, this suggests that we may be able to improve phylogenetic reconstruction if we remove sites with low mean support. I am currently writing scripts to implement this step and should be done this weekend.

Final Report:

For my final report, I will report on the correlation between support values and whether NT pairs are TP or FP, correlation between mean support values of a site and mean errors on a site. In addition, I plan to plot the distribution of the mean support values of a site and mean errors and examine of the distribution has any interesting properties. Can we model it some fashion by some probability density distribution.


Citations:
[1] G. Landan and D. Graur. Heads or tails: A simple reliability check for multiple sequence alignments. Mol. Biol. Evol., 24(6):1380–
1383, 2007.
[2] G. Landan and D. Graur. Local reliability measures from sets of co-optimal multiple sequence aligments. Pacific Symposium on
Biocomputing, 13:15–24, 2008.
[3] K. Liu, S. Raghavan, S. Nelesen, C. R. Linder, and T. Warnow. Rapid and accurate large-scale coestimation of sequence alignments
and phylogenetic trees. Science, 324(19):1561–1564, 2009.
[4] K. Liu, T. Warnow, M. Holder, S. Nelesen, J. Yu, A. Stamatakis, and C. R. Linder. Sat´e-ii, very fast and accurate simultaneous
estimation of multiple sequence alignments and phylogenetic trees. Submitted to Proceedings of the National Academy of Sciences of
the United States of America.
[5] G. Talavera and J. Castresana. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein
sequence alignments. Systems Biology, 54(4):564–577, 2007.
Attached Images
  

Last edited by namphuon; 04-19-2010 at 10:31 AM.
Reply With Quote
  #5  
Old 04-21-2010, 11:11 AM
namphuon namphuon is offline
Member
 
Join Date: Jan 2010
Posts: 25
Default

In my current research, I've been masking sites in MSA with low mean support value and then constructing an ML tree from the masked MSA. From my current results, it suggests that this method does not improve tree reconstruction. This suggests that there may not be much correlation between errors in a site and support value. I am currently writing scripts to calculate the number of errors in a site and the mean support value. Hopefully, this result will shed light on my current research.
Reply With Quote
  #6  
Old 05-02-2010, 08:05 PM
namphuon namphuon is offline
Member
 
Join Date: Jan 2010
Posts: 25
Default

I'm made some changes to my project. The basic idea behind my project is that given a multiple sequence alignment (MSA) M, we want to determine which nucleotides (NT) pairs in M are likely to be in the true alignment. One way to estimate whether an NT pair is in the true alignment is to generate a set of alternate alignments M', and find the proportion of M' that also contains the NT pair. This gives a support value for the NT pair. Thus, if the NT pair shows up in half of M', it has a support value of 0.5. Once every NT pair is annotated with a support value, we can create a classify on whether the NT pair is in the true alignment by classifying all NT pairs with support above a threshold as an estimated positive (P), and an estimated negative (N) otherwise. Since the true alignment is known, we will also be able to determine whether an NT pair was a true positive (TP), true negative (TN), false positive (FP), or false negative (FN).

For different thresholds, we will have different classifications of the NT pairs. For example, if we set the threshold for classifying NT pairs as positives to be low, we will have low FP, but will also have low TP and high FN. Thus, we need to have some method for rating which classifier has better performance. I plan to use ROC curves to perform this analysis. ROC curves were discussed in 2008's Spring class in Unit 17. I will generate ROC curves for the two different methods of generating M' on 2 different datasets, a "hard" dataset (high alignment error) and an "easy" data set (low alignment error). From the ROC curves, I will see if one method dominates another for various thresholds. I also plan to examine the area under the curve (AUC) as another classifier metric.

I originally planned to generate a table with support value being the columns and TP/FP as the rows, and then was going to perform several statistical test (Permutation test, Chi-Square test) to see if there was some significance to the support value. The reason why I changed my statistic of interest was that the Chi-Square test made it extremely clear that support value is highly correlated to being a TP. This was not very interesting. Examining which method results in better support values for classification is a much more interesting result. Luckily, most of the code I wrote for the first part of the project still applies. I only need to write code to take my data and generate ROC curves.

Last edited by namphuon; 05-02-2010 at 08:20 PM.
Reply With Quote
  #7  
Old 05-05-2010, 10:33 AM
namphuon namphuon is offline
Member
 
Join Date: Jan 2010
Posts: 25
Default

As discussed in class, it's pointless to find statistical significance between support value and likelihood of being a positive example. We are building classifiers so this should be highly significant, and from my calculations, it is. There's 0 probability (underflow on Matlab) that the null hypothesis is correct (no correlation). Thus, I changed my project to examine 2 different methods for classifying whether an NT pair is in the true alignment or not. Attached is the the report and supplemental figures.

I wrote lots of scripts and code for this project, and a readme file explaining each file. This can be found at:

http://www.cs.utexas.edu/~namphuon/c...ect/README.txt
http://www.cs.utexas.edu/~namphuon/c...oject/code.zip

The project gave me some interesting ideas. Previously, I was using the mean support score to mask sites in an MSA. I would then reconstruct a phylogenetic tree on the masked MSA. However, I think a better method may be to estimate the errors at a site using the MMSA method to classify the correctness of the NT pairs at the site. I will see if this method can improve tree reconstruction.
Attached Images
File Type: pdf suppl.pdf (377.0 KB, 480 views)
File Type: pdf paper.pdf (179.2 KB, 657 views)

Last edited by namphuon; 05-05-2010 at 02:56 PM.
Reply With Quote
  #8  
Old 05-07-2010, 12:38 PM
wpress wpress is offline
Professor
 
Join Date: Jan 2009
Posts: 222
Default

Very interesting! Of course it's a little bit unfair to SATe, because in somebody else's term project on some other planet, SATe might have been one of the methods included within MMSA, while another method might have been the one tried by itself.

Combining different methods is a hot topic these days, e.g., Netflix prize, and also, e.g., Boosting. I wonder if there's some way to "boost" SATe, generating independent alignments weighted in some manner towards the ones it got wrong on the previous alignment. (Of course this presumes that there is a training data set!) See Hastie, Tibshirani, Friedman Chapter 10.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -6. The time now is 08:43 AM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2018, Jelsoft Enterprises Ltd.