Abstract

AbstractAs high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein-coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multi-species nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models. We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study, and we provide a software implementation for use by the community. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues, and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE.

Highlights

  • High-throughput transcriptome sequencing is yielding precise structures for novel transcripts in many species, including mammals (Guttman et al, 2010)

  • Consistent with our previous work, we trained and applied PhyloCSF on this dataset using 4-fold cross-validation to ensure that any observed performance differences are not due to overfitting

  • We have introduced PhyloCSF, a comparative genomics method for distinguishing protein-coding and non-coding regions, and shown that it outperforms previous methods

Read more

Summary

Introduction

High-throughput transcriptome sequencing (mRNA-Seq) is yielding precise structures for novel transcripts in many species, including mammals (Guttman et al, 2010). In addition to classifying novel transcript models, such methods have applications in evaluating and revising existing gene annotations (Butler et al, 2009; Clamp et al, 2007; Kellis et al, 2003; Lin et al, 2007; Pruitt et al, 2009), and as input features for de novo gene structure predictors (Alioto and Guigó, 2009; Brent, 2008). As discussed in our previous work, CSF has certain drawbacks arising from its ad hoc scheme for combining evidence from multiple species. It makes only partial use of the evidence available in a multispecies alignment, and it produces a score lacking a precise theoretical interpretation, meaningful only relative to its empirical distributions in known coding and non-coding regions

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call