Abstract

BackgroundEST sequencing is a versatile approach for rapidly gathering protein coding sequences. They provide direct access to an organism's gene repertoire bypassing the still error-prone procedure of gene prediction from genomic data. Therefore, ESTs are often the only source for biological sequence data from taxa outside mainstream interest. The widespread use of ESTs in evolutionary studies and particularly in molecular systematics studies is still hindered by the lack of efficient and reliable approaches for automated ortholog predictions in ESTs. Existing methods either depend on a known species tree or cannot cope with redundancy in EST data.ResultsWe present a novel approach (HaMStR) to mine EST data for the presence of orthologs to a curated set of genes. HaMStR combines a profile Hidden Markov Model search and a subsequent BLAST search to extend existing ortholog cluster with sequences from further taxa. We show that the HaMStR results are consistent with those obtained with existing orthology prediction methods that require completely sequenced genomes. A case study on the phylogeny of 35 fungal taxa illustrates that HaMStR is well suited to compile informative data sets for phylogenomic studies from ESTs and protein sequence data.ConclusionHaMStR extends in a standardized manner a pre-defined set of orthologs with ESTs from further taxa. In the same fashion HaMStR can be applied to protein sequence data, and thus provides a comprehensive approach to compile ortholog cluster from any protein coding data. The resulting orthology predictions serve as the data basis for a variety of evolutionary studies. Here, we have demonstrated the application of HaMStR in a molecular systematics study. However, we envision that studies tracing the evolutionary fate of individual genes or functional complexes of genes will greatly benefit from HaMStR orthology predictions as well.

Highlights

  • EST sequencing is a versatile approach for rapidly gathering protein coding sequences

  • Approaches to resolve the evolutionary relationships of eukaryotes on a molecular basis -frequently referred to as molecular systematics- benefit from this data

  • We show that the joint application of a profile Hidden Markov Model based similarity search and a subsequent re-BLAST of the hit sequences against a reference proteome identifies candidate orthologs

Read more

Summary

Introduction

EST sequencing is a versatile approach for rapidly gathering protein coding sequences. The amount of protein-coding DNA sequences in the public data bases is steadily increasing. BMC Evolutionary Biology 2009, 9:157 http://www.biomedcentral.com/1471-2148/9/157 than 140 genes [1,2,3,4,5,6] Still, these studies consider only a small fraction of the data available. Despite the potential value of ESTs especially for molecular systematics [7] this data has rarely been used in phylogenetic studies so far. This is due to the fact that ESTs are redundant, short and of low sequence quality. Annotating ESTs and more importantly inferring their relationships to known genes in other taxa is still problematic

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call