Abstract

BackgroundPrevious methods of detecting the taxonomic origins of arbitrary sequence collections, with a significant impact to genome analysis and in particular metagenomics, have primarily focused on compositional features of genomes. The evolutionary patterns of phylogenetic distribution of genes or proteins, represented by phylogenetic profiles, provide an alternative approach for the detection of taxonomic origins, but typically suffer from low accuracy. Herein, we present rank-BLAST, a novel approach for the assignment of protein sequences into genomic groups of the same taxonomic origin, based on the ranking order of phylogenetic profiles of target genes or proteins across the reference database.ResultsThe rank-BLAST approach is validated by computing the phylogenetic profiles of all sequences for five distinct microbial species of varying degrees of phylogenetic proximity, against a reference database of 243 fully sequenced genomes. The approach - a combination of sequence searches, statistical estimation and clustering - analyses the degree of sequence divergence between sets of protein sequences and allows the classification of protein sequences according to the species of origin with high accuracy, allowing taxonomic classification of 64% of the proteins studied. In most cases, a main cluster is detected, representing the corresponding species. Secondary, functionally distinct and species-specific clusters exhibit different patterns of phylogenetic distribution, thus flagging gene groups of interest. Detailed analyses of such cases are provided as examples.ConclusionOur results indicate that the rank-BLAST approach can capture the taxonomic origins of sequence collections in an accurate and efficient manner. The approach can be useful both for the analysis of genome evolution and the detection of species groups in metagenomics samples.

Highlights

  • Previous methods of detecting the taxonomic origins of arbitrary sequence collections, with a significant impact to genome analysis and in particular metagenomics, have primarily focused on compositional features of genomes

  • We describe a novel approach for the assignment of protein sequences into genomic groups, using a simulation on a synthetic dataset composed of an assortment of proteins lacking taxonomic classification, originally retrieved from five species belonging to different taxonomic classes, orders and domains

  • Firmicutes Bacteria sFTihifigeudirniendti6voidaunayl rgaenko-mBLicAgSrTouppro(fBil)esveorfstuws othperomteainnsrfarnokmoSftsreppetcoiecoscicnusthpeyomgeaninesclculastsesirfied into the main cluster (A) and not clasThe individual rank-BLAST profiles of two proteins from Streptococcus pyogenes classified into the main cluster (A) and not classified into any genomic group (B) versus the mean rank of species in the main cluster

Read more

Summary

Introduction

Previous methods of detecting the taxonomic origins of arbitrary sequence collections, with a significant impact to genome analysis and in particular metagenomics, have primarily focused on compositional features of genomes. Several approaches have been proposed to detect genomic signatures on the basis of nucleotide composition [3,4,5] These approaches enable, to a varying degree of accuracy, the species classification of genes according to their compositional signatures, and their association with phylogenetic or environmental factors [6,7,8]. These methods have been applied to environmental sequencing samples, in order to detect the origins of these sequence fragments [9,10]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.