INDUS - a composition-based approach for rapid and accurate taxonomic classification of metagenomic sequences

Monzoorul Haque Mohammed,Rachamalla Maheedhar Reddy,Chennareddy Venkata Siva Kumar Reddy,Tarini Shankar Ghosh,Nitin Kumar Singh,Sharmila S Mande

doi:10.1186/1471-2164-12-s3-s4

Monzoorul Haque Mohammed, Rachamalla Maheedhar Reddy + Show 4 more

Open Access

PDF Available

https://doi.org/10.1186/1471-2164-12-s3-s4

Copy DOI

Export

Save

Cite

Journal: BMC Genomics	Publication Date: Nov 30, 2011
Citations: 23	License type: cc-by

Affiliation: Tata Consultancy Services (India)

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundTaxonomic classification of metagenomic sequences is the first step in metagenomic analysis. Existing taxonomic classification approaches are of two types, similarity-based and composition-based. Similarity-based approaches, though accurate and specific, are extremely slow. Since, metagenomic projects generate millions of sequences, adopting similarity-based approaches becomes virtually infeasible for research groups having modest computational resources. In this study, we present INDUS - a composition-based approach that incorporates the following novel features. First, INDUS discards the 'one genome-one composition' model adopted by existing compositional approaches. Second, INDUS uses 'compositional distance' information for identifying appropriate assignment levels. Third, INDUS incorporates steps that attempt to reduce biases due to database representation.ResultsINDUS is able to rapidly classify sequences in both simulated and real metagenomic sequence data sets with classification efficiency significantly higher than existing composition-based approaches. Although the classification efficiency of INDUS is observed to be comparable to those by similarity-based approaches, the binning time (as compared to alignment based approaches) is 23-33 times lower.ConclusionGiven it's rapid execution time, and high levels of classification efficiency, INDUS is expected to be of immense interest to researchers working in metagenomics and microbial ecology.AvailabilityA web-server for the INDUS algorithm is available at http://metagenomics.atc.tcs.com/INDUS/

Highlights

Taxonomic classification of metagenomic sequences is the first step in metagenomic analysis
The extent of similarity between metagenomic sequences and reference database sequences is inferred from the BLAST output
Query sequences are assigned to an organism/clade based on the pattern and quality of the generated BLAST hits

Summary

Introduction

Taxonomic classification of metagenomic sequences is the first step in metagenomic analysis. Various approaches are available for obtaining the taxonomic affiliation of DNA sequences constituting a metagenomic sequence data set These approaches can be broadly divided into two types, namely, similaritybased and composition-based. ‘Similarity-based’ approaches classify metagenomic sequences by comparing them with known sequences present in a reference database [2,3,4,5] These comparisons are usually done using the BLAST algorithm [6]. Given the limited sequence information available in existing reference databases, majority of sequences in metagenomic data sets fail to obtain BLAST hits and are categorized as ‘unassigned’. Similarity-based approaches need enormous amount of time and computing resources for generating alignments of millions of metagenomic sequences with existing reference database sequences. Composition-based approaches score query sequences against the pre-computed genome specific models, and assign them to an organism/clade based on the pattern of scores obtained. Since the composition-based methods do not involve alignment of query sequences with reference database sequences, these methods are quicker as compared to similaritybased methods

Results

Discussion

Conclusion