Census-based rapid and accurate metagenome taxonomic profiling.

Amirhossein Shamsaddini,Amy Zanne,Yang Pan,W Evan Johnson,Raja Mazumder,Mariya Shcheglovitova,Vahan Simonyan,Konstantinos Krampis

doi:10.1186/1471-2164-15-918

Abstract

BackgroundUnderstanding the taxonomic composition of a sample, whether from patient, food or environment, is important to several types of studies including pathogen diagnostics, epidemiological studies, biodiversity analysis and food quality regulation. With the decreasing costs of sequencing, metagenomic data is quickly becoming the preferred typed of data for such analysis.ResultsRapidly defining the taxonomic composition (both taxonomic profile and relative frequency) in a metagenomic sequence dataset is challenging because the task of mapping millions of sequence reads from a metagenomic study to a non-redundant nucleotide database such as the NCBI non-redundant nucleotide database (nt) is a computationally intensive task. We have developed a robust subsampling-based algorithm implemented in a tool called CensuScope meant to take a ‘sneak peak’ into the population distribution and estimate taxonomic composition as if a census was taken of the metagenomic landscape. CensuScope is a rapid and accurate metagenome taxonomic profiling tool that randomly extracts a small number of reads (based on user input) and maps them to NCBI’s nt database. This process is repeated multiple times to ascertain the taxonomic composition that is found in majority of the iterations, thereby providing a robust estimate of the population and measures of the accuracy for the results.ConclusionCensuScope can be run on a laptop or on a high-performance computer. Based on our analysis we are able to provide some recommendations in terms of the number of sequence reads to analyze and the number of iterations to use. For example, to quantify taxonomic groups present in the sample at a level of 1% or higher a subsampling size of 250 random reads with 50 iterations yields a statistical power of >99%. Windows and UNIX versions of CensuScope are available for download at https://hive.biochemistry.gwu.edu/dna.cgi?cmd=censuscope. CensuScope is also available through the High-performance Integrated Virtual Environment (HIVE) and can be used in conjunction with other HIVE analysis and visualization tools.Electronic supplementary materialThe online version of this article (doi:10.1186/1471-2164-15-918) contains supplementary material, which is available to authorized users.

Highlights

Understanding the taxonomic composition of a sample, whether from patient, food or environment, is important to several types of studies including pathogen diagnostics, epidemiological studies, biodiversity analysis and food quality regulation
This additional step can be used with any metagenomic sample where there is an abundance of reads from one specific organism and if the genome for that organism is available
The three key elements required in the analysis of metagenomic samples to obtain the taxonomic composition are: a) a comprehensive non-redundant reference sequence database coupled with b) a generally accepted taxonomy of known organisms, and c) the alignment software for sequence comparison

Summary

Introduction

Understanding the taxonomic composition of a sample, whether from patient, food or environment, is important to several types of studies including pathogen diagnostics, epidemiological studies, biodiversity analysis and food quality regulation. 392 ongoing metagenomic studies that involve sequencing of 3028 samples [1]. Disease related studies that involve microbiome analysis using metagenomics include obesity [11,12,13], Crohn’s disease [14,15], type 2 diabetes [5] and many others [16]. Such studies have vastly extended the currently available sequences in databases and will likely lead to the discovery of new genes that have useful applications in biotechnology and medicine [17,18]

Methods

Results

Conclusion