Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations.

Aaron Y Lee,Cecilia S Lee,Russell N Van Gelder

doi:10.1186/s12859-016-1159-6

Aaron Y Lee, Cecilia S Lee + Show 1 more

Open Access

https://doi.org/10.1186/s12859-016-1159-6

Copy DOI

Journal: BMC bioinformatics	Publication Date: Jul 28, 2016
Citations: 26	License type: CC BY 4.0

Affiliation: University of Washington, Seattle University

Abstract

BackgroundNext generation sequencing technology has enabled characterization of metagenomics through massively parallel genomic DNA sequencing. The complexity and diversity of environmental samples such as the human gut microflora, combined with the sustained exponential growth in sequencing capacity, has led to the challenge of identifying microbial organisms by DNA sequence. We sought to validate a Scalable Metagenomics Alignment Research Tool (SMART), a novel searching heuristic for shotgun metagenomics sequencing results.ResultsAfter retrieving all genomic DNA sequences from the NCBI GenBank, over 1 × 1011 base pairs of 3.3 × 106 sequences from 9.25 × 105 species were indexed using 4 base pair hashtable shards. A MapReduce searching strategy was used to distribute the search workload in a computing cluster environment. In addition, a one base pair permutation algorithm was used to account for single nucleotide polymorphisms and sequencing errors. Simulated datasets used to evaluate Kraken, a similar metagenomics classification tool, were used to measure and compare precision and accuracy. Finally using a same set of training sequences we compared Kraken, CLARK, and SMART within the same computing environment. Utilizing 12 computational nodes, we completed the classification of all datasets in under 10 min each using exact matching with an average throughput of over 1.95 × 106 reads classified per minute. With permutation matching, we achieved sensitivity greater than 83 % and precision greater than 94 % with simulated datasets at the species classification level. We demonstrated the application of this technique applied to conjunctival and gut microbiome metagenomics sequencing results. In our head to head comparison, SMART and CLARK had similar accuracy gains over Kraken at the species classification level, but SMART required approximately half the amount of RAM of CLARK.ConclusionsSMART is the first scalable, efficient, and rapid metagenomics classification algorithm capable of matching against all the species and sequences present in the NCBI GenBank and allows for a single step classification of microorganisms as well as large plant, mammalian, or invertebrate genomes from which the metagenomic sample may have been derived.

Highlights

Generation sequencing technology has enabled characterization of metagenomics through massively parallel genomic DNA sequencing
The library of sequenced DNA fragments mapped to an identified taxonomy species has been growing in parallel; the latest release of NCBI Genbank (v209) has catalogued 1.99 × 1011 basepairs of cDNA and genomic DNA from 1.87 × 108 records [3]
After transferring all genomic DNA reads from the latest release of the NCBI GenBank, a total of over 1 × 1011 bp of 3.34 × 109 sequences from 9.26 × 105 species of 1.49 × 103 classes were indexed

Summary

Introduction

Generation sequencing technology has enabled characterization of metagenomics through massively parallel genomic DNA sequencing. Other sequence alignment software has been created adapted to generation sequencing output such as Bowtie2 [6], Burrows-Wheeler Aligner [7], and Short Oligonucleotide Analysis Package [8] These alignment software work well for the precise alignment of a large number of generation sequencing reads against single organism genomes but scale poorly when attempting to align reads against all known DNA sequences. MEGAN and MetaPhyler have been developed to work with BLAST for the use of metagenomic sequencing classification [9, 10] Even though these probabilistic approaches have high accuracy [11, 12], they remain limited by the computational expensive nature of BLAST. Empirical approaches have used machine learning algorithms with both supervised [13,14,15,16] and unsupervised methods [17,18,19]

Results

Discussion

Conclusion