Consistency of metagenomic assignment programs in simulated and real data

Koldo Garcia-Etxebarria,Marc Garcia-Garcerà,Francesc Calafell

doi:10.1186/1471-2105-15-90

Koldo Garcia-Etxebarria, Marc Garcia-Garcerà + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-15-90

Copy DOI

Abstract

BackgroundMetagenomics is the genomic study of uncultured environmental samples, which has been greatly facilitated by the advent of shotgun-sequencing technologies. One of the main focuses of metagenomics is the discovery of previously uncultured microorganisms, which makes the assignment of sequences to a particular taxon a challenge and a crucial step. Recently, several methods have been developed to perform this task, based on different methodologies such as sequence composition or sequence similarity. The sequence composition methods have the ability to completely assign the whole dataset. However, their use in metagenomics and the study of their performance with real data is limited. In this work, we assess the consistency of three different methods (BLAST + Lowest Common Ancestor, Phymm, and Naïve Bayesian Classifier) in assigning real and simulated sequence reads.ResultsBoth in real and in simulated data, BLAST + Lowest Common Ancestor (BLAST + LCA), Phymm, and Naïve Bayesian Classifier consistently assign a larger number of reads in higher taxonomic levels than in lower levels. However, discrepancies increase at lower taxonomic levels. In simulated data, consistent assignments between all three methods showed greater precision than assignments based on Phymm or Bayesian Classifier alone, since the BLAST + LCA algorithm performed best. In addition, assignment consistency in real data increased with sequence read length, in agreement with previously published simulation results.ConclusionsThe use and combination of different approaches is advisable to assign metagenomic reads. Although the sensitivity could be reduced, the reliability can be increased by using the reads consistently assigned to the same taxa by, at least, two methods, and by training the programs using all available information.

Highlights

Metagenomics is the genomic study of uncultured environmental samples, which has been greatly facilitated by the advent of shotgun-sequencing technologies
A crucial step in metagenomics is sequence assignment, both taxonomic and functional: sequence reads need to be allocated to a genomic unit, or, at least, to a particular taxonomic level, and, in a further step, they may be mapped to a gene or set of genes with known functions
In our simulated data, using the BLAST + LCA (Lowest Common Ancestor) method implemented in FCP, between 0% and 3.65% of reads were not assigned at a domain level

Summary

Introduction

Metagenomics is the genomic study of uncultured environmental samples, which has been greatly facilitated by the advent of shotgun-sequencing technologies. Several methods have been developed to perform this task, based on different methodologies such as sequence composition or sequence similarity. The sequence composition methods have the ability to completely assign the whole dataset. Their use in metagenomics and the study of their performance with real data is limited. We assess the consistency of three different methods (BLAST + Lowest Common Ancestor, Phymm, and Naïve Bayesian Classifier) in assigning real and simulated sequence reads. In the case of composition-based programs, these methods can classify all the reads [2], and report the associated likelihood of the read to be assigned to the different categories. There are few metagenomic studies where these methods were used, e.g. [9,10,11]

Objectives

Methods

Results

Conclusion