MGmapper: Reference based mapping and taxonomy annotation of metagenomics sequence reads.

Thomas Nordahl Petersen,Ole Lund,Martin Christen Frølund Thomsen,Thomas Sicheritz-Pontén,Oksana Lukjancenko,Maria Maddalena Sperotto,Frank Møller Aarestrup,Lingling An

doi:10.1371/journal.pone.0176469

Thomas Nordahl Petersen, Ole Lund + Show 6 more

Open Access

https://doi.org/10.1371/journal.pone.0176469

Copy DOI

Journal: PloS one	Publication Date: May 3, 2017
Citations: 61	License type: CC BY 4.0

Affiliation: Technical University of Denmark

Abstract

An increasing amount of species and gene identification studies rely on the use of next generation sequence analysis of either single isolate or metagenomics samples. Several methods are available to perform taxonomic annotations and a previous metagenomics benchmark study has shown that a vast number of false positive species annotations are a problem unless thresholds or post-processing are applied to differentiate between correct and false annotations. MGmapper is a package to process raw next generation sequence data and perform reference based sequence assignment, followed by a post-processing analysis to produce reliable taxonomy annotation at species and strain level resolution. An in-vitro bacterial mock community sample comprised of 8 genuses, 11 species and 12 strains was previously used to benchmark metagenomics classification methods. After applying a post-processing filter, we obtained 100% correct taxonomy assignments at species and genus level. A sensitivity and precision at 75% was obtained for strain level annotations. A comparison between MGmapper and Kraken at species level, shows MGmapper assigns taxonomy at species level using 84.8% of the sequence reads, compared to 70.5% for Kraken and both methods identified all species with no false positives. Extensive read count statistics are provided in plain text and excel sheets for both rejected and accepted taxonomy annotations. The use of custom databases is possible for the command-line version of MGmapper, and the complete pipeline is freely available as a bitbucked package (https://bitbucket.org/genomicepidemiology/mgmapper). A web-version (https://cge.cbs.dtu.dk/services/MGmapper) provides the basic functionality for analysis of small fastq datasets.

Highlights

Excel sheets are provided as supplementary information in S1–S10 Files, at strain and species level for both annotations that passes the postprocessing criteria and for those that are rejected
The intention behind the development of the MGmapper pipeline is to simplify the processing of generation sequences from biological samples, and to enable users an easy access to an NGS analysis without necessarily understanding all the computational details of the process
In its present form MGmapper follows a mapping protocols against reference sequence databases, and provide BAM files, text and Excel summary files. These contain read-count statistics for those reference sequences that passed a post-processing procedure, and for those annotated reference sequences that did not meet the criteria set up in the post-processing, enabling a user to see discarded mapping results and possibly redo the post-processing if other threshold settings are preferred

Summary

Introduction

The task to assign each of those nucleotide reads to the genome that they represent is challenging and the problem of false positive predictions is always an issue to be considered for alignment based methods where a query sequence is mapped against a large database of target sequences. For the S_Abundance, a threshold of 0.01 was the best cut-off, based on benchmarking data as used by Peabody et al One drawback of using a size normalized abundance as criterion for true positive annotations is that, in case of small reference sequences, only a few assigned reads are needed to pass the cutoff. The normalized read count abundance, a low-read-count value, the number uniquely mapped reads and the edit-distance are measures used by MGmapper, rather than a single read count abundance threshold, with the aim to reduce the number of false positive taxonomy annotations from generation sequence data. In total the bacteria database is composed of 7451 genomic sequences (created: Feb 23, 2016), where entries with the word ‘plasmid’ in the fasta header were compiled into a separate plasmid database composed of 4429 genomic sequences

Methods

Minimum ReadCount of 10

Results

Discussion