Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps

Alexander T Dilthey,Chirag Jain,Sergey Koren,Adam M Phillippy

doi:10.1038/s41467-019-10934-2

Abstract

Metagenomic sequence classification should be fast, accurate and information-rich. Emerging long-read sequencing technologies promise to improve the balance between these factors but most existing methods were designed for short reads. MetaMaps is a new method, specifically developed for long reads, capable of mapping a long-read metagenome to a comprehensive RefSeq database with >12,000 genomes in <16 GB or RAM on a laptop computer. Integrating approximate mapping with probabilistic scoring and EM-based estimation of sample composition, MetaMaps achieves >94% accuracy for species-level read assignment and r2 > 0.97 for the estimation of sample composition on both simulated and real data when the sample genomes or close relatives are present in the classification database. To address novel species and genera, which are comparatively harder to predict, MetaMaps outputs mapping locations and qualities for all classified reads, enabling functional studies (e.g. gene presence/absence) and detection of incongruities between sample and reference genomes.

Highlights

Metagenomic sequence classification should be fast, accurate and information-rich
There are approaches based on linear models or linear mixed models, for example PhyloPythia[16,17], DiTASiC18, and MetaPalette[19]; methods based on structured output support vector machines, for example PhyloPythia+20; methods that combine Markov models with kmers/alignment, for example Phymm/PhymmBL21,22; and methods that directly employ the Burrows-Wheeler transform[23], for example Centrifuge[24]
The large majority of these methods have been designed for the analysis of short-read data and only a small number of long-read-specific methods have been developed: Frank et al.[25] describe a method developed for Pacific Biosciences CCS data, and MEGAN-LR26 aligns long reads to protein databases and carries out a lowest-common-ancestor-based analysis

Summary

Introduction

Metagenomic sequence classification should be fast, accurate and information-rich. Emerging long-read sequencing technologies promise to improve the balance between these factors but most existing methods were designed for short reads. Alignment-based methods, for complete genomes or signature or marker genes This category includes tools like Megan[7,8], MetaPhlan[9], GASiC10, and MG-RAST11. Some long-read sequencers (the Oxford Nanopore MinION in particular) support rapid, portable and robust sequencing workflows, enabling “in-field” metagenomics This is expanding the types of applications and scenarios that DNA sequencing and metagenomics can be applied to, such as the in-situ characterization of soil metagenomes in remote locations[27] or real-time pathogen sequencing during outbreaks[28]. In the space of long-read metagenomics, desirable algorithms are both fast (to deal with large data volumes of incoming sequencing data on acceptable time scales, e.g., in the field) and produce highly informative output that includes per-read positional and quality information (because the availability of long-range spatial information is one of the key advantages of long-read sequencing).

Methods

Results

Conclusion