Abstract

Next-generation sequencing (NGS) approaches rapidly produce millions to billions of short reads, which allow pathogen detection and discovery in human clinical, animal and environmental samples. A major limitation of sequence homology-based identification for highly divergent microorganisms is the short length of reads generated by most highly parallel sequencing technologies. Short reads require a high level of sequence similarities to annotated genes to confidently predict gene function or homology. Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs. We describe an ensemble strategy that integrates the sequential use of various de Bruijn graph and overlap-layout-consensus assemblers with a novel partitioned sub-assembly approach. We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly. We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches.

Highlights

  • With rapidly declining cost, next-generation sequencing (NGS) approaches have become common for comprehensive pathogen identification in clinical and environmental samples

  • In our previous report [1], we found empirically that a sequential de Bruijn graph (DBG) and OLC method that incorporates partitioning was more efficient at contig assembly of viral genomes from metagenomic NGS data

  • In silico-generated Bas-Congo virus (BASV) sequences were computationally spiked at various read lengths and depths of coverage (Table 1) into a complex in silico metagenomic background consisting of 10 million human reads, 2.5 million bacterial reads and 0.5 million viruses, generating sets A through J

Read more

Summary

Introduction

Next-generation sequencing (NGS) approaches have become common for comprehensive pathogen identification in clinical and environmental samples. One school of assemblers such as AMOS [8], CAP3 [9], Celera [10], VCAKE [11] and Newbler [12] use traditional olconsensus (OLC) algorithms which identify overlaps between various long reads and subsequently merge the read fragments into longer sequences. This approach requires pairwise evaluation of a large number of reads, which is computationally intensive. 10M human + 2.5M bact + 0.5M viral 10M human + 2.5M bact + 0.5M viral 10M human + 2.5M bact + 0.5M viral 10M human + 2.5M bact + 0.5M viral 10M human + 2.5M bact + 0.5M viral 10M human + 2.5M bact + 0.5M viral 10M human + 2.5M bact + 0.5M viral 10M human + 2.5M bact + 0.5M viral 10M human + 2.5M bact + 0.5M viral 10M human + 2.5M bact + 0.5M viral 3.8M nasopharyngeal swab sample 9.6M stool background containing a norovirus

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call