Abstract

There is still a lack of fast and accurate classification tools to identify the taxonomies of noisy long reads, which is a bottleneck to the use of the promising long-read metagenomic sequencing technologies. Herein, we propose de Bruijn graph-based Sparse Approximate Match Block Analyzer (deSAMBA), a tailored long-read classification approach that uses a novel pseudo alignment algorithm based on sparse approximate match block (SAMB). Benchmarks on real sequencing datasets demonstrate that deSAMBA enables to achieve high yields and fast speed simultaneously, which outperforms state-of-the-art tools and has many potentials to cutting-edge metagenomics studies.

Highlights

  • Metagenomic sequencing is ubiquitously applied to comprehensively study environmental samples (Methé et al, 2012; Gilbert et al, 2014; Cheng et al, 2020)

  • We present de Bruijn graph-based Sparse Approximate Match Block Analyzer, a novel approximate match-based pseudo alignment approach for the classification of long reads. deSAMBA is motivated by the fact (Chaisson and Tesler, 2012) that sequencing errors are unevenly distributed along the reads

  • Overview of de Bruijn Graph-Based Sparse Approximate Match Block Analyzer Approach deSAMBA is composed of some tailored designs and implementations to achieve high yields and fast speed simultaneously. It uses Unitig–Burrows–Wheeler transform (BWT) data structure (Guan et al, 2018) to index the de Bruijn graph of reference sequences and finds highly similar approximate match blocks through the index. These blocks are called sparse approximate match blocks (SAMBs), as they are usually sparsely placed along reads

Read more

Summary

Introduction

Metagenomic sequencing is ubiquitously applied to comprehensively study environmental samples (Methé et al, 2012; Gilbert et al, 2014; Cheng et al, 2020). It enables to reveal the compositions of microbial communities in various environments and study the functions of microbial communities and their interactions to environments. With the rapid development of high-throughput sequencing technologies, metagenomic sequencing is promising for the analysis of microbiome. Due to its ability of real-time and portable sequencing of the samples (Quick et al, 2016), longread sequencing technologies have enormous potential to metagenomic studies. With the characteristics of long-read sequencing data, analytical challenges still remain

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call