Abstract

Long-read next-generation amplicon sequencing shows promise for studying complete genes or genomes from complex and diverse populations. Current long-read sequencing technologies have challenging error profiles, hindering data processing and incorporation into downstream analyses. Here we consider the problem of how to reconstruct, free of sequencing error, the true sequence variants and their associated frequencies from PacBio reads. Called ‘amplicon denoising’, this problem has been extensively studied for short-read sequencing technologies, but current solutions do not always successfully generalize to long reads with high indel error rates. We introduce two methods: one that runs nearly instantly and is very accurate for medium length reads and high template coverage, and another, slower method that is more robust when reads are very long or coverage is lower. On two Mock Virus Community datasets with ground truth, each sequenced on a different PacBio instrument, and on a number of simulated datasets, we compare our two approaches to each other and to existing algorithms. We outperform all tested methods in accuracy, with competitive run times even for our slower method, successfully discriminating templates that differ by a just single nucleotide. Julia implementations of Fast Amplicon Denoising (FAD) and Robust Amplicon Denoising (RAD), and a webserver interface, are freely available.

Highlights

  • The Pacific Biosciences platform allows complex populations of long DNA molecules to be sequenced at reasonable depth

  • When inferring templates using Robust Amplicon Denoising (RAD), Fast Amplicon Denoising (FAD), and other methods, we first trim off the barcodes from the .fastq reads, to ensure the true clustering is obscured

  • We have presented two algorithms, FAD and RAD, for denoising long PacBio amplicons

Read more

Summary

Introduction

The Pacific Biosciences platform allows complex populations of long DNA molecules to be sequenced at reasonable depth. FAD is designed for cases where an appreciable number of sequences are expected to be error free, and these can reliably serve as our inferred templates, avoiding any form of clustering or consensus calls, and exploiting abundance and neighborhood information to keep or reject templates This method performs better for shorter amplicons, higher quality sequencing, and better read-per-template coverage. RAD is more complex, and designed for cases where very few reads are error free This can occur in PacBio amplicon sequence when either amplicons are very long, with fewer passes per molecule, or for short movie lengths, reducing raw read lengths, or for older sequencing chemistries. We employ a kmer-domain clustering approach, inspired by a non-parametric Bayesian procedure [21, 22] to partition reads into clusters, followed by a recursive cluster refinement procedure ( in kmer domain)

Methods
USEARCH
UNOISE
Results
Conclusion
Parallelization
F Performance

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.