Long-read amplicon denoising.

Venkatesh Kumar,Nikesh Kumar,Robert Ketteringham,Sanjay Mohan,Nicholas Bavafa,Michelli F Oliveira,Antonia Lorenzo,Kemal Eren,Ben Murrell,Thomas Vollbrecht,Brian Hanst,Mark Chernyshev,Michael Golden

doi:10.1093/nar/gkz657

Abstract

Long-read next-generation amplicon sequencing shows promise for studying complete genes or genomes from complex and diverse populations. Current long-read sequencing technologies have challenging error profiles, hindering data processing and incorporation into downstream analyses. Here we consider the problem of how to reconstruct, free of sequencing error, the true sequence variants and their associated frequencies from PacBio reads. Called ‘amplicon denoising’, this problem has been extensively studied for short-read sequencing technologies, but current solutions do not always successfully generalize to long reads with high indel error rates. We introduce two methods: one that runs nearly instantly and is very accurate for medium length reads and high template coverage, and another, slower method that is more robust when reads are very long or coverage is lower. On two Mock Virus Community datasets with ground truth, each sequenced on a different PacBio instrument, and on a number of simulated datasets, we compare our two approaches to each other and to existing algorithms. We outperform all tested methods in accuracy, with competitive run times even for our slower method, successfully discriminating templates that differ by a just single nucleotide. Julia implementations of Fast Amplicon Denoising (FAD) and Robust Amplicon Denoising (RAD), and a webserver interface, are freely available.

Highlights

The Pacific Biosciences platform allows complex populations of long DNA molecules to be sequenced at reasonable depth
When inferring templates using Robust Amplicon Denoising (RAD), Fast Amplicon Denoising (FAD), and other methods, we first trim off the barcodes from the .fastq reads, to ensure the true clustering is obscured
We have presented two algorithms, FAD and RAD, for denoising long PacBio amplicons

Summary

Introduction

The Pacific Biosciences platform allows complex populations of long DNA molecules to be sequenced at reasonable depth. FAD is designed for cases where an appreciable number of sequences are expected to be error free, and these can reliably serve as our inferred templates, avoiding any form of clustering or consensus calls, and exploiting abundance and neighborhood information to keep or reject templates This method performs better for shorter amplicons, higher quality sequencing, and better read-per-template coverage. RAD is more complex, and designed for cases where very few reads are error free This can occur in PacBio amplicon sequence when either amplicons are very long, with fewer passes per molecule, or for short movie lengths, reducing raw read lengths, or for older sequencing chemistries. We employ a kmer-domain clustering approach, inspired by a non-parametric Bayesian procedure [21, 22] to partition reads into clusters, followed by a recursive cluster refinement procedure ( in kmer domain)

Methods

USEARCH

UNOISE

Results

Conclusion

Parallelization

F Performance

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Nucleic Acids Research	Publication Date: Aug 16, 2019
Citations: 38	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Long-read amplicon denoising.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nucleic Acids Research

Lead the way for us

Similar Papers

A fault-tolerant method for HLA typing with PacBio data.
Chia-Jung Chang ... Kun-Mao Chao
BMC Bioinformatics | VOL. 15
Chia-Jung Chang, et. al.Chia-Jung Chang ... Kun-Mao Chao
03 Sep 2014
BMC Bioinformatics | VOL. 15

Do-it-Yourself Mock Community Standard for Multi-Step Assessment of Microbiome Protocols.
Joanna Colovas ... Marco E Mechan Llontop
Current Protocols | VOL. 2
Joanna Colovas, et. al.Joanna Colovas ... Marco E Mechan Llontop
01 Sep 2022
Current Protocols | VOL. 2

PBSIM: PacBio reads simulator—toward accurate genome assembly
Yukiteru Ono ... Michiaki Hamada
Bioinformatics | VOL. 29
Yukiteru Ono, et. al.Yukiteru Ono ... Michiaki Hamada
04 Nov 2012
Bioinformatics | VOL. 29

Chromosome assembly of large and complex genomes using multiple references.
Mikhail Kolmogorov ... David Thybert
Genome Research | VOL. 28
Mikhail Kolmogorov, et. al.Mikhail Kolmogorov ... David Thybert
19 Oct 2018
Genome Research | VOL. 28

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Long-read amplicon denoising.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nucleic Acids Research