Family reunion via error correction: an efficient analysis of duplex sequencing data

Nicholas Stoler,Gundula Povysil,Irene Tiemann-Boege,Renato Salazar,Monika Heinzl,Kateryna D Makova,Anton Nekrutenko,Barbara Arbeithuber

doi:10.1186/s12859-020-3419-8

Nicholas Stoler, Gundula Povysil + Show 6 more

Open Access

https://doi.org/10.1186/s12859-020-3419-8

Copy DOI

Abstract

BackgroundDuplex sequencing is the most accurate approach for identification of sequence variants present at very low frequencies. Its power comes from pooling together multiple descendants of both strands of original DNA molecules, which allows distinguishing true nucleotide substitutions from PCR amplification and sequencing artifacts. This strategy comes at a cost—sequencing the same molecule multiple times increases dynamic range but significantly diminishes coverage, making whole genome duplex sequencing prohibitively expensive. Furthermore, every duplex experiment produces a substantial proportion of singleton reads that cannot be used in the analysis and are thrown away.ResultsIn this paper we demonstrate that a significant fraction of these reads contains PCR or sequencing errors within duplex tags. Correction of such errors allows “reuniting” these reads with their respective families increasing the output of the method and making it more cost effective.ConclusionsWe combine an error correction strategy with a number of algorithmic improvements in a new version of the duplex analysis software, Du Novo 2.0. It is written in Python, C, AWK, and Bash. It is open source and readily available through Galaxy, Bioconda, and Github: https://github.com/galaxyproject/dunovo.

Highlights

Duplex sequencing is the most accurate approach for identification of sequence variants present at very low frequencies
The first dataset was produced by Schmitt et al [9], who employed Duplex Sequencing (DS) to identify a rare mutation at the ABL1 locus responsible for resistance to a chronic myeloid leukemia therapeutic compound imatinib
Since each DNA fragment is labeled by two tags, one at each end, the theoretical upper bound for the number of unique combinations is 4(12 + 12)

Summary

Introduction

Duplex sequencing is the most accurate approach for identification of sequence variants present at very low frequencies. Its power comes from pooling together multiple descendants of both strands of original DNA molecules, which allows distinguishing true nucleotide substitutions from PCR amplification and sequencing artifacts. This strategy comes at a cost—sequencing the same molecule multiple times increases dynamic range but significantly diminishes coverage, making whole genome duplex sequencing prohibitively expensive. The descendants of each original DNA fragment are identified and grouped together using tags—one sorts tags in sequencing reads lexicographically and all reads containing the same tag are bundled into a family.

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Mar 4, 2020
Citations: 11	License type: open-access

R Discovery Prime

R Discovery Prime

Family reunion via error correction: an efficient analysis of duplex sequencing data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Redefining “Gold Standard”: Ultra-Sensitive Characterization of Commercial DNA Standards with Duplex Sequencing
Jacob Higgins ... Jesse J Salk
Blood | VOL. 134
Jacob Higgins, et. al.Jacob Higgins ... Jesse J Salk
13 Nov 2019
Blood | VOL. 134

Detection of BCR-ABL1 Compound and Polyclonal Mutants in Chronic Myeloid Leukemia Patients Using a Novel Next Generation Sequencing Approach That Minimises PCR and Sequencing Errors
Wendy T Parker ... Susan Branford
Blood | VOL. 124
Wendy T Parker, et. al.Wendy T Parker ... Susan Branford
06 Dec 2014
Blood | VOL. 124

Abstract 441: The stochastic nature of errors in next-generation sequencing of circulating cell-free DNA
Hunter R Underhill ... Preetida J Bhetariya
Cancer Research | VOL. 79
Hunter R Underhill, et. al.Hunter R Underhill ... Preetida J Bhetariya
01 Jul 2019
Cancer Research | VOL. 79

Abstract 441: The stochastic nature of errors in next-generation sequencing of circulating cell-free DNA
Hunter R Underhill ... Sabine Hellwig
-
Hunter R Underhill, et. al.Hunter R Underhill ... Sabine Hellwig
01 Jul 2019
01 Jul 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Family reunion via error correction: an efficient analysis of duplex sequencing data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics