Streamlined analysis of duplex sequencing data with Du Novo.

Nicholas Stoler,Anton Nekrutenko,Barbara Arbeithuber,Kateryna D Makova,Wilfried Guiblet

doi:10.1186/s13059-016-1039-4

Nicholas Stoler, Anton Nekrutenko + Show 3 more

Open Access

https://doi.org/10.1186/s13059-016-1039-4

Copy DOI

Abstract

Duplex sequencing was originally developed to detect rare nucleotide polymorphisms normally obscured by the noise of high-throughput sequencing. Here we describe a new, streamlined, reference-free approach for the analysis of duplex sequencing data. We show the approach performs well on simulated data and precisely reproduces previously published results and apply it to a newly produced dataset, enabling us to type low-frequency variants in human mitochondrial DNA. Finally, we provide all necessary tools as stand-alone components as well as integrate them into the Galaxy platform. All analyses performed in this manuscript can be repeated exactly as described at http://usegalaxy.org/duplex.Electronic supplementary materialThe online version of this article (doi:10.1186/s13059-016-1039-4) contains supplementary material, which is available to authorized users.

Highlights

The term “genetic variation” is often used to imply allelic combinatorics within a diploid organism such as humans or Drosophila
In order to group single-stranded families from the same fragment together, we normalize the order of the concatenation to produce a “canonical barcode”, which will be identical for both strands
The order of the canonical barcode is determined by a simple string comparison

Summary

Background

The term “genetic variation” is often used to imply allelic combinatorics within a diploid organism such as humans or Drosophila. Because high-throughput sequencing technologies exhibit considerable amounts of noise [3], it becomes increasingly difficult to reliably call variants with frequencies below 1 % [4,5,6,7,8,9] In these situations increased sequencing depth does not improve the predictive power but instead introduces additional noise. Today the vast majority of strategies for the identification of low-frequency sequence variants rely on next-generation sequencing technologies. Noise reduction in these approaches ranges from simple basequality filtering to complex statistical strategies incorporating instrument and mapping errors [4, 7, 14]. We demonstrate the application of this approach by validating rare variants in the human mitochondrial genome

Results and discussion

Conclusions

Methods