Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction.

David Laehnemann,Alice Carolyn Mchardy,Arndt Borkhardt

doi:10.1093/bib/bbv029

David Laehnemann, Alice Carolyn Mchardy + Show 1 more

Open Access

https://doi.org/10.1093/bib/bbv029

Copy DOI

Journal: Briefings in Bioinformatics	Publication Date: May 29, 2015
Citations: 346	License type: cc-by

Abstract

Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.

Highlights

We begin with a survey of the errors generated during sequencing by five commonly used high-throughput sequencing platforms: the GS FLX and the GS Junior by 454 [1], the Complete Genomics platform [2], the HiSeq and the MiSeq by Illumina [3], the Personal Genome Machine (PGM) by Ion Torrent [4, 5] and the Real-time Sequencer (RS) by Pacific Biosciences [6]
Instead of the Levenshtein edit distance, which allows measurement of single nucleotide insertions, deletions and substitutions [56], most error correction tools use the Hamming distance [57], which accounts for substitutions only. This is usually justified by two main arguments: firstly, the number of substitution errors is an order of magnitude higher than indel errors in the predominant Illumina data and, secondly, the computational complexity of the approaches using only the Hamming distance is lower, which is especially important for error correction procedures in high-throughput data and computation intensive tasks like de novo assembly
Whereas most tools have no error model at all, others accommodate for some of the error biases we have reviewed to more precisely distinguish between sequencing errors and genuine sequence variation at low frequencies

Summary

Introduction

We begin with a survey of the errors generated during sequencing by five commonly used high-throughput sequencing platforms: the GS FLX and the GS Junior by 454 [1], the Complete Genomics platform [2], the HiSeq and the MiSeq by Illumina [3], the Personal Genome Machine (PGM) by Ion Torrent [4, 5] and the Real-time Sequencer (RS) by Pacific Biosciences [6].

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Briefings in Bioinformatics

Lead the way for us

Similar Papers

Whole genome sequencing of 35 individuals provides insights into the genetic architecture of Korean population.
Wenqian Zhang ... Hui Wen Ng
BMC Bioinformatics | VOL. Suppl 15 11
Wenqian Zhang, et. al.Wenqian Zhang ... Hui Wen Ng
21 Oct 2014
BMC Bioinformatics | VOL. Suppl 15 11

SNV identification from single-cell RNA sequencing data.
Patricia M Schnepp ... Mengjie Chen
Human Molecular Genetics | VOL. 28
Patricia M Schnepp, et. al.Patricia M Schnepp ... Mengjie Chen
27 Aug 2019
Human Molecular Genetics | VOL. 28

Aligning to the sample-specific reference sequence to optimize the accuracy of next-generation sequencing analysis for hepatitis B virus.
Wen-Chun Liu ... Cheng-Hsun Ho
Hepatology International | VOL. 10
Wen-Chun Liu, et. al.Wen-Chun Liu ... Cheng-Hsun Ho
25 Jul 2015
Hepatology International | VOL. 10

NanoNIPT: Short-fragment nanopore sequencing of cell-free DNA for non-invasive prenatal testing of fetal aneuploidies and sex chromosome aberrations.
Maria Winther Jørgensen ... Martin J Larsen
Prenatal diagnosis | VOL. 43
Maria Winther Jørgensen, et. al.Maria Winther Jørgensen ... Martin J Larsen
19 Feb 2023
Prenatal diagnosis | VOL. 43

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Briefings in Bioinformatics