Levenshtein error-correcting barcodes for multiplexed DNA sequencing

Tilo Buschmann,Leonid V Bystrykh

doi:10.1186/1471-2105-14-272

Abstract

BackgroundHigh-throughput sequencing technologies are improving in quality, capacity and costs, providing versatile applications in DNA and RNA research. For small genomes or fraction of larger genomes, DNA samples can be mixed and loaded together on the same sequencing track. This so-called multiplexing approach relies on a specific DNA tag or barcode that is attached to the sequencing or amplification primer and hence appears at the beginning of the sequence in every read. After sequencing, each sample read is identified on the basis of the respective barcode sequence.Alterations of DNA barcodes during synthesis, primer ligation, DNA amplification, or sequencing may lead to incorrect sample identification unless the error is revealed and corrected. This can be accomplished by implementing error correcting algorithms and codes. This barcoding strategy increases the total number of correctly identified samples, thus improving overall sequencing efficiency. Two popular sets of error-correcting codes are Hamming codes and Levenshtein codes.ResultLevenshtein codes operate only on words of known length. Since a DNA sequence with an embedded barcode is essentially one continuous long word, application of the classical Levenshtein algorithm is problematic. In this paper we demonstrate the decreased error correction capability of Levenshtein codes in a DNA context and suggest an adaptation of Levenshtein codes that is proven of efficiently correcting nucleotide errors in DNA sequences. In our adaption we take the DNA context into account and redefine the word length whenever an insertion or deletion is revealed. In simulations we show the superior error correction capability of the new method compared to traditional Levenshtein and Hamming based codes in the presence of multiple errors.ConclusionWe present an adaptation of Levenshtein codes to DNA contexts capable of correction of a pre-defined number of insertion, deletion, and substitution mutations. Our improved method is additionally capable of recovering the new length of the corrupted codeword and of correcting on average more random mutations than traditional Levenshtein or Hamming codes.As part of this work we prepared software for the flexible generation of DNA codes based on our new approach. To adapt codes to specific experimental conditions, the user can customize sequence filtering, the number of correctable mutations and barcode length for highest performance.

Highlights

High-throughput sequencing technologies are improving in quality, capacity and costs, providing versatile applications in DNA and RNA research
We present an adaptation of Levenshtein codes to DNA contexts capable of correction of a pre-defined number of insertion, deletion, and substitution mutations
Any randomly picked synthetic nucleotide sequence can be used as a barcode, this approach is problematic because all basic parameters of the corresponding oligonucleotide, namely minimal distance, GC content, sequence redundancy etc. cannot be properly controlled [13]

Summary

Introduction

High-throughput sequencing technologies are improving in quality, capacity and costs, providing versatile applications in DNA and RNA research. Alterations of DNA barcodes during synthesis, primer ligation, DNA amplification, or sequencing may lead to incorrect sample identification unless the error is revealed and corrected This can be accomplished by implementing error correcting algorithms and codes. Since modern machines are (at the time of writing this manuscript) capable of generating up to 8 ∗ 109 base pairs (8 Gbp) total read length in one lane, In such cases many samples are combined in a single batch and sequenced as one sample Using this multiplexed format, specific sample tags, called barcodes, are added to the amplification or sequencing primer to discriminate all sub-samples in the mixture. The protocol is efficient as long as barcodes can be read robustly [9] It is known, that multiple errors can occur with DNA sequencing due to defects in primer synthesis, the ligation process, sample pre-amplification, and sequencing. Any randomly picked synthetic nucleotide sequence can be used as a barcode, this approach is problematic because all basic parameters of the corresponding oligonucleotide, namely minimal distance, GC content, sequence redundancy etc. cannot be properly controlled [13]

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Sep 11, 2013
Citations: 142	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Levenshtein error-correcting barcodes for multiplexed DNA sequencing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

DNA sequencing error correction using spectral alignment
Novaldo Caesar ... Sony Hartono Wijaya
-
Novaldo Caesar, et. al.Novaldo Caesar ... Sony Hartono Wijaya
01 Sep 2013
01 Sep 2013

DNA barcoding demystified
Andrew Mitchell
Australian Journal of Entomology | VOL. 47
Andrew MitchellAndrew Mitchell
01 Aug 2008
Australian Journal of Entomology | VOL. 47

Fuzzy-based Spectral Alignment for Correcting DNA Sequence from Next Generation Sequencer
Kana Saputra S ... Wisnu Ananta Kusuma
TELKOMNIKA (Telecommunication Computing Electronics and Control) | VOL. 14
Kana Saputra S, et. al.Kana Saputra S ... Wisnu Ananta Kusuma
01 Jun 2016
TELKOMNIKA (Telecommunication Computing Electronics and Control) | VOL. 14

Exonuclease Proofreading by Human Mitochondrial DNA Polymerase
Allison A Johnson ... Kenneth A Johnson
Journal of Biological Chemistry | VOL. 276
Allison A Johnson, et. al.Allison A Johnson ... Kenneth A Johnson
01 Oct 2001
Journal of Biological Chemistry | VOL. 276

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Levenshtein error-correcting barcodes for multiplexed DNA sequencing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics