DNA Codewords Research Articles

BackgroundBarcode multiplexing is a key strategy for sharing the rising capacity of next-generation sequencing devices: Synthetic DNA tags, called barcodes, are attached to natural DNA fragments within the library preparation procedure. Different libraries, can individually be labeled with barcodes for a joint sequencing procedure. A post-processing step is needed to sort the sequencing data according to their origin, utilizing these DNA labels. The final separation step is called demultiplexing and is mainly determined by the characteristics of the DNA code words used as labels.Currently, we are facing two different strategies for barcoding: One is based on the Hamming distance, the other uses the edit metric to measure distances of code words. The theory of channel coding provides well-known code constructions for Hamming metric. They provide a large number of code words with variable lengths and maximal correction capability regarding substitution errors. However, some sequencing platforms are known to have exceptional high numbers of insertion or deletion errors. Barcodes based on the edit distance can take insertion and deletion errors into account in the decoding process. Unfortunately, there is no explicit code-construction known that gives optimal codes for edit metric.ResultsIn the present work we focus on an entirely different perspective to obtain DNA barcodes. We consider a concatenated code construction, producing so-called watermark codes, which were first proposed by Davey and Mackay, to communicate via binary channels with synchronization errors. We adapt and extend the concepts of watermark codes to use them for DNA sequencing. Moreover, we provide an exemplary set of barcodes that are experimentally compatible with common next-generation sequencing platforms. Finally, a realistic simulation scenario is use to evaluate the proposed codes to show that the watermark concept is suitable for DNA sequencing applications.ConclusionOur adaption of watermark codes enables the construction of barcodes that are capable of correcting substitutions, insertion and deletion errors. The presented approach has the advantage of not needing any markers or technical sequences to recover the position of the barcode in the sequencing reads, which poses a significant restriction with other approaches.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0482-7) contains supplementary material, which is available to authorized users.

Read full abstract

DNA codeword design has been a fundamental problem since the early days of DNA computing. The problem calls for finding large sets of single DNA strands that do not crosshybridize to themselves, to each other or to others' complements. Such strands represent so-called domains, particularly in the language of chemical reaction networks (CRNs). The problem has shown to be of interest in other areas as well, including DNA memories and phylogenetic analyses because of their error correction and prevention properties. In prior work, a theoretical framework to analyze this problem has been developed and natural and simple versions of Codeword Design have been shown to be NP-complete using any single reasonable metric that approximates the Gibbs energy, thus practically making it very difficult to find any general procedure for finding such maximal sets exactly and efficiently. In this framework, codeword design is partially reduced to finding large sets of strands maximally separated in DNA spaces and, therefore, the size of such sets depends on the geometry of these spaces. Here, the authors describe in detail a new general technique to embed them in Euclidean spaces in such a way that oligonucleotides with high (low, respectively) hybridization affinity are mapped to neighboring (remote, respectively) points in a geometric lattice. This embedding materializes long-held metaphors about codeword design in analogies with error-correcting code design in information theory in terms of sphere packing and leads to designs that are in some cases known to be provably nearly optimal for small oligonucleotide sizes, whenever the corresponding spherical codes in Euclidean spaces are known to be so. It also leads to upper and lower bounds on estimates of the size of optimal codes of size under 20-mers, as well as to a few infinite families of DNA strand lengths, based on estimates of the kissing (or contact) number for sphere codes in high-dimensional Euclidean spaces. Conversely, the authors show how solutions to DNA codeword design obtained by experimental or other means can also provide solutions to difficult spherical packing geometric problems via these approaches. Finally, the reduction suggests a tool to provide some insight into the approximate structure of the Gibbs energy landscapes, which play a primary role in the design and implementation of biomolecular programs.

Read full abstract

DNA Codewords Research Articles

Articles published on DNA Codewords

Construction of DNA Codes From Composite Matrices and a Bio-Inspired Optimization Algorithm

DNA Linear Block Codes: Generation, Error-Detection, and Error-Correction of DNA Codeword

DNA codes over finite local Frobenius non-chain rings of length 5 and nilpotency index 4

Fractal construction of constrained code words for DNA storage systems

DNA codes over finite local Frobenius non-chain rings of length 4

Correcting a Single Indel/Edit for DNA-Based Data Storage: Linear-Time Encoders and Order-Optimality

On conflict free DNA codes

Family of Constrained Codes for Archival DNA Data Storage

An Improved Non-dominated Sorting Genetic Algorithm-II (INSGA-II) applied to the design of DNA codewords

Testing DNA code words properties of regular languages

Insertion and deletion correcting DNA barcodes based on watermarks.

DNA Codeword Design: Theory and Applications

Geometric Approaches to Gibbs Energy Landscapes and DNA Oligonucleotide Design

A DNA assembly model of sentence generation

An Efficient Genetic Algorithm Based on the Cultural Algorithm Applied to DNA Codewords Design

DNA Code Word Design for DNA Computing with Real-Time Polymerase Chain Reaction

DNA Code Word Design for DNA Computing with Real-Time Polymerase Chain Reaction

Secondary Structure Prediction of Interacting RNA Molecules

RNA CODEWORDS AND PROTEIN SYNTHESIS. THE EFFECT OF TRINUCLEOTIDES UPON THE BINDING OF SRNA TO RIBOSOMES.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

DNA Codewords Research Articles

Articles published on DNA Codewords

Construction of DNA Codes From Composite Matrices and a Bio-Inspired Optimization Algorithm

DNA Linear Block Codes: Generation, Error-Detection, and Error-Correction of DNA Codeword

DNA codes over finite local Frobenius non-chain rings of length 5 and nilpotency index 4

Fractal construction of constrained code words for DNA storage systems

DNA codes over finite local Frobenius non-chain rings of length 4

Correcting a Single Indel/Edit for DNA-Based Data Storage: Linear-Time Encoders and Order-Optimality

On conflict free DNA codes

Family of Constrained Codes for Archival DNA Data Storage

An Improved Non-dominated Sorting Genetic Algorithm-II (INSGA-II) applied to the design of DNA codewords

Testing DNA code words properties of regular languages

Insertion and deletion correcting DNA barcodes based on watermarks.

DNA Codeword Design: Theory and Applications

Geometric Approaches to Gibbs Energy Landscapes and DNA Oligonucleotide Design

A DNA assembly model of sentence generation

An Efficient Genetic Algorithm Based on the Cultural Algorithm Applied to DNA Codewords Design

DNA Code Word Design for DNA Computing with Real-Time Polymerase Chain Reaction

DNA Code Word Design for DNA Computing with Real-Time Polymerase Chain Reaction

Secondary Structure Prediction of Interacting RNA Molecules

RNA CODEWORDS AND PROTEIN SYNTHESIS. THE EFFECT OF TRINUCLEOTIDES UPON THE BINDING OF SRNA TO RIBOSOMES.