Abstract

In recent years, DNA has emerged as a potentially viable storage technology. DNA synthesis, which refers to the task of writing the data into DNA, is perhaps the most costly part of existing storage systems. Consequently, the high cost and low throughput limit the practical use of available DNA synthesis technologies. It has been found that the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">homopolymer run</i> (i.e., the repetition of the same nucleotide) is a major factor affecting the synthesis and sequencing errors. Recently, [26] raised and studied the coding problem for efficient synthesis for DNA-based storage systems. Among other things, they studied the maximal code size under synthesis constraints. In [29], the authors studied the role of batch optimization in reducing the cost of large-scale DNA synthesis, for a given pool <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">S</i> of random quaternary strings of fixed length. This problem is related to the problem posed in [26] which can be viewed as the opposite side of the coin. Instead of seeking the largest code in which every codeword can be synthesized in a certain amount of time, they asked what is the average synthesis time of a randomly chosen string. Following the lead of [29], in this paper, we take a step forward towards the theoretical understanding of DNA synthesis, and study the homopolymer run of length <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</i> ≥ 1. Specifically, we are given a set of DNA strands <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">S</i> , randomly drawn from a Markovian distribution modeling a general homopolymer run length constraint, that we wish to synthesize. For this problem, we derive asymptotically tight high probability lower and upper bounds on the cost of DNA synthesis, for any <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">k</i> ≥ 1. Our bounds imply that, perhaps surprisingly, the periodic sequence ACGT is asymptotically optimal in the sense of achieving the smallest possible cost. Our main technical contribution is the representation of the DNA synthesis process as a certain constrained system, for which string techniques can be applied.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.