Abstract

DNA data storage systems have rapidly developed with novel error-correcting techniques, random access algorithms, and query systems. However, designing an algorithm for DNA storage systems is challenging, mainly due to the unpredictable nature of errors and the extremely high price of experiments. Thus, a simulator is of interest that can imitate the error statistics of a DNA storage system and replace the experiments in developing processes. We introduce novel generative adversarial networks that learn DNA storage channel statistics. Our simulator takes oligos (DNA sequences to write) as an input and generates a FASTQ file that includes output DNA reads and quality scores as if the oligos are synthesized and sequenced. We trained the proposed simulator with data from a single experiment consisting of 14,400 input oligo strands and 12,108,573 output reads. The error statistics between the input and the output of the trained generator match the actual error statistics, including the error rate at each position, the number of errors for each nucleotide, and high-order statistics. The code is available at <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><uri>https://github.com/gyfbianhuanyun/DNA_storage_simulator_GAN</uri></monospace> .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call