Information-Theoretic Foundations of DNA Data Storage

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Due to its longevity and enormous information density, DNA is an attractive medium for archival data storage. Thanks to rapid technological advances, DNA storage is becoming practically feasible, as demonstrated by a number of experimental storage systems, making it a promising solution for our society's increasing need of data storage. While in living things, DNA molecules can consist of millions of nucleotides, due to technological constraints, in practice, data is stored on many short DNA molecules, which are preserved in a DNA pool and cannot be spatially ordered. Moreover, imperfections in sequencing, synthesis, and handling, as well as DNA decay during storage, introduce random noise into the system, making the task of reliably storing and retrieving information in DNA challenging. This unique setup raises a natural information-theoretic question: how much information can be reliably stored on and reconstructed from millions of short noisy sequences? The goal of this monograph is to address this question by discussing the fundamental limits of storing information on DNA. Motivated by current technological constraints on DNA synthesis and sequencing, we propose a probabilistic channel model that captures three key distinctive aspects of the DNA storage systems: (1) the data is written onto many short DNA molecules that are stored in an unordered fashion; (2) the molecules are corrupted by noise and (3) the data is read by randomly sampling from the DNA pool. Our goal is to investigate the impact of each of these key aspects on the capacity of the DNA storage system. Rather than focusing on coding-theoretic considerations and computationally efficient encoding and decoding, we aim to build an information-theoretic foundation for the analysis of these channels, developing tools for achievability and converse arguments.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 45
  • 10.1109/tit.2021.3058966
DNA-Based Storage: Models and Fundamental Limits
  • Feb 2, 2021
  • IEEE Transactions on Information Theory
  • Ilan Shomorony + 1 more

Due to its longevity and enormous information density, DNA is an attractive medium for archival storage. In this work, we study the fundamental limits and trade-offs of DNA-based storage systems by introducing a new channel model, which we call the noisy shuffling-sampling channel. Motivated by current technological constraints on DNA synthesis and sequencing, this model captures three key distinctive aspects of DNA storage systems: (1) the data is written onto many short DNA molecules; (2) the molecules are corrupted by noise during synthesis and sequencing and (3) the data is read by randomly sampling from the DNA pool. We provide capacity results for this channel under specific noise and sampling assumptions and show that, in many scenarios, a simple index-based coding scheme is optimal.

  • Conference Article
  • Cite Count Icon 107
  • 10.1109/isit.2017.8007106
Fundamental limits of DNA storage systems
  • Jun 1, 2017
  • Reinhard Heckel + 3 more

Due to its longevity and enormous information density, DNA is an attractive medium for archival storage. In this work, we study the fundamental limits and tradeoffs of DNA-based storage systems under a simple model, motivated by current technological constraints on DNA synthesis and sequencing. Our model captures two key distinctive aspects of DNA storage systems: (1) the data is written onto many short DNA molecules that are stored in an unordered way and (2) the data is read by randomly sampling from this DNA pool. Under this model, we characterize the storage capacity, and show that a simple index-based coding scheme is optimal.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 233
  • 10.1038/s41598-019-45832-6
A Characterization of the DNA Data Storage Channel
  • Jul 4, 2019
  • Scientific Reports
  • Reinhard Heckel + 2 more

Owing to its longevity and enormous information density, DNA, the molecule encoding biological information, has emerged as a promising archival storage medium. However, due to technological constraints, data can only be written onto many short DNA molecules that are stored in an unordered way, and can only be read by sampling from this DNA pool. Moreover, imperfections in writing (synthesis), reading (sequencing), storage, and handling of the DNA, in particular amplification via PCR, lead to a loss of DNA molecules and induce errors within the molecules. In order to design DNA storage systems, a qualitative and quantitative understanding of the errors and the loss of molecules is crucial. In this paper, we characterize those error probabilities by analyzing data from our own experiments as well as from experiments of two different groups. We find that errors within molecules are mainly due to synthesis and sequencing, while imperfections in handling and storage lead to a significant loss of sequences. The aim of our study is to help guide the design of future DNA data storage systems by providing a quantitative and qualitative understanding of the DNA data storage channel.

  • Research Article
  • Cite Count Icon 16
  • 10.1103/physreve.92.032703
Pulling short DNA molecules having defects on different locations.
  • Sep 2, 2015
  • Physical Review E
  • Amar Singh + 1 more

We present a study on the role of defects on the stability of short DNA molecules. We consider short DNA molecules (16 base pairs) and investigate the thermal as well as mechanical denaturation of these molecules in the presence of defects that occur anywhere in the molecule. For the investigation, we consider four different kinds of chains. Not only are the ratios of AT to GC different in these molecules but also the distributions of AT and GC along the molecule are different. With suitable modifications in the statistical model to show the defect in a pair, we investigate the denaturation of short DNA molecules in thermal as well as constant force ensembles. In the force ensemble, we pulled the DNA molecule from each end (keeping other end free) and observed some interesting features of opening of the molecule in the presence of defects in the molecule. We calculate the probability of opening of the DNA molecule in the constant force ensemble to explain the opening of base pairs and hence the denaturation of molecules in the presence of defects.

  • Research Article
  • Cite Count Icon 173
  • 10.1016/s0021-9258(17)42623-8
The DNA dependence of the ATPase activity of DNA gyrase.
  • Dec 1, 1984
  • Journal of Biological Chemistry
  • A Maxwell + 1 more

We have studied the ATPase activity of DNA gyrase both in the absence and presence of DNA. In the absence of DNA we show that the gyrase B protein alone has a very low level of ATPase activity which can be increased many-fold by pretreatment of the B protein with heat or urea. When both the gyrase A protein and linear DNA are also present, the ATPase activity of the untreated B protein is greatly stimulated. We find that the extent of stimulation is dependent upon the length of the DNA but largely independent of DNA sequence. DNA molecules greater than 100 base pairs in length are much more effective in stimulating the gyrase ATPase than those of 70 base pairs or less, although short DNA molecules will stimulate the ATPase at high concentrations. The behavior of long and short DNA molecules with respect to ATPase stimulation is also reflected in their abilities to bind DNA gyrase. To account for these data we propose a model for the interaction of gyrase with ATP and DNA in which ATP hydrolysis requires the binding of DNA to two sites on the enzyme.

  • Research Article
  • Cite Count Icon 2
  • 10.1002/elps.201400168
High-throughput DNA separation in nanofilter arrays.
  • Jul 14, 2014
  • Electrophoresis
  • Sungup Choi + 3 more

We numerically investigated the dynamics of short double-stranded DNA molecules moving through a deep-shallow alternating nanofilter, by utilizing Brownian dynamics simulation. We propose a novel mechanism for high-throughput DNA separation with a high electric field, which was originally predicted by Laachi et al. [Phys. Rev. Lett. 2007, 98, 098106]. In this work, we show that DNA molecules deterministically move along different electrophoretic streamlines according to their length, owing to geometric constraint at the exit of the shallow region. Consequently, it is more probable that long DNA molecules pass over a deep well region without significant lateral migration toward the bottom of the deep well, which is in contrast to the long dwelling time for short DNA molecules. We investigated the dynamics of DNA passage through a nanofilter facilitating electrophoretic field kinematics. The statistical distribution of the DNA molecules according to their size clearly corroborates our assumption. On the other hand, it was also found that the tapering angle between the shallow and deep regions significantly affects the DNA separation performance. The current results show that the nonuniform field effect combined with geometric constraint plays a key role in nanofilter-based DNA separation. We expect that our results will be helpful in designing and operating nanofluidics-based DNA separation devices and in understanding the polymer dynamics in confined geometries.

  • Conference Article
  • Cite Count Icon 48
  • 10.1109/isit.2019.8849789
Capacity Results for the Noisy Shuffling Channel
  • Jul 1, 2019
  • Ilan Shomorony + 1 more

Motivated by DNA-based storage, we study the noisy shuffling channel, which can be seen as the concatenation of a standard noisy channel (such as the BSC) and a shuffling channel, which breaks the data block into small pieces and shuffles them. This channel models a DNA storage system, by capturing two of its key aspects: (1) the data is written onto many short DNA molecules that are stored in an unordered way and (2) the molecules are corrupted by noise at synthesis, sequencing, and during storage. For the BSC-shuffling channel we characterize the capacity exactly (for a large set of parameters), and show that a simple index-based coding scheme is optimal.

  • Research Article
  • Cite Count Icon 41
  • 10.1021/jp110803a
Conductance through Short DNA Molecules
  • Feb 7, 2011
  • The Journal of Physical Chemistry C
  • Aleksandar Staykov + 2 more

The conductance through short DNA molecules connected to gold electrodes is studied with density functional theory and nonequilibrium Green’s function method combined with density functional theory. The anchoring of the molecules to the electrodes is investigated, and in addition to the covalent S−Au bond, weak interactions between the aromatic heterocyclic bases and the electrodes are found. These weak interactions are important for the electron transport through DNA molecules. A tunneling mechanism is suggested, and the conductive properties of the nucleotides in a metal−molecule−metal junction are compared. Different four-nucleotide DNA sequences are investigated. A significant value for the current, 20 pA, is calculated for 1.5 V applied bias for a DNA sequence consisting of guanine and cytosine nucleotides. It is shown that adenine-thymine nucleotide pairs introduce potential barriers for the electron transport and therefore significantly decline the conductance. The obtained results are compared with recent experimental observations (Nanotechnology2009, 20, 115502) and confirm the possibility for electron transport through DNA molecules as well as provide an explanation for the reduced conductance through DNA sequences, which contain adenine-thymine nucleotide pairs. The results are compared with a previous theoretical study, performed with the extended Hückel method (ChemPhysChem2003, 4, 1256), which reports low conductance for DNA molecules. The difference in the conclusions is due to the applied bias self-consistent field calculations used in the recent study, which take into account the changes of the transmission probabilities with the bias.

  • Research Article
  • Cite Count Icon 81
  • 10.1073/pnas.2114937118
Single-molecule sequencing reveals a large population of long cell-free DNA molecules in maternal plasma
  • Dec 6, 2021
  • Proceedings of the National Academy of Sciences
  • Stephanie C Y Yu + 11 more

In the field of circulating cell-free DNA, most of the studies have focused on short DNA molecules (e.g., <500 bp). The existence of long cell-free DNA molecules has been poorly explored. In this study, we demonstrated that single-molecule real-time sequencing allowed us to detect and analyze a substantial proportion of long DNA molecules from both fetal and maternal sources in maternal plasma. Such molecules were beyond the size detection limits of short-read sequencing technologies. The proportions of long cell-free DNA molecules in maternal plasma over 500 bp were 15.5%, 19.8%, and 32.3% for the first, second, and third trimesters, respectively. The longest fetal-derived plasma DNA molecule observed was 23,635 bp. Long plasma DNA molecules demonstrated predominance of A or G 5' fragment ends. Pregnancies with preeclampsia demonstrated a reduction in long maternal plasma DNA molecules, reduced frequencies for selected 5' 4-mer end motifs ending with G or A, and increased frequencies for selected motifs ending with T or C. Finally, we have developed an approach that employs the analysis of methylation patterns of the series of CpG sites on a long DNA molecule for determining its tissue origin. This approach achieved an area under the curve of 0.88 in differentiating between fetal and maternal plasma DNA molecules, enabling the determination of maternal inheritance and recombination events in the fetal genome. This work opens up potential clinical utilities of long cell-free DNA analysis in maternal plasma including noninvasive prenatal testing of monogenic diseases and detection/monitoring of pregnancy-associated disorders such as preeclampsia.

  • Research Article
  • Cite Count Icon 7
  • 10.1016/bs.mie.2016.08.020
How to Measure Separations and Angles Between Intramolecular Fluorescent Markers.
  • Jan 1, 2016
  • Methods in enzymology
  • K.I Mortensen + 3 more

How to Measure Separations and Angles Between Intramolecular Fluorescent Markers.

  • Research Article
  • Cite Count Icon 50
  • 10.1016/j.bpj.2021.02.027
DNA length tunes the fluidity of DNA-based condensates
  • Feb 26, 2021
  • Biophysical Journal
  • Fernando Muzzopappa + 2 more

Living organisms typically store their genomic DNA in a condensed form. Mechanistically, DNA condensation can be driven by macromolecular crowding, multivalent cations, or positively charged proteins. At low DNA concentration, condensation triggers the conformational change of individual DNA molecules into a compacted state, with distinct morphologies. Above a critical DNA concentration, condensation goes along with phase separation into a DNA-dilute and a DNA-dense phase. The latter DNA-dense phase can have different material properties and has been reported to be rather liquid-like or solid-like depending on the characteristics of the DNA and the solvent composition. Here, we systematically assess the influence of DNA length on the properties of the resulting condensates. We show that short DNA molecules with sizes below 1 kb can form dynamic liquid-like assemblies when condensation is triggered by polyethylene glycol and magnesium ions, binding of linker histone H1, or nucleosome reconstitution in combination with linker histone H1. With increasing DNA length, molecules preferentially condense into less dynamic more solid-like assemblies, with phage λ-DNA with 48.5 kb forming mostly solid-like assemblies under the conditions assessed here. The transition from liquid-like to solid-like condensates appears to be gradual, with DNA molecules of roughly 1–10 kb forming condensates with intermediate properties. Titration experiments with linker histone H1 suggest that the fluidity of condensates depends on the net number of attractive interactions established by each DNA molecule. We conclude that DNA molecules that are much shorter than a typical human gene are able to undergo liquid-liquid phase separation, whereas longer DNA molecules phase separate by default into rather solid-like condensates. We speculate that the local distribution of condensing factors can modulate the effective length of chromosomal domains in the cell. We anticipate that the link between DNA length and fluidity established here will improve our understanding of biomolecular condensates involving DNA.

  • Research Article
  • Cite Count Icon 16
  • 10.1063/1.3682984
DNA conformation in nanochannels: Monte Carlo simulation studies using a primitive DNA model
  • Mar 1, 2012
  • The Journal of Chemical Physics
  • Rakwoo Chang + 1 more

We have performed canonical ensemble Monte Carlo simulations of a primitive DNA model to study the conformation of 2.56 ~ 21.8 μm long DNA molecules confined in nanochannels at various ionic concentrations with the comparison of our previous experimental findings. In the model, the DNA molecule is represented as a chain of charged hard spheres connected by fixed bond length and the nanochannels as planar hard walls. System potentials consist of explicit electrostatic potential along with short-ranged hard-sphere and angle potentials. Our primitive model system provides valuable insight into the DNA conformation, which cannot be easily obtained from experiments or theories. First, the visualization and statistical analysis of DNA molecules in various channel dimensions and ionic strengths verified the formation of locally coiled structures such as backfolding or hairpin and their significance even in highly stretched states. Although the folding events mostly occur within the region of ~0.5 μm from both chain ends, significant portion of the events still take place in the middle region. Second, our study also showed that two controlling factors such as channel dimension and ionic strength widely used in stretching DNA molecules have different influence on the local DNA structure. Ionic strength changes local correlation between neighboring monomers by controlling the strength of electrostatic interaction (and thus the persistence length of DNA), which leads to more coiled local conformation. On the other hand, channel dimension controls the overall stretch by applying the geometric constraint to the non-local DNA conformation instead of directly affecting local correlation. Third, the molecular weight dependence of DNA stretch was observed especially in low stretch regime, which is mainly due to the fact that low stretch modes observed in short DNA molecules are not readily accessible to much longer DNA molecules, resulting in the increase in the stretch of longer DNA molecules.

  • Research Article
  • Cite Count Icon 10
  • 10.1101/gr.278556.123
Genomic origin, fragmentomics, and transcriptional properties of long cell-free DNA molecules in human plasma.
  • Feb 1, 2024
  • Genome research
  • Huiwen Che + 17 more

Recent studies have revealed an unexplored population of long cell-free DNA (cfDNA) molecules in human plasma using long-read sequencing technologies. However, the biological properties of long cfDNA molecules (>500 bp) remain largely unknown. To this end, we have investigated the origins of long cfDNA molecules from different genomic elements. Analysis of plasma cfDNA using long-read sequencing reveals an uneven distribution of long molecules from across the genome. Long cfDNA molecules show overrepresentation in euchromatic regions of the genome, in sharp contrast to short DNA molecules. We observe a stronger relationship between the abundance of long molecules and mRNA gene expression levels, compared with short molecules (Pearson's r = 0.71 vs. -0.14). Moreover, long and short molecules show distinct fragmentation patterns surrounding CpG sites. Leveraging the cleavage preferences surrounding CpG sites, the combined cleavage ratios of long and short molecules can differentiate patients with hepatocellular carcinoma (HCC) from non-HCC subjects (AUC = 0.87). We also investigated knockout mice in which selected nuclease genes had been inactivated in comparison with wild-type mice. The proportion of long molecules originating from transcription start sites are lower in Dffb-deficient mice but higher in Dnase1l3-deficient mice compared with that of wild-type mice. This work thus provides new insights into the biological properties and potential clinical applications of long cfDNA molecules.

  • Research Article
  • Cite Count Icon 11
  • 10.1021/ma0609533
Entropic Elasticity of DNA with a Permanent Kink
  • Nov 16, 2006
  • Macromolecules
  • Jinyu Li + 2 more

Many proteins interact with and deform double-stranded DNA in cells. Single-molecule experiments have studied the elasticity of DNA with helix-deforming proteins, including proteins that bend DNA. These experiments increase the need for theories of DNA elasticity which include helix-deforming proteins. Previous theoretical work on bent DNA has examined a long DNA molecule with many nonspecifically binding proteins. However, recent experiments used relatively short DNA molecules with a single, well-defined bend site. Here we develop a simple, theoretical description of the effect of a single bend. We then include the description of the bend in the finite wormlike chain model (FWLC) of short DNA molecules attached to beads. We predict how the DNA force-extension relation changes due to formation of a single permanent kink, at all values of the applied stretching force. Our predictions suggest that high-resolution single-molecule experiments could determine the bend angle induced upon protein binding.

  • Research Article
  • Cite Count Icon 19
  • 10.1080/07391102.1991.10507916
DNA-Helix Bending, Stiffening and Elongation on Ligand Binding; Analysis for Serveral DNA-Drug Systems, General Viscometric DNA Response and Stereochemical Implications
  • Oct 1, 1991
  • Journal of Biomolecular Structure and Dynamics
  • Karl-Ernst Reinert

For several DNA-ligand systems the DNA helix bending, stiffening and elongation behaviour is treated quantitatively. The experimental basis are viscosity data from literature as a function of r, the ratio of drug molecules bound per DNA monomer unit. If the relative viscosity changes delta y1(r) and delta yh(r) for DNA of low and high molar mass, respectively, are known, the relative changes of contour length, delta L/L degrees, and of persistence length, delta a/a degrees, can be evaluated as a function of r, as repeatedly demonstrated. For random sequence-independent interactions, helix-bending is reflected by a helix-typical increment of delta a/a degrees (r), being zero at r = 0 and also at DNA saturation by bound ligand molecules [Reinert, Biophysical Chemistry 13, 1-14 (1981)]. This characteristic DNA behaviour often enables us to separate the bending and the stiffening increment of delta a/a degrees. The theoretical treatment of this problem (Schütz and Reinert, J. Biomolec. Struct. & Dynam. 9, 315-329, 1991) now permits a more detailed study of the ligand-induced DNA bending. The ligand-DNA systems treated here concern the following drugs (in parentheses DNA bending angle at low r-values): proflavin (8 degrees), daunomycin (11 degrees), aclacinomycin A (9.7 degrees, on cooperative interaction), actinomycin D (16 degrees), mitomycin C (16 degrees), a double intercalating bisphenantridine (12 degrees), 9-deacetyl-daunomycin (8 degrees) and 9-epi-deacetyl-daunomycin (12-18 degrees). We also demonstrate that the consideration of the DNA flexibility and its change on interaction of short DNA molecules with intercalating drugs delivers helix elongation values in better accord with the theoretical value. In the Appendix, a catalogue of simulated delta y(r)-dependences is given for both short and long DNA molecules. It systematically describes the DNA viscosity response upon typical DNA stiffening, elongation, and helix-bending effects.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.