In nanopore sequencers, single-stranded DNA molecules (or k-mers) enter a small opening in a membrane called a nanopore and modulate the ionic current through the pore, producing a channel output in the form of a noisy piecewise constant signal. An important problem in DNA-based data storage is finding a set of k-mers, i.e. a DNA code, that is robust against noisy sample duplication introduced by nanopore sequencers. Good DNA codes should contain as many k-mers as possible that produce distinguishable current signals (squiggles) as measured by the sequencer. The dissimilarity between squiggles can be estimated using a bound on their pairwise error probability, which is used as a metric for code design. Unfortunately, code construction using the union bound is limited to small k's due to the difficulty of finding maximum cliques in large graphs. In this paper, we construct large codes by concatenating codewords from a base code, thereby packing more information in a single strand while retaining the storage efficiency of the base code. To facilitate decoding, we include a circumfix in the base code to reduce the effect of the nanopore channel memory. We show that the decoding complexity scales as [Formula: see text], where m is the number of concatenated k-mers. Simulations show that the base code error rate is stable as m increases.
Read full abstract