A common method for reading information stored in DNA molecules is shotgun sequencing. This method outputs a histogram of the frequencies of all the molecules’ substrings of a given length ℓ. To protect against noisy readings, the rank-modulation scheme encodes the information in the relative ranking of the substring frequencies, instead of their absolute values. However, the best rank-modulation codes for shotgun sequencing have low rates which are asymptotically vanishing. In this paper we propose new constructions of rank-modulation codes for shotgun sequencing. The first code construction is systematic, allowing the user to arbitrarily set the frequencies of a large subset of the substrings, which the encoder then completes to a permutation that may be realized by a DNA molecule. The construction is then improved by allowing the user to set the frequencies of additional substrings, at the cost of imposing constraints on the frequencies. The resulting codes have higher, non-vanishing rates, compared with previously known codes. As an example, for histograms of substrings of length ℓ = 2, and an alphabet of size 4 (as in DNA molecules), we are able to construct a code with rate ≈ 0.909, whereas previously, the best construction resulted in a code with rate ≈ 0.654. Additionally, the encoded information in our construction may be written to shorter DNA molecules than possible before. We also prove that the systematic codes constructed in this paper are the largest possible among all systematic codes.
Read full abstract