Abstract

Real-world data often comes in compressed form. Analyzing compressed data directly (without first decompressing it) can save space and time by orders of magnitude. In this work, we focus on fundamental sequence comparison problems and try to quantify the gain in time complexity when the underlying data is highly compressible. We consider grammar compression, which unifies many practically relevant compression schemes such as the Lempel–Ziv family, dictionary methods, and others. For two strings of total length N and total compressed size n, it is known that the edit distance and a longest common subsequence (LCS) can be computed exactly in time Õ(nN), as opposed to O(N2) for the uncompressed setting. Many real-world applications need to align multiple sequences simultaneously, and the fastest known exact algorithms for median edit distance and LCS of k strings run in O(Nk) time, whereas the one for center edit distance has a time complexity of O(N2k). This naturally raises the question if compression can help to reduce the running time significantly for k ≥ 3, perhaps to O(Nk/2 nk/2) or, more optimistically, to O(Nnk–1).1 Unfortunately, we show new lower bounds that rule out any improvement beyond Ω(Nk–1 n) time for any of these problems assuming the Strong Exponential Time Hypothesis (SETH), where again N and n represent the total length and the total compressed size, respectively. This answers an open question of Abboud, Backurs, Bringmann, and Künnemann (FOCS'17). In presence of such negative results, we ask if allowing approximation can help, and we show that approximation and compression together can be surprisingly effective for both multiple and two strings. We develop an Õ(Nk/2 nk/2)-time FPTAS for the median edit distance of k sequences, leading to a saving of nearly half the dimensions for highly-compressible sequences. In comparison, no O(Nk–Ω(1))-time PTAS is known for the median edit distance problem in the uncompressed setting. We obtain an improvement from for the center edit distance problem. For two strings, we get an -time FPTAS for both edit distance and LCS; note that this running time is o(N) whenever n ≪ N1/4. In contrast, for uncompressed strings, there is not even a subquadratic algorithm for LCS that has less than polynomial gap in the approximation factor. Building on the insight from our approximation algorithms, we also obtain several new and improved results for many fundamental distance measures including the edit, Hamming, and shift distances.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call