The Dyck Language Edit Distance Problem in Near-Linear Time

Barna Saha

doi:10.1109/focs.2014.71

Abstract

Given a string σ over alphabet Σ and a grammar G defined over the same alphabet, how many minimum number of repairs (insertions, deletions and substitutions) are required to map σ into a valid member of G? The seminal work of Aho and Peterson in 1972 initiated the study of this language edit distance problem providing a dynamic programming algorithm for context free languages that runs in O(|G|2n3) time, where n is the string length and |G| is the grammar size. While later improvements reduced the running time to O(|G|n3), the cubic time complexity on the input length held a major bottleneck for applying these algorithms to their multitude of applications. In this paper, we study the language edit distance problem for a fundamental context free language, DYCK(s) representing the language of well-balanced parentheses of s different types, that has been pivotal in the development of formal language theory. We provide the very first it near-linear time} algorithm to tightly approximate the DYCK(s) language edit distance problem for any arbitrary s. DYCK(s) language edit distance significantly generalizes the well-studied string edit distance problem, and appears in most applications of language edit distance ranging from data quality in databases, generating automated error-correcting parsers in compiler optimization to structure prediction problems in biological sequences. Its nondeterministic counterpart is known as the hardest context free language. Our main result is an algorithm for edit distance computation to DYCK(s) for any positive integer s that runs in O(n1+ e polylog(n)) time and achieves an approximation factor of O(1/eβ(n)log|OPT|), for any e > 0. Here OPT is the optimal edit distance to DYCK(s) and β(n) is the best approximation factor known for the simpler problem of string edit distance running in analogous time. If we allow O(n1+e+|OPT|2ne) time, then the approximation factor can be reduced to O(1/e log|OPT|). Since the best known near-linear time algorithm for the string edit distance problem has β(n) = polylog(n), under near-linear time computation model both DYCK(s) language and string edit distance problems have polylog(n) approximation factors. This comes as a surprise since the former is a significant generalization of the latter and their exact computations via dynamic programming show a stark difference in time complexity. Rather less surprisingly, we show that the framework for efficiently approximating edit distance to DYCK(s) can be utilized for many other languages. We illustrate this by considering various memory checking languages (studied extensively under distributed verification) such as stack, queue, PQ and DEQUE which comprise of valid transcripts of stacks, queues, priority queues and double-ended queues respectively. Therefore, any language that can be recognized by these data structures, can also be repaired efficiently by our algorithm.

Full Text