We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern \(P[1\mathinner{.. }m]\) on a large repetitive text collection \(T[1\mathinner{.. }n]\) over an alphabet of size \(\sigma\) , which is represented as a (hopefully much smaller) run-length context-free grammar of size \(g_{rl}\) . We show that the problem can be solved in time \(O(m^{2}\log^{\epsilon}n)\) , for any constant \(\epsilon\gt0\) , on a data structure of size \(O(g_{rl})\) . Further, on a locally consistent grammar of size \(O(\delta\log\frac{n\log\sigma}{\delta\log n})\) , the time decreases to \(O(m\log m(\log m+\log^{\epsilon}n))\) . The value \(\delta\) is a function of the substring complexity of \(T\) and \(\Omega(\delta\log\frac{n\log\sigma}{\delta\log n})\) is a tight lower bound on the compressibility of repetitive texts \(T\) , so our structure has optimal size in terms of \(n\) , \(\sigma\) , and \(\delta\) . We extend our results to several related problems, such as finding \(k\) -MEMs, MUMs, rare MEMs, and applications. Categories and Subject Descriptors: E.1 [Data structures] ; E.2 [Data storage representations] ; E.4 [Coding and information theory]: Data compaction and compression; F.2.2 [Analysis of algorithms and problem complexity] : Nonnumerical algorithms and problems— Pattern matching, Computations on discrete structures, Sorting and searching
Read full abstract