Streaming Dictionary Matching with Mismatches

Paweł Gawrychowski,Tatiana Starikovskaya

doi:10.1007/s00453-021-00876-x

Abstract

In the k-mismatch problem we are given a pattern of length n and a text and must find all locations where the Hamming distance between the pattern and the text is at most k. A series of recent breakthroughs have resulted in an ultra-efficient streaming algorithm for this problem that requires only \(\mathcal {O}(k \log \frac{n}{k})\) space and \(\mathcal {O}(\log \frac{n}{k} (\sqrt{k \log k} + \log ^3 n))\) time per letter (Clifford, Kociumaka, Porat, SODA 2019). In this work, we consider a strictly harder problem called dictionary matching with k mismatches. In this problem, we are given a dictionary of d patterns, where the length of each pattern is at most n, and must find all substrings of the text that are within Hamming distance k from one of the patterns. We develop a streaming algorithm for this problem with \(\mathcal {O}(k d \log ^k d \mathop {\mathrm {polylog} {\,n}})\) space and \(\mathcal {O}(k \log ^{k} d \mathop {\mathrm {polylog} {\,n}} + |\mathrm {output}|)\) time per position of the text. The algorithm is randomised and outputs correct answers with high probability. On the lower bound side, we show that any streaming algorithm for dictionary matching with k mismatches requires \(\varOmega (k d)\) bits of space.

Highlights

The pattern matching problem is the fundamental problem of string processing and has been studied for more than 40 years
The streaming model of computation was designed to overcome the restrictions of the word-RAM model
We show a streaming algorithm for dictionary matching with k mismatches based on a new randomised implementation of the k-errata tree, a data structure introduced by Cole, Gottlieb, and Lewenstein [12]

Summary

Introduction

The pattern matching problem is the fundamental problem of string processing and has been studied for more than 40 years. For a pattern of length m, their algorithm uses O(log m) space and O(log m) time per character. The algorithm assumes the word-RAM model of computation, and for a dictionary of d patterns of length at most m, uses Ω(md) space and O(1 + occ) amortised time per. In ESA 2015, Clifford et al [9] showed a streaming dictionary matching algorithm that uses O(d log m) space and O(log log(m + d) + occ) time per character. By reduction to the streaming exact pattern matching, Porat and Porat [26] showed the first streaming k-mismatch algorithm with space O(k3 log m/ log log m) and time O(k2 log m/ log log m).

Our results

Preliminaries

Algorithm based on the randomised k-errata tree

Improving space

Reminder

Streaming algorithm for patterns with large periods

Streaming algorithm for patterns with small periods

Algorithm for Case 1

Proof of Theorem 1 – de-amortisation

De-amortised algorithm with a delay

Removing the delay

Proof of Lemma 3 – space lower bound

Reminder: the k-errata tree

Randomised implementation of the k-errata tree

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Streaming Dictionary Matching with Mismatches

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithmica

Lead the way for us

Journal: Algorithmica	Publication Date: Oct 17, 2021
License type: cc-by

Similar Papers

A note on the longest common substring with k-mismatches problem
Szymon Grabowski
Information Processing Letters | VOL. 115
Szymon GrabowskiSzymon Grabowski
10 Mar 2015
Information Processing Letters | VOL. 115

The k-mismatch problem revisited
...
-
, et. al. ...
21 Dec 2015
21 Dec 2015

Streaming Dictionary Matching with Mismatches
...
HAL (Le Centre pour la Communication Scientifique Directe) | VOL. -
, et. al. ...
01 Jan 2019
HAL (Le Centre pour la Communication Scientifique Directe) | VOL. -

The streaming k-mismatch problem
...
-
, et. al. ...
06 Jan 2019
06 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Streaming Dictionary Matching with Mismatches

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithmica