Abstract
We consider a sliding window W over a stream of characters from some alphabet of constant size. We want to look up a pattern in the current sliding window content and obtain all positions of the matches. We present an indexed version of the sliding window, based on a suffix tree. The data structure of size Θ(|W|) has optimal time queries Θ(m+occ) and amortized constant time updates, where m is the length of the query string and occ is its number of occurrences.
Highlights
Introduction and Related WorkText indexing, pattern matching, and big data in general is a well studied field of computer science and engineering
One way of implementing string matching including regular expressions is by using finite automata [4,5]
String matching is used in digital forensics, where we typically match multiple regular expressions on massive amounts of data, which involves multiple streams and parallelism
Summary
Pattern matching, and big data in general is a well studied field of computer science and engineering. A practical suffix array based a sliding window was proposed by Ferreira et al [22,23], with speed improvements by Salson et al [24] Their approach supports efficient substring query operations, but updating the suffix array requires at least linear time due to the nature of the array, i.e., insertion and/or removal of an element requires the other elements to shift by one slot. It turns out that this operation is not trivial, due to details hidden in Ukkonen’s suffix tree construction algorithm This is the first data structure for on-the-fly text indexing which requires amortized O(1) time for updates and worst case optimal time for queries.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.