Knuth-Morris-Pratt Research Articles

In the present work we perform compressed pattern matching in binary Huffman encoded texts [Huffman, D. (1952). A method for the construction of minimum redundancy codes, Proc. of the IRE, 40, 1098-1101]. A modified Knuth-Morris-Pratt algorithm is used in order to overcome the problem of false matches, i.e., an occurrence of the encoded pattern in the encoded text that does not correspond to an occurrence of the pattern itself in the original text. We propose a bitwise KMP algorithm that can move one extra bit in the case of a mismatch since the alphabet is binary. To avoid processing any bit of the encoded text more than once, a preprocessed table is used to determine how far to back up when a mismatch is detected, and is defined so that we are always able to align the start of the encoded pattern with the start of a codeword in the encoded text. We combine our KMP algorithm with two practical Huffman decoding schemes which handle more than a single bit per machine operation; skeleton trees defined by Klein [Klein, S. T. (2000). Skeleton trees for efficient decoding of huffman encoded texts. Information Retrieval, 3, 7-23], and numerical comparisons between special canonical values and portions of a sliding window presented in Moffat and Turpin [Moffat, A., & Turpin, A. (1997). On the implementation of minimum redundancy prefix codes. IEEE Transactions on Communications, 45, 1200-1207]. Experiments show rapid times of our algorithms compared to the decompress then search method, therefore, files can be kept in their compressed form, saving memory space. When compression gain is important, these algorithms are better than cgrep [Ferragina, P., Tommasi, A., & Manzini, G. (2004). C Library to over compressed texts, http://roquefort.di.unipi.it/~ferrax/CompressedSearch], which is only slightly faster than ours.

Read full abstract

The most important features of a string matching algorithm are its efficiency and its flexibility. Efficiency has traditionally received more attention, while flexibility in the search pattern is becoming a more and more important issue. Most classical string matching algorithms are aimed at quickly finding an exact pattern in a text, being Knuth-Morris-Pratt (KMP) and the Boyer-Moore (BM) family the most famous ones. A recent development uses deterministic "suffix automata" to design new optimal string matching algorithms, e.g. BDM and TurboBDM. Flexibility has been addressed quite separately by the use of "bit-parallelism", which simulates automata in their nondeterministic form by using bits and exploiting the intrinsic parallelism inside the computer word, e.g. the Shift-Or algorithm. Those algorithms are extended to handle classes of characters and errors in the pattern and/or in the text, their drawback being their inability to skip text characters. In this paper we merge bit-parallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bit-parallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as Shift-Or. It inherits from Shift-Or the ability to handle flexible patterns and from BDM the ability to skip characters. BNDM is 30%-40% faster than BDM and up to 7 times faster than Shift-Or. When compared to the fastest existing algorithms on exact patterns (which belong to the BM family), BNDM is from 20% slower to 3 times faster, depending on the alphabet size. With respect to flexible pattern searching, BNDM is by far the fastest technique to deal with classes of characters and is competitive to search allowing errors. In particular, BNDM seems very adequate for computational biology applications, since it is the fastest algorithm to search on DNA sequences and flexible searching is an important problem in that area. As a theoretical development related to flexible pattern matching, we introduce a new automaton to recognize suffixes of patterns with classes of characters. To the best of our knowledge, this automaton has not been studied before.

Read full abstract

Knuth-Morris-Pratt Research Articles

Related Topics

Articles published on Knuth-Morris-Pratt

A Fast Exact String Matching Algorithm Based on Nested Classification

Functional verification of signature detection architectures for high speed network applications

Parallelization of KMP String Matching Algorithm on Different SIMD Architectures: Multi-Core and GPGPU&apos;s

Fast searching in packed strings

모바일 환경에서 파일 검색 엔진을 위한 효과적인 방식

A simple fast hybrid pattern-matching algorithm

A computationally efficient engine for flexible intrusion detection

Adapting the Knuth–Morris–Pratt algorithm for pattern matching in Huffman encoded texts

On maximal suffixes and constant-space linear-time versions of KMP algorithm

On Obtaining Knuth, Morris, and Pratt's String Matcher by Partial Evaluation

Window-accumulated subsequence matching problem is linear

Fast and flexible string matching by combining bit-parallelism and suffix automata

Simple Optimal String Matching Algorithm

A positive supercompiler

A Left-to-Right Preprocessing Computation for the Boyer-Moore String Matching Algorithm

The Möbius function of factor order

Formal derivation of a pattern matching algorithm

An analytical comparison of two string searching algorithms

An alternative for the implementation of the Knuth-Morris-Pratt algorithm

Saving Space in Fast String-Matching

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Knuth-Morris-Pratt Research Articles

Related Topics

Articles published on Knuth-Morris-Pratt

A Fast Exact String Matching Algorithm Based on Nested Classification

Functional verification of signature detection architectures for high speed network applications

Parallelization of KMP String Matching Algorithm on Different SIMD Architectures: Multi-Core and GPGPU&amp;apos;s

Fast searching in packed strings

모바일 환경에서 파일 검색 엔진을 위한 효과적인 방식

A simple fast hybrid pattern-matching algorithm

A computationally efficient engine for flexible intrusion detection

Adapting the Knuth–Morris–Pratt algorithm for pattern matching in Huffman encoded texts

On maximal suffixes and constant-space linear-time versions of KMP algorithm

On Obtaining Knuth, Morris, and Pratt's String Matcher by Partial Evaluation

Window-accumulated subsequence matching problem is linear

Fast and flexible string matching by combining bit-parallelism and suffix automata

Simple Optimal String Matching Algorithm

A positive supercompiler

A Left-to-Right Preprocessing Computation for the Boyer-Moore String Matching Algorithm

The Möbius function of factor order

Formal derivation of a pattern matching algorithm

An analytical comparison of two string searching algorithms

An alternative for the implementation of the Knuth-Morris-Pratt algorithm

Saving Space in Fast String-Matching

Parallelization of KMP String Matching Algorithm on Different SIMD Architectures: Multi-Core and GPGPU's