Approximate String Matching with Compressed Indexes

Luís M Russo,Gonzalo Navarro,Pedro Morales,Arlindo Oliveira

doi:10.3390/a2031105

Abstract

A compressed full-text self-index for a text T is a data structure requiring reduced space and able to search for patterns P in T. It can also reproduce any substring of T, thus actually replacing T. Despite the recent explosion of interest on compressed indexes, there has not been much progress on functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in bioinformatics. We study ASM algorithms for Lempel-Ziv compressed indexes and for compressed suffix trees/arrays. Most compressed self-indexes belong to one of these classes. We start by adapting the classical method of partitioning into exact search to self-indexes, and optimize it over a representative of either class of self-index. Then, we show that a Lempel- Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to a Lempel- Ziv index. Finally, we improve hierarchical verification, a successful technique for sequential searching, so as to extend the matches of pattern pieces to the left or right. Most compressed suffix trees/arrays support the required bidirectionality, thus enabling the implementation of the improved technique. In turn, the improved verification largely reduces the accesses to the text, which are expensive in self-indexes. We show experimentally that our algorithms are competitive and provide useful space-time tradeoffs compared to classical indexes.

Highlights

Introduction and Related WorkApproximate string matching (ASM) is an important problem that arises in applications related to text searching, pattern recognition, signal processing, and computational biology, to name a few
In this paper we presented two algorithms for ASM in compressed space: an adaptation of the hybrid index for Lempel-Ziv compressed indexes and an hierarchical verification over fully compressed suffix trees (FCSTs)’s
We started by addressing the problem of approximate matching with q-samples indexes, where we described a new approach to this problem

Summary

Introduction and Related Work

Approximate string matching (ASM) is an important problem that arises in applications related to text searching, pattern recognition, signal processing, and computational biology, to name a few. Indexes based on q-grams or q-samples are appealing because they require less space than suffix trees or arrays The algorithms on those indexes do not offer worst-case guarantees, but perform well on average when the error level α = k/m is low enough, say O(1/ logσ u). One can use any compressed self-index to implement a filtration ASM method that relies on looking for exact occurrences of pattern substrings, as this is what all self-indexes provide We explore the impact of hierarchical verification on hybrid searching, using a compressed suffix tree instead of a Lempel-Ziv index. Compressed suffix trees and arrays are usually self-indexes, meaning that they do not store the text T but they are able to obtain it. This completes the strong symbiotic exchange between hierarchical verification and compressed self-indexing, and provides a very important result for ASM over compressed indexes, both in theory and in practice

A Simple Self-Indexing Method

A Hybrid q-samples Index

A Hybrid Lempel-Ziv Index

A: A0 A1 A2 A3 A4 A5 A6 A7

Findings

Conclusions and Future Work