SiEDM: An Efficient String Index and Search Algorithm for Edit Distance with Moves

Yoshimasa Takabatake,Hiroshi Sakamoto,Yasuo Tabei,Kenta Nakashima,Tetsuji Kuboyama

doi:10.3390/a9020026

Yoshimasa Takabatake, Hiroshi Sakamoto + Show 3 more

Open Access

https://doi.org/10.3390/a9020026

Copy DOI

Abstract

Although several self-indexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the problem of computing EDM is intractable, it has a wide range of potential applications, especially in approximate string retrieval. Despite the importance of computing EDM, there has been no efficient method for indexing and searching large text collections based on the EDM measure. We propose the first algorithm, named string index for edit distance with moves (siEDM), for indexing and searching strings with EDM. The siEDM algorithm builds an index structure by leveraging the idea behind the edit sensitive parsing (ESP), an efficient algorithm enabling approximately computing EDM with guarantees of upper and lower bounds for the exact EDM. siEDM efficiently prunes the space for searching query strings by the proposed method, which enables fast query searches with the same guarantee as ESP. We experimentally tested the ability of siEDM to index and search strings on benchmark datasets, and we showed siEDM’s efficiency.

Highlights

Vast amounts of text data are created, replicated, and modified with the increasing use of the internet and advances of data-centric technology
F ( Xi ) can be computed by rank1 ( FB, i )-th characteristic vector if the i-th bit of FB is 1; otherwise, F (LeftChild( Xi )) + F (RightChild( Xi )) + ( Xi, 1). Another data structure that string index for edit distance with moves (siEDM) uses is a non-negative integer vector named length vector, each dimension of which is the length of the substring derived from the corresponding variable
We evaluated the performance of siEDM on one core of a quad-core Intel Xeon Processor

Summary

Introduction

Vast amounts of text data are created, replicated, and modified with the increasing use of the internet and advances of data-centric technology. Building indexes is the de facto standard method to search large databases of highly repetitive texts. Several methods have been presented for indexing and searching large-scale and highly repetitive text collections. Algorithms 2016, 9, 26 indexing and searching highly repetitive texts These methods enable fast query searches, their applicability is limited to exact match searches. To accelerate the quadratic time upper bound on computing the edit distance, Cormode and Muthukrishnan introduced a new technique called edit sensitive parsing (ESP) [8]. Despite several attempts to efficiently compute EDM and various extensions of ESP, there is no method for indexing and searching texts with EDM. We propose a novel method called siEDM that efficiently indexes massive text, and performs query searches for EDM.

Basic Notations

Problem

ESP Revisit

Approximate Computations of EDM from ESP-Trees

Index Structure for ESP-Trees

Query Processing on Tree

Other Data Structures

Baseline Algorithm

Improvement

Candidate Finding

Computing Positions

Experiments

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms	Publication Date: Apr 15, 2016
Citations: 35	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

SiEDM: An Efficient String Index and Search Algorithm for Edit Distance with Moves

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms

Lead the way for us

Similar Papers

Searching String in Big-Data: A Better Approach by Applied Machine Learning
Paras Nath Singh ... Tara P Gowdar
SN Computer Science | VOL. 2
Paras Nath Singh, et. al.Paras Nath Singh ... Tara P Gowdar
03 Apr 2021
SN Computer Science | VOL. 2

A contextual normalised edit distance
Colin De La Higuera ... Luisa Mico
-
Colin De La Higuera, et. al.Colin De La Higuera ... Luisa Mico
01 Apr 2008
01 Apr 2008

A Contextual Normalised Edit Distance
Colin De La Higuera ... Luisa Micó
-
Colin De La Higuera, et. al.Colin De La Higuera ... Luisa Micó
01 Apr 2008
01 Apr 2008

How Compression and Approximation Affect Efficiency in String Distance Measures
Arun Ganesh ... Barna Saha
-
Arun Ganesh, et. al.Arun Ganesh ... Barna Saha
01 Jan 2021
01 Jan 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SiEDM: An Efficient String Index and Search Algorithm for Edit Distance with Moves

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms