A sublinear algorithm for approximate keyword searching

E W Myers

doi:10.1007/bf01185432

Abstract

Given a relatively short query stringW of lengthP, a long subject stringA of lengthN, and a thresholdD, theapproximate keyword search problem is to find all substrings ofA that align withW with not more than D insertions, deletions, and mismatches. In typical applications, such as searching a DNA sequence database, the size of the “database”A is much larger than that of the queryW, e.g.,N is on the order of millions or billions andP is a hundred to a thousand. In this paper we present an algorithm that given a precomputedindex of the databaseA, finds rare matches in time that issublinear inN, i.e.,N c for somec<1. The sequenceA must be overa. finite alphabet σ. More precisely, our algorithm requires 0(DN pow(ɛ) logN) expected-time where ɛ=D/P is the maximum number of differences as a percentage of query length, and pow(ɛ) is an increasing and concave function that is 0 when ɛ=0. Thus the algorithm is superior to current O(DN) algorithms when ɛ is small enough to guarantee that pow(ɛ) < 1. As seen in the paper, this is true for a wide range of ɛ, e.g., ɛ. up to 33% for DNA sequences (¦⌆¦=4) and 56% for proteins sequences (¦⌆¦=20). In preliminary practical experiments, the approach gives a 50-to 500-fold improvement over previous algorithms for prolems of interest in molecular biology.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A sublinear algorithm for approximate keyword searching

Abstract

Talk to us

Similar Papers

More From: Algorithmica

Lead the way for us

Journal: Algorithmica	Publication Date: Nov 1, 1994
Citations: 160

Similar Papers

An efficient algorithm for finding short approximate non-tandem repeats.
Ezekiel F Adebiyi ... Tao Jiang
Bioinformatics (Oxford, England) | VOL. Suppl 17 1
Ezekiel F Adebiyi, et. al.Ezekiel F Adebiyi ... Tao Jiang
01 Jun 2001
Bioinformatics (Oxford, England) | VOL. Suppl 17 1

Increasing Our Capacities
Maura C Flannery
The American Biology Teacher | VOL. 47
Maura C FlanneryMaura C Flannery
01 Jan 1985
The American Biology Teacher | VOL. 47

Fetal microchimerism: an aetiological factor in primary biliary cirrhosis?
David Ej Jones
Journal of Hepatology | VOL. 33
David Ej JonesDavid Ej Jones
01 Nov 2000
Journal of Hepatology | VOL. 33

Attribute-based proxy re-encryption with keyword search.
Yanfeng Shi ... Shuo Qiu
PloS one | VOL. 9
Yanfeng Shi, et. al.Yanfeng Shi ... Shuo Qiu
30 Dec 2015
PloS one | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A sublinear algorithm for approximate keyword searching

Abstract

Talk to us

Similar Papers

More From: Algorithmica