On the string matching with k differences in DNA databases

Yangjun Chen,Hoang Hai Nguyen

doi:10.14778/3447689.3447695

Yangjun Chen, Hoang Hai Nguyen

https://doi.org/10.14778/3447689.3447695

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

In this paper, we discuss an efficient and effective index mechanism for the string matching with k differences, by which we will find all the substrings of a target string y of length n that align with a pattern string x of length m with not more than k insertions, deletions, and mismatches. A typical application is the searching of a DNA database, where the size of a genome sequence in the database is much larger than that of a pattern. For example, n is often on the order of millions or billions while m is just a hundred or a thousand. The main idea of our method is to transform y to a BWT-array as an index, denoted as BWT ( y ), and search x against it. The time complexity of our method is bounded by O( k · | T |), where T is a tree structure dynamically generated during a search of BWT ( y ). The average value of | T | is bounded by O(|Σ| 2 k ), where Σ is an alphabet from which we take symbols to make up target and pattern strings. This time complexity is better than previous strategies when k ≤ O(log |Σ| n ). The general working process consists of two steps. In the first step, x is decomposed into a series of l small subpatterns, and BWT ( y ) is utilized to speedup the process to figure out all the occurrences of such subpatterns with ⌊ k/l ⌋ differences. In the second step, all the found occurrences in the first step will be rechecked to see whether they really match x , but with k differences. Extensive experiments have been conducted, which show that our method for this problem is promising.

Full Text