A hash trie filter method for approximate string matching in genomic databases

Ye-In Chang,Min-Tze Hsu,Jiun-Rung Chen

doi:10.1007/s10489-010-0233-4

Abstract

In genomic databases, approximate string matching with k errors is often applied when searching genomic sequences, where k errors can be caused by substitution, insertion, or deletion operations. In this paper, we propose a new method, the hash trie filter, to efficiently support approximate string matching in genomic databases. First, we build a hash trie for indexing the genomic sequence stored in a database in advance. Then, we utilize an efficient technique to find the ordered subpatterns in the sequence, which could reduce the number of candidates by pruning some unreasonable matching positions. Moreover, our method will dynamically decide the number of ordered matching grams, resulting in the increase of precision. The simulation results show that the hash trie filter outperforms the well-known (k+s) q-samples filter in terms of the response time, the number of verified candidates, and the precision, under different lengths of the query patterns and different error levels.

Full Text