Breadth-first search strategies for trie-based syntactic pattern recognition

B John Oommen,Ghada Badr

doi:10.1007/s10044-006-0032-z

Abstract

Dictionary-based syntactic pattern recognition of strings attempts to recognize a transmitted string X *, by processing its noisy version, Y, without sequentially comparing Y with every element X in the finite, (but possibly, large) dictionary, H. The best estimate X + of X *, is defined as that element of H which minimizes the generalized Levenshtein distance (GLD) D(X, Y) between X and Y, for all X ?H. The non-sequential PR computation of X + involves a compact trie-based representation of H. In this paper, we show how we can optimize this computation by incorporating breadth first search schemes on the underlying graph structure. This heuristic emerges from the trie-based dynamic programming recursive equations, which can be effectively implemented using a new data structure called the linked list of prefixes that can be built separately or "on top of" the trie representation of H. The new scheme does not restrict the number of errors in Y to be merely a small constant, as is done in most of the available methods. The main contribution is that our new approach can be used for generalized GLDs and not merely for 0/1 costs. It is also applicable when all possible correct candidates need to be known, and not just the best match. These constitute the cases when the "cutoffs" cannot be used in the DFS trie-based technique (Shang and Merrettal in IEEE Trans Knowl Data Eng 8(4):540---547, 1996). The new technique is compared with the DFS trie-based technique (Risvik in United Patent 6377945 B1, 23 April 2002; Shang and Merrettal in IEEE Trans Knowl Data Eng 8(4):540---547, 1996) using three large and small benchmark dictionaries with different errors. In each case, we demonstrate marked improvements with regard to the operations needed up to 21%, while at the same time maintaining the same accuracy. Additionally, some further improvements can be obtained by introducing the knowledge of the maximum number or percentage of errors in Y.

Full Text