Tries in data retrieval and syntactic pattern recognition

Ghada Badr

doi:10.22215/etd/2006-06665

Abstract

String searching plays an important role in many problems, including text processing; information retrieval, speech and signal processing, pattern recognition, database operations, library systems, compilers, command interpreters, and Bioinformatics. This Thesis deals with problems related to and inexact string matching, and in particular, when these problems involve tries. The main aim of this research is to enhance the search performance for strings when they are stored using the trie data structure, and to develop methods that work well, in practice, especially for dictionary-based techniques. The enhancing of the search will be done for both domains, namely the and approximate search for strings. The Thesis presents contributions in two main fields, namely Information Retrieval and Syntactic Pattern Recognition. The following summarize the problems addressed in each of the two fields. Information retrieval. Exact search. In this part of the Thesis, we consider the problem of performing a sequence of access operations on a set of strings S = {s1, s2, ..., sN}. We assume that the strings are accessed based on a set of access probabilities P = {p1, p2, ..., pN}. We also assume that P is not known a priori, and that it is time-invariant. The problems studied involve searching for exact patterns. This will be achieved by applying self-adjusting techniques for the trie data structure when the nodes of the trie are implemented as binary search trees, and by incorporating the concept of direction by proposing a new representation for the trie, namely the Dual-Trie (DT). Syntactic pattern recognition. Approximate string matching. In this part of the Thesis, we consider the traditional problem involved in the syntactic Pattern Recognition (PR) of strings, namely that of recognizing garbled words (sequences). Let Y be a misspelled (noisy) string obtained from an unknown word X*, which is an element of a finite (but possibly, large) dictionary H stored as a trie, T. Y is assumed to contain Substitution, Insertion and Deletion (SID) errors, and we attempt to obtain an appropriate estimate X+ of X*, by processing the information contained in Y. We propose to use various Artificial Intelligence (AI) search techniques within a trie, and to optimize the dynamic programming calculations for the edit distances.

Full Text