Efficient Substring Discovery Using Suffix, LCP Array and Algorithm-Architecture Interaction

Anindya Poddar

doi:10.31390/gradschool_dissertations.490

Abstract

Preprocessing of database is inevitable to extract information from large databases like biological sequences of gene or protein. Discovery of patterns becomes very time efficient when we preprocess the database in the form suffix array. Due to inherent organization of data in suffix array and it’s secondary data structure longest common prefix (LCP) array (Manber and Myers 1990) only a limited portion of the database is accessed during the searching operation which results in outcome of plenty of information in very less amount of time depending on the size of the database. Unlike exact pattern matching here we preprocess the database instead of pattern. We found suffix and LCP array as a perfect tool to compute N-grams (substring) in various dimensions. Since past couple of decades there has been significant research on construction of suffix and LCP array. Comparatively the research of properly utilizing this prospective data structures to retrieve the substring information from various perspectives have remained almost unfocussed. Our main focus in this work was to develop a number of algorithms for computing present and missing N-grams in a text in linear time and present them non-redundantly for large databases. Finding information of present and missing N-grams and their time efficient non-redundant representation in large genome sequences can lead to new discovery in biology in the future. We have implemented and applied all our algorithms on various genome and proteome sequences and found interesting results. They were also tested for performance and other hardware parameter measurements on various platforms in order to suggest appropriate architecture for this kind of application.

Full Text