Searching Strings by Substring

doi:10.1017/9781009128933.011

Abstract

This chapter deals with the design of data structures and algorithms for the substring search problem, which occurs mainly in computational biology and textual database applications to date. Most of the chapter is devoted to describing the two main data-structure champions in this context, the suffix array and the suffix tree. Several pseudocodes and illustrative examples enrich this discussion, which is accompanied by the evaluation of time, space, and I/O complexities incurred by their construction and by the execution of some powerful query operations. In particular, the chapter deals with the efficient/optimal construction of large suffix arrays in external memory, hence describing the DC3 algorithm and the I/O-efficient scan-based algorithm proposed by Gonnet, Baeza-Yates, and Snider, and the efficient direct construction of suffix trees, via McCreight’s algorithm, or via suffix arrays and LCP arrays. It will also detail the elegant construction of this latter array in internal memory, which is fundamental for several text-mining applications, some of which are described at the end of the chapter.

Full Text