Full-text search engine with suffix index for massive heterogeneous data

Wentao Xu,Haoyu Chen,Yidong Huan,Xuedong Hu,Ge Nong

doi:10.1016/j.is.2021.101893

Abstract

Existing popular search engines like Elasticsearch (ES) commonly use inverted indices to quickly retrieve source data matching a given set of queries. However, an inverted index may not find all of the matching results from data, particularly those that are hard to be segmented into words, such as data logs and scientific signals. This article presents our innovative technique for a true full-text search system called SAES by replacing the inverted index in ES with the suffix index to guarantee a 100% recall ratio. We designed a distributed suffix index scheme with online building and offline merging capable of scaling with the architecture of ES. The suffix index is dynamically constructed by several suffix array construction tools which adapt to the data size and available computing resources such as CPU cores, RAM, and disk capacities. Furthermore, it can be compacted to provide a trade-off between searching speed and index storage space. An experimental study was conducted to test the functions and performance of single- and multi-node SAES on realistic datasets of texts, logs, genomes, and signals. The systems performed well for both exact and approximate search queries defined on units of bytes or half-bytes. This work provides a feasible reference design for extending ES with suffix index to support true full-text searches over massive heterogeneous data.

Full Text