Abstract

Existing popular search engines like Elasticsearch (ES) commonly use inverted indices to quickly retrieve source data matching a given set of queries. However, an inverted index may not find all of the matching results from data, particularly those that are hard to be segmented into words, such as data logs and scientific signals. This article presents our innovative technique for a true full-text search system called SAES by replacing the inverted index in ES with the suffix index to guarantee a 100% recall ratio. We designed a distributed suffix index scheme with online building and offline merging capable of scaling with the architecture of ES. The suffix index is dynamically constructed by several suffix array construction tools which adapt to the data size and available computing resources such as CPU cores, RAM, and disk capacities. Furthermore, it can be compacted to provide a trade-off between searching speed and index storage space. An experimental study was conducted to test the functions and performance of single- and multi-node SAES on realistic datasets of texts, logs, genomes, and signals. The systems performed well for both exact and approximate search queries defined on units of bytes or half-bytes. This work provides a feasible reference design for extending ES with suffix index to support true full-text searches over massive heterogeneous data.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.