Engineering basic algorithms of an in-memory text search engine

Frederik Transier,Peter Sanders

doi:10.1145/1877766.1877768

Abstract

Inverted index data structures are the key to fast text search engines. We first investigate one of the predominant operation on inverted indexes, which asks for intersecting two sorted lists of document IDs of different lengths. We explore compression and performance of different inverted list data structures. In particular, we present Lookup , a new data structure that allows intersection in expected time linear in the smaller list. Based on this result, we present the algorithmic core of a full text data base that allows fast Boolean queries, phrase queries, and document reporting using less space than the input text. The system uses a carefully choreographed combination of classical data compression techniques and inverted-index-based search data structures. Our experiments show that inverted indexes are preferable over purely suffix-array-based techniques for in-memory (English) text search engines. A similar system is now running in practice in each core of the distributed data base engine TREX of SAP.

Full Text