SCANNS: Towards Scalable and Concurrent Data Indexing and Searching in High-End Computing System

Alexandru Iulian Orhean,Ioan Raicu,Lavanya Ramakrishnan,Anna Giannakou,Kyle Chard

doi:10.1109/ccgrid54584.2022.00014

Abstract

Increasing data volumes, particularly in science and engineering, has resulted in the widespread adoption of parallel and distributed file systems for data storage and access. However, as file system sizes and the amount of data “owned” by users has grown, it is increasingly difficult to discover and locate data amongst the terabytes or petabytes of accessible data. While it is now routine to search for data on a personal computer or discover data online at the click of a button, there is no such equivalent method for discovering data on large parallel and distributed file systems in high-performance computing systems. Popular search solutions, such as Apache Lucene, were designed and implemented to run on commodity hardware thus posing significant limitations in achieving good efficiency on large-scale storage systems with many-core architectures, multiple NUMA nodes, and multiple NVMe storage devices. In this work we revisit and propose methods and techniques to support efficient indexing of data in order to enable search. We propose SCANNS, an indexing framework that can exploit the properties of modern high-performance computing systems delivering an order of magnitude better performance. SCANNS supports out-of-the-box Term Frequency-Inverse Document Frequency information retrieval model. We evaluate SCANNS on the Mystic system with configurations up to 192-cores, 768GiB of RAM, 8 NUMA nodes, and up to 16 NVMe drives, and achieved performance improvements up to 19x better indexing while delivering up to 280X lower search latency when compared to Apache Lucene.

Full Text