Abstract

Increasing data volumes, particularly in science and engineering, has resulted in the widespread adoption of parallel and distributed file systems for data storage and access. However, as file system sizes and the amount of data “owned” by users has grown, it is increasingly difficult to discover and locate data amongst the terabytes or petabytes of accessible data. While it is now routine to search for data on a personal computer or discover data online at the click of a button, there is no such equivalent method for discovering data on large parallel and distributed file systems in high-performance computing systems. Popular search solutions, such as Apache Lucene, were designed and implemented to run on commodity hardware thus posing significant limitations in achieving good efficiency on large-scale storage systems with many-core architectures, multiple NUMA nodes, and multiple NVMe storage devices. In this work we revisit and propose methods and techniques to support efficient indexing of data in order to enable search. We propose SCANNS, an indexing framework that can exploit the properties of modern high-performance computing systems delivering an order of magnitude better performance. SCANNS supports out-of-the-box Term Frequency-Inverse Document Frequency information retrieval model. We evaluate SCANNS on the Mystic system with configurations up to 192-cores, 768GiB of RAM, 8 NUMA nodes, and up to 16 NVMe drives, and achieved performance improvements up to 19x better indexing while delivering up to 280X lower search latency when compared to Apache Lucene.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.