Succinct Multibit Tree: Compact Representation of Multibit Trees by Using Succinct Data Structures in Chemical Fingerprint Searches

Yasuo Tabei

doi:10.1007/978-3-642-33122-0_16

Abstract

Similarity searches in the databases of chemical fingerprints are a fundamental task in discovering novel drug-like molecules. Multibit trees have a data structure that enables fast similarity searches of chemical fingerprints (Kristensen et al., WABI'09). A standard pointer-based representation of multibit trees consumes a large amount of memory to index large-scale fingerprint databases. To make matters worse, original fingerprint databases need to be stored in memory to filter out false positives. A succinct data structure is compact and enables fast operations. Many succinct data structures have been proposed thus far, and have been applied to many fields such as full text indexing and genome mapping. We present compact representations of both multibit trees and fingerprint databases by applying these data structures. Experiments revealed that memory usage in our representations was much smaller than that of the standard pointer-based representation. Moreover, our representations enabled us to efficiently perform PubChem-scale similarity searches.

Full Text