The chemfp project

Andrew Dalke

doi:10.1186/s13321-019-0398-8

Abstract

The chemfp project has had four main goals: (1) promote the FPS format as a text-based exchange format for dense binary cheminformatics fingerprints, (2) develop a high-performance implementation of the BitBound algorithm that could be used as an effective baseline to benchmark new similarity search implementations, (3) experiment with funding a pure open source software project through commercial sales, and (4) publish the results and lessons learned as a guide for future implementors. The FPS format has had only minor success, though it did influence development of the FPB binary format, which is faster to load but more complex. Both are summarized. The chemfp benchmark and the no-cost/open source version of chemfp are proposed as a reference baseline to evaluate the effectiveness of other similarity search tools. They are used to evaluate the faster commercial version of chemfp, which can test 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k = 1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query. The same search of 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest CPU-based similarity search implementations. Modern CPUs are fast enough that memory bandwidth and latency are now important factors. Single-threaded search uses most of the available memory bandwidth. Sorting the fingerprints by popcount improves memory coherency, which when combined with 4 OpenMP threads makes it possible to construct an N × N similarity matrix for 1 million fingerprints in about 30 min. These observations may affect the interpretation of previous publications which assumed that search was strongly CPU bound. The chemfp project funding came from selling a purely open-source software product. Several product business models were tried, but none proved sustainable. Some of the experiences are discussed, in order to contribute to the ongoing conversation on the role of open source software in cheminformatics.

Highlights

Molecular similarity search is a fundamental concept in cheminformatics
Complete search systems are available from many vendors, or a good programmer can implement a basic system with reasonable search performance in only a few hours
This paper focuses on exact methods for binary fingerprints

Summary

Introduction

Molecular similarity search is a fundamental concept in cheminformatics. The most common form is almost certainly a Tanimoto similarity search of bitstring fingerprints. High-performance search systems, which combine fast popcount evaluation and pruning algorithms, require significantly more development effort. The chemfp project started in order to develop a de facto file format for chemical fingerprints. This requires some consideration of why such a format did not already exist, in order to understand which factors to focus on during format development. Willett [1], influenced by earlier work [2] showed how the Tanimoto similarity between two bitstring fingerprints is a useful mechanism to characterize molecular similarity. The term “fingerprint” first appeared in the literature in 1992 [3] to distinguish the -new enumeration-based Daylight fingerprints from the older substructure dictionary approach. The late 1980s and 1990s brought an incredible growth of research as people explored ways to generate, compare, and cluster fingerprints, to extend the concept to sparse and count fingerprints, and to extend fingerprints beyond 2D substructures [4]

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Cheminformatics	Publication Date: Dec 1, 2019
Citations: 28	License type: open-access

R Discovery Prime

R Discovery Prime

The chemfp project

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Cheminformatics

Lead the way for us

Similar Papers

Quantifying the Performance Impact of Memory Latency and Bandwidth for Big Data Workloads
Russell Clapp ... Thomas Willhalm
-
Russell Clapp, et. al.Russell Clapp ... Thomas Willhalm
01 Oct 2015
01 Oct 2015

Mass Blaster V1.0 – A Perl Gui Tool for Mass Sequence Blast and Gene Prediction
Saravanan Vijayakumar
Journal of Proteomics & Bioinformatics | VOL. 03
Saravanan VijayakumarSaravanan Vijayakumar
01 Jan 2009
Journal of Proteomics & Bioinformatics | VOL. 03

Gaining competitive advantage in a knowledge‐based economy through the utilization of open source software
Darius Hedgebeth
VINE | VOL. 37
Darius HedgebethDarius Hedgebeth
04 Sep 2007
VINE | VOL. 37

Implementation of a 3rd-generation SPARC V9 64 b microprocessor
...
-
, et. al. ...
07 Feb 2000
Implementation of a 3rd-generation SPARC V9 64 b microprocessor
...

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The chemfp project

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Cheminformatics