Large scale study of multiple-molecule queries

Ramzi J Nasr,Pierre F Baldi,S Joshua Swamidass

doi:10.1186/1758-2946-1-7

Ramzi J Nasr, Pierre F Baldi + Show 1 more

Open Access

https://doi.org/10.1186/1758-2946-1-7

Copy DOI

Journal: Journal of Cheminformatics	Publication Date: Jun 4, 2009
Citations: 52	License type: CC BY 2.0

Affiliation: University of California, Irvine

Abstract

BackgroundIn ligand-based screening, as well as in other chemoinformatics applications, one seeks to effectively search large repositories of molecules in order to retrieve molecules that are similar typically to a single molecule lead. However, in some case, multiple molecules from the same family are available to seed the query and search for other members of the same family.Multiple-molecule query methods have been less studied than single-molecule query methods. Furthermore, the previous studies have relied on proprietary data and sometimes have not used proper cross-validation methods to assess the results. In contrast, here we develop and compare multiple-molecule query methods using several large publicly available data sets and background. We also create a framework based on a strict cross-validation protocol to allow unbiased benchmarking for direct comparison in future studies across several performance metrics.ResultsFourteen different multiple-molecule query methods were defined and benchmarked using: (1) 41 publicly available data sets of related molecules with similar biological activity; and (2) publicly available background data sets consisting of up to 175,000 molecules randomly extracted from the ChemDB database and other sources. Eight of the fourteen methods were parameter free, and six of them fit one or two free parameters to the data using a careful cross-validation protocol. All the methods were assessed and compared for their ability to retrieve members of the same family against the background data set by using several performance metrics including the Area Under the Accumulation Curve (AUAC), Area Under the Curve (AUC), F1-measure, and BEDROC metrics.Consistent with the previous literature, the best parameter-free methods are the MAX-SIM and MIN-RANK methods, which score a molecule to a family by the maximum similarity, or minimum ranking, obtained across the family. One new parameterized method introduced in this study and two previously defined methods, the Exponential Tanimoto Discriminant (ETD), the Tanimoto Power Discriminant (TPD), and the Binary Kernel Discriminant (BKD), outperform most other methods but are more complex, requiring one or two parameters to be fit to the data.ConclusionFourteen methods for multiple-molecule querying of chemical databases, including novel methods, (ETD) and (TPD), are validated using publicly available data sets, standard cross-validation protocols, and established metrics. The best results are obtained with ETD, TPD, BKD, MAX-SIM, and MIN-RANK. These results can be replicated and compared with the results of future studies using data freely downloadable from http://cdb.ics.uci.edu/.

Highlights

The rapid search of large repositories of molecules is a fundamental task of chemoinformatics
All the methods were assessed and compared for their ability to retrieve members of the same family against the background data set by using several performance metrics including the Area Under the Accumulation Curve (AUAC), Area Under the Curve (AUC), F1-measure, and Boltzmann-Enhanced Discrimination of Receiver Operating Characteristic (BEDROC) metrics
One new parameterized method introduced in this study and two previously defined methods, the Exponential Tanimoto Discriminant (ETD), the Tanimoto Power Discriminant (TPD), and the Binary Kernel Discriminant (BKD), outperform most other methods but are more complex, requiring one or two parameters to be fit to the data

Summary

Results

Fourteen different multiple-molecule query methods were defined and benchmarked using: (1) 41 publicly available data sets of related molecules with similar biological activity; and (2) publicly available background data sets consisting of up to 175,000 molecules randomly extracted from the ChemDB database and other sources. Eight of the fourteen methods were parameter free, and six of them fit one or two free parameters to the data using a careful cross-validation protocol. One new parameterized method introduced in this study and two previously defined methods, the Exponential Tanimoto Discriminant (ETD), the Tanimoto Power Discriminant (TPD), and the Binary Kernel Discriminant (BKD), outperform most other methods but are more complex, requiring one or two parameters to be fit to the data

Conclusion

Introduction

Results and discussion

Method

Flower DR

11. Singh R

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Large scale study of multiple-molecule queries

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Cheminformatics

Lead the way for us

Similar Papers

Extreme Learning Machine Framework for Risk Stratification of Fatty Liver Disease Using Ultrasound Tissue Characterization.
Venkatanareshbabu Kuppili ... Mainak Biswas
Journal of Medical Systems | VOL. 41
Venkatanareshbabu Kuppili, et. al.Venkatanareshbabu Kuppili ... Mainak Biswas
23 Aug 2017
Journal of Medical Systems | VOL. 41

Machine learning models for prediction of double and triple burdens of non-communicable diseases in Bangladesh.
Md Akib Al-Zubayer ... Uttam Kumar Majumder
Journal of biosocial science | VOL. 56
Md Akib Al-Zubayer, et. al.Md Akib Al-Zubayer ... Uttam Kumar Majumder
20 Mar 2024
Journal of biosocial science | VOL. 56

Cardiovascular/stroke risk prevention: A new machine learning framework integrating carotid ultrasound image-based phenotypes and its harmonics with conventional risk factors
Ankush Jamthikar ... Jasjit S Suri
Indian heart journal | VOL. 72
Ankush Jamthikar, et. al.Ankush Jamthikar ... Jasjit S Suri
18 Jun 2020
Indian heart journal | VOL. 72

P-289 Optimising the detection of obliterated Pouch of Douglas for endometriosis diagnosis, by combining unpaired endometriosis ultrasounds and magnetic resonance imaging using Artificial Intelligence
J Avery ... S Knox
Human Reproduction | VOL. 39
J Avery, et. al.J Avery ... S Knox
03 Jul 2024
Human Reproduction | VOL. 39

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Large scale study of multiple-molecule queries

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Cheminformatics