Abstract

With the rapidly increasing availability of High-Throughput Screening (HTS) data in the public domain, such as the PubChem database, methods for ligand-based computer-aided drug discovery (LB-CADD) have the potential to accelerate and reduce the cost of probe development and drug discovery efforts in academia. We assemble nine data sets from realistic HTS campaigns representing major families of drug target proteins for benchmarking LB-CADD methods. Each data set is public domain through PubChem and carefully collated through confirmation screens validating active compounds. These data sets provide the foundation for benchmarking a new cheminformatics framework BCL::ChemInfo, which is freely available for non-commercial use. Quantitative structure activity relationship (QSAR) models are built using Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Decision Trees (DTs), and Kohonen networks (KNs). Problem-specific descriptor optimization protocols are assessed including Sequential Feature Forward Selection (SFFS) and various information content measures. Measures of predictive power and confidence are evaluated through cross-validation, and a consensus prediction scheme is tested that combines orthogonal machine learning algorithms into a single predictor. Enrichments ranging from 15 to 101 for a TPR cutoff of 25% are observed.

Highlights

  • The development of quantitative structure activity relationship (QSAR) models in ligand-based computer-aided drug discovery (LB-CADD) has shown practical value for in silico high-throughput screening (HTS) to identify potential hit compounds, i.e., compounds that share a biological activity of interest [1]

  • We focus on HTS experiments with a single well-defined biological target protein

  • Nine large data sets were assembled originating from realistic HTS experiments for a range of common drug target proteins including G-protein coupled receptor (GPCR), ion channels, transporters, kinase inhibitors, and enzymes

Read more

Summary

Introduction

The development of quantitative structure activity relationship (QSAR) models in ligand-based computer-aided drug discovery (LB-CADD) has shown practical value for in silico (virtual) high-throughput screening (HTS) to identify potential hit compounds, i.e., compounds that share a biological activity of interest [1]. LB-CADD has the potential to reduce these costs in a resource-limited academic environment. Public databases such as PubChem [4] contain biological activities for several hundred thousands of compounds tested against different biological targets [5]. LB-CADD is attractive in the resource-limited environment of academia as it reduces the cost and increases quality of drug discovery and/or probe development for rare or neglected diseases. Methods [13,14]

Molecular Descriptors Numerically Encode Chemical Structure
Consensus of QSAR Models Has Potential to Improve Prediction Accuracy
Significance
Results and Discussion
Machine Learning Algorithms Relate Chemical Structure to Biological Activity
Quality Measures Assess the Predictive Power of Machine Learning Algorithms
Experimental
GPCR: Allosteric Modulators of M1 Muscarinic Receptor
Ion Channel
Transporter
Kinase Inhibitor
Enzyme
Numerical Description of Molecules for QSAR Model Development
Monitoring Data Set is Used for Early Termination of Training Process
Cross-Validation Ascertains Robustness of QSAR Models
Selection of an Optimized Descriptor Set Guides QSAR Model Training
3.10. Consensus Predictions Seeks Improved Accuracies of Trained QSAR Models
N pIC50 or i 1 pEC50
3.11. Implementation
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call