Abstract

Developing realistic data sets for evaluating virtual screening methods is a task that has been tackled by the cheminformatics community for many years. Numerous artificially constructed data collections were developed, such as DUD, DUD-E, or DEKOIS. However, they all suffer from multiple drawbacks, one of which is the absence of experimental results confirming the impotence of presumably inactive molecules, leading to possible false negatives in the ligand sets. In light of this problem, the PubChem BioAssay database, an open-access repository providing the bioactivity information of compounds that were already tested on a biological target, is now a recommended source for data set construction. Nevertheless, there exist several issues with the use of such data that need to be properly addressed. In this article, an overview of benchmarking data collections built upon experimental PubChem BioAssay input is provided, along with a thorough discussion of noteworthy issues that one must consider during the design of new ligand sets from this database. The points raised in this review are expected to guide future developments in this regard, in hopes of offering better evaluation tools for novel in silico screening procedures.

Highlights

  • The PubChem BioAssay database was first introduced in 2004 as a part of the PubChem project initiated by the National Center for Biotechnology Information (NCBI), aiming to provide the scientific community with an open-access resource where experimental bioactivity high-throughput screening (HTS) data of chemical substances can be found [1,2,3,4,5]

  • Post-design analyses on the resulting data sets showed that (i) there existed a large number of distinct molecular scaffolds presented by the ligands (1.2 compounds/scaffold class), denoting the absence of “analog bias” and a good representation of drug-like chemical space, (ii) the correlation between the degree of data set clumping and retrospective virtual screening performance was no longer observed after the Maximum Unbiased Validation (MUV) design, suggesting that the final ligand sets were not affected by benchmarking data set bias, and (iii) the MUV data were significantly less biased than the -standard DUD data set, as evidenced by a lower molecular self-similarity level and a higher difficulty in distinguishing true actives from true inactives by ligand-based virtual screening simulations

  • Due to the unreasonably rigorous data quality filters that were applied during the construction of this data collection, the quantity of target sets offered by the authors is relatively small, and several important protein families that have been largely investigated by biochemists, e.g., G protein-coupled receptors (GPCRs) and nuclear receptors, are neglected

Read more

Summary

Introduction

The PubChem BioAssay database (http://pubchem.ncbi.nlm.nih.gov/bioassay) was first introduced in 2004 as a part of the PubChem project initiated by the National Center for Biotechnology Information (NCBI), aiming to provide the scientific community with an open-access resource where experimental bioactivity high-throughput screening (HTS) data of chemical substances can be found [1,2,3,4,5]. Starting out with small-molecule HTS input from the National Institute of Health (NIH), the database gathers data from over 700 different sources, including governmental organizations, world-renowned research centers, and chemical vendors, as well as other biochemical databases, featuring over 260 million bioactivity data points reported in both small-molecule assays and RNA interference reagents-screening projects [5,6,7,8,9,10,11]. We give a thorough discussion of noteworthy issues that have to be addressed prior to utilizing such data in cheminformatics-related projects, with illustrations observed in our recently introduced LIT-PCBA data collection [22], which was constructed from PubChem BioAssay data

PubChem BioAssay Statistics
What We Can Do with PubChem BioAssay Data
The MUV Data Sets
The UCI Repository
The Butkiewicz et al Data Collection
The Lindh et al Data Collection
The LIT-PCBA Data Sets
Assay Selection as Regards the Data Size and Hit Rates
Assay Selection as Regards the Nature of Virtual Screening
Assay Selection as Regards the Screening Stage
Detecting False Positives among Active Substances
Possible Chemical Bias in Assembling Active and Inactive Substances
Potency Bias in the Composition of Active Ligand Sets
Processing Input Structures Prior to Virtual Screening
Conclusions
Findings
Methods

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.