Abstract
Structure–activity relationship modelling is frequently used in the early stage of drug discovery to assess the activity of a compound on one or several targets, and can also be used to assess the interaction of compounds with liability targets. QSAR models have been used for these and related applications over many years, with good success. Conformal prediction is a relatively new QSAR approach that provides information on the certainty of a prediction, and so helps in decision-making. However, it is not always clear how best to make use of this additional information. In this article, we describe a case study that directly compares conformal prediction with traditional QSAR methods for large-scale predictions of target-ligand binding. The ChEMBL database was used to extract a data set comprising data from 550 human protein targets with different bioactivity profiles. For each target, a QSAR model and a conformal predictor were trained and their results compared. The models were then evaluated on new data published since the original models were built to simulate a “real world” application. The comparative study highlights the similarities between the two techniques but also some differences that it is important to bear in mind when the methods are used in practical drug discovery applications.
Highlights
Public databases of bioactivity data play a critical role in modern translational science
In this article we focus on conformal prediction (CP) [11], but recognise that there are alternatives such as Venn–ABERS predictors [12, 13] which have been applied to drug discovery applications [14,15,16]
Data sets Data were extracted from version 23 of the ChEMBL database (ChEMBL_23) [27] using a protocol adapted from the study of Lenselink et al [24] (Fig. 1)
Summary
Public databases of bioactivity data play a critical role in modern translational science They provide a central place to access the ever-increasing amounts of data that would otherwise have to be extracted from tens of thousands of different journal articles. They make the data easier to use by automated and/or manual classification, annotation and standardisation approaches By making their content freely accessible, the entire scientific community can query, extract and download information of interest. The latest release (version 24) of ChEMBL (ChEMBL_24) contains more than 6 million curated data points for around 7500 protein targets and 1.2 million distinct compounds [3] This represents a gold mine for chemists, biologists, toxicologists and modellers alike
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.