The Development of Target-Specific Machine Learning Models as Scoring Functions for Docking-Based Target Prediction.

Mauro S Nogueira,Oliver Koch

doi:10.1021/acs.jcim.8b00773

Abstract

The identification of possible targets for a known bioactive compound is of the utmost importance for drug design and development. Molecular docking is one possible approach for in-silico protein target prediction, whereas a molecule is docked into several different protein structures to identify potential targets. This reverse docking approach is hampered by the limitation of current scoring functions to correctly discriminate between targets and nontargets. In this work, a development of target-specific scoring functions is described that showed improved prediction performances for the correct target prediction of both actives and decoys on three validation data sets. In contrast to pure ligand-based approaches, that are in general faster and include a greater target space, docking-based approaches can cover also unknown chemical space that lies outside the known bioactivity data. These target-specific scoring functions are based on known bioactivity data retrieved from ChEMBL and supervised machine learning approaches. Neural Networks and Support Vector Machines (SVMs) models were trained for 20 different protein targets. Our protein-ligand interaction fingerprint PADIF (Protein Atom Score Contributions Derived Interaction Fingerprint) represents the input for training, whereas the PADIFs are calculated based on docking poses of active and inactive compounds. Different data sets of previously unseen molecules were used for the final evaluation and analysis of the prediction performance of the created models. For a single-target selectivity data set, the correct target model returns in most of the cases the highest probabilities scores for their active molecules and with statistically significant differences from the other targets. These probability scores were also predicted and successfully used to rank the targets for molecules of a multitarget data set with activity data described simultaneously for two, three, and four to seven protein targets.

Full Text