The influence of the inactives subset generation on the performance of machine learning methods.

Sabina Smusz,Andrzej J Bojarski,Rafał Kurczab

doi:10.1186/1758-2946-5-17

Abstract

BackgroundA growing popularity of machine learning methods application in virtual screening, in both classification and regression tasks, can be observed in the past few years. However, their effectiveness is strongly dependent on many different factors.ResultsIn this study, the influence of the way of forming the set of inactives on the classification process was examined: random and diverse selection from the ZINC database, MDDR database and libraries generated according to the DUD methodology. All learning methods were tested in two modes: using one test set, the same for each method of inactive molecules generation and using test sets with inactives prepared in an analogous way as for training. The experiments were carried out for 5 different protein targets, 3 fingerprints for molecules representation and 7 classification algorithms with varying parameters. It appeared that the process of inactive set formation had a substantial impact on the machine learning methods performance.ConclusionsThe level of chemical space limitation determined the ability of tested classifiers to select potentially active molecules in virtual screening tasks, as for example DUDs (widely applied in docking experiments) did not provide proper selection of active molecules from databases with diverse structures. The study clearly showed that inactive compounds forming training set should be representative to the highest possible extent for libraries that undergo screening.

Highlights

A growing popularity of machine learning methods application in virtual screening, in both classification and regression tasks, can be observed in the past few years
Six most frequently used ways of selecting assumed inactives were tested: random and diverse selection from: the ZINC database [12], the MDDR database [13] and libraries generated according to the DUD methodology [14] in terms of their impact on the machine learning methods performance
This step of training set formation may have a significant impact on the effectiveness of classification performed by machine learning methods

Summary

Introduction

A growing popularity of machine learning methods application in virtual screening, in both classification and regression tasks, can be observed in the past few years. Their effectiveness is strongly dependent on many different factors. During the preparation for machine learning experiments, the need of generating sets of compounds assumed as inactive arises. Various approaches to this task have already been proposed. In very few cases, number of inactive compounds is sufficient enough to perform ML experiments [11]

Methods

Results

Discussion

Conclusion