Abstract

Machine learning methods may have the potential to significantly accelerate drug discovery. However, the increasing rate of new methodological approaches being published in the literature raises the fundamental question of how models should be benchmarked and validated. We reanalyze the data generated by a recently published large-scale comparison of machine learning models for bioactivity prediction and arrive at a somewhat different conclusion. We show that the performance of support vector machines is competitive with that of deep learning methods. Additionally, using a series of numerical experiments, we question the relevance of area under the receiver operating characteristic curve as a metric in virtual screening. We further suggest that area under the precision–recall curve should be used in conjunction with the receiver operating characteristic curve. Our numerical experiments also highlight challenges in estimating the uncertainty in model performance via scaffold-split nested cross validation.

Highlights

  • Computational approaches to drug discovery are often justified as necessary due to the prohibitive time and cost of experiments

  • The questions we will ask are: (1) Is one machine learning method significantly better than the rest, using metrics adopted by Mayr et al.? (2) Are the metrics adopted by Mayr et al the most relevant to ligand-based bioactivity prediction? Our key conclusion is an alternative interpretation of their results that considers both statistical and practical significance— we argue that deep learning methods do not significantly outperform all competing methods

  • In order to compare the performance of feedforward deep neural networks (FNN) to other models on this assay, the mean and standard deviation area under the receiver operating curve (AUC–receiver operating characteristic curve (ROC)) scores over all threefold were calculated by Mayr et al this averaging completely discards the inherent uncertainty of each independent test fold, which can be useful information

Read more

Summary

Introduction

Computational approaches to drug discovery are often justified as necessary due to the prohibitive time and cost of experiments. Using the sign-test we calculated 95% Wilson score intervals for the sign-test statistic for the alternative hypothesis that FNN has better AUC–ROC performance than SVM, the second best performing classifier according to Mayr et al Using all 3930 test folds in the analysis (since each is an independent test set) gives an interval of (0.502, 0.534), while only comparing the mean AUC values per assay gives a confidence interval of (0.501, 0.564) While both of these tests are narrowly significant at the = 0.05 level (intervals do not include 0.5), it is worth examining the practical meaning of these results. Some of these results are just noise due to small assay sizes; it indicates that classifier performance is likely assay dependent, and one should try multiple classifiers for a given

Method
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call