Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty

Lewis H Mervin,Maria-Anna Trapotsi,Avid M Afzal,Andreas Bender,Ian P Barrett,Ola Engkvist

doi:10.1186/s13321-021-00539-7

Abstract

Measurements of protein–ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actual standard deviation of experimental measurements (σ) or the associated comparability of activity values between the aggregated heterogenous activity units (i.e., Ki versus IC50 values) during dataset assimilation. However, experimental errors are usually a neglected aspect of model generation. In order to improve upon the current state-of-the-art, we herein present a novel approach toward predicting protein–ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF algorithm was applied toward in silico protein target prediction across ~ 550 tasks from ChEMBL and PubChem. Predictions were evaluated by taking into account various scenarios of experimental standard deviations in both training and test sets and performance was assessed using fivefold stratified shuffled splits for validation. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information was not considered in any way in the original RF algorithm. For example, in cases when σ ranged between 0.4–0.6 log units and when ideal probability estimates between 0.4–0.6, the PRF outperformed RF with a median absolute error margin of ~ 17%. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold), although the RF models gave errors smaller than the experimental uncertainty, which could indicate that they were overtrained and/or over-confident. Finally, the PRF models trained with putative inactives decreased the performance compared to PRF models without putative inactives and this could be because putative inactives were not assigned an experimental pXC50 value, and therefore they were considered inactives with a low uncertainty (which in practice might not be true). In conclusion, PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold.

Highlights

The application of Machine Learning (ML) and Artificial Intelligence (AI) to the drug development process has increased in recent years [1,2,3], but the majority of research toward small molecule property prediction itselfMervin et al J Cheminform (2021) 13:62 has predominantly focused on improving the reported accuracy of base algorithms, rather than factoring the experimental error into predictions [4]
Summary In conclusion, the aim of this analysis was to investigate the performance of Probabilistic Random Forest (PRF) as a method able to take into account experimental errors, which are usually a neglected aspect of model generation
By evaluating the current experimental error in ChEMBL v27, we identified that it is very similar to those reported in previous versions of ChEMBL v14

Summary

Introduction

The application of Machine Learning (ML) and Artificial Intelligence (AI) to the drug development process has increased in recent years [1,2,3], but the majority of research toward small molecule property prediction itselfMervin et al J Cheminform (2021) 13:62 has predominantly focused on improving the reported accuracy of base algorithms, rather than factoring the experimental error into predictions [4]. Since experimental error influences dataset generation and performance, it is important to investigate methods capable of accommodating experimental variability during training This is important for binary classification tasks due to imposing arbitrary cut-off(s) to the activity scale. pXC50 activity values of 5.1 or 4.9 are treated important in contributing to the opposing activity (e.g., classification threshold of 5), even though experimental error may not afford such discriminatory accuracy This is detrimental in practice and it is important to evaluate the presence of experimental error in databases and apply methodologies to account for variability in experiments

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Cheminformatics	Publication Date: Aug 19, 2021
Citations: 11	License type: open-access

R Discovery Prime

R Discovery Prime

Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Cheminformatics

Lead the way for us

Similar Papers

Probabilistic Random Forest: A Machine Learning Algorithm for Noisy Data Sets
Itamar Reis ... Sahar Shahaf
The Astronomical Journal | VOL. 157
Itamar Reis, et. al.Itamar Reis ... Sahar Shahaf
20 Dec 2018
The Astronomical Journal | VOL. 157

Estimation of Machine Learning model uncertainty in particle physics event classifiers
Julia Vázquez-Escobar ... Miguel Cárdenas-Montes
Computer Physics Communications | VOL. 268
Julia Vázquez-Escobar, et. al.Julia Vázquez-Escobar ... Miguel Cárdenas-Montes
16 Jul 2021
Computer Physics Communications | VOL. 268

Task Committee on Experimental Uncertainty and Measurement Errors in Hydraulic Engineering: An Update
Brian Wahlin ... Tony Wahl
-
Brian Wahlin, et. al.Brian Wahlin ... Tony Wahl
01 Jul 2005
01 Jul 2005

Air to brain, blood to brain and plasma to brain distribution of volatile organic compounds: linear free energy analyses
Michael H Abraham ... William E Acree
European Journal of Medicinal Chemistry | VOL. 41
Michael H Abraham, et. al.Michael H Abraham ... William E Acree
03 Mar 2006
European Journal of Medicinal Chemistry | VOL. 41

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Cheminformatics