Hyper-parameter optimization of multiple machine learning algorithms for molecular property prediction using hyperopt library

Jun Zhang,Weifeng Shen,Qin Wang

doi:10.1016/j.cjche.2022.04.004

Abstract

Due to outstanding performance in cheminformatics, machine learning algorithms have been increasingly used to mine molecular properties and biomedical big data. The performance of machine learning models is known to critically depend on the selection of the hyper-parameter configuration. However, many studies either explored the optimal hyper-parameters per the grid searching method or employed arbitrarily selected hyper-parameters, which can easily lead to achieving a suboptimal hyper-parameter configuration. In this study, Hyperopt Library embedding with the Bayesian optimization is employed to find optimal hyper-parameters for different machine learning algorithms. Six drug discovery datasets, including solubility, probe-likeness, hERG, Chagas disease, tuberculosis, and malaria, are used to compare different machine learning algorithms with ECFP6 fingerprints. This contribution aims to evaluate whether the Bernoulli Naïve Bayes, Logistic Linear Regression, AdaBoost Decision Tree, Random Forest, support vector machine, and deep neural networks algorithms with optimized hyper-parameters can offer any improvement in testing as compared with the Referenced Models assessed by an array of metrics including AUC, F1-score, Cohen’s kappa, Matthews correlation coefficient, recall, precision, and accuracy. Based on the rank normalized score approach, the Hyperopt Models achieve better or comparable performance on 33 out 36 models for different drug discovery datasets, showing significant improvement achieved by employing the Hyperopt library. The open-source code of all the 6 machine learning frameworks employed in the Hyperopt Python package is provided to make this approach accessible to more scientists, who are not familiar with writing code.

Full Text