Effective data-balancing methods for class-imbalanced genotoxicity datasets using machine learning algorithms and molecular fingerprints

Su-Yong Bae,Jonga Lee,Jaeseong Jeong,Changwon Lim,Jinhee Choi

doi:10.1016/j.comtox.2021.100178

Abstract

Machine learning and deep learning approaches have been increasingly used in the field of toxicology through prediction models developed using various toxicity data. However, toxicity data are often class-imbalanced, which hinders the development of machine learning models with good performance. Therefore, in this study, we identified effective data-balancing methods for class-imbalanced genotoxicity datasets using machine learning algorithms and molecular fingerprints. Data-balancing methods, such as random undersampling (RUS), sample weight (SW), synthetic minority oversampling technique (SMOTE), and random oversampling (ROS) were applied to the datasets. Model performance was evaluated using the F1 score on five machine learning algorithms: gradient boosting tree (GBT), random forest (RF), support vector machine (SVM), multi-layer perceptron (MLP) network, and k-nearest neighbors (kNN) in combination with five molecular fingerprints (Morgan, MACCS, RDKit, Pattern, and Layered). The performance was evaluated for each combination of molecular fingerprints, machine learning algorithms, and data-balancing methods. The MACCS-GBT-SMOTE combination model achieved the best F1 score, followed by RDKit-GBT-SW. Thus, this study demonstrated that data balancing conducted using oversampling methods improved the performance of models. The systematic approach used in this study can also be applied to other toxicity datasets, which may facilitate the development of an improved classification model for toxicity screening.

Full Text