Structure\u2013activity relationship-based chemical classification of highly imbalanced Tox21 datasets

Gabriel Idakwo,Joseph Luttrell,Huixiao Hong,Chaoyang Zhang,Yan Li,Sundar Thangapandian,Nan Wang,Zhaoxian Zhou,Ping Gong,Bei Yang

doi:10.1186/s13321-020-00468-x

Abstract

The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure–Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for > 10,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F1 score, Matthews correlation coefficient and Brier score provided a more consistent assessment of the overall performance across the 12 datasets. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. We also found that a strong negative correlation existed between the prediction accuracy and the imbalance ratio (IR), which is defined as the number of inactive compounds divided by the number of active compounds. SMN became less effective when IR exceeded a certain threshold (e.g., > 28). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through the use of data rebalancing.

Highlights

Structure–activity relationship (SAR) has been frequently used to predict the biological activities of chemicals from their molecular structures
Results and discussion we present (1) a summary of the curated and preprocessed Toxicology in the 21st Century program (Tox21) dataset, (2) the preliminary comparative results to justify the selection of Random Forest (RF) as the base classifier, (3) parameter optimization for RF and Edited Nearest Neighbor (ENN) algorithms, (4) performance metrics of four classification methods for the twelve imbalanced Tox21 datasets, (5) the impact of imbalance ratio (IR) and classification methods on prediction performance, and (6) a comparison between this study and published Tox21 studies
The original raw Tox21 datasets contained more than 12 K chemicals, approximately 50% of them or fewer were retained for each assay after preprocessing

Summary

Introduction

Structure–activity relationship (SAR) has been frequently used to predict the biological activities of chemicals from their molecular structures. Regardless of the huge chemical space, only a few compounds are likely to interact with a target biomacromolecule causing biological effects and are labelled as active compounds, whereas the remaining majority are labelled as inactive compounds. This gives rise to a common problem of class imbalance for SAR-based predictive modeling, in chemical classification and activity quantification using machine learning approaches [3,4,5]

Methods

Results

Conclusion