Abstract

Substances that do not degrade over time have proven to be harmful to the environment and are dangerous to living organisms. Being able to predict the biodegradability of substances without costly experiments is useful. Recently, the quantitative structure–activity relationship (QSAR) models have proposed effective solutions to this problem. However, the molecular descriptor datasets usually suffer from the problems of unbalanced class distribution, which adversely affects the efficiency and generalization of the derived models. Accordingly, this study aims at validating the performances of balanced random trees (RTs) and boosted C5.0 decision trees (DTs) to construct QSAR models to classify the ready biodegradation of substances and their abilities to deal with unbalanced data. The balanced RTs model algorithm builds individual trees using balanced bootstrap samples, while the boosted C5.0 DT is modeled using cost-sensitive learning. We employed the two-dimensional molecular descriptor dataset, which is publicly available through the University of California, Irvine (UCI) machine learning repository. The molecular descriptors were ranked according to their contributions to the balanced RTs classification process. The performance of the proposed models was compared with previously reported results. Based on the statistical measures, the experimental results showed that the proposed models outperform the classification results of the support vector machine (SVM), K-nearest neighbors (KNN), and discrimination analysis (DA). Classification measures were analyzed in terms of accuracy, sensitivity, specificity, precision, false positive rate, false negative rate, F1 score, receiver operating characteristic (ROC) curve, and area under the ROC curve (AUROC).

Highlights

  • The objective of quantitative structure–activity relationship (QSAR) modeling is to discover the relationships between molecular structures and various physical, chemical, and biological activities [1,2].Computationally, the molecular composition can be described by molecular descriptors that are mathematical representations of chemical information as follows:A = f x1, x2, . . . , xpInt

  • The molecular descriptor dataset was taken from the University of California, Irvine (UCI) machine learning repository

  • The training and testing data are unbalanced as the not-ready biodegradable (NRB) samples are twice as large as the ready biodegradable (RB) samples

Read more

Summary

Introduction

The objective of quantitative structure–activity relationship (QSAR) modeling is to discover the relationships between molecular structures and various physical, chemical, and biological activities [1,2].Computationally, the molecular composition can be described by molecular descriptors that are mathematical representations of chemical information as follows:A = f x1 , x2 , . . . , xpInt. J. Environ. Res. Public Health 2020, 17, 9322; doi:10.3390/ijerph17249322

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call