Abstract

Following the explosive growth in chemical and biological data, the shift from traditional methods of drug discovery to computer-aided means has made data mining and machine learning methods integral parts of today’s drug discovery process. In this paper, extreme gradient boosting (Xgboost), which is an ensemble of Classification and Regression Tree (CART) and a variant of the Gradient Boosting Machine, was investigated for the prediction of biological activity based on quantitative description of the compound’s molecular structure. Seven datasets, well known in the literature were used in this paper and experimental results show that Xgboost can outperform machine learning algorithms like Random Forest (RF), Support Vector Machines (LSVM), Radial Basis Function Neural Network (RBFN) and Naïve Bayes (NB) for the prediction of biological activities. In addition to its ability to detect minority activity classes in highly imbalanced datasets, it showed remarkable performance on both high and low diversity datasets.

Highlights

  • Recent advancement in technology has been crucial to the explosive growth in the amount of chemical and biological data available in the public domain

  • The one run definition of area under curve (AUC) (Equation (11)) known as balanced accuracy which is given while sensitivity (SEN) (Equation (12)) and specificity (SPC) (Equation (13)) show the ability of the by the average of the sum of sensitivity and specificity has been used in this work

  • This paper investigated the performance of Xgboost on bioactivity prediction and found out that Xgboost is a robust predictive algorithm

Read more

Summary

Introduction

Recent advancement in technology has been crucial to the explosive growth in the amount of chemical and biological data available in the public domain. Data driven drug discovery and development process has attracted increased research interest in the last decade with a view to design and analyze but apply effective learning methodologies to the rapidly growing data. By leveraging one of the important principles of chemical/molecular similarity [1], where similar biological activities and properties are expected of structurally similar compounds, approaches to drug design through screening of large chemical databases have increased over the years. Virtual Screening (VS), the use of computational approaches and tools through the search of large databases for target or activity prediction, has notably witnessed a shift in trend from the traditional similarity searching, through reference compounds, to the use of machine learning tools to learn from the massive big data by training and prediction of unknown activity. Support Vector Machines (SVM) [2,3], DT [4], Random Forest [5], K Nearest

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.