Abstract

Due to its extensive steps and trials, drug discovery is a long and expensive process. In the last decade, as also hard pressed by the COVID-19 pandemic, the screening process could be assisted with the advancement in computational technology including the application of Machine Learning. The classification task in Machine Learning has become one of the major approaches for drug discovery. Unfortunately, this practice uses discretized labels that might lead to the loss of quantitative properties that could be meaningful. Therefore, in this paper, we aim to compare various Machine Learning regression algorithms in predicting inhibitory bioactivity, specifically the IC50 value, with the SARS-CoV-2 Replicase Polyprotein 1ab as the target. With 1,138 non-duplicated data downloaded from the ChEMBL database that was engineered into four dataset variances, 42 regression algorithms were utilized for the prediction. We found that there are computational challenges to the use of regression algorithms in predicting bioactivity, for only a handful and a specific dataset variance that returned valid performance parameters upon testing. The three that yielded the highest counts of valid performance parameters are the Histogram Gradient Boosting Regressor (HGBR), Light Gradient Boosting Machine Regressor (LGBR), and Random Forest Regression (RFR). Further statistical analyses show that there is no significant difference between these three algorithms, except for the time taken for training and testing the model, where the LGBR excels. Therefore, these three algorithms should be primarily considered for the study with the same nature.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call