Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms

Sahar K Hussin,Adel Alkhalil,Mahmoud I Marie,Yasser M Omar,Rabie A Ramadan,Salah M Abdelmageid

doi:10.1155/2021/6675279

Abstract

Virtual screening is the most critical process in drug discovery, and it relies on machine learning to facilitate the screening process. It enables the discovery of molecules that bind to a specific protein to form a drug. Despite its benefits, virtual screening generates enormous data and suffers from drawbacks such as high dimensions and imbalance. This paper tackles data imbalance and aims to improve virtual screening accuracy, especially for a minority dataset. For a dataset identified without considering the data’s imbalanced nature, most classification methods tend to have high predictive accuracy for the majority category. However, the accuracy was significantly poor for the minority category. The paper proposes a K-mean algorithm coupled with Synthetic Minority Oversampling Technique (SMOTE) to overcome the problem of imbalanced datasets. The proposed algorithm is named as KSMOTE. Using KSMOTE, minority data can be identified at high accuracy and can be detected at high precision. A large set of experiments were implemented on Apache Spark using numeric PaDEL and fingerprint descriptors. The proposed solution was compared to both no-sampling method and SMOTE on the same datasets. Experimental results showed that the proposed solution outperformed other methods.

Highlights

Virtual screening is the most critical process in drug discovery, and it relies on machine learning to facilitate the screening process
The accuracy was significantly poor for the minority category. e paper proposes a K-mean algorithm coupled with Synthetic Minority Oversampling Technique (SMOTE) to overcome the problem of imbalanced datasets. e proposed algorithm is named as K-Mean Synthetic Minority Oversampling Technique (KSMOTE)
Current classification models such as k-nearest neighbor (KNN), random forest (RF), multilayer perceptron (MLP), support vector machine (SVM), decision tree (DT), logistic regression (LG), and gradient boosting (GBT) depend on a sufficient, representative, and reasonably balanced collection of training data to draw an approximate boundary for decision-making between different groups. ese learning algorithms are utilized in a variety of fields, including financial forecasting and text classification [9]

Summary

Related Work

For the paper to be self-contained, this section reviews the most related work to the VS research, techniques, problems, and state-of-the-art solutions. Several approaches have been proposed in the literature to handle big data classification including classification algorithms, random forest, decision tree, multilayer perceptron, logistic regression, and gradient boosting. From a completed cost-sensitive survey, SVM and Tree C4.5 (J48) performed well, taking minority group sizes into account It shows that a hybrid of majority class undersampling and SMOTE can improve overall classification performance in an imbalanced dataset. Ese findings indicate the quality of the proposed approach for managing the intrinsic imbalance problem in HTS data used to cluster possible interference compounds to virtual screening utilizing luciferase-based HTS experiments. Such predictive models can be helpful in evaluating cell-based screening outcomes in general by bringing feature-based data into such datasets It could be utilized as a method to recognize and remove potentially undesirable compounds. It was presented in the literature that some of the proposed approaches have succeeded somehow in responding to the issues of unbalanced PubChem datasets, there is still a lack of time efficiency during calculations

Proposed KSMOTE Framework

Results

Evaluation Metrics

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Complexity	Publication Date: Jan 28, 2021
Citations: 10	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Complexity

Lead the way for us

Similar Papers

An Empirical Investigation of Virtual Screening
Amir Ali Rafati-Afshar ... Abdelhamid Bouchachia
-
Amir Ali Rafati-Afshar, et. al.Amir Ali Rafati-Afshar ... Abdelhamid Bouchachia
01 Oct 2013
01 Oct 2013

Artificial intelligence to deep learning: machine intelligence approach for drug discovery.
Rohan Gupta ... Pravir Kumar
Molecular diversity | VOL. 25
Rohan Gupta, et. al.Rohan Gupta ... Pravir Kumar
12 Apr 2021
Molecular diversity | VOL. 25

Automated semiconductor wafer defect classification dealing with imbalanced data
Po-Hsuan Lee ... Zhe Wang
-
Po-Hsuan Lee, et. al.Po-Hsuan Lee ... Zhe Wang
20 Mar 2020
20 Mar 2020

An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data
Ming Hao ... Stephen H Bryant
Analytica Chimica Acta | VOL. 806
Ming Hao, et. al.Ming Hao ... Stephen H Bryant
06 Nov 2013
Analytica Chimica Acta | VOL. 806

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Complexity