Abstract

The prediction of Compound-Protein Interactions (CPI) is an essential step in the drug-target analysis for developing new drugs as well as for drug repositioning. One challenging issue in this field is that commonly there are more numbers of non-interacting compound-protein pairs than interacting pairs. This problem causes bias, which may degrade the prediction of CPI. Besides, currently, there is not much research on CPI prediction that compares data sampling techniques to handle the class imbalance problem. To address this issue, we compare four data sampling techniques, namely Random Under-sampling (RUS), Combination of Over-Under-sampling (COUS), Synthetic Minority Over-sampling Technique (SMOTE), and Tomek Link (T-Link). The benchmark CPI data: Nuclear Receptor and G-Protein Coupled Receptor (GPCR) are used to test these techniques. Area Under Curve (AUC) applied to evaluate the CPI prediction performance of each technique. Results show that the AUC values for RUS, COUS, SMOTE, and T-Link are 0.75, 0.77, 0.85 and 0.79 respectively on Nuclear Receptor data and 0.70, 0.85, 0.91 and 0.72 respectively on GPCR data. These results indicate that SMOTE has the highest AUC values. Furthermore, we found that the SMOTE technique is more capable of handling class imbalance problems on CPI prediction compared to the remaining three other techniques.

Highlights

  • The identification of Compound-ProteinInteraction (CPI) plays a key role in the development of drugs, herbal medicines

  • An experiment has proven Tomek-Link (T-Link) can improve performance in the classification of arterial blood pressures and Ecoli2 datasets (Elhassan et al, 2017). Based on those three studies, we conclude that Random Under-sampling (RUS), Synthetic Minority Oversampling Technique (SMOTE), and TLink techniques are proper sampling techniques to handle the imbalanced class on Compound-Protein Interactions (CPI)

  • RESULT AND DISCUSSION Figures 3 and 4 show the CPI prediction evaluation results using Receiver Operating Characteristic (ROC) parameters previously implemented by Bipartite Local Model (BLM) and data sampling techniques (RUS, Combination of Over-Under-sampling (COUS), SMOTE, and Tomek Link (T-Link)) on two Yamanishi datasets, i.e., Nuclear Receptor and G-Protein Coupled Receptor (GPCR)

Read more

Summary

INTRODUCTION

Interaction (CPI) plays a key role in the development of drugs, herbal medicines. There are many medicinal properties of herbal formula, which cannot be predicted by IJAH due to a lack of CPI data To solve this problem, a previous study by Kurnia (2017) has predicted CPI in IJAH by using the Bipartite Local Model–Neighbor Interaction profile Inferring (BLMNII). An experiment has proven Tomek-Link (T-Link) can improve performance in the classification of arterial blood pressures and Ecoli datasets (Elhassan et al, 2017) Based on those three studies, we conclude that RUS, SMOTE, and TLink techniques are proper sampling techniques to handle the imbalanced class on CPI. After the matrix of CPI has been balanced by using the data sampling technique, the CPI matrix might have missing values of interacting class caused by duplication or reduction To overcome this problem, we use k-Nearest Neighbors (k-NN) to impute missing values. AUC is known to have proven to be a reliable performance measure for class imbalance problems (Fawcett, 2004)

MATERIALS AND METHODS
RESULT
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call