Abstract

Protein-nucleotide interactions are ubiquitous in a wide variety of biological processes. Accurately identifying interaction residues solely from protein sequences is useful for both protein function annotation and drug design, especially in the post-genomic era, as large volumes of protein data have not been functionally annotated. Protein-nucleotide binding residue prediction is a typical imbalanced learning problem, where binding residues are extremely fewer in number than non-binding residues. Alleviating the severity of class imbalance has been demonstrated to be a promising means of improving the prediction performance of a machine-learning-based predictor for class imbalance problems. However, little attention has been paid to the negative impact of class imbalance on protein-nucleotide binding residue prediction. In this study, we propose a new supervised over-sampling algorithm that synthesizes additional minority class samples to address class imbalance. The experimental results from protein-nucleotide interaction datasets demonstrate that the proposed supervised over-sampling algorithm can relieve the severity of class imbalance and help to improve prediction performance. Based on the proposed over-sampling algorithm, a predictor, called TargetSOS, is implemented for protein-nucleotide binding residue prediction. Cross-validation tests and independent validation tests demonstrate the effectiveness of TargetSOS. The web-server and datasets used in this study are freely available at http://www.csbio.sjtu.edu.cn/bioinf/TargetSOS/.

Highlights

  • Protein-ligand interactions are ubiquitous in virtually all biological processes [1,2,3], and the prediction of protein-ligand interactions using automated computational methods has been an area of intense research in bioinformatics fields [4,5,6,7,8,9,10,11,12,13,14,15]

  • The second dataset [14], NUC5, is a multiple nucleotide-interacting dataset that consists of five training sub-datasets, each for a specific type of nucleotide; NUC5 consists of 227, 321, 140, 56, and 105 protein sequences that interact with five types of nucleotides, i.e., ATP, ADP, AMP, GTP, and GDP, respectively, and the maximal pairwise identity of the sequences of each of the five sub-datasets is less than 40%

  • Feature Representation and Classifier The main purpose of this study is to demonstrate the feasibility of the proposed supervised over-sampling (SOS) algorithm and its effectiveness in proteinnucleotide binding residue prediction

Read more

Summary

Introduction

Protein-ligand interactions are ubiquitous in virtually all biological processes [1,2,3], and the prediction of protein-ligand interactions using automated computational methods has been an area of intense research in bioinformatics fields [4,5,6,7,8,9,10,11,12,13,14,15]. Nucleotides (e.g., ATP, ADP, AMP, GDP, and GTP) play critical roles in various metabolic processes, such as providing chemical energy, signaling, and replication and transcription of DNA [10,11,12,13,14,15]. Proteinnucleotide (e.g., protein-ATP) binding residues are considered valuable targets of therapeutic drugs [12]. Accurate identification of nucleotide-binding residues in protein sequences is of significant importance for protein function analysis and drug design [16], especially in the post-genomic era, as large volumes of protein data have not been functionally annotated

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.