Abstract

DNA-binding proteins (DBPs) play vital roles in all aspects of genetic activities. However, the identification of DBPs by using wet-lab experimental approaches is often time-consuming and laborious. In this study, we develop a novel computational method, called PredDBP-Stack, to predict DBPs solely based on protein sequences. First, amino acid composition (AAC) and transition probability composition (TPC) extracted from the hidden markov model (HMM) profile are adopted to represent a protein. Next, we establish a stacked ensemble model to identify DBPs, which involves two stages of learning. In the first stage, the four base classifiers are trained with the features of HMM-based compositions. In the second stage, the prediction probabilities of these base classifiers are used as inputs to the meta-classifier to perform the final prediction of DBPs. Based on the PDB1075 benchmark dataset, we conduct a jackknife cross validation with the proposed PredDBP-Stack predictor and obtain a balanced sensitivity and specificity of 92.47% and 92.36%, respectively. This outcome outperforms most of the existing classifiers. Furthermore, our method also achieves superior performance and model robustness on the PDB186 independent dataset. This demonstrates that the PredDBP-Stack is an effective classifier for accurately identifying DBPs based on protein sequence information alone.

Highlights

  • DNA-binding proteins (DBPs) are fundamental in the process of composing DNA and regulating genes

  • Based on the AATP-hidden markov model (HMM) feature representation, we first analyze the predictive power of six classifiers, i.e., Decision Tree (DT), K Nearest Neighbor (KNN), Logistic Regression (LR), XGB, random forest (RF), and support vector machine (SVM) employed in the base level of stacking

  • To assure the distinct and high quality of the target figure, only three Receiver Operating Characteristic (ROC) curves corresponding with LR, DT, and SVM models are shown in Figure 3, which illustrates the consistent findings with Table 1

Read more

Summary

Introduction

DNA-binding proteins (DBPs) are fundamental in the process of composing DNA and regulating genes. They execute intercellular and intracellular functions such as transcription, DNA replication, recombination, modification, and other biological activities associated with DNA [1]. With the rapid development of high-throughput sequencing technology and growing extension of protein sequence data, more efficient and accurate machine learning (ML) methods are implemented and applied for the classification of DBPs [4, 5]. Structure-based methods rely on the structure information of proteins such as the spatial distribution, net charge, electrostatic potential, the dipole moment, and quadrupole moment tensors [6, 7]. The great difficulty of acquiring the high-resolution crystal structure of proteins and the insufficient quantity of proteins with known structure information heavily limit the use of structure-based predictors [8]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call