Abstract

Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.

Highlights

  • DNA-binding proteins play key roles in a wide variety of molecular functions, including recognizing specific nucleotide sequences, maintenance of cellular DNA, transcriptional and translational regulation, DNA replication, and DNA damage repair [1,2,3]

  • Performance of the Proposed Method The proposed method, called DBPPred, was implemented by ranking features using random forest algorithm and selecting features using forward best-first search strategy based on Gaussian naıve Bayes

  • Total of 300 features according to the feature rank were input to the subsequent feature selection, and each step in the forward best-first search by adding one remaining feature was performed based on 5 CV with 10 runs

Read more

Summary

Introduction

DNA-binding proteins play key roles in a wide variety of molecular functions, including recognizing specific nucleotide sequences, maintenance of cellular DNA, transcriptional and translational regulation, DNA replication, and DNA damage repair [1,2,3]. Sequence based prediction methods for DNA-binding proteins have been investigated with several classifiers such as logistic regression [24], random forest [8,22], support vector machine [20,21,23,25,26,27,28,29,30], and threading based method [19], using various features including (pseudo) amino acid composition [8,20,22,23,24,25,26,27,28,29,30], physicochemical properties [20,21,22,28], predicted secondary structure [20,22,28], predicted solvent accessibility [28], evolutionary profile [20,23], and their various transformations

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call