Improved DNA-Binding Protein Identification by Incorporating Evolutionary Information Into the Chou’s PseAAC

Xiangzheng Fu,Lijun Cai,Wen Zhu,Jialiang Yang,Bo Liao,Lihong Peng

doi:10.1109/access.2018.2876656

Abstract

DNA-binding proteins play critical roles in various cellular biological processes, such as gene expression and transcription. However, the experimental methods to identify these proteins like ChIP-sequencing are expensive and time-consuming, which presents the need for in silico methods, especially machine learning-based methods. In recent years, the accuracy of machine learning-based DNA-binding protein prediction has been increasing significantly. However, there are still some critical problems to be solved like how to convert protein sequences into an appropriate discrete model or vector. In this paper, we propose a novel feature construction method based on a position-specific scoring matrix (PSSM) named K-PSSM-Composition. The proposed features can efficiently capture the information about 20 amino acid residues and the local information of a given sequence during the evolutionary process. We perform a recursive feature elimination to extract the optimal set of features, which are used to train the support vector machine model for predicting DNA-binding proteins. We evaluate and compare our proposed predictor with other advanced predictors via two standard benchmark data sets. The proposed method achieves the accuracy values of 89.77% and 88.71% for the jackknife test and independent test respectively, outperforming the compared methods. This finding demonstrates the efficacy and effectiveness of the proposed method in predicting the DNA-binding proteins. The source code and data are available at https://github.com/Excelsior511/DNA-Binding-Proteins .

Full Text