DNABP: Identification of DNA-Binding Proteins Based on Feature Selection Using a Random Forest and Predicting Binding Residues.

Xin Ma,Xiao Sun,Jing Guo,Bin Liu

doi:10.1371/journal.pone.0167345

Xin Ma, Xiao Sun + Show 2 more

Open Access

https://doi.org/10.1371/journal.pone.0167345

Copy DOI

Journal: PloS one	Publication Date: Dec 1, 2016
Citations: 32	License type: CC BY 4.0

Affiliation: Nanjing Audit University, Southeast University

Abstract

DNA-binding proteins are fundamentally important in cellular processes. Several computational-based methods have been developed to improve the prediction of DNA-binding proteins in previous years. However, insufficient work has been done on the prediction of DNA-binding proteins from protein sequence information. In this paper, a novel predictor, DNABP (DNA-binding proteins), was designed to predict DNA-binding proteins using the random forest (RF) classifier with a hybrid feature. The hybrid feature contains two types of novel sequence features, which reflect information about the conservation of physicochemical properties of the amino acids, and the binding propensity of DNA-binding residues and non-binding propensities of non-binding residues. The comparisons with each feature demonstrated that these two novel features contributed most to the improvement in predictive ability. Furthermore, to improve the prediction performance of the DNABP model, feature selection using the minimum redundancy maximum relevance (mRMR) method combined with incremental feature selection (IFS) was carried out during the model construction. The results showed that the DNABP model could achieve 86.90% accuracy, 83.76% sensitivity, 90.03% specificity and a Matthews correlation coefficient of 0.727. High prediction accuracy and performance comparisons with previous research suggested that DNABP could be a useful approach to identify DNA-binding proteins from sequence information. The DNABP web server system is freely available at http://www.cbi.seu.edu.cn/DNABP/.

Highlights

DNA-protein interactions play significant roles in various biological processes, such as gene regulation, DNA replication and repair, transcription and other biological activities associated with DNA [1,2,3]
The hybrid feature comprises 64 features selected from the position-specific scoring matrix (PSSM)-PP, DNA-binding propensity measures obtained from the information of DNA-binding residues, non-binding propensity measures obtained from the information of non-binding residues and physicochemical property features using the minimum redundancy maximum relevance method combined with incremental feature selection (IFS)
Based on the main dataset (Mainset), the different DNA-binding protein prediction models were constructed by random forest (RF) and various features

Summary

Introduction

DNA-protein interactions play significant roles in various biological processes, such as gene regulation, DNA replication and repair, transcription and other biological activities associated with DNA [1,2,3]. Identification of DNA-binding proteins is fundamentally important to understand how proteins interact with DNA. DNA-binding proteins can be identified by many experimental techniques such as chromatin immunoprecipitation on microarrays, Xray crystallography and nuclear magnetic resonance (NMR). DNABP and analysis, decision to publish, or preparation of the manuscript

Methods

Results

Conclusion