PDNAsite: Identification of DNA-binding Site from Protein Sequence by Incorporating Spatial and Sequence Context.

Jiyun Zhou,Hongpeng Wang,Ruifeng Xu,Yulan He,Qin Lu,Bing Kong

doi:10.1038/srep27653

Jiyun Zhou, Hongpeng Wang + Show 4 more

Open Access

https://doi.org/10.1038/srep27653

Copy DOI

Abstract

Protein-DNA interactions are involved in many fundamental biological processes essential for cellular function. Most of the existing computational approaches employed only the sequence context of the target residue for its prediction. In the present study, for each target residue, we applied both the spatial context and the sequence context to construct the feature space. Subsequently, Latent Semantic Analysis (LSA) was applied to remove the redundancies in the feature space. Finally, a predictor (PDNAsite) was developed through the integration of the support vector machines (SVM) classifier and ensemble learning. Results on the PDNA-62 and the PDNA-224 datasets demonstrate that features extracted from spatial context provide more information than those from sequence context and the combination of them gives more performance gain. An analysis of the number of binding sites in the spatial context of the target site indicates that the interactions between binding sites next to each other are important for protein-DNA recognition and their binding ability. The comparison between our proposed PDNAsite method and the existing methods indicate that PDNAsite outperforms most of the existing methods and is a useful tool for DNA-binding site identification. A web-server of our predictor (http://hlt.hitsz.edu.cn:8080/PDNAsite/) is made available for free public accessible to the biological research community.

Highlights

Protein-DNA interactions play important roles in a wide range of fundamental biological processes such as gene regulation, transcription, DNA replication, DNA repair and DNA packaging[1,2,3,4,5]
0.034 area under ROC curve (AUC) when Latent Semantic Analysis (LSA) was applied on the whole feature space, while the prediction performance increased by 0.018 Mathews Correlation Coefficient (MCC), 0.83% ST and 0.008 AUC when LSA was applied on the sub feature space spanned by position-specific scoring matrices (PSSM) features
DBindR, BindN, Dp-bind, BindN-RF, BindN+and DNABR are predictors trained by only sequence information, whereas DNABINDPROT, PreDNA, DNABind and our proposed PDNAsite are built by both sequence information and structural information

Summary

Introduction

Protein-DNA interactions play important roles in a wide range of fundamental biological processes such as gene regulation, transcription, DNA replication, DNA repair and DNA packaging[1,2,3,4,5]. Wang et al.[14] constructed a DNA-binding site classifier using the evolutionary information in terms of PSSM and several new sequence descriptors including the BLAST-based conservation score, the mean, and the standard deviation of biochemical feature values Ahmad and his coworkers[15] developed a DNA-binding site predictor based on Artificial Neural Networks (ANNs) by using only evolutionary information in terms of PSSM. Kuznetsov et al.[21] developed a SVM predictor for the identification of DNA-binding sites by using several categories of structure and sequence information, including PSSM, BLOSUM62, solvent accessibility, and secondary structure Tjong and his coworkers[22] constructed a DNA-binding site predictor DISPLAR by training an ANN classifier utilizing solvent accessibility and evolutionary information. To the best of our knowledge, this is the best-performing predictor up to now

Methods

Results

Discussion

Conclusion