Abstract

BackgroundDNA-binding proteins perform important functions in a great number of biological activities. DNA-binding proteins can interact with ssDNA (single-stranded DNA) or dsDNA (double-stranded DNA), and DNA-binding proteins can be categorized as single-stranded DNA-binding proteins (SSBs) and double-stranded DNA-binding proteins (DSBs). The identification of DNA-binding proteins from amino acid sequences can help to annotate protein functions and understand the binding specificity.In this study, we systematically consider a variety of schemes to represent protein sequences: OAAC (overall amino acid composition) features, dipeptide compositions, PSSM (position-specific scoring matrix profiles) and split amino acid composition (SAA), and then we adopt SVM (support vector machine) and RF (random forest) classification model to distinguish SSBs from DSBs.ResultsOur results suggest that some sequence features can significantly differentiate DSBs and SSBs. Evaluated by 10 fold cross-validation on the benchmark datasets, our prediction method can achieve the accuracy of 88.7% and AUC (area under the curve) of 0.919. Moreover, our method has good performance in independent testing.ConclusionsUsing various sequence-derived features, a novel method is proposed to distinguish DSBs and SSBs accurately. The method also explores novel features, which could be helpful to discover the binding specificity of DNA-binding proteins.

Highlights

  • DNA-binding proteins perform important functions in a great number of biological activities

  • The positive charge residues (Arg, His and Lys) in Double-stranded DNA binding proteins (DSBs) have a higher level than these of single-stranded DNA-binding proteins (SSBs), and it coincides with the fact that Double-stranded DNA (dsDNA) strand has higher negative charge than Single-stranded DNA (ssDNA) strand, and dsDNA has a stabilized double-helix structure while ssDNA presents unwound and irregular helix

  • In this study, we compile a non-redundant sequence dataset consisting of 873 DSBs and 183 SSBs, and build four kinds of typical features underlying DNA binding proteins sequences

Read more

Summary

Introduction

DNA-binding proteins perform important functions in a great number of biological activities. The identification of DNA-binding proteins from amino acid sequences can help to annotate protein functions and understand the binding specificity. We systematically consider a variety of schemes to represent protein sequences: OAAC (overall amino acid composition) features, dipeptide compositions, PSSM (position-specific scoring matrix profiles) and split amino acid composition (SAA), and we adopt SVM (support vector machine) and RF (random forest) classification model to distinguish SSBs from DSBs. Proteins-DNA interaction is important for a great number of biological processes such as DNA replication, transcription, DNA repair and gene expression [1,2,3,4], etc. Structure-based methods can produce high-accuracy performances, they can’t be applied in high-throughput

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call