Abstract

BackgroundProtein-DNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanism of protein-DNA recognition, it is necessary to identify the DNA-binding residues in DNA-binding proteins. However, structural data are available for only a few hundreds of protein-DNA complexes. With the rapid accumulation of sequence data, it becomes an important but challenging task to accurately predict DNA-binding residues directly from amino acid sequence data.ResultsA new machine learning approach has been developed in this study for predicting DNA-binding residues from amino acid sequence data. The approach used both the labelled data instances collected from the available structures of protein-DNA complexes and the abundant unlabeled data found in protein sequence databases. The evolutionary information contained in the unlabeled sequence data was represented as position-specific scoring matrices (PSSMs) and several new descriptors. The sequence-derived features were then used to train random forests (RFs), which could handle a large number of input variables and avoid model overfitting. The use of evolutionary information was found to significantly improve classifier performance. The RF classifier was further evaluated using a separate test dataset, and the predicted DNA-binding residues were examined in the context of three-dimensional structures.ConclusionThe results suggest that the RF-based approach gives rise to more accurate prediction of DNA-binding residues than previous studies. A new web server called BindN-RF has thus been developed to make the RF classifier accessible to the biological research community.

Highlights

  • Protein-DNA interactions are involved in many biological processes essential for cellular function

  • Since the dataset was imbalanced with only 15% of the amino acid residues as DNA-binding sites, the performance of the random forests (RFs) classifier was measured by the average of sensitivity and specificity, and the area under the receiver operating characteristic curve (ROC area under the ROC curve (AUC) = 0.7837)

  • Different training parameters were tested for constructing the RF classifier, and the above performance measures were obtained with 1000 decision trees in the forest and m = 5

Read more

Summary

Introduction

Protein-DNA interactions are involved in many biological processes essential for cellular function. Structural data are available for only a few hundreds of protein-DNA complexes. With the rapid accumulation of sequence data, it becomes an important but challenging task to accurately predict DNA-binding residues directly from amino acid sequence data. Many nuclear proteins perform essential functions through interaction with DNA. To understand the molecular mechanism of protein-DNA recognition, it is important to identify the DNA-binding residues in proteins. The identification is straightforward if the structure of a protein-DNA complex is known. Only a few hundreds of protein-DNA complexes have structural data available in the Protein Data Bank [2]. With the rapid accumulation of sequence data from many genomes, computational methods are needed for accurate prediction of DNA-binding residues in protein sequences. The prediction results can provide useful information for protein functional annotation, protein-DNA docking, and experimental studies such as site-directed mutagenesis

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call