PreDBA: A heterogeneous ensemble approach for predicting protein-DNA binding affinity

Wenyi Yang,Lei Deng

doi:10.1038/s41598-020-57778-1

Wenyi Yang, Lei Deng

Open Access

https://doi.org/10.1038/s41598-020-57778-1

Copy DOI

Journal: Scientific Reports	Publication Date: Jan 28, 2020
Citations: 27	License type: open-access

Affiliation: Central South University, Xinjiang University

Abstract

The interaction between protein and DNA plays an essential function in various critical natural processes, like DNA replication, transcription, splicing, and repair. Studying the binding affinity of proteins to DNA helps to understand the recognition mechanism of protein-DNA complexes. Since there are still many limitations on the protein-DNA binding affinity data measured by experiments, accurate and reliable calculation methods are necessarily required. So we put forward a computational approach in this paper, called PreDBA, that can forecast protein-DNA binding affinity effectively by using heterogeneous ensemble models. One hundred protein-DNA complexes are manually collected from the related literature as a data set for protein-DNA binding affinity. Then, 52 sequence and structural features are obtained. Based on this, the correlation between these 52 characteristics and protein-DNA binding affinity is calculated. Furthermore, we found that the protein-DNA binding affinity is affected by the DNA molecule structure of the compound. We classify all protein-DNA compounds into five classifications based on the DNA structure related to the proteins that make up the protein-DNA complexes. In each group, a stacked heterogeneous ensemble model is constructed based on the obtained features. In the end, based on the binding affinity data set, we used the leave-one-out cross-validation to evaluate the proposed method comprehensively. In the five categories, the Pearson correlation coefficient values of our recommended method range from 0.735 to 0.926. We have demonstrated the advantages of the proposed method compared to other machine learning methods and currently existing protein-DNA binding affinity prediction approach.

Highlights

The interaction between protein and DNA plays an essential function in various critical natural processes, like DNA replication, transcription, splicing, and repair
We only selected the protein-DNA crystal structures deposited in the PDB that have better than 3 resolution
Through the prediction of three characteristics, we can get a correlation coefficient of 0.843. In this class of complex binding affinity prediction process, we found that the Nearest-neighbor bases of DNA play a decisive role

Summary

Introduction

The interaction between protein and DNA plays an essential function in various critical natural processes, like DNA replication, transcription, splicing, and repair. Electrophoretic mobility shift assays (EMSAs)[5,6], conventional chromatin immunoprecipitation (ChIP)[7], peptide nucleic acid (PNA) assisted identification of RNA-binding proteins (RBPs) (PAIR)[8], X-ray crystals[9] and nuclear magnetic resonance (NMR) spectroscopy[10] have been applied to expose protein-DNA binding residues These laboratory methods are expensive and time-consuming. Many computational prediction techniques, including empirical scoring functions[11,12,13,14,15], knowledge-based methods[16,17,18] and quantitative structure-activity relationships[19,20], have been proposed for the binding affinity of protein-ligand complexes and protein-protein complexes[21,22,23]. Classified the complexes into five classes based on the type of DNA associated with the proteins

Methods

Results

Conclusion