PiDNA: predicting protein–DNA interactions with structural models

Chih-Kang Lin,Chien-Yu Chen

doi:10.1093/nar/gkt388

Abstract

Predicting binding sites of a transcription factor in the genome is an important, but challenging, issue in studying gene regulation. In the past decade, a large number of protein–DNA co-crystallized structures available in the Protein Data Bank have facilitated the understanding of interacting mechanisms between transcription factors and their binding sites. Recent studies have shown that both physics-based and knowledge-based potential functions can be applied to protein–DNA complex structures to deliver position weight matrices (PWMs) that are consistent with the experimental data. To further use the available structural models, the proposed Web server, PiDNA, aims at first constructing reliable PWMs by applying an atomic-level knowledge-based scoring function on numerous in silico mutated complex structures, and then using the PWM constructed by the structure models with small energy changes to predict the interaction between proteins and DNA sequences. With PiDNA, the users can easily predict the relative preference of all the DNA sequences with limited mutations from the native sequence co-crystallized in the model in a single run. More predictions on sequences with unlimited mutations can be realized by additional requests or file uploading. Three types of information can be downloaded after prediction: (i) the ranked list of mutated sequences, (ii) the PWM constructed by the favourable mutated structures, and (iii) any mutated protein–DNA complex structure models specified by the user. This study first shows that the constructed PWMs are similar to the annotated PWMs collected from databases or literature. Second, the prediction accuracy of PiDNA in detecting relatively high-specificity sites is evaluated by comparing the ranked lists against in vitro experiments from protein-binding microarrays. Finally, PiDNA is shown to be able to select the experimentally validated binding sites from 10 000 random sites with high accuracy. With PiDNA, the users can design biological experiments based on the predicted sequence specificity and/or request mutated structure models for further protein design. As well, it is expected that PiDNA can be incorporated with chromatin immunoprecipitation data to refine large-scale inference of in vivo protein–DNA interactions. PiDNA is available at: http://dna.bime.ntu.edu.tw/pidna.

Highlights

Interactions between transcription factors (TFs) and their binding sites play important roles in many biological processes
This study first evaluates whether the position frequency matrix (PFM) constructed using the highly reliable structures with limited mutations are consistent with the known binding sites of the query protein
Mouse and yeast TFs with structure models available in Protein Data Bank (PDB) are examined to see if annotated PFMs can be found in literature or databases

Summary

Introduction

Interactions between transcription factors (TFs) and their binding sites play important roles in many biological processes. The number of well-characterized PWMs is still far behind the number of known TFs. the number of well-characterized PWMs is still far behind the number of known TFs In this regard, it is desirable to exploit other resources, such as protein–DNA complexes in protein structure databases, to improve the coverage of TFs on which the prediction of binding sites can be made or improved. Many potential functions, including physics-based and knowledge-based [7,9,10,11], have been developed for improving protein–DNA docking [11,12,13]. These potential functions are being applied to predict binding specificity and construct

Methods

Results

Conclusion