A general pairwise interaction model provides an accurate description of in vivo transcription factor binding sites.

Marc Santolini,Thierry Mora,Vincent Hakim

doi:10.1371/journal.pone.0099015

Marc Santolini, Thierry Mora + Show 1 more

Open Access

https://doi.org/10.1371/journal.pone.0099015

Copy DOI

Journal: PloS one	Publication Date: Jun 13, 2014
Citations: 74	License type: CC BY 4.0

Affiliation: French National Centre for Scientific Research

Abstract

The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial importance for understanding and predicting regulatory elements in gene networks. TFBS motifs are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair contributes independently to the transcription factor (TF) binding. However, this description ignores correlations between nucleotides at different positions, and is generally inaccurate: analysing fly and mouse in vivo ChIPseq data, we show that in most cases the PWM model fails to reproduce the observed statistics of TFBSs. To overcome this issue, we introduce the pairwise interaction model (PIM), a generalization of the PWM model. The model is based on the principle of maximum entropy and explicitly describes pairwise correlations between nucleotides at different positions, while being otherwise as unconstrained as possible. It is mathematically equivalent to considering a TF-DNA binding energy that depends additively on each nucleotide identity at all positions in the TFBS, like the PWM model, but also additively on pairs of nucleotides. We find that the PIM significantly improves over the PWM model, and even provides an optimal description of TFBS statistics within statistical noise. The PIM generalizes previous approaches to interdependent positions: it accounts for co-variation of two or more base pairs, and predicts secondary motifs, while outperforming multiple-motif models consisting of mixtures of PWMs. We analyse the structure of pairwise interactions between nucleotides, and find that they are sparse and dominantly located between consecutive base pairs in the flanking region of TFBS. Nonetheless, interactions between pairs of non-consecutive nucleotides are found to play a significant role in the obtained accurate description of TFBS statistics. The PIM is computationally tractable, and provides a general framework that should be useful for describing and predicting TFBSs beyond PWMs.

Highlights

Gene regulatory networks are at the basis of our understanding of cell states and of the dynamics of their response to environmental cues
The availability of ChIPseq data for many transcription factor (TF) is an opportunity to revisit the question of nucleotide correlations in Transcription Factor Binding Sites (TFBSs), and to propose alternative descriptions of TFBS ensembles beyond the Position Weight Matrices (PWMs) [17]
To allow for a fair and consistent comparison between different models, we have developed a workflow in which the TFBS collection and the model describing them are simultaneously obtained and refined together

Summary

Introduction

Gene regulatory networks are at the basis of our understanding of cell states and of the dynamics of their response to environmental cues Central effectors of this regulation are Transcription Factors (TFs), which bind on short DNA regulatory sequences and interact with the transcription apparatus or with histone-modifying proteins to alter target gene expressions [1]. In eukaryotes the TF binding specificity is only moderate, meaning that a given TF may bind to a variety of different sequences in vivo [3] The collection of such binding sequences is typically described by a Position Weight Matrix (PWM) which gives the probability that a particular base pair stands at a given position in the TFBS. Initial PWM refinement Along with the ChIPseq data for the different factors, we retrieved corresponding PWMs from the literature [32], from JASPAR database [48], or from TRANSFAC database version 2010:3 [49]. These initial PWMs were refined according to the following protocol

Methods

Results

Conclusion