Abstract
Inferring the combinatorial regulatory code of transcription factors (TFs) from genome-wide TF binding profiles is challenging. A major reason is that TF binding profiles significantly overlap and are therefore highly correlated. Clustered occurrence of multiple TFs at genomic sites may arise from chromatin accessibility and local cooperation between TFs, or binding sites may simply appear clustered if the profiles are generated from diverse cell populations. Overlaps in TF binding profiles may also result from measurements taken at closely related time intervals. It is thus of great interest to distinguish TFs that directly regulate gene expression from those that are indirectly associated with gene expression. Graphical models, in particular Bayesian networks, provide a powerful mathematical framework to infer different types of dependencies. However, existing methods do not perform well when the features (here: TF binding profiles) are highly correlated, when their association with the biological outcome is weak, and when the sample size is small. Here, we develop a novel computational method, the Neighbourhood Consistent PC (NCPC) algorithms, which deal with these scenarios much more effectively than existing methods do. We further present a novel graphical representation, the Direct Dependence Graph (DDGraph), to better display the complex interactions among variables. NCPC and DDGraph can also be applied to other problems involving highly correlated biological features. Both methods are implemented in the R package ddgraph, available as part of Bioconductor (http://bioconductor.org/packages/2.11/bioc/html/ddgraph.html). Applied to real data, our method identified TFs that specify different classes of cis-regulatory modules (CRMs) in Drosophila mesoderm differentiation. Our analysis also found depletion of the early transcription factor Twist binding at the CRMs regulating expression in visceral and somatic muscle cells at later stages, which suggests a CRM-specific repression mechanism that so far has not been characterised for this class of mesodermal CRMs.
Highlights
A major area in genome research is understanding how the regulatory information is encoded
Work over the past few decades has resulted in the notion of a combinatorial regulatory code: the concerted binding of a context-specific set of transcription factors (TFs) to regulatory sequences, which is crucial for proper gene expression
Transcription factors (TFs) are proteins that bind to DNA and regulate gene expression
Summary
A major area in genome research is understanding how the regulatory information is encoded. A canonical example of this traditional dissection is the identification of the various stripe enhancers of the Drosophila even-skipped gene that respond to different TFs involved in early patterning (see [2,3] for review). Whereas the inference of the regulatory code may greatly benefit from having additional data, such as the expression patterns of the genes of interest under mutant conditions, it is often difficult to collect at the genome level. Recent studies provide evidence for so-called ‘‘hotspots’’ to which many interacting or non-interacting TFs may bind [6,8,9], which leads to high correlations among binding profiles of both functionally ‘‘relevant’’ and functionally ‘‘irrelevant’’ TFs. It remains a significant challenge to distinguish relevant and important TFs from the others in the understanding of the combinatorial regulatory code
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.