Genome-Wide Signatures of Transcription Factor Activity: Connecting Transcription Factors, Disease, and Small Molecules

Jing Chen,Mukta Phatak,Johannes M Freudenberg,John Reichard,Siva Sivaganesan,Zhen Hu,Mario Medvedovic,Quaid Morris

doi:10.1371/journal.pcbi.1003198

Abstract

Identifying transcription factors (TF) involved in producing a genome-wide transcriptional profile is an essential step in building mechanistic model that can explain observed gene expression data. We developed a statistical framework for constructing genome-wide signatures of TF activity, and for using such signatures in the analysis of gene expression data produced by complex transcriptional regulatory programs. Our framework integrates ChIP-seq data and appropriately matched gene expression profiles to identify True REGulatory (TREG) TF-gene interactions. It provides genome-wide quantification of the likelihood of regulatory TF-gene interaction that can be used to either identify regulated genes, or as genome-wide signature of TF activity. To effectively use ChIP-seq data, we introduce a novel statistical model that integrates information from all binding “peaks” within 2 Mb window around a gene's transcription start site (TSS), and provides gene-level binding scores and probabilities of regulatory interaction. In the second step we integrate these binding scores and regulatory probabilities with gene expression data to assess the likelihood of True REGulatory (TREG) TF-gene interactions. We demonstrate the advantages of TREG framework in identifying genes regulated by two TFs with widely different distribution of functional binding events (ERα and E2f1). We also show that TREG signatures of TF activity vastly improve our ability to detect involvement of ERα in producing complex diseases-related transcriptional profiles. Through a large study of disease-related transcriptional signatures and transcriptional signatures of drug activity, we demonstrate that increase in statistical power associated with the use of TREG signatures makes the crucial difference in identifying key targets for treatment, and drugs to use for treatment. All methods are implemented in an open-source R package treg. The package also contains all data used in the analysis including 494 TREG binding profiles based on ENCODE ChIP-seq data. The treg package can be downloaded at http://GenomicsPortals.org.

Highlights

The specificity of transcriptional initiation in the genomes of eukaryotes is maintained through regulatory programs entailing complex interactions among transcription factors (TF), epigenetic modifications of regulatory DNA regions and associated histones, chromatin-remodeling proteins, and the basal transcriptional machinery [1]
Two main findings of our study are: 1) True REGulatory (TREG) binding scores derived from ChIP-seq data are more informative than simple alternatives that can be used to summarize ChIP-seq data; and 2) TREG signatures that integrate the binding and gene expression data are more sensitive in detecting evidence of TF regulatory activity than commonly used alternatives
We show that this advantage of TREG signatures can make the difference between being able and not being able to infer TF regulatory activity in complex transcriptional profiles

Summary

Introduction

The specificity of transcriptional initiation in the genomes of eukaryotes is maintained through regulatory programs entailing complex interactions among transcription factors (TF), epigenetic modifications of regulatory DNA regions and associated histones, chromatin-remodeling proteins, and the basal transcriptional machinery [1]. High-throughput sequencing of immuno-precipitated DNA fragments (ChIP-seq) provides means to assess genome-wide expression regulatory events, such as TF-DNA interactions [2]. Sophisticated statistical methodologies have been developed for identifying TF binding events in terms of ‘‘peaks’’ in the distributions of ChIP-seq data [3,4,5,6,7,8]. The evidence provided by ChIP-seq binding data that a gene’s expression is regulated by a TF is a function of the number of peaks, their intensity and proximity to the transcription start site (TSS) [9]. The identification of true regulatory TF-gene relationships requires per-gene summaries/scores measuring the totality of the evidence in ChIP-seq data, integrated with measurements of gene expression levels

Objectives

Methods

Results

Conclusion