T-KDE: a method for genome-wide identification of constitutive protein binding sites from multiple ChIP-seq data sets

Yuanyuan Li,Leping Li,David M Umbach

doi:10.1186/1471-2164-15-27

Yuanyuan Li, Leping Li + Show 1 more

Open Access

https://doi.org/10.1186/1471-2164-15-27

Copy DOI

Abstract

BackgroundA protein may bind to its target DNA sites constitutively, i.e., regardless of cell type. Intuitively, constitutive binding sites should be biologically functional. A prerequisite for understanding their functional relevance is knowing all their locations for a protein of interest. Genome-wide discovery of constitutive binding sites requires robust and efficient computational methods to integrate results from numerous binding experiments. Such methods are lacking, however.ResultsTo locate constitutive binding sites for a protein using ChIP-seq data for that protein from multiple cell lines, we developed a method, T-KDE, which combines a binary range tree with a kernel density estimator. Using 132 CTCF (CCCTC-binding factor) ChIP-seq datasets, we showed that the number of constitutive sites identified by T-KDE is robust to the choice of tuning parameter and that T-KDE identifies binding site locations more accurately than a binning approach. Furthermore, T-KDE can identify constitutive sites that are missed by a motif-based approach either because a bound site failed to reach the motif significance cutoff or because the peak sequence scanned was too short. By studying sites declared constitutive by T-KDE but not by the motif-based approach, we discovered two new CTCF motif variants. Using ENCODE data on 22 transcription factors (TF) in 132 cell lines, we identified constitutive binding sites for each TF and provide evidence that, for some TFs, they may be biologically meaningful.ConclusionsT-KDE is an efficient and effective method to predict constitutive protein binding sites using ChIP-seq peaks from multiple cell lines. Besides constitutive binding sites for a given protein, T-KDE can identify genomic “hot spots” where several different proteins bind and, conversely, cell-type-specific sites bound by a given protein.

Highlights

A protein may bind to its target DNA sites constitutively, i.e., regardless of cell type
Data We downloaded data on ChIP-seq peaks for 22 transcription factors from the ENCODE portal at the UCSC Genome Browser [23]. (The complete list of datasets and their unique identifiers can be found in Additional file 1: Table S2.) For each ChIP-seq peak, we calculated the location of the peak center as half the sum of the start and end coordinates for the peak, and we used these locations for subsequent analysis
Utility of the binary range tree Without initial data partition using the binary range tree, kernel density estimator (KDE) analysis and mode finding on even a single chromosome is computationally prohibitive; estimating the density, rather than finding the local maxima/ minima, is the bottleneck

Summary

Introduction

A protein may bind to its target DNA sites constitutively, i.e., regardless of cell type. We believe that these constitutive binding sites are likely to have unique or fundamental biological roles. We carried out a comprehensive analysis of the 116 CTCF ChIP-seq datasets from 56 cell lines from the ENCODE (Encyclopedia of DNA Elements) consortium and identified ~24,000 CTCF binding sites that were bound in more than 90% of the 56 cell lines [3]. We see value in locating constitutive binding sites for other DNA binding proteins as a possible window into highly conserved biological processes

Objectives

Methods

Results

Discussion

Conclusion