Abstract

BackgroundUnderstanding the mechanisms by which transcription factors (TF) are recruited to their physiological target sites is crucial for understanding gene regulation. DNA sequence intrinsic features such as predicted binding affinity are often not very effective in predicting in vivo site occupancy and in any case could not explain cell-type specific binding events. Recent reports show that chromatin accessibility, nucleosome occupancy and specific histone post-translational modifications greatly influence TF site occupancy in vivo. In this work, we use machine-learning methods to build predictive models and assess the relative importance of different sequence-intrinsic and chromatin features in the TF-to-target-site recruitment process.MethodsOur study primarily relies on recent data published by the ENCODE consortium. Five dissimilar TFs assayed in multiple cell-types were selected as examples: CTCF, JunD, REST, GABP and USF2. We used two types of candidate target sites: (a) predicted sites obtained by scanning the whole genome with a position weight matrix, and (b) cell-type specific peak lists provided by ENCODE. Quantitative in vivo occupancy levels in different cell-types were based on ChIP-seq data for the corresponding TFs. In parallel, we computed a number of associated sequence-intrinsic and experimental features (histone modification, DNase I hypersensitivity, etc.) for each site. Machine learning algorithms were then used in a binary classification and regression framework to predict site occupancy and binding strength, for the purpose of assessing the relative importance of different contextual features.ResultsWe observed striking differences in the feature importance rankings between the five factors tested. PWM-scores were amongst the most important features only for CTCF and REST but of little value for JunD and USF2. Chromatin accessibility and active histone marks are potent predictors for all factors except REST. Structural DNA parameters, repressive and gene body associated histone marks are generally of little or no predictive value.ConclusionsWe define a general and extensible computational framework for analyzing the importance of various DNA-intrinsic and chromatin-associated features in determining cell-type specific TF binding to target sites. The application of our methodology to ENCODE data has led to new insights on transcription regulatory processes and may serve as example for future studies encompassing even larger datasets.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0846-z) contains supplementary material, which is available to authorized users.

Highlights

  • Understanding the mechanisms by which transcription factors (TF) are recruited to their physiological target sites is crucial for understanding gene regulation

  • As we found again that Support Vector Machines (SVM) with radial kernel performed better than the other two methods tested, we used this method here and in all subsequent analyses described in this paper

  • There are 30,073 common CTCF sites (18.6 %) out of 161,438 sites in total. We extended this type of regression-based machine modeling framework for predicting tag counts to three additional factors (REST, GABP and USF2) and three additional cell types (GM12878, HeLa and HepG2)

Read more

Summary

Introduction

Understanding the mechanisms by which transcription factors (TF) are recruited to their physiological target sites is crucial for understanding gene regulation. DNA sequence intrinsic features such as predicted binding affinity are often not very effective in predicting in vivo site occupancy and in any case could not explain cell-type specific binding events. Genes are regulated by transcription factors (TF) binding to physiological target sites in the genome. Research in this area has been hampered by the lack of powerful assays to study TF binding events in vivo This has drastically changed with the advent of the ChIP-seq technology which allows for comprehensive, genome-wide mapping of all in vivo bound sites of a given TF in a particular cell type at near base-pair resolution [2]. The recruitment of TFs to target sites depends on both DNA-intrinsic properties and cell type specific covariates

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.