Abstract
BackgroundChromatin immunoprecipitation followed by deep sequencing (ChIP-seq) is the most widely used method for characterizing the epigenetic states of chromatin on a genomic scale. With the recent availability of large genome-wide data sets, often comprising several epigenetic marks, novel approaches are required to explore functionally relevant interactions between histone modifications. Computational discovery of "chromatin states" defined by such combinatorial interactions enabled descriptive annotations of genomes, but more quantitative approaches are needed to progress towards predictive models.ResultsWe propose non-negative matrix factorization (NMF) as a new unsupervised method to discover combinatorial patterns of epigenetic marks that frequently co-occur in subsets of genomic regions. We show that this small set of combinatorial "codes" can be effectively displayed and interpreted. NMF codes enable dimensionality reduction and have desirable statistical properties for regression and classification tasks. We demonstrate the utility of codes in the quantitative prediction of Pol2-binding and the discrimination between Pol2-bound promoters and enhancers. Finally, we show that specific codes can be linked to molecular pathways and targets of pluripotency genes during differentiation.ConclusionsWe have introduced and evaluated a new computational approach to represent combinatorial patterns of epigenetic marks as quantitative variables suitable for predictive modeling and supervised machine learning. To foster widespread adoption of this method we make it available as an open-source software-package – epicode at https://github.com/mcieslik-mctp/epicode.
Highlights
Chromatin immunoprecipitation followed by deep sequencing (ChIP-seq) is the most widely used method for characterizing the epigenetic states of chromatin on a genomic scale
We have shown that negative matrix factorization (NMF) applied to epigenetic marks yield sparse codes with an important nesting property
Further we have demonstrated the benefits of using codes over individual marks in predictive modeling of Pol2binding
Summary
Chromatin immunoprecipitation followed by deep sequencing (ChIP-seq) is the most widely used method for characterizing the epigenetic states of chromatin on a genomic scale. With the recent availability of large genome-wide data sets, often comprising several epigenetic marks, novel approaches are required to explore functionally relevant interactions between histone modifications. Chromatin immunoprecipitation followed by deep sequencing (ChIP-seq) is becoming the standard method for the genome-wide mapping of histone modifications and transcription factor (TF) binding sites [2]. Most of the existing analysis tools are focused on the delineation of enriched sites from a single sample with optional “input control” [4] For histone modifications this task becomes more challenging as their enrichments are often weaker and less localized. None of the standard peak-based method deals with multiple marks and the reconciliation of several sets of peaks is an added challenge [11,12,13]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.