Abstract

Genome-wide maps of transcription factor (TF) occupancy and regions of open chromatin implicitly contain DNA sequence signals for multiple factors. We present SeqGL, a novel de novo motif discovery algorithm to identify multiple TF sequence signals from ChIP-, DNase-, and ATAC-seq profiles. SeqGL trains a discriminative model using a k-mer feature representation together with group lasso regularization to extract a collection of sequence signals that distinguish peak sequences from flanking regions. Benchmarked on over 100 ChIP-seq experiments, SeqGL outperformed traditional motif discovery tools in discriminative accuracy. Furthermore, SeqGL can be naturally used with multitask learning to identify genomic and cell-type context determinants of TF binding. SeqGL successfully scales to the large multiplicity of sequence signals in DNase- or ATAC-seq maps. In particular, SeqGL was able to identify a number of ChIP-seq validated sequence signals that were not found by traditional motif discovery algorithms. Thus compared to widely used motif discovery algorithms, SeqGL demonstrates both greater discriminative accuracy and higher sensitivity for detecting the DNA sequence signals underlying regulatory element maps. SeqGL is available at http://cbio.mskcc.org/public/Leslie/SeqGL/.

Highlights

  • Transcription factor (TF) ChIP-seq profiles and genome-wide regulatory element maps based on DNase I hypersensitive site sequencing (DNase-seq) or transposase-accessible chromatin sequencing (ATAC-seq) implicitly contain rich information about the cell-type specific and genomiccontext dependent binding of multiple factors

  • Transcription factors (TFs) are proteins that recognize and bind specific DNA sequence signals to regulate the expression of target genes

  • Recent years have seen the rapid development of genome-wide assays to profile the binding locations of a single TF or, more generally, regions of open chromatin that are occupied by a complex repertoire of DNA binding factors

Read more

Summary

Introduction

Transcription factor (TF) ChIP-seq profiles and genome-wide regulatory element maps based on DNase I hypersensitive site sequencing (DNase-seq) or transposase-accessible chromatin sequencing (ATAC-seq) implicitly contain rich information about the cell-type specific and genomiccontext dependent binding of multiple factors. Several methods use DNase-seq profiles to scan for instances of known motifs [10, 11], and one recently proposed approach exploits the read-level properties of high-depth digital genomic footprinting (DGF) to improve localization of known motifs [12]. These methods do not enable de novo discovery of binding signals that are not represented in TF motif databases, and methods that rely on the depth and read-level properties of DNase I cleavage in DGF may not readily generalize to newer assays like ATAC-seq, which can be used in low cell number settings where DNase-seq is not feasible. We show how SeqGL can be trained in a multi-task setting, where we jointly train on experiments from multiple cell types in order to identify shared and cell-type specific binding signals or encode information about genomic context, such as gene proximity or chromatin state, into the task structure to reveal more detailed regulatory sequence information

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.