Abstract
The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs.
Highlights
Multiple mechanisms exist to modulate protein levels in a cell and create a dynamic cellular phenotype from a static genotype
MotifSpec consistently found longer motifs than either DREME or Amadeus. As these algorithms are word-based, we speculate that finding high-scoring exact matches to individual long k-mers or regular expressions is less likely and that these long words get filtered out at an early stage in the algorithm
The motifs found were almost identical to those found using each complete dataset, and the area under the ROC curve (auROC) for the test sets were consistently within 1% of the auROC found on the whole dataset
Summary
Multiple mechanisms exist to modulate protein levels in a cell and create a dynamic cellular phenotype from a static genotype. Transcription factors (TFs) bind to intergenic cis-regulatory elements and enhance or inhibit the transcription of their target genes. Identifying the DNA binding specificities of transcription factors is necessary to decipher the regulatory network in the cell, identify disease causing mutations in these elements, and engineer synthetic organisms to perform specific biochemical functions. [14]) have shown that the use of a higher order background model can prove beneficial. Since our objective function penalizes motifs according to their actual frequency in the negative/background set, it is likely that the background model, which is a summary statistic of the background set, is not as important to performance
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have