Repulsive parallel MCMC algorithm for discovering diverse motifs from large sequence sets.

Hisaki Ikebata,Ryo Yoshida,John Hancock

doi:10.1093/bioinformatics/btv017

Abstract

Motivation: The motif discovery problem consists of finding recurring patterns of short strings in a set of nucleotide sequences. This classical problem is receiving renewed attention as most early motif discovery methods lack the ability to handle large data of recent genome-wide ChIP studies. New ChIP-tailored methods focus on reducing computation time and pay little regard to the accuracy of motif detection. Unlike such methods, our method focuses on increasing the detection accuracy while maintaining the computation efficiency at an acceptable level. The major advantage of our method is that it can mine diverse multiple motifs undetectable by current methods.Results: The repulsive parallel Markov chain Monte Carlo (RPMCMC) algorithm that we propose is a parallel version of the widely used Gibbs motif sampler. RPMCMC is run on parallel interacting motif samplers. A repulsive force is generated when different motifs produced by different samplers near each other. Thus, different samplers explore different motifs. In this way, we can detect much more diverse motifs than conventional methods can. Through application to 228 transcription factor ChIP-seq datasets of the ENCODE project, we show that the RPMCMC algorithm can find many reliable cofactor interacting motifs that existing methods are unable to discover.Availability and implementation: A C++ implementation of RPMCMC and discovered cofactor motifs for the 228 ENCODE ChIP-seq datasets are available from http://daweb.ism.ac.jp/yoshidalab/motif.Contact: ikebata.hisaki@ism.ac.jp, yoshidar@ism.ac.jpSupplementary information: Supplementary data are available from Bioinformatics online.

Highlights

The motif discovery problem has been receiving renewed attention since recent experimental technologies, such as ChIP-seq, posed new challenges
On the Supplementary Website, we provide all the discovered cofactor motifs which were associated with annotated motifs in JASPAR
We report the performance of several motif discovery algorithms on two types of data: (i) promoter sequences into which strings generated from position probability matrix (PPM) in the JASPAR CORE database are planted and (ii) 228 transcription factor (TF) ChIP-seq datasets of the ENCODE project

Summary

Introduction

The motif discovery problem has been receiving renewed attention since recent experimental technologies, such as ChIP-seq, posed new challenges. The problem is to identify recurring patterns of conserved short strings that appear in a large fraction of nucleotide sequences. A genome-wide ChIP study produces thousands or more DNA fragments consisting of several hundred base pairs, which cover the binding sites of a transcription factor (TF). By discovering motifs in the given sequences, which are associated with known TF-binding motifs in a database, e.g. JASPAR (Sandelin et al, 2004), TRANSFAC.

Objectives

Methods

Results

Conclusion