DiNAMO: highly sensitive DNA motif discovery in high-throughput sequencing data

Chadi Saad,Hélène Touzet,Hugues Richard,Martin Figeac,Laurent Noé,Julie Leclerc,Marie-Pierre Buisine

doi:10.1186/s12859-018-2215-1

Abstract

BackgroundDiscovering over-represented approximate motifs in DNA sequences is an essential part of bioinformatics. This topic has been studied extensively because of the increasing number of potential applications. However, it remains a difficult challenge, especially with the huge quantity of data generated by high throughput sequencing technologies. To overcome this problem, existing tools use greedy algorithms and probabilistic approaches to find motifs in reasonable time. Nevertheless these approaches lack sensitivity and have difficulties coping with rare and subtle motifs.ResultsWe developed DiNAMO (for DNA MOtif), a new software based on an exhaustive and efficient algorithm for IUPAC motif discovery. We evaluated DiNAMO on synthetic and real datasets with two different applications, namely ChIP-seq peaks and Systematic Sequencing Error analysis. DiNAMO proves to compare favorably with other existing methods and is robust to noise.ConclusionsWe shown that DiNAMO software can serve as a tool to search for degenerate motifs in an exact manner using IUPAC models. DiNAMO can be used in scanning mode with sliding windows or in fixed position mode, which makes it suitable for numerous potential applications.Availabilityhttps://github.com/bonsai-team/DiNAMO.

Highlights

Discovering over-represented approximate motifs in DNA sequences is an essential part of bioinformatics
Given a set of DNA sequences, the motif discovery consists in finding over-represented motifs, that are significantly more frequent in the sequences than one would expect by chance
Nodes are IUPAC motifs of length L and there is an edge between two motifs M1 and M2, if M1 and M2 differ at exactly one position, named i, and the ith letter of M1 is directly connected to the ith letter of M2 in the IUPAC character lattice

Summary

Results

We used the nucleotide level correlation coefficient (nCC) to evaluate the performance quality [30]. DiNAMO achieves the best results for the detection of degenerate motifs compared to all other tools (best nCC value). We ran DiNAMO, MEME-CHIP, HOMER and Discrover with IUPAC motifs of length L = 7 containing up to 3 degenerate letters (d = 3). For each dataset (GATA1, SOX2, OCT4, STAT3, KLF1 respectively), 18,17,13,17,5 cofactors are found by DiNAMO, 11,16,10,3,2 by MEME-CHIP, 9,10,12,7,6 by HOMER and 3,4,4,2,2 by Discrover Most of these motifs have been already validated experimentally as co-factors of the principal transcription factor (see Table S2 in the Additional file 2). HOMER achieves good results, but it is less sensitive than MEME-CHIP and DiNAMO (the difference of TFBS motif detection proportion reaches approximately 20%) and the HOMER’s curve is inexplicably reversed for the smallest datasets (0.5% of peak files). No significant motifs were found, showing the selectivity of the algorithm

Conclusions

Background

Methods

Results and discussion

The motif

Conclusion