SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets

Qiang Yu,Dingbang Wei,Hongwei Huo

doi:10.1186/s12859-018-2242-y

Abstract

BackgroundGiven a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more.ResultsWe analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D’ with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D’. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D’ efficiently and (2) that the qPMS algorithms executed on D’ can find implanted or real motifs in a significantly shorter time than when executed on D.ConclusionsWe improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D’, rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.

Highlights

Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences
The quorum planted motif search (qPMS) algorithms are based on searching possible combinations of motif instances or possible candidate motifs and are either sample driven or pattern driven
The sample-driven qPMS algorithms, such as WINNOWER [5], DPCFG [12] and RecMotif [13], have an initial search space of (n – l + 1)tt-tuples (x1, x2, ..., xt) in the case of q = 1; each t tuple is composed of t l-mers from t input sequences, i.e., a group of possible motif instances

Summary

Introduction

Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. The sample-driven qPMS algorithms, such as WINNOWER [5], DPCFG [12] and RecMotif [13], have an initial search space of (n – l + 1)tt-tuples (x1, x2, ..., xt) in the case of q = 1; each t tuple is composed of t l-mers from t input sequences, i.e., a group of possible motif instances. Because of the much smaller initial search space, the pattern-driven qPMS algorithms

Objectives

Methods

Results

Conclusion