Efficient mining gapped sequential patterns for motifs in biological sequences

Vance Liao,Ming-Syan Chen

doi:10.1186/1752-0509-7-s4-s7

Abstract

BackgroundPattern mining for biological sequences is an important problem in bioinformatics and computational biology. Biological data mining yield impact in diverse biological fields, such as discovery of co-occurring biosequences, which is important for biological data analyses. The approaches of mining sequential patterns can discover all-length motifs of biological sequences. Nevertheless, traditional approaches of mining sequential patterns inefficiently mine DNA and protein data since the data have fewer letters and lengthy sequences. Furthermore, gap constraints are important in computational biology since they cope with irrelative regions, which are not conserved in evolution of biological sequences.ResultsWe devise an approach to efficiently mine sequential patterns (motifs) with gap constraints in biological sequences. The approach is the Depth-First Spelling algorithm for mining sequential patterns of biological sequences with Gap constraints (termed DFSG).ConclusionsPrefixSpan is one of the most efficient methods in traditional approaches of mining sequential patterns, and it is the basis of GenPrefixSpan. GenPrefixSpan is an approach built on PrefixSpan with gap constraints, and therefore we compare DFSG with GenPrefixSpan. In the experimental results, DFSG mines biological sequences much faster than GenPrefixSpan.

Highlights

Pattern mining for biological sequences is an important problem in bioinformatics and computational biology
The runtime rate rises when the number of sequences gets larger. This experiment confirms that DFSG is more efficient than GenPrefixSpan when the number of sequences is increased
Mining sequential patterns of biological sequences is important in computational biology

Summary

Introduction

Pattern mining for biological sequences is an important problem in bioinformatics and computational biology. The approaches of mining sequential patterns can discover all-length motifs of biological sequences. Traditional sequential pattern mining methods discover general sequential patterns, which can be applied to various constraints. Traditional sequential pattern mining methods discover 2l subsequences of a sequential pattern with length l. The numbers of subsequences for a sequential pattern are too large in traditional mining methods, and the maximal sequential pattern mining method [5] is proposed to efficiently identify maximal sequential patterns, which have no frequent supersequences. Another alternative is to mine closed sequential patterns [6], which patterns do not have any frequent supersequences with the same occurrence frequency. Mining sequential patterns of data streams [7] is in a different environment and has some additional constraints, such as strictly restricted memory, continuously identified sequential patterns, and a linear time execution

Methods

Results

Conclusion