Discovering common recurrent patterns in multiple strings over large alphabets

Aqil M Azmi,Abdulrakeeb M Al-Ssulami

doi:10.1016/j.patrec.2014.12.009

Abstract

Motifs are short recurring patterns that are of much interest as they help us understand the mechanism behind regulating gene expression. This paper presents an efficient deterministic algorithm that exhaustively discovers all identical string motifs of all sizes that appears in all or most of the input strings. The input is a set of n different strings each of length N. The problem is to discover all exact instances of any size occurring at least θ times per string in q or more input strings. The quorums θ and q are user set positive integers. The algorithm starts with substrings of size 2, and keeps augmenting them one symbol a time. We encode the substrings, such that it is not possible to have two different substrings with the same encoding. After the sorting of the encoded substring, we can tell where, and which of the substrings have multiple copies. Next, we eliminate all the substrings that appear less than the quorums, and repeat the process. Theoretically the algorithm is competitive and scales extremely well with the size of the alphabet. For strings over the alphabet Σ, it is O(nN + |Σ|) in space, and O(nNL − nL2) in time, where L is the length of the longest identical string motif satisfying the constraint. For experimental purposes we applied the algorithm on real biological sequences (DNA and protein), and on randomly generated sequences using alphabet of sizes up to 1000 symbols. These experiments confirm that in practice the algorithm has indeed a linear complexity, both in space and time, making it practical for the important and difficult case of large-alphabet inputs.

Full Text