Finding Surprisingly Frequent Patterns of Variable Lengths in Sequence Data

Reza Sadoddin,Joerg Sander,Davood Rafiei

doi:10.1137/1.9781611974348.4

Abstract

We address the problem of finding ‘surprising’ patterns of variable length in sequence data, where a surprising pattern is defined as a subsequence of a longer sequence, whose observed frequency is statistically significant with respect to a given distribution. Finding statistically significant patterns in sequence data is the core task in some interesting applications such as Biological motif discovery and anomaly detection. We show that the presence of few ‘true’ surprising patterns in the data could cause a large number of highly-correlated patterns to stand statistically significant just because of those few significant patterns. Our approach to solving the ‘redundant patterns’ problem is based on capturing the dependencies between patterns through an ‘explain’ relationship where a set of patterns can explain the statistical significance of another pattern. This allows us to address the problem of redundancy by choosing a few ‘core’ patterns which explain the significance of all other significant patterns. We propose a greedy algorithm for efficiently finding an approximate core pattern set of minimum size. Using both synthetic and real-world sequential data, chosen from different domains including Medicine and Bioinformatics, we show that the proposed notion of core patterns very closely matches the notion of ‘true’ surprising patterns in data.

Full Text