Learning Regular Expressions from Noisy Sequences

Ugo Galassi,Attilio Giordana

doi:10.1007/11527862_7

Abstract

The presence of long gaps dramatically increases the diffculty of detecting and characterizing complex events hidden in long sequences. In order to cope with this problem, a learning algorithm based on an abstraction mechanism is proposed: it can infer the general model of complex events from a set of learning sequences. Events are described by means of regular expressions, and the abstraction mechanism is based on the substitution property of regular languages. The induction algorithm proceeds bottom-up, progressively coarsening the sequence granularity, letting correlations between subsequences, separated by long gaps, naturally emerge. Two abstraction operators are defined. The first one detects, and abstracts into non-terminal symbols, regular expressions not containing iterative constructs. The second one detects and abstracts iterated subsequences. By interleaving the two operators, regular expressions in general form may be inferred. Both operators are based on string alignment algorithms taken from bio-informatics. A restricted form of the algorithm has already been outlined in previous papers, where the emphasis was on applications. Here, the algorithm, in an extended version, is described and analyzed into details.

Full Text