Compression and the wheel of fortune

A Apostolico,L Parida

doi:10.1109/dcc.2003.1194005

Abstract

Data compression techniques hinged on the notion of a motif are presented, interpreted here as a string of intermittently solid and wild characters that recurs more or less frequently in an input sequence or family of sequences. Correspondingly, motif discovery techniques and tools have been devised. This task is made difficult by the circumstance that the number of motifs identifiable in general in a sequence can be exponential in the size of that sequence. A significant gain in the direction of reducing the number of motifs is achieved through the introduction of irredundant motifs, which in intuitive terms are a combination of other motif occurrences. The number of abundant motifs in a sequence is at worst linear in the sequence. It is shown that irredundant motifs can be usefully exploited in lossy compression methods based on textual substitution and suitable for signals as well as text. Preliminary experiments with these fungible strategies at the crossroads of lossless and lossy data compression show performances that improve over popular methods by more than 20% in lossy and 10% in lossless implementations.

Full Text