Abstract

BackgroundMotif patterns of maximal saturation emerged originally in contexts of pattern discovery in biomolecular sequences and have recently proven a valuable notion also in the design of data compression schemes. Informally, a motif is a string of intermittently solid and wild characters that recurs more or less frequently in an input sequence or family of sequences. Motif discovery techniques and tools tend to be computationally imposing, however, special classes of "rigid" motifs have been identified of which the discovery is affordable in low polynomial time.ResultsIn the present work, "extensible" motifs are considered such that each sequence of gaps comes endowed with some elasticity, whereby the same pattern may be stretched to fit segments of the source that match all the solid characters but are otherwise of different lengths. A few applications of this notion are then described. In applications of data compression by textual substitution, extensible motifs are seen to bring savings on the size of the codebook, and hence to improve compression. In germane contexts, in which compressibility is used in its dual role as a basis for structural inference and classification, extensible motifs are seen to support unsupervised classification and phylogeny reconstruction.ConclusionOff-line compression based on extensible motifs can be used advantageously to compress and classify biological sequences.

Highlights

  • Motif patterns of maximal saturation emerged originally in contexts of pattern discovery in biomolecular sequences and have recently proven a valuable notion in the design of data compression schemes

  • Off-line compression based on extensible motifs can be used advantageously to compress and classify biological sequences

  • We present lossy off-line data compression techniques by textual substitution in which the patterns used in compression are chosen among the extensible motifs that are found to recur in the textstring with a minimum pre-specified frequency

Read more

Summary

Introduction

Motif patterns of maximal saturation emerged originally in contexts of pattern discovery in biomolecular sequences and have recently proven a valuable notion in the design of data compression schemes. Allowing for spacers in a string is what makes it extensible Such spacers are indicated by annotating the dot characters. D will denote the maximum number of consecutive dots allowed in a string. In such cases, for clarity of notation, we use the extensible wild card denoted by the dash symbol "-" instead of the annotated dot character, .[1,d] in the string. A motif m is extensible if it contains at least one annotated dot, otherwise m is rigid. Given an extensible string m, a rigid string m' is a realization of m if each annotated dot .α is replaced by l ∈ α dots. An extensible string m occurs at position l in s if (page number not for citation purposes)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call