Compression schemes for advanced data structures have become a central modern challenge. Information theory has traditionally dealt with conventional data such as text, images, or video. In contrast, most data available today is multitype and context-dependent. To meet this challenge, we have recently initiated a systematic study of advanced data structures such as unlabeled graphs [8]. In this paper, we continue this program by considering trees with statistically correlated vertex names. Trees come in many forms, but here we deal with binary plane trees (where order of subtrees matters) and their non-plane version (where order of subtrees doesn't matter). Furthermore, we assume that each name is generated by a known memoryless source (horizontal independence), but a symbol of a vertex name depends in a Markovian sense on the corresponding symbol of the parent vertex name (vertical Markovian dependency). Such a model is closely connected to models of phylogenetic trees. While in general the problem of multimodal compression and associated analysis can be extremely complicated, we find that in this natural setting, both the entropy analysis and optimal compression are analytically tractable. We evaluate the entropy for both types of trees. For the plane case, with or without vertex names, we find that a simple two-stage compression scheme is both efficient and optimal. We then present efficient and optimal compression algorithms for the more complicated non-plane case.
Read full abstract