In this article we develop an information-theoretic framework of multiple sequence alignments (MSAs), based on sub-sampling. The key component of this framework is an information-theoretical potential defined on pairs of sites (links) within the MSA. This potential quantifies the expected drop in variation of information between the two constituent sites. The expectation is taken with respect to all possible sub-alignments, obtained by removing a finite, fixed number of rows. We show that the potential is zero for linked sites representing columns, for which symbols are in bijective correspondence and that it is strictly positive, otherwise. It is furthermore shown that the potential assumes its unique minimum for links at which each symbol pair appears with the same multiplicity. We then show that the established drop of the variation of information exceeds finite-size effects inherent to the construction of the potential. Finally, we provide as a proof of concept an application of our results to a specific MSA composed of the inverse fold solutions of three distinguished secondary structures.
Read full abstract