Abstract
Abstract. A basic task of exploratory data analysis is the characterisation of "structure" in multivariate datasets. For bivariate Gaussian distributions, natural measures of dependence (the predictive relationship between individual variables) and compactness (the degree of concentration of the probability density function (pdf) around a low-dimensional axis) are respectively provided by ordinary least-squares regression and Principal Component Analysis. This study considers general measures of structure for non-Gaussian distributions and demonstrates that these can be defined in terms of the information theoretic "distance" (as measured by relative entropy) between the given pdf and an appropriate "unstructured" pdf. The measure of dependence, mutual information, is well-known; it is shown that this is not a useful measure of compactness because it is not invariant under an orthogonal rotation of the variables. An appropriate rotationally invariant compactness measure is defined and shown to reduce to the equivalent PCA measure for bivariate Gaussian distributions. This compactness measure is shown to be naturally related to a standard information theoretic measure of non-Gaussianity. Finally, straightforward geometric interpretations of each of these measures in terms of "effective volume" of the pdf are presented.
Highlights
A fundamental question in exploratory data analysis is: given observations of two variables x1 and x2, to what extent is the joint distribution of these variables “interesting”, in the sense that it is “structured”? Different kinds of structure can be considered, among which some of the most important are:I Dependence: to what extent does knowledge of x1 imply knowledge about x2?II Compactness: to what extent is variance shared between x1 and x2; that is, how tightly concentrated around a lower-dimensional surface is the joint probability density function p(x1, x2)?That these are distinct measures of structure is illustrated by the the pdfs displayed in Fig. 1, all of which by construction have the same total variance var(x1)+var(x2)
A similar measure of compactness was introduced in Pena and van der Linde (2007); the present study demonstrates the connection of the compactness measure to principal component analysis (PCA), emphasizes the fundamental difference between it and mutual information as measures of structure, and illustrates how all of these measures of structure can be expressed as relative entropies
This study has considered three measures of structure for multivariate datasets, all defined in terms of the relative entropy between a given pdf p(x) and an appropriate “unstructured” pdf
Summary
II Compactness: to what extent is variance shared between x1 and x2; that is, how tightly concentrated around a lower-dimensional surface is the joint probability density function (pdf) p(x1, x2)?. This measure will be contrasted with the wellestablished measures of dependence and non-Gaussianity provided by mutual information and negentropy. The measure of compactness will be seen to be a combined measure of Gaussianity and covariance isotropy, and to have a natural connection to the standard information theoretic measure of non-Gaussianity This discussion presents a unifying notion of “structure” in probability distributions: each of the measures of dependence, compactness, and nonGaussianity are defined in terms of the information theoretic “distance” (as measured by relative entropy) between the given pdf and the appropriate “unstructured” pdf. One must have a clear idea of what constitutes “interesting structure” without regard to estimation questions before complexities due to finite data can be addressed
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.