Quantifying Data Dependencies with Rényi Mutual Information and Minimum Spanning Trees.

Anne Eggels,Daan Crommelin

doi:10.3390/e21020100

Abstract

In this study, we present a novel method for quantifying dependencies in multivariate datasets, based on estimating the Rényi mutual information by minimum spanning trees (MSTs). The extent to which random variables are dependent is an important question, e.g., for uncertainty quantification and sensitivity analysis. The latter is closely related to the question how strongly dependent the output of, e.g., a computer simulation, is on the individual random input variables. To estimate the Rényi mutual information from data, we use a method due to Hero et al. that relies on computing minimum spanning trees (MSTs) of the data and uses the length of the MST in an estimator for the entropy. To reduce the computational cost of constructing the exact MST for large datasets, we explore methods to compute approximations to the exact MST, and find the multilevel approach introduced recently by Zhong et al. (2015) to be the most accurate. Because the MST computation does not require knowledge (or estimation) of the distributions, our methodology is well-suited for situations where only data are available. Furthermore, we show that, in the case where only the ranking of several dependencies is required rather than their exact value, it is not necessary to compute the Rényi divergence, but only an estimator derived from it. The main contributions of this paper are the introduction of this quantifier of dependency, as well as the novel combination of using approximate methods for MSTs with estimating the Rényi mutual information via MSTs. We applied our proposed method to an artificial test case based on the Ishigami function, as well as to a real-world test case involving an El Nino dataset.

Highlights

In the field of uncertainty quantification (UQ), the question to what extent random variables are dependent emerges in various places
Many methods in UQ are designed for situations where the random input variables are mutually independent, thedependence of these inputs is of obvious importance
A topic in UQ that is relevant for this study is sensitivity analysis, where it is investigated which random inputs induce the largest uncertainties in the simulation output [4,5,6]

Summary

Introduction

In the field of uncertainty quantification (UQ), the question to what extent random variables are dependent emerges in various places. Many methods in UQ are designed for situations where the random input variables are mutually independent, the (in)dependence of these inputs is of obvious importance. Sensitivity analysis is closely related to the question how strongly dependent the output is on the individual input variables. By quantifying these dependencies, one can order the inputs by the extent to which the output is dependent on them (from strongly to weakly dependent), providing relevant information for sensitivity analysis

Objectives

Methods

Results

Conclusion