Abstract

This article presents a novel and remarkably efficient method of computing the statistical G-test made possible by exploiting a connection with the fundamental elements of information theory: by writing the G statistic as a sum of joint entropy terms, its computation is decomposed into easily reusable partial results with no change in the resulting value. This method greatly improves the efficiency of applications that perform a series of G-tests on permutations of the same features, such as feature selection and causal inference applications because this decomposition allows for an intensive reuse of these partial results. The efficiency of this method is demonstrated by implementing it as part of an experiment involving IPC–MB, an efficient Markov blanket discovery algorithm, applicable both as a feature selection algorithm and as a causal inference method. The results show outstanding efficiency gains for IPC–MB when the G-test is computed with the proposed method, compared to the unoptimized G-test, but also when compared to IPC–MB++, a variant of IPC–MB which is enhanced with an AD–tree, both static and dynamic. Even if this proposed method of computing the G-test is presented here in the context of IPC–MB, it is in fact bound neither to IPC–MB in particular, nor to feature selection or causal inference applications in general, because this method targets the information-theoretic concept that underlies the G-test, namely conditional mutual information. This aspect grants it wide applicability in data sciences.

Highlights

  • Statistical tests of independence are mathematical tools used to determine whether two random variables, recorded in a data set as features, are more likely to be independent of each other, as opposed to being dependent, given a significance threshold

  • This section contains the discussion on the theoretical relationship of the G-test with information theory and how it is exploited by decomposed conditional mutual information’ (dcMI), our proposed optimization

  • The experiment is focused on the most intensive use-case of the IPC–MB algorithm, namely to find the Markov blankets of all the features in each data set. This exhaustive use-case emulates how IPC–MB would be applied for reconstructing the entire Bayesian network of the data sets, as opposed to the simpler use-case of applying IPC–MB to find the MB of a single feature

Read more

Summary

Introduction

Statistical tests of independence are mathematical tools used to determine whether two random variables, recorded in a data set as features, are more likely to be independent of each other, as opposed to being dependent, given a significance threshold. Most Markov blanket discovery algorithms found in literature are designed to perform a large number of statistical tests of conditional independence (CI) on the features of the given data set. DcMI involves writing the G statistic in the form of conditional mutual information, which can be further decomposed into a sum of reusable joint entropy terms These terms are computed and cached once, but reused many times when computing subsequent G-tests during the same run of the algorithm or future runs on the same data set. The comparative experiment revealed that computing the G statistic in its canonical (unoptimized) form consumes an average of 99.74% of the total running time of IPC–MB, even when applied on a small data set This further emphasises the performance bottleneck inherent to CI tests and the necessity of their optimization. This article is structured as follows: Section 2 presents the dcMI optimization in the context of existing Markov blanket discovery algorithms, while describing other optimizations found in literature with which it shares features; Sections 3 and 4 provide the theoretical background required to describe the dcMI optimization, namely the elements of information theory it relies on, and the statistical G-test, respectively; Section 5 describes the theoretical design of dcMI and its relation to the G-test, including an example of how dcMI operates; Section 6 describes IPC–MB, the Markov blanket discovery algorithm used to demonstrate the efficiency of dcMI; Section 7 presents the comparative experiment in which dcMI is empirically evaluated using IPC–MB and compares its performance to alternative optimizations of IPC–MB found in literature; Section 8 summarizes the conclusions that emerge from the design of dcMI and from the experimental evaluation

Related Work
Notes on Information Theory
The G-Test
The G Statistic and dcMI
The Iterative Parent-Child Markov Blanket Algorithm
IPC–MB and AD–Trees
IPC–MB and dcMI
A Comparative Experiment
Implementation
Data Sets
The ALARM Subexperiment
The ANDES Subexperiment
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call