Accelerating Causal Inference and Feature Selection Methods through G-Test Computation Reuse.

Camil Băncioiu,Remus Brad

doi:10.3390/e23111501

Camil Băncioiu, Remus Brad

Open Access

PDF Available

https://doi.org/10.3390/e23111501

Copy DOI

Export

Save

Cite

Journal: Entropy (Basel, Switzerland)	Publication Date: Nov 12, 2021
Citations: 1	License type: CC BY 4.0

Affiliation: Lucian Blaga University of Sibiu

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

This article presents a novel and remarkably efficient method of computing the statistical G-test made possible by exploiting a connection with the fundamental elements of information theory: by writing the G statistic as a sum of joint entropy terms, its computation is decomposed into easily reusable partial results with no change in the resulting value. This method greatly improves the efficiency of applications that perform a series of G-tests on permutations of the same features, such as feature selection and causal inference applications because this decomposition allows for an intensive reuse of these partial results. The efficiency of this method is demonstrated by implementing it as part of an experiment involving IPC–MB, an efficient Markov blanket discovery algorithm, applicable both as a feature selection algorithm and as a causal inference method. The results show outstanding efficiency gains for IPC–MB when the G-test is computed with the proposed method, compared to the unoptimized G-test, but also when compared to IPC–MB++, a variant of IPC–MB which is enhanced with an AD–tree, both static and dynamic. Even if this proposed method of computing the G-test is presented here in the context of IPC–MB, it is in fact bound neither to IPC–MB in particular, nor to feature selection or causal inference applications in general, because this method targets the information-theoretic concept that underlies the G-test, namely conditional mutual information. This aspect grants it wide applicability in data sciences.

Highlights

Statistical tests of independence are mathematical tools used to determine whether two random variables, recorded in a data set as features, are more likely to be independent of each other, as opposed to being dependent, given a significance threshold
This section contains the discussion on the theoretical relationship of the G-test with information theory and how it is exploited by decomposed conditional mutual information’ (dcMI), our proposed optimization
The experiment is focused on the most intensive use-case of the IPC–MB algorithm, namely to find the Markov blankets of all the features in each data set. This exhaustive use-case emulates how IPC–MB would be applied for reconstructing the entire Bayesian network of the data sets, as opposed to the simpler use-case of applying IPC–MB to find the MB of a single feature

Summary

Introduction

Statistical tests of independence are mathematical tools used to determine whether two random variables, recorded in a data set as features, are more likely to be independent of each other, as opposed to being dependent, given a significance threshold. Most Markov blanket discovery algorithms found in literature are designed to perform a large number of statistical tests of conditional independence (CI) on the features of the given data set. DcMI involves writing the G statistic in the form of conditional mutual information, which can be further decomposed into a sum of reusable joint entropy terms These terms are computed and cached once, but reused many times when computing subsequent G-tests during the same run of the algorithm or future runs on the same data set. The comparative experiment revealed that computing the G statistic in its canonical (unoptimized) form consumes an average of 99.74% of the total running time of IPC–MB, even when applied on a small data set This further emphasises the performance bottleneck inherent to CI tests and the necessity of their optimization. This article is structured as follows: Section 2 presents the dcMI optimization in the context of existing Markov blanket discovery algorithms, while describing other optimizations found in literature with which it shares features; Sections 3 and 4 provide the theoretical background required to describe the dcMI optimization, namely the elements of information theory it relies on, and the statistical G-test, respectively; Section 5 describes the theoretical design of dcMI and its relation to the G-test, including an example of how dcMI operates; Section 6 describes IPC–MB, the Markov blanket discovery algorithm used to demonstrate the efficiency of dcMI; Section 7 presents the comparative experiment in which dcMI is empirically evaluated using IPC–MB and compares its performance to alternative optimizations of IPC–MB found in literature; Section 8 summarizes the conclusions that emerge from the design of dcMI and from the experimental evaluation

Related Work

Notes on Information Theory

The G-Test

The G Statistic and dcMI

The Iterative Parent-Child Markov Blanket Algorithm

IPC–MB and AD–Trees

IPC–MB and dcMI

A Comparative Experiment

Implementation

Data Sets

The ALARM Subexperiment

The ANDES Subexperiment

Findings

Conclusions

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Accelerating Causal Inference and Feature Selection Methods through G-Test Computation Reuse.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Entropy (Basel, Switzerland)

Lead the way for us

Similar Papers

Data from Pathway Analysis of Genome-wide Association Study in Childhood Leukemia among Hispanics
Anand P Chokkalingam ... Xiaorong Shao
-
Anand P Chokkalingam, et. al.Anand P Chokkalingam ... Xiaorong Shao
31 Mar 2023
31 Mar 2023

Data from Pathway Analysis of Genome-wide Association Study in Childhood Leukemia among Hispanics
Anand P Chokkalingam ... Xiaorong Shao
-
Anand P Chokkalingam, et. al.Anand P Chokkalingam ... Xiaorong Shao
31 Mar 2023
31 Mar 2023

Pathway Analysis of Genome-wide Association Study in Childhood Leukemia among Hispanics.
Ling-I Hsu ... Farren Briggs
Cancer Epidemiology, Biomarkers & Prevention | VOL. 25
Ling-I Hsu, et. al.Ling-I Hsu ... Farren Briggs
01 May 2016
Cancer Epidemiology, Biomarkers & Prevention | VOL. 25

Decision letter: Applying causal discovery to single-cell analyses using CausalCell
Babak Momeni ... Anna Akhmanova
-
Babak Momeni, et. al.Babak Momeni ... Anna Akhmanova
14 Aug 2022
14 Aug 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Accelerating Causal Inference and Feature Selection Methods through G-Test Computation Reuse.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Entropy (Basel, Switzerland)