Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks.

Bradley Butcher,Christopher Robinson,Novi Quadrianto,Jeremy Reffin,Vincent S Huang,Grace Charles,Sema K Sgaier

doi:10.3389/frai.2021.612551

Bradley Butcher, Christopher Robinson + Show 5 more

Open Access

https://doi.org/10.3389/frai.2021.612551

Copy DOI

Abstract

Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.

Highlights

To meet ambitious global health and development goals in lower and middle income countries (LMICs), policy decisions have been increasingly reliant on data-driven approaches to provide necessary insights
While we had determined that 5,000 respondents was likely a large enough sample to have sufficient power for predictive regression models, we did not know whether this sample size would have sufficient performance for a causal Bayesian network model
Having a Causal Datasheet that describes the expected performance in recovering ground truth structures for any given dataset can be tremendously valuable to both machine learning scientists and practitioners

Summary

Introduction

To meet ambitious global health and development goals in lower and middle income countries (LMICs), policy decisions have been increasingly reliant on data-driven approaches to provide necessary insights This has spawned numerous programs ranging from specific subjects at the sub-national and national level (e.g., Community Behavior Tracking Survey in Uttar Pradesh, India, and the Social And Living Standards Measurement in Pakistan) to broad health topics with multinational participation (e.g., the Multiple Indicator Cluster Surveys developed by the United Nations Children’s Fund, and the USAID-backed Demographic and Health Survey) (Croft et al, 2018; Khan and Hancioglu, 2019; Huang et al, 2020; Pakistan Bureau of Statistics, 2020). RCTs are by design conducted with the intent to test a narrow set of hypotheses, not to explore unknown causal structures - a potential missed opportunity to target public health solutions more precisely

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in artificial intelligence	Publication Date: Apr 14, 2021
Citations: 12	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in artificial intelligence

Lead the way for us

Similar Papers

Equity in Global Health Research.
Meghan N Cichocki ... Kevin C Chung
Plastic and reconstructive surgery | VOL. 151
Meghan N Cichocki, et. al.Meghan N Cichocki ... Kevin C Chung
29 Mar 2023
Plastic and reconstructive surgery | VOL. 151

Editorial: Advancing the Mexico agenda for health systems research – from clinical efficacy to population health
Geert Van Etten ... Leon Bijlmakers
Tropical Medicine and International Health | VOL. 11
Geert Van Etten, et. al.Geert Van Etten ... Leon Bijlmakers
01 Aug 2006
Tropical Medicine and International Health | VOL. 11

Building a Global Mental Health Research Workforce: Perspectives from the National Institute of Mental Health.
Pamela Y Collins ... Beverly A Pringle
Academic Psychiatry | VOL. 40
Pamela Y Collins, et. al.Pamela Y Collins ... Beverly A Pringle
19 Nov 2015
Academic Psychiatry | VOL. 40

Evaluation of the National Institute for Health and Care Research’s (NIHR) Global Health Research (GHR) Portfolio - Inception Report
Paula Quigley ... Korina Cox
-
Paula Quigley, et. al.Paula Quigley ... Korina Cox
14 Dec 2022
14 Dec 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in artificial intelligence