The Problem of Fairness in Synthetic Healthcare Data.

Karan Bhanot,Kristin P Bennett,John S Erickson,Isabelle Guyon,Miao Qi

doi:10.3390/e23091165

Abstract

Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must “fairly” represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets.

Highlights

The COVID-19 pandemic brought to the forefront the urgent need to rapidly create healthcare solutions and responses to emerging and existing health problems
Resemblance in synthetic data generation is a measure of how closely matched are real data and synthetic data generated by the model
We develop two metrics to quantify fairness on three previously published datasets for Multiparameter Intelligent Monitoring in Intensive Care (MIMIC)-III, American Time Use Survey (ATUS) and Autism Spectral Disorder (ASD) claims data for different protected attributes such as age, gender, and race

Summary

Introduction

The COVID-19 pandemic brought to the forefront the urgent need to rapidly create healthcare solutions and responses to emerging and existing health problems. Data driven approaches based on artificial intelligence (AI), machine learning (ML) and statistics offer powerful ways to rapidly address these problems. Medical data records are generated by millions of individuals everyday, creating an abundance of data to be used for developing healthcare solutions and facilitating new research. Use of supervised and unsupervised machine learning on public health data has been used for outbreak detection, hospital readmission, feature association with outcomes and more [4]. Irrespective of the abundance of healthcare data, research and work in the field is often restricted due to limited public access to healthcare records. The records are protected by privacy laws like the Health Insurance Portability and Accountability Act (HIPAA) in the United States [5,6] and General Data

Objectives

Methods

Results

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Entropy	Publication Date: Sep 4, 2021
Citations: 43	License type: CC BY 4.0

R Discovery Prime

The Problem of Fairness in Synthetic Healthcare Data.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Entropy

Lead the way for us

Similar Papers

Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies
Saverio D'Amico ...
Blood | VOL. 140
Saverio D'Amico, et. al.Saverio D'Amico ...
15 Nov 2022
Blood | VOL. 140

A method for generating synthetic longitudinal health data
Lucy Mosquera ... Chris Carvalho
BMC Medical Research Methodology | VOL. 23
Lucy Mosquera, et. al.Lucy Mosquera ... Chris Carvalho
23 Mar 2023
BMC Medical Research Methodology | VOL. 23

Characterization of Synthetic Health Data Using Rule-Based Artificial Intelligence Models.
Marta Lenatti ... Vanessa Orani
IEEE Journal of Biomedical and Health Informatics | VOL. 27
Marta Lenatti, et. al.Marta Lenatti ... Vanessa Orani
01 Aug 2023
IEEE Journal of Biomedical and Health Informatics | VOL. 27

Can synthetic data be a proxy for real clinical trial data? A validation study
Zahra Azizi ... Lucy Mosquera
BMJ Open | VOL. 11
Zahra Azizi, et. al.Zahra Azizi ... Lucy Mosquera
01 Apr 2021
BMJ Open | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

The Problem of Fairness in Synthetic Healthcare Data.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Entropy