The National COVID Cohort Collaborative: Analyses of Original and Computationally Derived Electronic Health Record Data.

Randi Foraker,Adam Wilcox,Jason Thomas,Noa Zamstein,Aixia Guo,Philip Ro Payne

doi:10.2196/30697

Abstract

BackgroundComputationally derived (“synthetic”) data can enable the creation and analysis of clinical, laboratory, and diagnostic data as if they were the original electronic health record data. Synthetic data can support data sharing to answer critical research questions to address the COVID-19 pandemic.ObjectiveWe aim to compare the results from analyses of synthetic data to those from original data and assess the strengths and limitations of leveraging computationally derived data for research purposes.MethodsWe used the National COVID Cohort Collaborative’s instance of MDClone, a big data platform with data-synthesizing capabilities (MDClone Ltd). We downloaded electronic health record data from 34 National COVID Cohort Collaborative institutional partners and tested three use cases, including (1) exploring the distributions of key features of the COVID-19–positive cohort; (2) training and testing predictive models for assessing the risk of admission among these patients; and (3) determining geospatial and temporal COVID-19–related measures and outcomes, and constructing their epidemic curves. We compared the results from synthetic data to those from original data using traditional statistics, machine learning approaches, and temporal and spatial representations of the data.ResultsFor each use case, the results of the synthetic data analyses successfully mimicked those of the original data such that the distributions of the data were similar and the predictive models demonstrated comparable performance. Although the synthetic and original data yielded overall nearly the same results, there were exceptions that included an odds ratio on either side of the null in multivariable analyses (0.97 vs 1.01) and differences in the magnitude of epidemic curves constructed for zip codes with low population counts.ConclusionsThis paper presents the results of each use case and outlines key considerations for the use of synthetic data, examining their role in collaborative research for faster insights.

Highlights

COVID-19 presents data and knowledge sharing challenges [1]
Novel data whose features are queried independently for each distinct use case are produced in a multidimensional space that adheres to the statistical properties of the original source data
We found the random forest (RF) model achieved an under the receiver operating characteristic (ROC) curve of 0.814 (0.816 by logistic regression (LR)) using original data, and 0.812 (0.815 by LR) using synthetic data (Figure 1 A and C)

Summary

Introduction

COVID-19 presents data and knowledge sharing challenges [1]. Clinical data exist at individual institutions; these data are rarely shared with external entities. Big data from multiple institutions allow for more comprehensive analyses, for characterizing rare outcomes [2,3] In response to this need, the National COVID Cohort Collaborative (N3C), an open science community, was formed to ingest and harmonize COVID-19 data from institutions across the United States [4]. Objective: We aim to compare the results from analyses of synthetic data to those from original data and assess the strengths and limitations of leveraging computationally derived data for research purposes. We downloaded electronic health record data from 34 National COVID Cohort Collaborative institutional partners and tested three use cases, including (1) exploring the distributions of key features of the COVID-19–positive cohort; (2) training and testing predictive models for assessing the risk of admission among these patients; and (3) determining geospatial and temporal COVID-19–related measures and outcomes, and constructing their epidemic curves. Conclusions: This paper presents the results of each use case and outlines key considerations for the use of synthetic data, examining their role in collaborative research for faster insights

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Medical Internet Research	Publication Date: Oct 4, 2021
Citations: 13	License type: cc-by

R Discovery Prime

R Discovery Prime

The National COVID Cohort Collaborative: Analyses of Original and Computationally Derived Electronic Health Record Data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Medical Internet Research

Lead the way for us

Similar Papers

The Use of Synthetic Electronic Health Record Data and Deep Learning to Improve Timing of High-Risk Heart Failure Surgical Intervention by Predicting Proximity to Catastrophic Decompensation.
Aixia Guo ... Brian P Cupps
Frontiers in Digital Health | VOL. 2
Aixia Guo, et. al.Aixia Guo ... Brian P Cupps
07 Dec 2020
Frontiers in Digital Health | VOL. 2

Spot the difference: comparing results of analyses from real patient data and synthetic derivatives.
Randi E Foraker ... Thomas Maddox
JAMIA open | VOL. 3
Randi E Foraker, et. al.Randi E Foraker ... Thomas Maddox
14 Dec 2020
JAMIA open | VOL. 3

Synthetic Data for Video Surveillance Applications of Computer Vision: A Review
Rita Delussu ... Giorgio Fumera
International Journal of Computer Vision | VOL. 132
Rita Delussu, et. al.Rita Delussu ... Giorgio Fumera
17 May 2024
International Journal of Computer Vision | VOL. 132

Systematic Evaluation of Synthetic Panel Data Quality with an Application to Chronic Lymphocytic Leukemia
Dimitris Karletsos ... Andy Wilson
Blood | VOL. 140
Dimitris Karletsos, et. al.Dimitris Karletsos ... Andy Wilson
15 Nov 2022
Blood | VOL. 140

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The National COVID Cohort Collaborative: Analyses of Original and Computationally Derived Electronic Health Record Data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Medical Internet Research