ObjectiveTo access Electronic Health Record (EHR) data, hospitals have implemented Clinical Data Warehouses (CDWs) using Extract Transform and Load (ETL) processes. While ETL performances are typically evaluated individually, our study examines the cumulative impact of ETLs on data availability. MethodsUsing a real multi-hospital CDW as a case study, we modeled EHR data processing from the software sources to the CDW's data store. We simulated a scenario where researchers aimed to reconstruct breast cancer care trajectories using EHR data. We calculated the size and characteristics of the data store population, and compared them to the original population. ResultsEHR data are recorded in various software depending on data category, hospital, and year, each requiring specific series of ETLs for integration in the CDW. Despite acceptable transfer rates for each ETL (range 73 %-100 %), cumulative losses led to study populations in the data store being up to 90 % smaller than anticipated when researchers required data exhaustivity for patients. Population size decreased steeply with the more data categories required. No difference was found in population characteristics between the data store and the original cohorts. Discussion & ConclusionResearchers should scrutinize data availability in CDWs as missing data could result from outsourced care, incomplete input, or underperforming ETLs. Integrating more data sources in CDWs increases the number of data routes, necessitating time for ETL implementation and maintenance, and increases data loss risks. Though commonly perceived as a “black box”, data transformation can significantly influence the reliability of populations studied in CDWs. Public interest SummaryTo access data generated during care, researchers build Clinical Data Warehouses (CDWs). CDWs are infrastructures composed of a series of processing steps to extract the data from the data source, transform it according to the needs and load it into a data store. Usually, the performances of these processing steps are evaluated one a time. However, each data point goes through a series of processing steps before being made available for research. In this study, we aim to evaluate the impact of the entire data processing pipeline on the availability of data points in a CDW by simulating a study on breast cancer and evaluating the impact on the size and the characteristics of the final cohort. The cumulative losses of the processing steps resulted in a population 90 % smaller than anticipated. The characteristics of the final population showed no difference to those of the original cohort.
Read full abstract