Efficient Enriching of Synthesized Relational Patient Data with Time Series Data

Simon Schiff,Marcel Gehrke,Ralf Möller

doi:10.1016/j.procs.2018.10.130

Abstract

Analysing data from electronic healthcare records allows for supporting decision making and thereby can improve healthcare. However, obtaining sufficient healthcare data required for machine learning analysis is challenging due to, e.g, privacy aspects of medical data. For machine learning tasks, carefully prepared synthesized medical records can be as good as real records, which is shown in [17]. Existing tools for medical data provision generate either relational records or streams of measurements over time, but not an appropiate combination of both. In this paper, we contribute an approach to enriching synthesized relational data with time series (longitudinal data) of real patients. We use Synthea to synthesize relational data and enrich the records with time series from the anonymized MIMIC III database. In our data integration scenario, we need to find the best match from the relational data to the time series data to obtain a sufficient amount of medical data for machine learning analyses. Our experiments show that we can enrich huge amounts of relational data with real time series data. However, without any processing optimizations, the runtime does not easily scale with the number of synthesized relational records. With several optimizations and using a distributed execution engine, such as Apache Spark SQL, we can efficiently enrich synthesized relational data with time series data.

Full Text