A novel pipeline for realistic synthetic longitudinal EHR data generation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

BackgroundSynthetic health data offers a promising means of sharing clinical information without compromising patient privacy. However, existing methods often produce outputs that differ in structure from real data and are evaluated in narrow contexts, limiting their practical use in downstream analytical workflows. This study introduces a pipeline that builds upon existing methods for generating realistic synthetic longitudinal electronic health record data, evaluates it across three diverse datasets, and offers evidence-based guidance on the use of synthetic data to replace or augment real data.MethodsThe pipeline extends existing state of the art HALO and ConSequence frameworks with a post-processing step that reconstructs continuous variables and timestamps, producing synthetic data that closely matches the structure of real medical record datasets. It was applied to three clinically diverse datasets: a small longitudinal cohort, a medium-sized intensive-care dataset, and a very large multi-hospital administrative dataset. Realism was assessed alongside utility for machine learning, statistical modelling, and time series analysis tasks.ResultsAcross all datasets, the pipeline generated realistic synthetic data that preserved key statistical properties and relationships. Machine learning models trained on synthetic data achieved similar predictive accuracy and feature importance patterns to those trained on real data, indicating strong utility. Synthetic data also performed well in statistical modelling, with the direction and magnitude of effects generally closely aligned with the real data. However, it may be less suitable when precise estimates are required or when modelling relatively rare conditions. Importantly, although the pipeline reconstructed timestamp structures, it did not capture aggregate temporal patterns and the resulting data was therefore unsuitable for time series analysis.ConclusionsThe pipeline produces realistic and analytically useful synthetic longitudinal electronic health record data across datasets of widely varying scales. These findings provide practical guidance on when synthetic data can meaningfully substitute for or complement real data.

Similar Papers
  • Research Article
  • Cite Count Icon 12
  • 10.1093/jamia/ocab111
Utilizing timestamps of longitudinal electronic health record data to classify clinical deterioration events.
  • Jul 16, 2021
  • Journal of the American Medical Informatics Association
  • Li-Heng Fu + 7 more

To propose an algorithm that utilizes only timestamps of longitudinal electronic health record data to classify clinical deterioration events. This retrospective study explores the efficacy of machine learning algorithms in classifying clinical deterioration events among patients in intensive care units using sequences of timestamps of vital sign measurements, flowsheets comments, order entries, and nursing notes. We design a data pipeline to partition events into discrete, regular time bins that we refer to as timesteps. Logistic regressions, random forest classifiers, and recurrent neural networks are trained on datasets of different length of timesteps, respectively, against a composite outcome of death, cardiac arrest, and Rapid Response Team calls. Then these models are validated on a holdout dataset. A total of 6720 intensive care unit encounters meet the criteria and the final dataset includes 830 578 timestamps. The gated recurrent unit model utilizes timestamps of vital signs, order entries, flowsheet comments, and nursing notes to achieve the best performance on the time-to-outcome dataset, with an area under the precision-recall curve of 0.101 (0.06, 0.137), a sensitivity of 0.443, and a positive predictive value of 0. 092 at the threshold of 0.6. This study demonstrates that our recurrent neural network models using only timestamps of longitudinal electronic health record data that reflect healthcare processes achieve well-performing discriminative power.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 22
  • 10.1007/s11263-024-02102-x
Synthetic Data for Video Surveillance Applications of Computer Vision: A Review
  • May 17, 2024
  • International Journal of Computer Vision
  • Rita Delussu + 2 more

In recent years, there has been a growing interest in synthetic data for several computer vision applications, such as automotive, detection and tracking, surveillance, medical image analysis and robotics. Early use of synthetic data was aimed at performing controlled experiments under the analysis by synthesis approach. Currently, synthetic data are mainly used for training computer vision models, especially deep learning ones, to address well-known issues of real data, such as manual annotation effort, data imbalance and bias, and privacy-related restrictions. In this work, we survey the use of synthetic training data focusing on applications related to video surveillance, whose relevance has rapidly increased in the past few years due to their connection to security: crowd counting, object and pedestrian detection and tracking, behaviour analysis, person re-identification and face recognition. Synthetic training data are even more interesting in this kind of application, to address further, specific issues arising, e.g., from typically unconstrained image or video acquisition conditions and cross-scene application scenarios. We categorise and discuss the existing methods for creating synthetic data, analyse the synthetic data sets proposed in the literature for each of the considered applications, and provide an overview of their effectiveness as training data. We finally discuss whether and to what extent the existing synthetic data sets mitigate the issues of real data, highlight existing open issues, and suggest future research directions in this field.

  • Research Article
  • 10.1002/alz.056279
Accuracy in estimating prevalence and incidence of dementia using longitudinal electronic health record data from the Indian Health Service
  • Dec 1, 2021
  • Alzheimer's & Dementia
  • Luohua Jiang + 5 more

BackgroundOur knowledge regarding dementia epidemiology for American Indians and Alaska Natives (AI/ANs) is very limited. Longitudinal electronic health record (EHR) data available in the Indian Health Service (IHS) provides an opportunity to estimate dementia prevalence and incidence among AI/ANs at the national level. However, as with other EHR‐based studies, identifying dementia patients via clinical diagnostic codes likely underestimates the prevalence of dementia. Furthermore, longitudinal studies of dementia using EHR could be challenging if some prevalent cases of dementia cannot be identified at baseline, which might be mistakenly considered as incident cases in subsequent years.MethodsWe extracted data from the IHS National Data Warehouse and related EHR databases between fiscal year (FY) 2007‐2013. Adults were identified as having dementia if they had at least one qualifying ICD‐9 diagnostic code for all‐cause dementia. A total of 1,117 AI/AN adults who were 45+ years old were identified as dementia patients in FY2007. Among these patients, we evaluated the number of years needed to correctly classify them as prevalent cases using data after FY2007. We also examined the impact of different dementia definitions on our ability to identify these prevalent cases using FY2008‐2013 data.ResultsAmong the FY2007 dementia patients who used IHS services in FY2008, only 63.7% of them were identified as having dementia in FY2008. The remaining 36.3% prevalent cases might be classified as incident cases if they have a qualifying diagnostic code after FY2008. Even when we used a 5‐year window (FY2008‐FY2012), only 78.9% of the FY2007 dementia patients were identified as prevalent cases. Among dementia patients who were 65+ years old in FY2007, a 5‐year window correctly identified 87.7% of the FY2007 dementia patients as prevalent cases. Altering the definition of dementia by adding dementia medications or diagnostic codes for other types of cognitive disorders did not substantially change our ability in distinguishing prevalent vs. incident cases of dementia.ConclusionsIt is challenging to distinguish dementia prevalent and incident cases using EHR data from the IHS. The accuracy of estimating prevalence and incidence of dementia using this data source might be higher among older patients.

  • Abstract
  • Cite Count Icon 5
  • 10.1182/blood-2023-190151
Harnessing Artificial Intelligence for Risk Stratification in Acute Myeloid Leukemia (AML): Evaluating the Utility of Longitudinal Electronic Health Record (EHR) Data Via Graph Neural Networks
  • Nov 2, 2023
  • Blood
  • Riya Sinha + 9 more

Harnessing Artificial Intelligence for Risk Stratification in Acute Myeloid Leukemia (AML): Evaluating the Utility of Longitudinal Electronic Health Record (EHR) Data Via Graph Neural Networks

  • Conference Article
  • 10.54941/ahfe1006801
Data Synthetization and Feature Analysis: A Study in Bladder Cancer Recurrence Data
  • Jan 1, 2025
  • AHFE international
  • Sandi Baressi Šegota + 7 more

The application of synthetic data within the biomedical domain is rapidly gaining momentum, driven by the growing need for robust datasets suitable for machine learning (ML) and statistical modeling. In scenarios where access to real patient data is limited due to privacy concerns or scarcity, synthetic data offers an attractive alternative. These artificially generated datasets aim to mimic the statistical characteristics of original data, enabling researchers to conduct exploratory analysis, develop predictive models, or validate findings without compromising patient confidentiality. However, the increasing use of synthetic data raises several methodological and interpretative challenges, particularly regarding the correct sequence and context for applying statistical analyses. One of the central issues identified in contemporary literature concerns the timing of data analysis relative to the synthetic data generation process. Some studies conduct statistical or ML analyses directly on real datasets and use synthetic data for validation or augmentation. Others, conversely, perform all stages of analysis including feature importance estimation, correlation assessment, and model training on synthetic data. This inconsistency raises the question of whether statistical analysis conducted solely on synthetic datasets yields reliable insights, or whether it constitutes a methodological flaw. The prevailing assumption is that analysis should ideally be performed on real data to preserve statistical integrity, but empirical evaluation of this notion remains limited. In the current study, the authors address this issue by applying a synthetic data generation method specifically, the Tabular Variational Auto encoder (TVAE) to a biomedical dataset focused on bladder cancer recurrence. This dataset includes various diagnostic variables, and the primary goal is to assess how well synthetic data replicates analytical insights drawn from the original data. To achieve this, the authors conduct both correlational analysis and machine learning-based feature importance estimation. The results derived from synthetic datasets of varying sizes are then compared to those obtained from the original data. The findings indicate that while synthetic data can approximate general trends observed in the original dataset, there are notable differences depending on the analytical technique employed. In particular, models such as Random Forest appear more sensitive to variations introduced during the synthetization process. This sensitivity manifests as shifts in feature importance rankings and variability in predictive performance, especially when working with smaller synthetic datasets. On the other hand, simpler statistical methods such as correlation coefficients display more stability, suggesting that some analytical approaches may be more robust to data generation artifacts than others. These observations underscore the importance of methodological caution when interpreting results based on synthetic biomedical data. While synthetic datasets hold considerable promise for advancing data-driven research in biomedicine, they are not a one-size-fits-all solution. The sequence in which synthetic data is introduced into the research pipeline whether before or after statistical analysis—can significantly influence the validity of the findings. As such, researchers must critically assess the suitability of synthetic data for specific analytical tasks and ensure transparency in reporting their methodological choices. Future work should further explore the impact of different generative models and dataset properties on the reliability of synthetic-data-driven insights.

  • Research Article
  • 10.1161/circ.146.suppl_1.13653
Abstract 13653: Leveraging Natural Language Processing and Machine Learning to Predict Worsening Heart Failure Events
  • Nov 8, 2022
  • Circulation
  • Rishi V Parikh + 12 more

Background: Prior risk models in patients with heart failure (HF) have focused on hospitalizations for worsening HF (WHF) and have not evaluated for differences in predictors by left ventricular ejection fraction (LVEF). We used natural language processing (NLP) and machine learning methods with access to longitudinal electronic health record (EHR) data to develop risk prediction models for WHF events across practice settings and by LVEF category. Methods: We identified all adults with HF and known LVEF on January 1 st of each year from 2011-2019 in an integrated health care system. WHF events within 1 year were defined as any hospitalization, emergency department, or outpatient encounter with ≥1 symptom, ≥2 objective findings including ≥1 sign, and ≥1 change in HF-related therapy. Signs and symptoms were ascertained using rule-based NLP. We conducted boosted decision tree-based ensemble models for any WHF event within each LVEF category: HF with reduced EF (HFrEF; LVEF ≤40%), HF with mildly reduced EF (HFmrEF; LVEF 41-49%), and HF with preserved EF (HFpEF; LVEF ≥50%). We evaluated model discrimination using area under the curve (AUC) and model calibration using Brier scores. Results: Among 359,298 patients from 2011-2019, 65,838 (18%) had HFrEF, 52,491 (15%) had HFmrEF, and 240,969 (67%) had HFpEF. Mean age was 75±12, 47% were women, and 37% were minorities including 10% Black, 11% Asian/Pacific Islander, and 12% of Hispanic ethnicity. WHF events occurred in 22% of patients with HFrEF, 17% with HFmrEF, and 16% with HFpEF. The models displayed an AUC of 0.75 and Brier score of 0.15 for HFrEF and an AUC of 0.77 and Brier scores of 0.12 for both HFmrEF and HFpEF. Clinical predictors were similar across LVEF categories ( Table ). Conclusions: Longitudinal EHR data can be leveraged using NLP and machine learning for accurate risk estimation that reliably identifies clinical predictors across a range of LVEF. These findings may provide novel insight into the natural history of HF.

  • Research Article
  • Cite Count Icon 25
  • 10.1111/add.14374
AUDIT-C and ICD codes as phenotypes for harmful alcohol use: association with ADH1B polymorphisms in two US populations.
  • Aug 1, 2018
  • Addiction
  • Amy C Justice + 13 more

Longitudinal electronic health record (EHR) data offer a large-scale, untapped source of phenotypical information on harmful alcohol use. Using established, alcohol-associated variants in the gene that encodes the enzyme alcohol dehydrogenase 1B (ADH1B) as criterion standards, we compared the individual and combined validity of three longitudinal EHR-based phenotypes of harmful alcohol use: Alcohol Use Disorders Identification Test-Consumption (AUDIT-C) trajectories; mean age-adjusted AUDIT-C; and diagnoses of alcohol use disorder (AUD). With longitudinal EHR data from the Million Veteran Program (MVP) linked to genetic data, we used two population-specific polymorphisms in ADH1B that are associated strongly with AUD in African Americans (AAs) and European Americans (EAs): rs2066702 (Arg369Cys, AAs) and rs1229984 (Arg48His, EAs) as criterion measures. United States Department of Veterans Affairs Healthcare System. A total of 167 721 veterans (57 677 AAs and 110 044 EAs; 92% male, mean age=63years) took part in this study. Data were collected from 1 October 2007 to 1May 2017. Using all AUDIT-C scores and AUD diagnostic codes recorded in the EHR, we calculated age-adjusted mean AUDIT-C values, longitudinal statistical trajectories of AUDIT-C scores and ICD-9/10 diagnostic groupings for AUD. A total of 19 793 AAs (34.3%) had one or two minor alleles at rs2066702 [minor allele frequency (MAF)=0.190] and 6933 EAs (6.3%) had one or two minor alleles at rs1229984 (MAF=0.032). In both populations, trajectories and age-adjusted mean AUDIT-C were correlated (r=0.90) but, when considered separately, highest score (8+ versus 0) of age-adjusted mean AUDIT-C demonstrated a stronger association with the ADH1B variants [adjusted odds ratio (aOR) 0.54 in AAs and 0.37 in AAs] than did the highest trajectory (aOR 0.71 in AAs and 0.53 in EAs); combining AUDIT-C metrics did not improve discrimination. When age-adjusted mean AUDIT-C score and AUD diagnoses were considered together, age-adjusted mean AUDIT-C (8+ versus 0) was associated with lower odds of having the ADH1B minor allele than were AUD diagnostic codes: aOR=0.59 versus 0.86 in AAs and 0.48 versus 0.68 in EAs. These independent associations combine to yield an even lower aOR of 0.51 for AAs and 0.33 for EAs. The age-adjusted mean AUDIT-C score is associated more strongly with genetic polymorphisms of known risk for alcohol use disorder than are longitudinal trajectories of AUDIT-C or AUD diagnostic codes. AUD diagnostic codes modestly enhance this association.

  • Research Article
  • Cite Count Icon 61
  • 10.1161/circoutcomes.118.005114
Recurrent Neural Networks for Early Detection of Heart Failure From Longitudinal Electronic Health Record Data: Implications for Temporal Modeling With Respect to Time Before Diagnosis, Data Density, Data Quantity, and Data Type.
  • Oct 1, 2019
  • Circulation: Cardiovascular Quality and Outcomes
  • Robert Chen + 4 more

We determined the impact of data volume and diversity and training conditions on recurrent neural network methods compared with traditional machine learning methods. Using longitudinal electronic health record data, we assessed the relative performance of machine learning models trained to detect a future diagnosis of heart failure in primary care patients. Model performance was assessed in relation to data parameters defined by the combination of different data domains (data diversity), the number of patient records in the training data set (data quantity), the number of encounters per patient (data density), the prediction window length, and the observation window length (ie, the time period before the prediction window that is the source of features for prediction). Data on 4370 incident heart failure cases and 30 132 group-matched controls were used. Recurrent neural network model performance was superior under a variety of conditions that included (1) when data were less diverse (eg, a single data domain like medication or vital signs) given the same training size; (2) as data quantity increased; (3) as density increased; (4) as the observation window length increased; and (5) as the prediction window length decreased. When all data domains were used, the performance of recurrent neural network models increased in relation to the quantity of data used (ie, up to 100% of the data). When data are sparse (ie, fewer features or low dimension), model performance is lower, but a much smaller training set size is required to achieve optimal performance compared with conditions where data are more diverse and includes more features. Recurrent neural networks are effective for predicting a future diagnosis of heart failure given sufficient training set size. Model performance appears to continue to improve in direct relation to training set size.

  • Research Article
  • Cite Count Icon 1
  • 10.23889/ijpds.v9i5.2766
SynD: Australian synthetic health data community of practice
  • Sep 10, 2024
  • International Journal of Population Data Science
  • Ben Hachey + 10 more

ObjectivesThe current workflow for health data research in Australia is inefficient. After funding is secured, researchers often face delays of months or years to access the necessary data. Synthetic data could significantly improve the pace and impact of health data research but lacks foundational infrastructure. We aim to develop this infrastructure and support the use of synthetic data to improve data access and research quality across Australia. ApproachWe held two workshops with Australian groups working on synthetic data. The format included participant updates and invited talks on international approaches to synthetic data and health data research. Workshops collected use cases and stimulated discussion on national collaboration. A facilitator then led thematic analysis to draft a consensus roadmap and terms of reference towards national synthetic data infrastructure. ResultsWe recruited 18 participants. Participants were cross sectoral: universities (9), research funding bodies (5), state health departments (4). Represented six states and territories: Queensland (6), New South Wales (3), Victoria (3), Western Australia (3), Australian Capital Territory (2), South Australia (1). Gender: women (11), men (7). The roadmap includes stakeholder engagement, a governance framework, and training events. ConclusionSynD is an Australian community of practice for synthetic health data. Our mission is to unlock the value of health information through synthetic data to advance research, education, innovation and service delivery within the health and care sector. This collaborative effort should ensure a harmonised approach to the safe and effective utilisation of synthetic data to enhance health outcomes across Australia.

  • Research Article
  • Cite Count Icon 7
  • 10.1177/20539517251318289
The ontological politics of synthetic data: Normalities, outliers, and intersectional hallucinations
  • Apr 13, 2025
  • Big Data & Society
  • Francis Lee + 2 more

Synthetic data is increasingly used as a substitute for real data due to ethical, legal, and logistical reasons. However, the rise of synthetic data also raises critical questions about its entanglement with the politics of classification and the reproduction of social norms and categories. This paper aims to problematize the use of synthetic data by examining how its production is intertwined with the maintenance of certain worldviews and classifications. We argue that synthetic data, like real data, is embedded with societal biases and power structures, leading to the reproduction of existing social inequalities. Through empirical examples, we demonstrate how synthetic data tends to highlight majority elements as the “normal” and minimize minority elements, and that the slight changes to the data structures that create synthetic data will also inevitably result in what we term “intersectional hallucinations.” These hallucinations are inherent to synthetic data and cannot be entirely eliminated without compromising the purpose of creating synthetic datasets. We contend that decisions about synthetic data involve determining which intersections are essential and which can be disregarded, a practice which will imbue these decisions with norms and values. Our study underscores the need for critical engagement with the mathematical and statistical choices in synthetic data production and advocates for careful consideration of the ontological and political implications of these choices during curatorial style production of synthetic structured data.

  • Research Article
  • Cite Count Icon 1
  • 10.1109/jbhi.2025.3551312
EvolveFNN: An Interpretable Framework for Early Detection Using Longitudinal Electronic Health Record Data.
  • Jul 1, 2025
  • IEEE journal of biomedical and health informatics
  • Yufeng Zhang + 4 more

The extensive adoption of artificial intelligence in clinical decision support systems requires greater model interpretability. Hence, we introduce EvolveFNN, an interpretable model based on the recurrent neural network that merges fuzzy logic principles with recurrent units. This model is designed to train precise and understandable models using high-dimensional longitudinal electronic health records data. Through supervised learning, our method allows the identification of variable encoding functions and significant rules. To demonstrate performance and capabilities in classification and rule discovery, we first test our method on a simulated dataset. The proposed methods achieve the best model performance compared to other methods, and the rules learned are almost identical to what we used to generate the synthetic data. Furthermore, we showcase a pilot application that proves its potential in the early detection of cardiac event onset. Our proposed algorithm obtains a comparable model performance to vanilla GRU models and remains relatively stable when the prediction window size changes. Examining the rules generated by our proposed model, we find that the extracted rules not only align with clinical practices and existing literature but also provide potential risk factors not explored in the population. The additional experiments on the MIMIC-III benchmark dataset show the algorithm's generalizability. In conclusion, our proposed approach can effectively train accurate, interpretable, and reliable models using large longitudinal electronic health records, offering clinicians valuable insights.

  • Research Article
  • Cite Count Icon 16
  • 10.1002/cpt.3001
A case for synthetic data in regulatory decision-making in Europe.
  • Aug 24, 2023
  • Clinical Pharmacology & Therapeutics
  • Clara Alloza + 12 more

Regulators are faced with many challenges surrounding health data usage, including privacy, fragmentation, validity, and generalizability, especially in the European Union (EU), for which synthetic data may provide innovative solutions. Synthetic data, defined as data artificially generated rather than captured in the real world, are increasingly being used for healthcare research purposes as a proxy to real-world data (RWD). Currently, there are barriers particularly challenging in Europe, where sharing patient's data is strictly regulated, costly, and time consuming, causing delays in evidence generation and regulatory approvals. Recent initiatives are encouraging the use of synthetic data in regulatory decision-making and health technology assessment to overcome these challenges, but synthetic data have still to overcome realistic obstacles before their adoption by researchers and regulators in Europe. Thus, the emerging use of RWD and synthetic data by pharmaceutical and medical device industries calls regulatory bodies to provide a framework for proper evidence generation and informed regulatory decision-making. As the provision of data becomes more ubiquitous in scientific research, so will innovations in artificial intelligence, machine learning, and generation of synthetic data, making the exploration and intricacies of this topic all the more important and timely. In this review, we discuss the potential merits and challenges of synthetic data in the context of decision-making in the European regulatory environment. We explore the current uses of synthetic data and ongoing initiatives, the value of synthetic data for regulatory purposes, and realistic barriers to the adoption of synthetic data in healthcare.

  • Research Article
  • Cite Count Icon 2
  • 10.2196/53241
Creating High-Quality Synthetic Health Data: Framework for Model Development and Validation.
  • Apr 22, 2024
  • JMIR Formative Research
  • Elnaz Karimian Sichani + 3 more

Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients' privacy while properly reflecting the data. This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected. We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMIC-III (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the one-to-one link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients. The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data. We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.

  • Research Article
  • Cite Count Icon 4
  • 10.29012/jpc.v5i1.628
On Regression-Tree-Based Synthetic Data Methods for Business Data
  • Aug 1, 2013
  • Journal of Privacy and Confidentiality
  • Joo Ho Lee + 2 more

This paper concerns the use of synthetic data for protecting the confidentiality of business data during statistical analysis. Synthetic data sets are traditionally constructed by replacing sensitive values in a confidential data set with draws from statistical models estimated on the confidential data set. Unfortunately, the process of generating effective statistical models can be a difficult and labour-intensive task. Recently, it has been proposed to use easily-implemented methods from machine learning instead of statistical model estimation in the data synthesis task. J. Drechsler and J.P. Reiter (2011) have conducted an evaluation of four such methods, and have found that regression trees could give rise to synthetic data sets which provide reliable analysis results as well as low disclosure risks. Their conclusion was based on simulations using a subset of the 2002 Uganda census public use file. It is an interesting question whether the same conclusion applies to other types of data with different characteristics, for example business data which have quite different characteristics from population census and survey data. In particular, business data generally have few variables that are mostly categorical, and often have highly skewed distributions with outliers. In this paper we investigate the applicability of regression-tree-based methods for constructing synthetic business data. We give a detailed example comparing exploratory data analysis and linear regression results under two variants of a regression-tree-based synthetic data approach. We also include an evaluation of the analysis results with respect to the results of analysis of the original data. We further investigate the impact of different stopping criteria on performance. While it is certainly true that any method designed to protect confidentiality introduces error, and may indeed give misleading conclusions, our analysis of the results for synthesisers based on CART models has provided some evidence that this error is not random but is due to the particular characteristics of business data. We conclude that more careful analysis needs to be done in applying these methods and end users certainly need aware of possible discrepancies.

  • Research Article
  • Cite Count Icon 1
  • 10.69554/lqom5698
High-fidelity synthetic patient data applications and privacy considerations
  • Jun 1, 2024
  • Journal of Data Protection & Privacy
  • Puja Myles + 4 more

This paper explores the potential applications of high-fidelity synthetic patient data in the context of healthcare research, including challenges and benefits. The paper starts by defining synthetic data, types of synthetic data and approaches to generating synthetic data. It then discusses the potential applications of synthetic data in addition to as a privacy enhancing technology and current debates around whether synthetic data should be considered personal data and,therefore, should be subjected to privacy controls to minimise reidentification risks. This will be followed by a discussion of privacy preservation approaches and privacy metrics that can be applied in the context of synthetic data. The paper includes a case study based on synthetic electronic healthcare record data from the Clinical Practice Research Datalink on how privacy concerns due to reidentification have been addressed in order to make this data available for research purposes. The authors conclude that synthetic data, particularly high-fidelity synthetic patient data, has the potential to add value over and above real data for public health and that it is possible to address privacy concerns to make synthetic data available via a combination of privacy measures applied during the synthetic data generation process and post-generation reidentification risk assessments as part of data protection impact assessments.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant