Validation Assessment of Privacy-Preserving Synthetic Electronic Health Record Data: Comparison of Original Versus Synthetic Data on Real-World COVID-19 Vaccine Effectiveness.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

To assess the validity of privacy-preserving synthetic data by comparing results from synthetic versus original EHR data analysis. A published retrospective cohort study on real-world effectiveness of COVID-19 vaccines by Maccabi Healthcare Services in Israel was replicated using synthetic data generated from the same source, and the results were compared between synthetic versus original datasets. The endpoints included COVID-19 infection, symptomatic COVID-19 infection and hospitalization due to infection and were also assessed in several demographic and clinical subgroups. In comparing synthetic versus original data estimates, several metrices were utilized: standardized mean differences (SMD), decision agreement, estimate agreement, confidence interval overlap, and Wald test. Synthetic data were generated five times to assess the stability of results. The distribution of demographic and clinical characteristics demonstrated very small difference (< 0.01 SMD). In the comparison of vaccine effectiveness assessed in relative risk reduction between synthetic versus original data, there was a 100% decision agreement, 100% estimate agreement, and a high level of confidence interval overlap (88.7%-99.7%) in all five replicates across all subgroups. Similar findings were achieved in the assessment of vaccine effectiveness against symptomatic COVID-19 Infection. In the comparison of hazard ratios for COVID 19-related hospitalization and odds ratio for symptomatic COVID-19 Infection, the Wald tests suggested no significant difference between respective effect estimates in all five replicates for all patient subgroups but there were disagreements in estimate and decision metrices in some subgroups and replicates. Overall, comparison of synthetic versus original real-world data demonstrated good validity and reliability. Transparency on the process to generate high fidelity synthetic data and assurances of patient privacy are warranted.

Similar Papers
  • Abstract
  • 10.23889/ijpds.v7i3.1984
Barriers and facilitators to generating synthetic administrative data for research.
  • Aug 25, 2022
  • International Journal of Population Data Science
  • Theodora Kokosi + 4 more

ObjectivesGeneration of synthetic data could improve the efficiency of administrative data analysis. We describe barriers and facilitators to synthetic administrative data in the UK based on our experience of generating, assessing, and evaluating the performance of different approaches. We aim to provide guidance on the appropriate uses of synthetic administrative data. ApproachWe generated synthetic versions of one large-population survey (Natsal-3) and two administrative datasets (Hospital Episode Statistics [HES] and National Pupil Database [NPD]). A range of methods were used based on the statistical techniques of sampling and prediction. We implemented non-parametric (e.g., Classification and Regression Tree) and parametric (e.g., generalised linear models) methods, and multiple imputation and Bayesian networks in R software. We attempted to generate low- and high-fidelity datasets and assessed utility by visualising marginal distributions of key variables, estimating the standardised propensity mean square error, and deriving standardised coefficient differences of model estimates and overlap of confidence intervals. ResultsResults from our analysis highlighted some facilitators related to low-fidelity synthetic data that are quicker to generate, can retain the data types, format, and privacy and could be used to support training and code development. Conversely, some of the barriers included computational issues when generating high-fidelity synthetic data from complex data structures. High-fidelity data are achievable but only in the context of a specific research question and a limited number of variables. Results from the Natsal-3 data showed that parametric methods produced slightly better data utility compared to non-parametric methods. Results for HES and NPD will also be presented. ConclusionsLow-fidelity synthetic data can provide a useful resource to support users of administrative data, whilst minimising data access timelines and while retaining privacy and confidentiality of personal data. High-utility datasets can be generated but take considerable resources, and current approaches cannot fully handle the complexity of longitudinal administrative data.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 22
  • 10.1038/s41598-024-57207-7
An evaluation of the replicability of analyses using synthetic health data
  • Mar 24, 2024
  • Scientific Reports
  • Khaled El Emam + 3 more

Synthetic data generation is being increasingly used as a privacy preserving approach for sharing health data. In addition to protecting privacy, it is important to ensure that generated data has high utility. A common way to assess utility is the ability of synthetic data to replicate results from the real data. Replicability has been defined using two criteria: (a) replicate the results of the analyses on real data, and (b) ensure valid population inferences from the synthetic data. A simulation study using three heterogeneous real-world datasets evaluated the replicability of logistic regression workloads. Eight replicability metrics were evaluated: decision agreement, estimate agreement, standardized difference, confidence interval overlap, bias, confidence interval coverage, statistical power, and precision (empirical SE). The analysis of synthetic data used a multiple imputation approach whereby up to 20 datasets were generated and the fitted logistic regression models were combined using combining rules for fully synthetic datasets. The effects of synthetic data amplification were evaluated, and two types of generative models were used: sequential synthesis using boosted decision trees and a generative adversarial network (GAN). Privacy risk was evaluated using a membership disclosure metric. For sequential synthesis, adjusted model parameters after combining at least ten synthetic datasets gave high decision and estimate agreement, low standardized difference, as well as high confidence interval overlap, low bias, the confidence interval had nominal coverage, and power close to the nominal level. Amplification had only a marginal benefit. Confidence interval coverage from a single synthetic dataset without applying combining rules were erroneous, and statistical power, as expected, was artificially inflated when amplification was used. Sequential synthesis performed considerably better than the GAN across multiple datasets. Membership disclosure risk was low for all datasets and models. For replicable results, the statistical analysis of fully synthetic data should be based on at least ten generated datasets of the same size as the original whose analyses results are combined. Analysis results from synthetic data without applying combining rules can be misleading. Replicability results are dependent on the type of generative model used, with our study suggesting that sequential synthesis has good replicability characteristics for common health research workloads.

  • Research Article
  • Cite Count Icon 42
  • 10.1053/j.gastro.2021.06.076
BNT162b2 Messenger RNA COVID-19 Vaccine Effectiveness in Patients With Inflammatory Bowel Disease: Preliminary Real-World Data During Mass Vaccination Campaign
  • Jul 2, 2021
  • Gastroenterology
  • Amir Ben-Tov + 5 more

BNT162b2 Messenger RNA COVID-19 Vaccine Effectiveness in Patients With Inflammatory Bowel Disease: Preliminary Real-World Data During Mass Vaccination Campaign

  • Research Article
  • Cite Count Icon 21
  • 10.1016/j.kint.2022.07.018
The effectiveness and safety of mRNA (BNT162b2) and inactivated (CoronaVac) COVID-19 vaccines among individuals with chronic kidney diseases
  • Aug 11, 2022
  • Kidney International
  • Franco Wing Tak Cheng + 9 more

The effectiveness and safety of mRNA (BNT162b2) and inactivated (CoronaVac) COVID-19 vaccines among individuals with chronic kidney diseases

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 39
  • 10.1186/s12874-023-01869-w
A method for generating synthetic longitudinal health data
  • Mar 23, 2023
  • BMC Medical Research Methodology
  • Lucy Mosquera + 11 more

Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health’s administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.amj.2022.02.007
Vaccination
  • Mar 17, 2022
  • Air Medical Journal
  • David J Dries

Vaccination

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 187
  • 10.1111/rssa.12358
General and Specific Utility Measures for Synthetic Data
  • Mar 7, 2018
  • Journal of the Royal Statistical Society Series A: Statistics in Society
  • Joshua Snoke + 4 more

SummaryData holders can produce synthetic versions of data sets when concerns about potential disclosure restrict the availability of the original records. The paper is concerned with methods to judge whether such synthetic data have a distribution that is comparable with that of the original data: what we term general utility. We consider how general utility compares with specific utility: the similarity of results of analyses from the synthetic data and the original data. We adapt a previous general measure of data utility, the propensity score mean-squared error pMSE, to the specific case of synthetic data and derive its distribution for the case when the correct synthesis model is used to create the synthetic data. Our asymptotic results are confirmed by a simulation study. We also consider two specific utility measures, confidence interval overlap and standardized difference in summary statistics, which we compare with the general utility results. We present two contrasting examples of data syntheses: one illustrating synthetic data that is evaluated as being useful by both general and specific measures and the second where neither is the case. For the second case we show how the general utility measures can identify the deficiencies of the synthetic data and suggest how this can inform possible improvements to the synthesis method.

  • Research Article
  • Cite Count Icon 262
  • 10.1002/14651858.cd006776.pub2
Using alternative statistical formats for presenting risks and risk reductions.
  • Mar 16, 2011
  • The Cochrane database of systematic reviews
  • Elie A Akl + 8 more

The success of evidence-based practice depends on the clear and effective communication of statistical information. To evaluate the effects of using alternative statistical presentations of the same risks and risk reductions on understanding, perception, persuasiveness and behaviour of health professionals, policy makers, and consumers. We searched Ovid MEDLINE (1966 to October 2007), EMBASE (1980 to October 2007), PsycLIT (1887 to October 2007), and the Cochrane Central Register of Controlled Trials (The Cochrane Library, 2007, Issue 3). We reviewed the reference lists of relevant articles, and contacted experts in the field. We included randomized and non-randomized controlled parallel and cross-over studies. We focused on four comparisons: a comparison of statistical presentations of a risk (eg frequencies versus probabilities) and three comparisons of statistical presentation of risk reduction: relative risk reduction (RRR) versus absolute risk reduction (ARR), RRR versus number needed to treat (NNT), and ARR versus NNT. Two authors independently selected studies for inclusion, extracted data, and assessed risk of bias. We contacted investigators to obtain missing information. We graded the quality of evidence for each outcome using the GRADE approach. We standardized the outcome effects using adjusted standardized mean difference (SMD). We included 35 studies reporting 83 comparisons. None of the studies involved policy makers. Participants (health professionals and consumers) understood natural frequencies better than probabilities (SMD 0.69 (95% confidence interval (CI) 0.45 to 0.93)). Compared with ARR, RRR had little or no difference in understanding (SMD 0.02 (95% CI -0.39 to 0.43)) but was perceived to be larger (SMD 0.41 (95% CI 0.03 to 0.79)) and more persuasive (SMD 0.66 (95% CI 0.51 to 0.81)). Compared with NNT, RRR was better understood (SMD 0.73 (95% CI 0.43 to 1.04)), was perceived to be larger (SMD 1.15 (95% CI 0.80 to 1.50)) and was more persuasive (SMD 0.65 (95% CI 0.51 to 0.80)). Compared with NNT, ARR was better understood (SMD 0.42 (95% CI 0.12 to 0.71)), was perceived to be larger (SMD 0.79 (95% CI 0.43 to 1.15)).There was little or no difference for persuasiveness (SMD 0.05 (95% CI -0.04 to 0.15)). The sensitivity analyses including only high quality comparisons showed consistent results for persuasiveness for all three comparisons. Overall there were no differences between health professionals and consumers. The overall quality of evidence was rated down to moderate because of the use of surrogate outcomes and/or heterogeneity. None of the comparisons assessed behaviourbehaviour. Natural frequencies are probably better understood than probabilities. Relative risk reduction (RRR), compared with absolute risk reduction (ARR) and number needed to treat (NNT), may be perceived to be larger and is more likely to be persuasive. However, it is uncertain whether presenting RRR is likely to help people make decisions most consistent with their own values and, in fact, it could lead to misinterpretation. More research is needed to further explore this question.

  • Research Article
  • Cite Count Icon 2
  • 10.1200/jco.2023.41.16_suppl.1554
Can synthetic data accurately mimic oncology clinical trials?
  • Jun 1, 2023
  • Journal of Clinical Oncology
  • Samer El Kababji + 9 more

1554 Background: There is strong interest by researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data. Reusing data extracts the most utility possible from patient contributions. The majority of patients do want to share their data for secondary research purposes. However, data access for secondary analysis remains a challenge. A key reason why individual-level data is not made directly available to data users by authors and data custodians is concern over breaches of patient privacy. Synthetic data generation (SDG) is an effective way to address privacy concerns that can enable the broader sharing of clinical trial datasets. However, a key question is whether the reproducibility of the generated data is adequate to draw reliable conclusions. Methods: We synthesized datasets from five pragmatic breast cancer clinical trials performed by the REaCT group (https://react.ohri.ca/). A sequential synthesis method, a type of machine learning was performed. The published analysis of each trial was repeated on each synthetic dataset to evaluate reproducibility. We evaluated reproducibility on three criteria: (a) decision agreement: the direction and statistical significance of the primary endpoint effect estimates are the same as the real data, (b) estimate agreement: the parameter estimates from the synthetic data are within the 95% confidence interval of the real data, and (c) the confidence interval overlap between real and synthetic parameters is above 50%. In addition, we evaluated privacy using a membership disclosure metric. This evaluates the ability of an adversary to determine that a target individual was in the original dataset using the synthetic data, computed as an F1 classification accuracy score. Results: Our results show that decision and estimate agreements held true across all five trials, and the confidence interval overlap was high. The risks of membership disclosure are all below the established 0.2 threshold. Conclusions: In this study, we were able to successfully generate synthetic datasets that accurately replicated original data from 5 oncology trials and yielded the same results as in the original published studies, with a very low risk of membership disclosure. With proper modeling techniques, synthetic datasets can play a key role in data democratization and the reuse of oncology clinical trials.[Table: see text]

  • Research Article
  • Cite Count Icon 3
  • 10.1177/20543581241242550
Perceptions and Information-Seeking Behavior Regarding COVID-19 Vaccination Among Patients With Chronic Kidney Disease in 2023: A Cross-Sectional Survey.
  • Jan 1, 2024
  • Canadian Journal of Kidney Health and Disease
  • Omosomi Enilama + 9 more

People living with chronic kidney disease (CKD) face an increased risk of severe outcomes such as hospitalization or death from COVID-19. COVID-19 vaccination is a vital approach to mitigate the risk and severity of infection in patients with CKD. Limited information exists regarding the factors that shape COVID-19 vaccine uptake, including health information-seeking behavior and perceptions, within the CKD population. The objectives were to describe among CKD patients, (1) health information-seeking behavior on COVID-19, (2) their capacity to comprehend and trust COVID-19 information from different sources, and (3) their perceptions concerning COVID-19 infection and vaccination. Cross-sectional web-based survey administered in British Columbia and Ontario from February 17, 2023, to April 17, 2023. Chronic kidney disease G3b-5D patients and kidney transplant recipients (CKD G1T-5T) enrolled in a longitudinal COVID-19 vaccine serology study. The survey consisted of a questionnaire that included demographic and clinical data, perceived susceptibility of contracting COVID-19, the ability to collect, understand, and trust information on COVID-19, as well as perceptions regarding COVID-19 vaccination. Descriptive statistics were used to present the data with values expressed as count (%) and chi square tests were performed with a significance level set at P ≤ .05. A content analysis was performed on one open-ended response regarding respondents' questions surrounding COVID-19 infection and vaccination. Among the 902 patients who received the survey via email, 201 completed the survey, resulting in a response rate of 22%. The median age was 64 years old (IQR 53-74), 48% were male, 51% were university educated, 32% were on kidney replacement therapies, and 57% had received ≥5 COVID-19 vaccine doses. 65% of respondents reported that they had sought out COVID-19-related information in the last 12 months, with 91% and 84% expressing having understood and trusted the information they received, respectively. Those with a higher number of COVID-19 vaccine doses were associated with having sought out (P =.017), comprehended (P < .001), and trusted (P =. 005) COVID-19-related information. Female sex was associated with expressing more concern about contracting COVID-19 (P = .011). Most respondents strongly agreed to statements regarding the benefits of COVID-19 vaccination. Respondents' questions about COVID-19 infection and vaccination centered on 4 major themes: COVID-19 vaccination strategy, vaccine effectiveness, vaccine safety, and the impact of COVID-19 infection and vaccination on kidney health. This survey was administered within the Canadian health care context to patients with CKD who had at least 1 COVID-19 vaccine dose. Race/ethnicity of participants was not captured. In this survey of individuals with CKD, COVID-19 information-seeking behavior was high and almost all respondents understood and trusted the information they received. Perceptions toward the COVID-19 vaccine and booster were mostly favorable.

  • Research Article
  • 10.1089/derm.2023.0379
Patient-Reported Association Between COVID-19 Infection or Vaccination and Onset of Allergic Contact Dermatitis®.
  • Mar 27, 2024
  • Dermatitis : contact, atopic, occupational, drug
  • Nicholas Battis + 3 more

Patient-Reported Association Between COVID-19 Infection or Vaccination and Onset of Allergic Contact Dermatitis®.

  • Research Article
  • Cite Count Icon 3
  • 10.1155/2023/2206498
Herpes Zoster after COVID-19 Infection or Vaccination: A Prospective Cohort Study in a Tertiary Dermatology Clinic.
  • Dec 31, 2023
  • Dermatology Research and Practice
  • Charussri Leeyaphan + 6 more

Herpes zoster (HZ) has been observed to occur after COVID-19 infection and vaccination; however, knowledge regarding the demographic data, clinical presentations, and treatment outcomes of HZ is limited. To compare the demographic data, clinical manifestations, treatments, and outcomes of patients with and without HZ within 14 days of COVID-19 infection or vaccination. This prospective cohort study involving patients diagnosed with cutaneous HZ was conducted at a dermatology clinic from October 2021 to January 2023. Among a total of 232 patients with HZ, the median age was 62.0 years and 59.1% were female. HZ developed in 23 (9.9%) and four (1.7%) patients after COVID-19 vaccination and infection, respectively. The mean duration from vaccination and the median duration from infection to HZ onset were 5.7 and 8.5 days, respectively. The proportion of female patients was significantly higher in the group of patients with COVID-19 vaccination or infection than in those without such a history (P = 0.035). Patients who developed HZ following the recent COVID-19 infection had a median age of 42.5 years, which was lower than that of the other groups. Dissemination occurred in 8.7% of the patients after COVID-19 vaccination. HZ recurrence was reported in five cases, of which 80% had been vaccinated or infected with COVID-19 during the previous 21 days. All patients had similar durations of antiviral treatment, crust-off time, and duration of neuralgia. HZ after COVID-19 vaccination is more frequently observed in females, while HZ after COVID-19 infection tends to occur in younger patients. Disseminated HZ is more common in patients recently vaccinated against COVID-19. COVID-19 vaccination or infection may trigger recurrent HZ infection.

  • Abstract
  • 10.1016/j.jval.2022.04.1501
PCR158 Hungarians' Attitudes Toward the COVID-19 Disease and Vaccination: An Online Survey
  • Jun 25, 2022
  • Value in Health
  • H Khatatbeh + 8 more

PCR158 Hungarians' Attitudes Toward the COVID-19 Disease and Vaccination: An Online Survey

  • Research Article
  • Cite Count Icon 66
  • 10.1053/j.gastro.2021.06.014
COVID-19 Vaccination Is Safe and Effective in Patients With Inflammatory Bowel Disease: Analysis of a Large Multi-institutional Research Network in the United States
  • Jun 15, 2021
  • Gastroenterology
  • Yousaf Bashir Hadi + 5 more

COVID-19 Vaccination Is Safe and Effective in Patients With Inflammatory Bowel Disease: Analysis of a Large Multi-institutional Research Network in the United States

  • Research Article
  • Cite Count Icon 22
  • 10.1002/hpm.3449
Knowledge, attitude and practice survey towards COVID-19 vaccination: A mediation analysis.
  • Feb 28, 2022
  • The International Journal of Health Planning and Management
  • Mitali Sengupta + 4 more

Background and AimThe COVID‐19 pandemic has significantly impacted human lives across the world. In a country like India, with the second highest population in the world, impact of COVID‐19 has been diverse and multidimensional. Under such circumstances, vaccination against COVID‐19 infection is claimed to be one of the major solutions to contain the pandemic. Understanding of Knowledge, Attitude and Practice (KAP) measures are essential prerequisites to design suitable intervention programs. This paper examines the different KAP factors in Indians towards their decision of vaccine uptake.MethodAn online questionnaire was administered to Indian respondents. (Pilot study: n = 100, Main study: n = 221) to assess their existing knowledge on COVID‐19 infections and vaccination, attitude and intentions towards COVID‐19 vaccines and their decision towards COVID‐19 vaccine uptake.ResultThe findings highlighted that existing knowledge on COVID‐19 infections and vaccination directly impacted their attitude and intention towards vaccination. The attitude and intention towards COVID‐19 vaccines directly impacted their practice of undergoing COVID‐19 vaccination. Further, there was a statistically significant and considerably large indirect effect of existing knowledge on COVID‐19 infections and vaccination on the practice of undergoing COVID‐19 vaccination through attitude and intention towards the vaccine. There was no direct effect of Knowledge (existing knowledge on COVID‐19 infections and vaccination) on Practice (decision to undergo COVID‐19 vaccination). Therefore, Attitude and intention towards COVID‐19 vaccine is the primary mediator between Knowledge (existing knowledge on COVID‐19 infections and vaccination) and Practice (decision to undergo COVID‐19 vaccination).ConclusionParticipants decision towards COVID‐19 vaccination decisions are strongly related to their attitude and intentions that confirms the strong role of attitude towards success of COVID‐19 vaccination programme. Therefore, ‘person‐centric’ attitude based positive intervention strategies that links their prior knowledge on COVID‐19 infections and vaccination must be designed for greater vaccine acceptance amongst Indians.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant