Clinical data analysis could lead to breakthroughs. However, clinical data contain sensitive information about participants that could be utilized for unethical activities, such as blackmailing, identity theft, mass surveillance, or social engineering. Data anonymization is a standard step during data collection, before sharing, to overcome the risk of disclosure. However, conventional data anonymization techniques are not foolproof and also hinder the opportunity for personalized evaluations. Much research has been done for synthetic data generation using generative adversarial networks and many other machine learning methods; however, these methods are either not free to use or are limited in capacity. This study evaluates the performance of an emerging tool named synthpop, an R package producing synthetic data as an alternative approach for data anonymization. This paper establishes data standards derived from the original data set based on the utilities and quality of information and measures variations in the synthetic data set to evaluate the performance of the data synthesis process. The methods to assess the utility of the synthetic data set can be broadly divided into two approaches: general utility and specific utility. General utility assesses whether synthetic data have overall similarities in the statistical properties and multivariate relationships with the original data set. Simultaneously, the specific utility assesses the similarity of a fitted model’s performance on the synthetic data to its performance on the original data. The quality of information is assessed by comparing variations in entropy bits and mutual information to response variables within the original and synthetic data sets. The study reveals that synthetic data succeeded at all utility tests with a statistically non-significant difference and not only preserved the utilities but also preserved the complexity of the original data set according to the data standard established in this study. Therefore, synthpop fulfills all the necessities and unfolds a wide range of opportunities for the research community, including easy data sharing and information protection.
Read full abstract