Abstract
Synthetic data are becoming increasingly important mechanisms for sharing data among collaborators and with the public. Multiple methods for the generation of synthetic data have been proposed, but many have short comings with respect to maintaining the statistical properties of the original data. We propose a new method for fully synthetic data generation that leverages linear and integer mathematical programming models in order to match the moments of the original data in the synthetic data. This method has no inherent disclosure risk and does not require parametric or distributional assumptions. We demonstrate this methodology using the Framingham Heart Study. Existing synthetic data methods that use chained equations were compared with our approach. We fit Cox proportional hazards, logistic regression, and nonparametric models to synthetic data and compared with models fitted to the original data. True coverage, the proportion of synthetic data parameter confidence intervals that include the original data's parameter estimate, was 100% for parametric models when up to four moments were matched, and consistently outperformed the chained equations approach. The area under the curve and accuracy of the nonparametric models trained on synthetic data marginally differed when tested on the full original data. Models were also trained on synthetic data and a partition of original data and were tested on a held-out portion of original data. Fourth-order moment matched synthetic data outperformed others with respect to fitted parametric models but did not always outperform other methods with fitted nonparametric models. No single synthetic data method consistently outperformed others when assessing the performance of nonparametric models. The performance of fourth-order moment matched synthetic data in fitting parametric models suggests its use in these cases. Our empirical results also suggest that the performance of synthetic data generation techniques, including the moment matching approach, is less stable for use with nonparametric models. The benefits of the moment matching approach should be weighed against additional computational costs. In summary, our results demonstrate that the introduced moment matching approach may be considered as an alternative to existing synthetic data generation methods.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have