Abstract

Machine learning (ML) utility has been the main evaluation metrics for data synthesizers. However, because ML utility cannot be simply calculated, none of the previous synthesizers were trained to reach the same level of ML utility as a training objective. This study aims to integrate ML utility into data synthesizer training using a transformer-based model as a learned loss function. The transformer was trained to estimate ML utility of synthetic datasets, then it’s integrated by backpropagating the difference between estimated and expected value. The integration has significantly improved the average ML utility of LCT-GAN and Realtabformer. The ML utility of LCT-GAN improved by 0.0158 for Contraceptive dataset, 0.031 for Insurance dataset, and 0.0561 for Treatment dataset. The ML utility of Realtabformer improved by 0.02 for Contraceptive dataset and 0.0024 for Insurance dataset. The increase affects the dataset distribution, correlation between features, and privacy, but the direction varies. Correlation coefficients indicate that synthetic data distribution gets closer to real data as ML utility improves. In addition to ML utility integration, this study has also shown that patterns between rows in a dataset can be learned, so better synthesizers can be developed based on them.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.