Abstract

Data synthesizing is of great significance for the privacy protection of real credit data. Credit data synthesis poses unique challenges, involving discrete and continuous features, lack of prior information, high feature complexity, and imbalance. To address the challenge, we propose a data-driven prior-based tabular variational autoencoder (DPTVAE) to end-to-end synthesize credit data, without any expert experience. It mainly contains three main innovations: 1) Binning Gaussian probability density (BGPD)-based feature type classification. Previous work relies on expert-experience classification, which is limited and possibly missing. We innovatively propose BGPD-based class values importance calculation to automatically classify discrete continuous columns, so as to effectively facilitate the rational synthesis requirement of values or distributions. 2) Encoding based on BGPD-Variational Gaussian Mixture (BGPD-VGM): Continuous columns of financial data usually involve skewed, multi-peaks, or mixture distributions. To adapt to the complexity of the distribution, we propose BGPD-VGM to encode data-driven prior. 3) Conditional decoding: We also designed a conditional decoding strategy for DPTVAE to synthesize imbalanced discrete columns. Compared to seven existing advanced models, DPTVAE demonstrates exceptional synthesis performance on two datasets with a 33-fold difference in data size, particularly in identifying real default users based on synthetic data. This achievement is significant for data applications based on privacy protection. The code in this work could be found in https://github.com/jinxtan/DPTVAE.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.