Synthetic Data by Principal Component Analysis

Natsuki Sano

doi:10.1109/icdmw51313.2020.00023

Abstract

In statistical disclosure control, releasing synthetic data implies difficulty in identifying individual records, since the value of synthetic data is different from original data. We propose two methods of generating synthetic data using principal component analysis: orthogonal transformation (linear method) and sandglass-type neural networks (nonlinear method). While the typical generation method of synthetic data by multiple imputation requires existence of common variables between population and survey data, our proposed method can generate synthetic data without common variables. Additionally, the linear method can explicitly evaluate information loss as the ratio of discarded eigenvalues. We generate synthetic data by the proposed method for decathlon data and evaluate four information loss measures: our proposed information loss measure, mean absolute error for each record, mean absolute error of mean of each variable, and mean absolute error of covariance between variables. We find that information loss in the linear method is less than that in the nonlinear method.

Full Text