Abstract

In statistical disclosure control, global recoding is a typical masking method for categorical variable. When merging categories in global recoding, the decision of the minimum frequency ratio is essential from the point of view of both the reduction of the disclosure risk and the utility of masked data. Previous utility measures such as comparison of record value itself or contingency tables between original data and masked data can not evaluate masked data by global recoding. We propose a new information loss measure that can evaluate masked data by global recoding based on the model performance. Using the proposed information measure, we examined three hypotheses concerning global recoding, (a) when minimum frequency ratio increases, information loss increases, (b) when the number of input variable in model increases, information loss increases, (c) models yielding high performances in the original data suffer from large information loss by global recoding, by numerical experiments of the data of 2010 Population Census of Japan. For the performance measure by recall the hypotheses were supported, for that by accuracy, the hypotheses were supported, except for models that predicts a dominant category. Also, values of information loss for specific number of input variables and minimum frequency ratio are reported in our study. In the worst case, the mean of information loss of accuracy was 0.041, where the number of input variables was 4 and minimum frequency ratio was set at 0.05. These results provide useful insight for protected data publishers in deciding minimum frequency ratio, and for users conducting statistical analyses.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call