Abstract

Statistical disclosure control is the collective name for a range of tools used by data providers such as government departments to protect the confidentiality of individuals or organizations. When the published tables contain magnitude data such as turnover or health statistics, the preferred method is to suppress the values of certain cells. Assigning a cost to the information lost by suppressing any given cell creates the “cell suppression problem.” This consists of finding the minimum cost solution which meets the confidentiality constraints. Solving this problem simultaneously for all of the sensitive cells in a table is NP-hard and not possible for medium to large sized tables. In this paper, we describe the development of a heuristic tool for this problem which hybridizes linear programming (to solve a relaxed version for a single sensitive cell) with a genetic algorithm (to seek an order for considering the sensitive cells which minimizes the final cost). Considering a range of real-world and representative “artificial” datasets, we show that the method is able to provide relatively low cost solutions for far larger tables than is possible for the optimal approach to tackle. We show that our genetic approach is able to significantly improve on the initial solutions provided by existing heuristics for cell ordering, and outperforms local search. This approach is then extended and applied to large statistical tables with over 200000 cells.

Highlights

  • In today’s “Knowledge Economy” many organisations hold large amounts of data gathered from a variety of sources, some of which they wish to publish, sell, or otherwise exploit and disseminate, whilst respecting the privacy of individual sources

  • To reduce this problem processing a specified sequence of the sensitive cells, Castro [5] has developed a new minimum-L 2-distance gradually building up a secondary suppression pattern so perturbation method which maintains both additivity and as to meet the protection constraints, while minimising the margin totals and has been shown to protect three- information loss. dimensional tables with up to 1,000,000 cells

  • As currently methods as it involves solving a difficult combinatorial implemented, the output from the linear programs (LPs) heuristic is not optimisation. It is the objective of this paper to extend available to the user, and because of the large numbers cell suppression, which preserves more of the original of constraints and variables, the “optimal” approach is cell values than perturbation methods, so that it can be only possible for tables with a few hundreds

Read more

Summary

A Genetic Approach to Statistical Disclosure

Abstract—Statistical Disclosure Control is the collective name for a range of tools used by data providers such as government departments to protect the confidentiality of individuals or organizations. Assigning a cost to the information lost by suppressing any given cell creates the “Cell Suppression Problem” This consists of finding the minimum cost solution which meets the confidentiality constraints. Solving this problem simultaneously for all of the sensitive cells in a table is NP-hard and not possible for medium to large sized tables. We show that our genetic approach is able to significantly improve on the initial solutions provided by existing heuristics for cell ordering, and outperforms local search. This approach is extended and applied to large statistical tables with over 200,000 cells

INTRODUCTION
BACKGROUND
The Incremental Attacker
METHODOLOGY
Procedure
Analysis
REDUCING THE COST OF THE FITNESS FUNCTION
PROTECTING LARGER STATISTICAL TABLES
Findings
VIII. CONCLUSIONS AND SUGGESTED FUTURE
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call