Statistical Disclosure Control Methods Research Articles

The 2011 Population and Housing Census in the Czech Republic was accompanied by a significant change in the technology used to prepare course of the fieldwork, along with changes in how the data are processed and how the outputs are disseminated. Grids are regular polygon networks that divide the territory of country in a grid-like way/pattern into equally large territorial units, to which aggregate statistical data are assigned. The disadvantage of grids is that these are territorially small units that are often minimally populated. This mainly has implications for the protection of individual data, which is associated with statistical disclosure control (SDC). The research question addressed in this paper is whether data protection (perturbation methods) leads to a change in the characteristics of the file either in terms of statistics of the whole file (i.e. for all grids) or in terms of spatial statistics, which indicate the spatial distribution of the analysed phenomenon. Two possible solutions to the issue of grid data protection are discussed. One comes from the Statistical Office of the European Communities (Eurostat) and the other from Cantabular, which is a product of the Sensible Code Company (SCC) based in Belfast. According to the Cantabular methodology, one variant was processed, while according to the Eurostat methodology, two variants were calculated, which differ by the parameter settings for maximum noise D and the variance of noise V. The results of the descriptive statistics show a difference in absolute differences when Cantabular and Europstat solutions are compared. In the case of other statistics, the results are fully comparable. This paper is devoted to one specific type of census output. The question is to what extent these results are relevant for other types of census outputs. They differ fundamentally in the number of dimensions (grids have only two dimensions). It would therefore be appropriate to use SDC procedures that allow greater flexibility in defining SDC parameters.

Read full abstract

BackgroundThe exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce.ObjectiveThis work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data.MethodsA total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed.ResultsA total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility.ConclusionsThe results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.

Read full abstract

Statistical Disclosure Control Methods Research Articles

Related Topics

Articles published on Statistical Disclosure Control Methods

Database Reconstruction Is Not So Easy and Is Different from Reidentification

Using Saturated Count Models for User-Friendly Synthesis of Large Confidential Administrative Databases

Vine copula statistical disclosure control for mixed-type data

Statistical disclosure control for continuous variables using an extended skew‐t copula

Statistical Disclosure Control Methods for Harmonised Protection of Census Data: a Grid Case

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing.

Statistical Disclosure Control Methods for Microdata from the Labour Force Survey

General Confidentiality and Utility Metrics for Privacy-Preserving Data Publishing Based on the Permutation Model

An Empirical Study of Applying Statistical Disclosure Control Methods to Public Health Research

Feedback-Based Integration of the Whole Process of Data Anonymization in a Graphical Interface

Using the Complex Measure in an Assessment of the Information Loss Due to the Microdata Disclosure Control

Towards the adaptation of SDC methods to stream mining

Disclosure risk reduction for generalized linear model output in a remote analysis system

Measuring Disclosure Risk and Data Utility for Flexible Table Generators

Statistical Disclosure Control for Micro-Data Using theRPackagesdcMicro

A minimum spanning tree equipartition algorithm for microaggregation

A derivative‐free algorithm for refining numerical microaggregation solutions

Providing Data With High Utility And No Disclosure Risk For The Public and Researchers: An Evaluation By Advanced Statistical Disclosure Risk Methods

Secure and efficient anonymization of distributed confidential databases

Column generation bounds for numerical microaggregation

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Statistical Disclosure Control Methods Research Articles

Related Topics

Articles published on Statistical Disclosure Control Methods

Database Reconstruction Is Not So Easy and Is Different from Reidentification

Using Saturated Count Models for User-Friendly Synthesis of Large Confidential Administrative Databases

Vine copula statistical disclosure control for mixed-type data

Statistical disclosure control for continuous variables using an extended skew‐t copula

Statistical Disclosure Control Methods for Harmonised Protection of Census Data: a Grid Case

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing.

Statistical Disclosure Control Methods for Microdata from the Labour Force Survey

General Confidentiality and Utility Metrics for Privacy-Preserving Data Publishing Based on the Permutation Model

An Empirical Study of Applying Statistical Disclosure Control Methods to Public Health Research

Feedback-Based Integration of the Whole Process of Data Anonymization in a Graphical Interface

Using the Complex Measure in an Assessment of the Information Loss Due to the Microdata Disclosure Control

Towards the adaptation of SDC methods to stream mining

Disclosure risk reduction for generalized linear model output in a remote analysis system

Measuring Disclosure Risk and Data Utility for Flexible Table Generators

Statistical Disclosure Control for Micro-Data Using theRPackagesdcMicro

A minimum spanning tree equipartition algorithm for microaggregation

A derivative‐free algorithm for refining numerical microaggregation solutions

Providing Data With High Utility And No Disclosure Risk For The Public and Researchers: An Evaluation By Advanced Statistical Disclosure Risk Methods

Secure and efficient anonymization of distributed confidential databases

Column generation bounds for numerical microaggregation