A method for managing re-identification risk from small geographic areas in Canada

Khaled El Emam,Philip Abdelmalik,Jim Bottomley,Ann Brown,Tyson Roffey,Angelica Neisa,Mark Walker

doi:10.1186/1472-6947-10-18

Abstract

BackgroundA common disclosure control practice for health datasets is to identify small geographic areas and either suppress records from these small areas or aggregate them into larger ones. A recent study provided a method for deciding when an area is too small based on the uniqueness criterion. The uniqueness criterion stipulates that an the area is no longer too small when the proportion of unique individuals on the relevant variables (the quasi-identifiers) approaches zero. However, using a uniqueness value of zero is quite a stringent threshold, and is only suitable when the risks from data disclosure are quite high. Other uniqueness thresholds that have been proposed for health data are 5% and 20%.MethodsWe estimated uniqueness for urban Forward Sortation Areas (FSAs) by using the 2001 long form Canadian census data representing 20% of the population. We then constructed two logistic regression models to predict when the uniqueness is greater than the 5% and 20% thresholds, and validated their predictive accuracy using 10-fold cross-validation. Predictor variables included the population size of the FSA and the maximum number of possible values on the quasi-identifiers (the number of equivalence classes).ResultsAll model parameters were significant and the models had very high prediction accuracy, with specificity above 0.9, and sensitivity at 0.87 and 0.74 for the 5% and 20% threshold models respectively. The application of the models was illustrated with an analysis of the Ontario newborn registry and an emergency department dataset. At the higher thresholds considerably fewer records compared to the 0% threshold would be considered to be in small areas and therefore undergo disclosure control actions. We have also included concrete guidance for data custodians in deciding which one of the three uniqueness thresholds to use (0%, 5%, 20%), depending on the mitigating controls that the data recipients have in place, the potential invasion of privacy if the data is disclosed, and the motives and capacity of the data recipient to re-identify the data.ConclusionThe models we developed can be used to manage the re-identification risk from small geographic areas. Being able to choose among three possible thresholds, a data custodian can adjust the definition of "small geographic area" to the nature of the data and recipient.

Highlights

A common disclosure control practice for health datasets is to identify small geographic areas and either suppress records from these small areas or aggregate them into larger ones
Because our desired analysis is at the Forward Sortation Areas (FSAs) geographic unit, we developed a gridding methodology, described in Additional file 1, to assign the FSAs to individual records based on their census tracts
Disclosure control practices for small geographic areas often result in health datasets that have significantly reduced utility

Summary

Introduction

A common disclosure control practice for health datasets is to identify small geographic areas and either suppress records from these small areas or aggregate them into larger ones. A recent study provided a method for deciding when an area is too small based on the uniqueness criterion. The uniqueness criterion stipulates that an the area is no longer too small when the proportion of unique individuals on the relevant variables (the quasi-identifiers) approaches zero. Using a uniqueness value of zero is quite a stringent threshold, and is only suitable when the risks from data disclosure are quite high. Other uniqueness thresholds that have been proposed for health data are 5% and 20%. Records from individuals living in small geographic areas tend to have a higher probability of being re-identified [21,22,23].

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Medical Informatics and Decision Making	Publication Date: Apr 2, 2010
Citations: 71	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

A method for managing re-identification risk from small geographic areas in Canada

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making

Lead the way for us

Similar Papers

Artificial intelligence to predict estimates of physical activity in small geographic areas
D Carvalho Malta ... J M Almeida
European Journal of Public Health | VOL. 34
D Carvalho Malta, et. al.D Carvalho Malta ... J M Almeida
28 Oct 2024
European Journal of Public Health | VOL. 34

Influence of Socioeconomic Deprivation on the Relation Between Air Pollution and β-Agonist Sales for Asthma
Olivier Laurent ... Emmanuel Rivière
Chest | VOL. 135
Olivier Laurent, et. al.Olivier Laurent ... Emmanuel Rivière
01 Mar 2009
Chest | VOL. 135

Evaluating Predictors of Geographic Area Population Size Cut-offs to Manage Re-identification Risk
K El Emam ... P Abdelmalik
Journal of the American Medical Informatics Association | VOL. 16
K El Emam, et. al.K El Emam ... P Abdelmalik
25 Feb 2009
Journal of the American Medical Informatics Association | VOL. 16

Spacing of point counts for grassland bird surveys in small geographical areas: Biases and tradeoffs
Lloyd W Morrison ... David G Peitz
The Wilson Journal of Ornithology | VOL. 132
Lloyd W Morrison, et. al.Lloyd W Morrison ... David G Peitz
14 Sep 2021
The Wilson Journal of Ornithology | VOL. 132

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A method for managing re-identification risk from small geographic areas in Canada

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making