Abstract

In this article, we propose a k-anonymity approach that prioritizes the generalization of attributes based on their utility. We focus on transport data, which we consider a special case in which many or all attributes are quasi-identifiers (e.g., origin, destination, ride start time), as they allow correlation with easily observable auxiliary data. The novelty in our approach lies in introducing normalization techniques as well as distance and utility metrics that allow the consideration of not only numerical attributes but also categorical attributes by representing them in tree or graph form. The prioritization of the attributes in the generalization process is based on the attributes' utility and can further be influenced by either automatically or manually assigned attribute weights. We evaluate and compare different options for all components of our mechanism as well as present an extensive performance evaluation of our approach using real-world data. Lastly, we show in which cases suppression of records can counter-intuitively lead to higher data utility.

Highlights

  • Publishing or exchanging datasets is often hindered by privacy protection requirements and by the concern of disclosing too much information

  • Seniors may be less likely to travel during the morning rush hour, stations in a residential area are more likely to be destinations in the evening hours, or the choice of boarding station may be used to predict the destination station to a certain extent

  • While we do not incorporate a specific mechanism to incorporate these correlations into the utility metric, we avoid a privacy attack where one attribute could be used to de-generalize another attribute by treating every attribute as a quasi-identifier, requiring each entry in an equivalence class to be identical with respect to every attribute

Read more

Summary

INTRODUCTION

Publishing or exchanging datasets is often (rightfully) hindered by privacy protection requirements and by the concern of disclosing too much information. With regards to microdata, that is, data that includes user-specific information, such as the travel routes of single users, the privacy implications are too severe to allow disclosure This in turn leads to the fact that Open Data initiatives often suffer from data sparsity and that research, in the smart mobility domain, cannot be enriched, evaluated, and validated using real datasets [1]. Advancements of k-anonymity include l-diversity [5], t-closeness [6], and m-invariance [7] which consider the distribution of the sensitive attributes What these approaches have in common is that their mechanisms focus on the privacy aspects but do not take into consideration impact on or even control of data utility, or, usefulness.

PRELIMANARIES
UTILITY FACTOR OF AN ATTRIBUTE
UTILITY-WEIGHTED RECORD DISTANCE
DISTORTION
5: User-assigned attribute weights
RESULTS AND DISCUSSIONS
RELATED WORK
CONCLUSION AND FUTURE WORKS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call