Rapid Re-Identification Risk Assessment for Anonymous Data Set in Mobile Multimedia Scene

Zhigang Yang,Daizhong Luo,Yu Xiong,Ruyan Wang

doi:10.1109/access.2020.2977404

Abstract

Ubiquitous mobile multimedia applications bring great convenience to users. However, when enjoying mobile multimedia services, users provide personal data to service platforms. Although the service platforms always claim that the collected personal data are de-identified, the risk of re-identifying users through linkage attacks still exists and is incalculable. This paper proposes a rapid prediction model for the overall re-identification risk based on the statistics of data sets (i.e., the number of individuals, number of attributes, distribution of attribute values, and attribute dependency). Our proposed model reveals the impact of statistics on the overall re-identification risk and adopts random sampling and semi-random sampling methods to predict the overall re-identification risk of data sets with and without strong dependency ordered attribute pairs. Experimental results show that for the data sets without strong dependency ordered attribute pairs, the random sampling method has a high prediction accuracy (the prediction error is less than 0.05). For the data sets with strong dependency ordered attribute pairs, the semi-random sampling method has a high prediction accuracy (the prediction error is less than 0.09). Exploiting our model, governments and individuals can quickly assess the privacy leakage risk of their data sets, given only the statistic of the data sets. Besides, this model can also evaluate the privacy risk of data collection schemes in advance according to historical statistics, and identify suspected services.

Highlights

With the wide popularity of smart terminals and development of wireless communication technology, mobile multimedia applications become the indispensable tool for daily life and work [1]–[3]
We propose R3A model, in which the overall re-identification risk (ORR) of target data set can be predicted by the average ORR of random data sets with the same statistic
We considered the confidence of frequent tuple (a, b) in target data set is b_a, the algorithm of semi-random sampling method is shown as Algorithm 2

Summary

Introduction

With the wide popularity of smart terminals and development of wireless communication technology, mobile multimedia applications become the indispensable tool for daily life and work [1]–[3]. Ubiquitous access, rich functions and good experience make mobile multimedia applications more and more popular. Mobile multimedia service providers, in order to increase user viscosity, improve user experience, or reserve data resources, collect user personal information while providing services. While enjoying the convenience of mobile multimedia services, users must take on the risk of privacy disclosure. Trajectories of users will expose sensitive information such as home address and workplace. Information collectors always claim that the purpose of collecting personal data is to provide better services to users, and personal information will be de-identified and properly preserved. Many incidents of service provider data breach, such as the Facebook data privacy scandal and the Equifax data breach, suggest that improper data sharing and ubiquitous hacking make data stored on servers highly vulnerable. The leaked data may not contain the user’s identity, user’s quasi-identifiers such as age, gender, and zip code in the anonymous data can be collected by many

Results

Discussion

Conclusion