Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning

Liangyuan Na,Cong Yang,Fangyuan Zhao,Yoshimi Fukuoka,Chi-Cheng Lo,Anil Aswani

doi:10.1001/jamanetworkopen.2018.6040

Abstract

Despite data aggregation and removal of protected health information, there is concern that deidentified physical activity (PA) data collected from wearable devices can be reidentified. Organizations collecting or distributing such data suggest that the aforementioned measures are sufficient to ensure privacy. However, no studies, to our knowledge, have been published that demonstrate the possibility or impossibility of reidentifying such activity data. To evaluate the feasibility of reidentifying accelerometer-measured PA data, which have had geographic and protected health information removed, using support vector machines (SVMs) and random forest methods from machine learning. In this cross-sectional study, the National Health and Nutrition Examination Survey (NHANES) 2003-2004 and 2005-2006 data sets were analyzed in 2018. The accelerometer-measured PA data were collected in a free-living setting for 7 continuous days. NHANES uses a multistage probability sampling design to select a sample that is representative of the civilian noninstitutionalized household (both adult and children) population of the United States. The NHANES data sets contain objectively measured movement intensity as recorded by accelerometers worn during all walking for 1 week. The primary outcome was the ability of the random forest and linear SVM algorithms to match demographic and 20-minute aggregated PA data to individual-specific record numbers, and the percentage of correct matches by each machine learning algorithm was the measure. A total of 4720 adults (mean [SD] age, 40.0 [20.6] years) and 2427 children (mean [SD] age, 12.3 [3.4] years) in NHANES 2003-2004 and 4765 adults (mean [SD] age, 45.2 [19.9] years) and 2539 children (mean [SD] age, 12.1 [3.4] years) in NHANES 2005-2006 were included in the study. The random forest algorithm successfully reidentified the demographic and 20-minute aggregated PA data of 4478 adults (94.9%) and 2120 children (87.4%) in NHANES 2003-2004 and 4470 adults (93.8%) and 2172 children (85.5%) in NHANES 2005-2006 (P < .001 for all). The linear SVM algorithm successfully reidentified the demographic and 20-minute aggregated PA data of 4043 adults (85.6%) and 1695 children (69.8%) in NHANES 2003-2004 and 4041 adults (84.8%) and 1705 children (67.2%) in NHANES 2005-2006 (P < .001 for all). This study suggests that current practices for deidentification of accelerometer-measured PA data might be insufficient to ensure privacy. This finding has important policy implications because it appears to show the need for deidentification that aggregates the PA data of multiple individuals to ensure privacy for single individuals.

Highlights

IntroductionPolicymakers[1,2] have raised the possibility of identifying individuals or their actions based on activity data, whereas device manufacturers and exercise-focused social networks maintain that sharing deidentified data poses no privacy risks.[3,4,5] Wearable device users are concerned with privacy issues,[6] and ethical consequences have been discussed.[7,8] There are potentially legal requirements from the Health Insurance Portability and Accountability Act (HIPAA) on the privacy of activity data.[9,10,11] One key unresolved question is whether it is possible to reidentify activity data
This study suggests that current practices for deidentification of accelerometer-measured physical activity (PA) data might be insufficient to ensure privacy
We evaluated the feasibility of this scenario by attempting to match a second data set of physical activity data and demographic information to a first data set of record numbers, physical activity data, and demographic information

Summary

Introduction

Policymakers[1,2] have raised the possibility of identifying individuals or their actions based on activity data, whereas device manufacturers and exercise-focused social networks maintain that sharing deidentified data poses no privacy risks.[3,4,5] Wearable device users are concerned with privacy issues,[6] and ethical consequences have been discussed.[7,8] There are potentially legal requirements from the Health Insurance Portability and Accountability Act (HIPAA) on the privacy of activity data.[9,10,11] One key unresolved question is whether it is possible to reidentify activity data. Demographics in an anonymized data set can function as a quasi-identifier that is capable of being used to reidentify individuals.[12] Reidentification is possible using online search data,[13] movie rating data,[14] social network data,[15] and genetic data.[16] a key feature in these examples is a type of data sparsity, a large number of characteristics for each individual, which leads to a diversity of combinations in such a way that any particular combination of the data is identifying. Individuals’ movie ratings are highly revealing because of the many permutations of likes and dislikes.[14] As another example, the particular genetic sequence combinations (and especially single-nucleotide polymorphisms) of a single individual are unique and capable of identifying that individual.[16]

Methods

Results

Discussion

Conclusion