AbstractWith Big Data revolution, the education sector is being reshaped. The current data‐driven education system provides many opportunities to utilize the enormous amount of collected data about students' activities and performance for personalized education, adapting teaching methods, and decision making. On the other hand, such benefits come at a cost to privacy. For example, the identification of a student's poor performance across multiple courses. While several works have been conducted on quantifying the re‐identification risks of individuals in released datasets, they assume an adversary's prior knowledge about target individuals. Most of them do not utilize all the available information in the datasets. For example, event‐level information that associates multiple records to the same individual and correlation between attributes. In this work, we propose a method using a Markov Model (MM) to quantify re‐identification risks using all available information in the data under a more realistic threat model that assumes different levels of an adversary's knowledge about the target individual, ranging from any one of the attributes to all given attributes. Moreover, we propose a workflow for efficiently calculating MM risk which is highly scalable to large number of attributes. Experimental results from real education datasets show the efficacy of our model for re‐identification risk. Practitioner notesWhat is already known about this topic? There has been a number of works/research conducted on privacy risk quantification in datasets and in the Web. Majority of them have strong assumption about adversary's prior knowledge of target individual(s). Most of them do not utilize all the available information in the datasets, eg, event‐level or duplicate records and correlation between attributes. What this paper adds? This paper proposes a new re‐identification risk quantification model using Markov models. Our model addresses the shortcomings of existing works, eg, strong assumption about adversary's knowledge, unexplainable model, and utilizing available information in the datasets. Specifically, our proposed model not only focuses on the uniqueness of data points in the datasets (as most of the other existing methods), but also takes into account uniformity and correlation characteristics of these data points. Re‐identification risk quantification is computationally expensive and is not scalable to large datasets with increasing number of attributes. This paper introduces a workflow for data custodians to use to efficiently evaluate the worst‐case re‐identification risk in their datasets before releasing. It presents extensive experimental evaluation results of the proposed model for quantifying re‐identification risks on several real education datasets. Implications for practice and/or policy? Empirical results on real education datasets validate the significance and efficacy of the proposed model for re‐identification risk quantification compared to existing approaches. Our model can be used by the data custodians as a tool to evaluate the worst‐case risk of a dataset. It empowers data custodians to make informed decisions on appropriate actions to mitigate these risks (eg, data perturbation) before sharing or releasing their datasets to third parties. A typical use case would be one where the data custodians is an online course/program provider, which collects data about students' engagement with their courses and would like to share it with third parties for them to run learning analytics that would provide value‐added benefits back to the data custodian. We specifically study the privacy risk quantification for education data; however, our model is applicable to any tabular data release.
Read full abstract