Preserving output-privacy in data stream classification

Radhika Kotecha,Sanjay Garg

doi:10.1007/s13748-017-0114-8

Abstract

Privacy-preservation has emerged to be a major concern in devising a data mining system. But, protecting the privacy of data mining input does not guarantee a privacy-preserved output. This paper focuses on preserving the privacy of data mining output and particularly the output of classification task. Further, instead of static datasets, we consider the classification of continuously arriving data streams: a rapidly growing research area. Due to the challenges of data stream classification such as vast volume, a mixture of labeled and unlabeled instances throughout the stream and timely classifier publication, enforcing privacy-preservation techniques becomes even more challenging. In order to achieve this goal, we propose a systematic method for preserving output-privacy in data stream classification that addresses several applications like loan approval, credit card fraud detection, disease outbreak or biological attack detection. Specifically, we propose an algorithm named Diverse and k-Anonymized HOeffding Tree (DAHOT) that is an amalgamation of popular data stream classification algorithm Hoeffding tree and a variant of k-anonymity and l-diversity principles. The empirical results on real and synthetic data streams verify the effectiveness of DAHOT as compared to its bedrock Hoeffding tree and two other techniques, one that learns sanitized decision trees from sampled data stream and other technique that uses ensemble-based classification. DAHOT guarantees to preserve the private patterns while classifying the data streams accurately.

Full Text