Abstract

When a company requires analytical capabilities using data that might include sensitive information, it is important to use a solution that protects those sensitive portions, while maintaining its usefulness. An analysis of existing anonymization approaches found out that some of them only permit to disclose aggregated information about large groups or require to know in advance the type of analysis to be performed, which is not viable in Big Data projects; others have low scalability which is not feasible with large data sets. Another group of works are only presented theoretically, without any evidence on evaluations or tests in real environments. To fill this gap this paper presents Anonylitics, an implementation of the k-anonymity principle for small and Big Data settings that is intended for contexts where it is necessary to disclose small or large data sets for applying supervised or non-supervised techniques. Anonylitics improves available implementations of k-anonymity using a hybrid approach during the creation of the anonymized blocks, maintaining the data types of the original attributes, and guaranteeing scalability when used with large data sets. Considering the diverse infrastructure and data volumes managed by current companies, Anonylitics was implemented in two versions, the first one uses a centralized approach, for companies that have small data sets, or large data sets, but good vertical infrastructure capabilities, and a Big Data version, for companies with large data sets and horizontal infrastructure capabilities. Evaluation on different data sets with diverse protection requirements demonstrates that our solution maintains the utility of the data, guarantees its privacy and has a good time-complexity performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call