Feature grouping-based parallel outlier mining of categorical data using spark

Junli Li,Jifu Zhang,Xiao Qin,Yaling Xun

doi:10.1016/j.ins.2019.07.045

Junli Li, Jifu Zhang + Show 2 more

Open Access

https://doi.org/10.1016/j.ins.2019.07.045

Copy DOI

Abstract

This paper proposes a feature-grouping based parallel outlier mining method called POS for high-dimensional categorical datasets. Existing methods of outlier mining are inadequate to deal with datasets which are so voluminous and complex. We solve this problem by proposing a parallel framework using the Spark platform for categorical and mass data. POS is composed of two modules, which are parallel feature grouping, and parallel outlier mining. Additionally, Vertical transformation is utilized to improve the performance of POS. We implement our POS on the Spark platform and evaluate it using synthetic and real-world datasets. Our experimental results confirm that POS is a promising and practical parallel algorithm to mine outliers in high-dimensional categorical datasets because POS achieves high performance in terms of extensibility and scalability.

Full Text