Abstract

This paper proposes a feature-grouping based parallel outlier mining method called POS for high-dimensional categorical datasets. Existing methods of outlier mining are inadequate to deal with datasets which are so voluminous and complex. We solve this problem by proposing a parallel framework using the Spark platform for categorical and mass data. POS is composed of two modules, which are parallel feature grouping, and parallel outlier mining. Additionally, Vertical transformation is utilized to improve the performance of POS. We implement our POS on the Spark platform and evaluate it using synthetic and real-world datasets. Our experimental results confirm that POS is a promising and practical parallel algorithm to mine outliers in high-dimensional categorical datasets because POS achieves high performance in terms of extensibility and scalability.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call