Abstract

This paper proposes a feature-grouping based parallel outlier mining method called POS for high-dimensional categorical datasets. Existing methods of outlier mining are inadequate to deal with datasets which are so voluminous and complex. We solve this problem by proposing a parallel framework using the Spark platform for categorical and mass data. POS is composed of two modules, which are parallel feature grouping, and parallel outlier mining. Additionally, Vertical transformation is utilized to improve the performance of POS. We implement our POS on the Spark platform and evaluate it using synthetic and real-world datasets. Our experimental results confirm that POS is a promising and practical parallel algorithm to mine outliers in high-dimensional categorical datasets because POS achieves high performance in terms of extensibility and scalability.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.