Abstract

In many fields, e.g., data mining and machine learning, distance-based outlier detection (DOD) is widely employed to remove noises and find abnormal phenomena, because DOD is unsupervised, can be employed in any metric spaces, and does not have any assumptions of data distributions. Nowadays, data mining and machine learning applications face the challenge of dealing with large datasets, which requires efficient DOD algorithms. We address the DOD problem with two different definitions. Our new idea, which solves the problems, is to exploit an in-memory proximity graph. For each problem, we propose a new algorithm that exploits a proximity graph and analyze an appropriate type of proximity graph for the algorithm. Our empirical study using real datasets confirms that our DOD algorithms are significantly faster than state-of-the-art ones.

Highlights

  • Outlier detection is a fundamental task in many applications, such as fraud detection, health check, and noise data removal [5,17,50]

  • We propose a new solution for the (r, k)-distance-based outlier detection (DOD) problem that filters non-outliers efficiently while guaranteeing correctness by exploiting a proximity graph

  • We evaluated the pre-processing efficiency of proximity graphs: NSW, KGraph, MRPG-basic, and MRPG

Read more

Summary

Introduction

Outlier detection is a fundamental task in many applications, such as fraud detection, health check, and noise data removal [5,17,50]. As described later, these applications often employ distance-based outlier detection (DOD) [32], because DOD is unsupervised, can be employed in any metric spaces, and does not have any assumptions of data distributions. To train high performance machine learning models, noises (i.e., outliers) should be removed from training datasets, because the performances of models tend to be affected by outliers [5,37,51]. It is common practice for many applications to remove noises as pre-processing for training [14,29], and DOD can contribute to this noise removal.

Objectives
Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.