Little data is often enough for distance-based outlier detection

David Muhr,Michael Affenzeller

doi:10.1016/j.procs.2022.01.297

Little data is often enough for distance-based outlier detection

David Muhr, Michael Affenzeller

Open Access

https://doi.org/10.1016/j.procs.2022.01.297

Copy DOI

Journal: Procedia Computer Science	Publication Date: Jan 1, 2022
Citations: 9	License type: cc-by

Affiliation: Johannes Kepler University of Linz

#Distance-based Outlier Detection #Millions Of Data Points + Show 8 more

Abstract
Full-Text
Similar Papers

Abstract

Many real-world use cases benefit from fast training and prediction times, and much research went into speeding up distance-based outlier detection methods to millions of data points. Contrary to popular belief, our findings suggest that little data is often enough for distance-based outlier detection models. We show that using only a tiny fraction of the data to train distance-based outlier detection models often leads to no significant reduction in predictive performance and detection variance over a wide range of tabular datasets. Furthermore, we compare a data reduction based on random subsampling and clustering-based prototypes and show that both approaches yield similar outlier detection results. Simple random subsampling, thus, proves to be a useful benchmark and baseline for future research on speeding up distance-based outlier detection.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

More From: Procedia Computer Science

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.