Outlier Mining Methods Based on Graph Structure Analysis

Pablo Amil,Cristina Masoller,Nahuel Almeira

doi:10.3389/fphy.2019.00194

Abstract

Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap nonlinear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.

Highlights

When working with large databases, it is common to have entries that may not belong to the database
As removing outliers should improve the performance of machine learning algorithms, we performed two tests: first, we recalculated the correlation metrics presented in Amil et al ([46], Table 1), removing the first n outliers that were identified by each method
We have proposed two methods for outlier mining that rely on the definition of a meaningful measure of distance between pairs of elements in the dataset, one being fully unsupervised without the need of setting any parameters, and other which has 2 integer number parameters that can be set using a labeled training set

Summary

Introduction

When working with large databases, it is common to have entries that may not belong to the database. Anomalous items that appear not to belong, may be legitimate, just extreme cases of the variability of a large sample All these elements are usually referred to as outliers [1, 2]. Rogue waves (or freak waves), which are extremely high waves that might have different generating mechanisms than normal waves [3], have been studied in many fields [4,5,6,7,8], including hydrodynamics and optics They are usually defined as the extremes in the tail of the distribution of wave heights, their precise definition varies, as in hydrodynamics a wave whose height is larger than three times the average can be considered extreme, while in optics, much higher waves compared to the average can be observed [9]

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Physics	Publication Date: Nov 26, 2019
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Outlier Mining Methods Based on Graph Structure Analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Physics

Lead the way for us

Similar Papers

New local density definition based on minimum hyper sphere for outlier mining algorithm using in industrial databases
Yiwei Yuan ... Yanbin Zhang
-
Yiwei Yuan, et. al.Yiwei Yuan ... Yanbin Zhang
01 May 2014
01 May 2014

Mining Distance-Based Outlier Taking Into Account Class Label

-

06 Apr 2012
06 Apr 2012

Material analysis and big data monitoring of sports training equipment based on machine learning algorithm
Lei Zhang ... Ning Li
Neural Computing and Applications | VOL. 34
Lei Zhang, et. al.Lei Zhang ... Ning Li
23 Mar 2021
Neural Computing and Applications | VOL. 34

Quantifying the impact of uninformative features on the performance of supervised classification and dimensionality reduction algorithms
Weihua Lei ... Cleber Zanchettin
APL Machine Learning | VOL. 1
Weihua Lei, et. al.Weihua Lei ... Cleber Zanchettin
01 Dec 2023
APL Machine Learning | VOL. 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Outlier Mining Methods Based on Graph Structure Analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Physics