Abstract

This paper aims at formulating the issue of ranking multivariate unlabeled observations depending on their degree of abnormality as an unsupervised statistical learning task. In the 1-d situation, this problem is usually tackled by means of tail estimation techniques: univariate observations are viewed as all the more ‘abnormal’ as they are located far in the tail(s) of the underlying probability distribution. It would be desirable as well to dispose of a scalar valued ‘scoring’ function allowing for comparing the degree of abnormality of multivariate observations. Here we formulate the issue of scoring anomalies as a M-estimation problem by means of a novel functional performance criterion, referred to as the Mass Volume curve (MV curve in short), whose optimal elements are strictly increasing transforms of the density almost everywhere on the support of the density. We first study the statistical estimation of the MV curve of a given scoring function and we provide a strategy to build confidence regions using a smoothed bootstrap approach. Optimization of this functional criterion over the set of piecewise constant scoring functions is next tackled. This boils down to estimating a sequence of empirical minimum volume sets whose levels are chosen adaptively from the data, so as to adjust to the variations of the optimal MV curve, while controlling the bias of its approximation by a stepwise curve. Generalization bounds are then established for the difference in sup norm between the MV curve of the empirical scoring function thus obtained and the optimal MV curve.

Highlights

  • In a wide variety of applications, ranging from the monitoring of aircraft engines in aeronautics to non destructive control quality in the industry through fraud detection, network intrusion surveillance or system management in data centers (see for instance (Viswanathan et al, 2012)), anomaly detection is of crucial importance

  • The true MV curve of a given scoring function is unknown in practice and its performance must be statistically assessed based on a data sample

  • Where, given Dn, MVBs oot = λs ◦−1 is the empirical MV curve of the scoring function s based on a sample of i.i.d. random variables with distribution Fs and where MVs = λs ◦ αs−1 is the smooth version of the empirical MV curve, αs−1 being the generalized inverse of αs

Read more

Summary

Introduction

In a wide variety of applications, ranging from the monitoring of aircraft engines in aeronautics to non destructive control quality in the industry through fraud detection, network intrusion surveillance or system management in data centers (see for instance (Viswanathan et al, 2012)), anomaly detection is of crucial importance. ‘normal’ vs ‘abnormal’, is assigned to the sampling data), and shall be referred to as the Mass Volume curve (MV curve in abbreviated form) The latter induces a partial preorder on the set of scoring functions: the collection of optimal elements is defined as the set of scoring functions whose MV curve is minimum everywhere. We give an example showing that the nature of the problem tackled here is very different than that of density estimation and we give a simpler formula of the derivative of the optimal MV curve (that of the underlying density) compared to the one originally given in (Clemencon and Jakubowicz, 2013)

Framework and Notations
Minimum Volume Sets
Ranking Anomalies
Overall Objective
A Functional Criterion: the Mass Volume Curve
Statistical Estimation
Consistency and Asymptotic Normality
Confidence Regions in the Mass Volume Space
Bootstrap consistency
Illustrative Numerical Experiments
A M-estimation Approach to Anomaly Scoring
Piecewise Constant Scoring Functions
Adaptive Approximation of the Optimal MV Curve
Empirical Adaptive Estimation of the Optimal MV Curve
16: Output
The Anomaly Ranking algorithm A-Rank
Performance Bounds for the A-Rank Algorithm
Conclusion
Properties of the MV Curve
Strong approximation: proof of Theorem 1
Proof of Theorem 3
Proof of Theorem 4
Proof of Lemma 1
Proof of Lemma 3
Proof of Lemma 4
Proof of Lemma 5
Other technical results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call