Abstract

In a data mining process, outlier detection aims to use the high marginality of these elements to identify them by measuring their degree of deviation from representative patterns, thereby yielding relevant knowledge. Whereas rough sets (RS) theory has been applied to the field of knowledge discovery in databases (KDD) since its formulation in the 1980s; in recent years, outlier detection has been increasingly regarded as a KDD process with its own usefulness. The application of RS theory as a basis to characterise and detect outliers is a novel approach with great theoretical relevance and practical applicability. However, algorithms whose spatial and temporal complexity allows their application to realistic scenarios involving vast amounts of data and requiring very fast responses are difficult to develop. This study presents a theoretical framework based on a generalisation of RS theory, termed the variable precision rough sets model (VPRS), which allows the establishment of a stochastic approach to solving the problem of assessing whether a given element is an outlier within a specific universe of data. An algorithm derived from quasi-linearisation is developed based on this theoretical framework, thus enabling its application to large volumes of data. The experiments conducted demonstrate the feasibility of the proposed algorithm, whose usefulness is contextualised by comparison to different algorithms analysed in the literature.

Highlights

  • Outlier detection is an area of increasing relevance within the more general data mining process

  • This study presents a theoretical framework based on a generalisation of RS theory, termed the variable precision rough sets model (VPRS), which allows the establishment of a stochastic approach to solving the problem of assessing whether a given element is an outlier within a specific universe of data

  • Whereas VPRS has been applied to problems in multiple fields [13,14,15,16], in the field of statistics [17], this study aimed to develop a new application of this model to the outlier detection problem, breaking with the traditional scheme followed by most existing detection methods

Read more

Summary

Introduction

Outlier detection is an area of increasing relevance within the more general data mining process. The starting hypothesis is summarised as follows: “a new theory may be developed by extending the basic concepts and the formal tools provided by RS theory [1, 11] and VPRS [7], applied to the outlier detection problem, which allows the unsupervised determination, for each element of a universe of data, of the region of threshold values (μ, β) in which such element is an outlier.” Based on this approach, which was termed the βμ Method (see Figure 1), “the outlier probability.

Outlier Region
Computational Implementation
Estimation of the Outlier Probability of Each Element
Theoretical Framework
Validation of the Results
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call