Outlier Detection Forest for Large-Scale Categorical Data Sets

Zhipeng Sun,Yuying Li,Chuang Liu,Hongwei Du,Qiang Ye,Patricia Lilian Kibenge,Hui Huang

doi:10.1007/978-3-030-34980-6_4

Abstract

Outlier detection is one of the most important data mining problems, which has attracted much attention over the past years. So far, there have been a variety of different schemes for outlier detection. However, most of the existing methods work with numeric data sets. And these methods cannot be directly applied to categorical data sets because it is not straightforward to define a practical similarity measure for categorical data. Furthermore, the existing outlier detection schemes that are tailored for categorical data tend to result in poor scalability, which makes them infeasible for large-scale data sets. In this paper, we propose a tree-based outlier detection algorithm for large-scale categorical data sets, Outlier Detection Forest (ODF). Our experimental results indicate that, compared with the state-of-the-art outlier detection schemes, ODF can achieve the same level of outlier detection precision and much better scalability.

Full Text