FAST-ODT: A Lightweight Outlier Detection Scheme for Categorical Data Sets

Hongwei Du,Wen Xu,Zhipeng Sun,Chuang Liu,Qiang Ye

doi:10.1109/tnse.2020.3022869

Abstract

Outlier detection is a key data analysis technique that aims to find unusual data objects in a data set. It has been widely used in varied areas, including communication networks, finance, medicine, environmental studies, etc. Many applications in these areas involve categorical data. For example, the data set used in the application of intrusion detection normally includes a group of captured packets, which tend to have categorical attributes such as “protocol”. Although there are many outlier detection algorithms for applications involving numerical data, only a few existing schemes can handle categorical data. And the schemes designed for categorical data seriously suffer from two problems: low detection precision and high time complexity. In this paper, we present two novel outlier detection algorithms for categorical data sets. First of all, we describe a simple scheme based on entropy, Outlier Detection Tree (ODT). With ODT, a classification tree is constructed to classify the data set into two classes: a normal class and an abnormal class. Thereafter, each data object is identified as an outlier or a normal one using the if-then rules in the tree. Furthermore, we propose an advanced outlier detection algorithm, FAST-ODT, which achieves both high detection accuracy and low time complexity. Our experimental results indicate that FAST-ODT outperforms the existing algorithms in terms of outlier detection precision and computational complexity.

Full Text