Deterministic attribute selection for isolation forest

Łukasz Gałka,Paweł Karczmarek

doi:10.1016/j.patcog.2024.110395

Abstract

Modern data mining techniques have been gained importance in recent years. In particular, anomaly detection algorithms, applied in key sectors of information technology, have been growing in popularity. One of the efficient and fast algorithms is Isolation Forest. The method consists of two separated stages: Forest formation and evaluation of elements. The first stage relies on forming a forest of isolation trees. Each tree is built in the same manner according to drawn samples and random divisions of data attributes. In this study, an innovative deterministic attribute selection method is proposed, maintaining its random value. New ideas based on imbalance, clustering, and a dispersion of values through non-linear transformation of elements are introduced and thoroughly analyzed. These novel anomaly detection approaches are applied to 25 real datasets, as well as our own artificially generated databases. The Area Under the ROC Curve and the Area Under the PR Curve are used as a measure of the outliers classification quality. The results of the numerical experiment have proven high efficiency and competitive evaluation speed of the proposals in comparison to other Isolation Forest-based approaches, as well as several other popular techniques.

Full Text