An In-Depth Study and Improvement of Isolation Forest

Yousra Chabchoub,Maurras Ulbricht Togbe,Aliou Boly,Raja Chiky

doi:10.1109/access.2022.3144425

Abstract

Historically, anomalies detection was an important issue for industrial applications such as the detection of a manufacturing failure or defect. It is still a current topic that tries to meet the ever increasing demand in different fields such as intrusion detection, fraud detection, ecosystem change detection or event detection in sensor networks. That’s why anomalies detection remains a research topic of great interest for various research communities. In this paper, we focused on Isolation Forest (IForest), a well known, efficient anomalies detection algorithm. We provided a deep and complete view on IForest. We evaluated the impact of its input parameters (number of trees, sample size and decision threshold) on the efficiency of the detection and on the execution time. We discussed the benefit of including some anomalies into the training phase. To address the limits of IForest, we performed different experiments on commonly used real datasets and also on synthetic datasets with non trivial distributions. We designed multidimensional datasets where anomalies are carried by several dimensions simultaneously. Moreover, we used a varying density and distance between anomalies and normal data, for a variable similarity between these two data classes. We compared the performance of IForest against its improved version called Extended IForest. Finally, we designed and validated a new extension of IForest, based on the different individual trees decisions instead of a global forest decision that we call Majority Voting IForest (MVIForest). The experiments show that MVIForest has a shorter execution time than IForest, with almost the same accuracy.

Highlights

Anomalies are often defined as elements with different behavior compared to normal data
For the choice of the input parameters, we found that using a large number of trees does not really improve the ability of Isolation Forest (IForest) to detect the anomalies, it increases the execution time
We propose Majority Voting IForest, an extension of IForest improving its execution time

Summary

Introduction

Anomalies are often defined as elements with different behavior compared to normal data. Anomalies detection is an interesting issue widely studied by different research communities: statistics, data mining, machine learning and more recently deep learning. It has many real-world applications such as intrusion detection, astronomy, finance or cybersecurity. The most known reviews of anomalies detection existing techniques are [1], [4], [5] and [19]. They identify the main following approaches: statistical approach, clustering and nearest neighbors.

Objectives

Methods

Findings

Conclusion