Improved software defect prediction using Pruned Histogram-based isolation forest

Zhiguo Ding,Liudong Xing

doi:10.1016/j.ress.2020.107170

Abstract

Software defect prediction (SDP) is a hot topic in the modern software engineering research community. It has been used for evaluating software quality and reliability and allocating limited testing resources effectively. Based on analyzing the software source code and development process and extracting critical metrics, many data mining and machine learning methods have been used for SDP. However, these existing learning methods have difficulty with handling the imbalanced data distribution of accumulated training dataset. Isolation forest, an anomaly detection method based on the ensemble learning, has been studied to deal with the imbalanced data distribution issue for obtaining high prediction performance. However, the isolation forest method suffers from a main drawback of slow convergence, which is caused by selecting the feature value at random during the process of building isolation trees. To conquer this problem, in this paper histogram is constructed for the value set of selected isolation feature helping identify feature values preferable to build isolation trees. Motivated by the “many could be better than all” principle in the ensemble learning, the ensemble pruning strategy is further employed to optimize the obtained isolation forest, leading to a novel SDP method named PHIForest (Pruned Histogram-based Isolation Forest) in this work. The proposed method can provide fast convergence through the histogram-based splitting feature value selection, and decrease the ensemble scale and improve prediction performance through the ensemble pruning. Comprehensive experiments conducted on ten real datasets are performed to demonstrate effectiveness of the proposed SDP method.

Full Text