MICHAC: Defect Prediction via Feature Selection Based on Maximal Information Coefficient with Hierarchical Agglomerative Clustering

Zhou Xu,Jin Liu,Jifeng Xuan,Xiaohui Cui

doi:10.1109/saner.2016.34

Abstract

Defect prediction aims to estimate software reliability via learning from historical defect data. A defect prediction method identifies whether a software module is defect-prone or not according to metrics that are mined from software projects. These metric values, also known as features, may involve irrelevance and redundancy, which will hurt the performance of defect prediction methods. Existing work employs feature selection to preprocess defect data to filter out useless features. In this paper, we propose a novel feature selection framework, MICHAC, short for defect prediction via Maximal Information Coefficient with Hierarchical Agglomerative Clustering. MICHAC consists of two major stages. First, MICHAC employs maximal information coefficient to rank candidate features to filter out irrelevant ones, second, MICHAC groups features with hierarchical agglomerative clustering and selects one feature from each resulted group to remove redundant features. We evaluate our proposed method on 11 widelystudied NASA projects and four open-source AEEEM projects using three different classifiers with four performance metrics (precision, recall, F-measure, and AUC). Comparison with five existing methods demonstrates that MICHAC is effective in selecting features in defect prediction.

Full Text