Using Code Evolution Information to Improve the Quality of Labels in Code Smell Datasets

Yijun Wang,Linfeng Yin,Songyuan Hu,Xiaocong Zhou

doi:10.1109/compsac.2018.00015

Abstract

Several approaches are proposed to detect code A set of important approaches are based on machine learning algorithms, which require the code smells have been labeled in source codes as training data firstly. The common labeling approaches are based on manual or tools, but it is difficult for current approaches to get reliable large-scale datasets. In this paper, an approach using the evolution information of source codes is proposed to get large-scale and more reliable training datasets for detecting code smells based on machine learning algorithms. Our approach analyzes the evolving of the source code smells firstly labeled by a tool from the baseline version into the contrastive version of a software system, and then constructs training datasets based on those smells. Experiments conducted on three open source software projects for detecting four types of code smells(which are Data Class, God Class, Brain Class and Brain Method) show that the models obtained by changed smells datasets have better performance on code smell detection than those obtained by unchanged smells datasets (with an average improvement rate of 7.8% and a maximum increase of 30%). The experiments results indicate that using the evolution information of source codes can construct more reliable training datasets for detecting code smells based on machine learning algorithms.

Full Text