Ml-Codesmell: A code smell prediction dataset for machine learning approaches

Binh Nguyen Thanh,Minh Nguyen N H,Hanh Le Thi My,Binh Nguyen Thanh

doi:10.1145/3568562.3568643

Abstract

In recent years, many studies on detecting code smells in source code have published datasets with limited characteristics, such as the ambiguity of code smell definitions leads to different interpretations for each code smell, the number of samples of the datasets is small, and the features of the datasets are heterogeneous. Therefore, comparing performance between detecting code smell models is challenging, and the datasets are often not reusable in other code smell detection studies. In this work, we propose the ml-Codesmell dataset created by analyzing source code and extracting massive source code metrics with many labelled code smells. The proposed dataset has been used to train and predict code smell using machine learning algorithms. Based on the high confidential F1-score in evaluation, the ml-Codesmell dataset demonstrates a strong correlation between features and labels. Regarding these advantages, the ml-Codesmell dataset is expected to be helpful for studies on detecting code smell using machine learning approaches in software development.

Full Text