Abstract

In recent years, many studies on detecting code smells in source code have published datasets with limited characteristics, such as the ambiguity of code smell definitions leads to different interpretations for each code smell, the number of samples of the datasets is small, and the features of the datasets are heterogeneous. Therefore, comparing performance between detecting code smell models is challenging, and the datasets are often not reusable in other code smell detection studies. In this work, we propose the ml-Codesmell dataset created by analyzing source code and extracting massive source code metrics with many labelled code smells. The proposed dataset has been used to train and predict code smell using machine learning algorithms. Based on the high confidential F1-score in evaluation, the ml-Codesmell dataset demonstrates a strong correlation between features and labels. Regarding these advantages, the ml-Codesmell dataset is expected to be helpful for studies on detecting code smell using machine learning approaches in software development.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call