Semi-supervised detection of Long Method and God Class code smells

Ilija Brdar,Jelena Slivka,Jelena Vlajkov,Aleksandar Kovacevic,Katarina-Glorija Grujic

doi:10.1109/sisy56759.2022.10036248

Abstract

Code smells are poorly designed parts of code whose removal is essential for sustainable software development. However, recognizing code smells in practice is challenging. Machine Learning (ML)-based code smell detectors could solve this problem. Current ML-based code smell detection approaches are based on supervised learning (SL) that requires a large and diverse dataset for training. Unfortunately, the existing code smell datasets are small, which hinders the performance of the trained SL models. This paper aims to improve the performance of ML-based code smell detectors by employing semi-supervised learning (SSL). SSL models are trained by combining a manually labeled code smell dataset with unlabeled code snippets collected from open-source repositories. Two major SSL techniques are employed: self-training and co-training. Experiments were performed for two code smell types: God Class and Long Method. SSL classifiers significantly outperformed SL classifiers for God Class detection (by 6% F-measure). For Long Method detection, SSL classifiers slightly outperformed SL classifiers (by 1%F-measure). This paper is the first to consider applying SSL for code smell detection. SSL models outperforming SL models in all experiments suggest that SSL holds the great potential to improve current code smell detectors, which is essential for their adoption in practice.

Full Text