Automatic detection of Feature Envy and Data Class code smells using machine learning

Nikola Luburić,Aleksandar Kovačević,Jelena Slivka,Milica Škipina

doi:10.1016/j.eswa.2023.122855

Abstract

Code smells in software indicate poor design and implementation choices. Detecting and removing them is critical for sustainable software development. Machine learning (ML) can automate code smell detection. Most ML solutions train models from scratch on code smell datasets, using handcrafted source code metrics as features. Pretrained language models, like BERT, fueled a paradigm shift in natural language processing: from handcrafted features to automatically inferred features and from training models from scratch to using pretrained models. Code embeddings offer the potential to bring a similar paradigm shift to code analysis. Nevertheless, the potential of using pretrained neural code embeddings for code smell detection has yet to be fully explored. To this end, we evaluated ML models trained using different code representations: code metrics and state-of-the-art neural code embeddings (CodeT5 and CuBERT). We experimented with CodeT5 variants (base and small) and explored multiple ways of embedding code snippets (by combining line-level embeddings or passing the entire code snippet as input). We tested our approaches on the tasks of detecting Data Class and Feature Envy on the MLCQ dataset. Considering the results of this study and our previous research, performance-wise, there is no clear winner between using code metrics or code embeddings for different code smell types and programming languages. However, given that, in contrast to code metrics, code embeddings can automatically adapt to new programming constructs and are expected to scale better with dataset size, these models are likely to become the future state-of-the-art feature generation technique for code smell detection.

Full Text