A study on the relevance of semantic features extracted using BERT-based language models for enhancing the performance of software defect classifiers

Anamaria Briciu,Gabriela Czibula,Mihaiela Lupea

doi:10.1016/j.procs.2023.10.149

Abstract

In the context of new research in the software defect prediction (SDP) task using pre-trained language models, the present study aims to analyze the relevance of semantic features extracted using BERT-based language models for the detection of defective source codes. RoBERTa and CodeBERT-MLM language models are used to generate source code embeddings that capture semantic and contextual features which help in solving language understanding tasks such as SDP. The learned representations are then fed to a neural network-based SDP classifier in order to decide which of the learned code embeddings are more informative in discriminating between faulty and non-faulty software entities. Extensive experiments are conducted in a cross-version SDP scenario for Apache Calcite, an open-source framework for data management. The evaluation results of the defect classifiers show a statistically significant improvement when the code representations are learned by the pre-trained models compared with the semantic representations provided by other natural language-based models, doc2vec and LSI.

Full Text