Abstract

In the context of new research in the software defect prediction (SDP) task using pre-trained language models, the present study aims to analyze the relevance of semantic features extracted using BERT-based language models for the detection of defective source codes. RoBERTa and CodeBERT-MLM language models are used to generate source code embeddings that capture semantic and contextual features which help in solving language understanding tasks such as SDP. The learned representations are then fed to a neural network-based SDP classifier in order to decide which of the learned code embeddings are more informative in discriminating between faulty and non-faulty software entities. Extensive experiments are conducted in a cross-version SDP scenario for Apache Calcite, an open-source framework for data management. The evaluation results of the defect classifiers show a statistically significant improvement when the code representations are learned by the pre-trained models compared with the semantic representations provided by other natural language-based models, doc2vec and LSI.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call