Semantic feature learning for software defect prediction from source code and external knowledge

Jingyu Liu,Jun Ai,Minyan Lu,Jie Wang,Haoxiang Shi

doi:10.1016/j.jss.2023.111753

Abstract

Software defects not only reduce operational reliability but also significantly increase overall maintenance costs. Consequently, it is necessary to predict software defects at an early stage. Existing software defect prediction studies work with artificially designed metrics or features extracted from source code by machine learning-based approaches to perform classification. However, these methods fail to make full use of the defect-related information other than code, such as comments in codes and commit messages. Therefore, in this paper, additional information extracted from natural language text is combined with the programming language codes to enrich the semantic features. A novel model based on Transformer architecture and multi-channel CNN, PM2-CNN, is proposed for software defect prediction. Pretrained language model and CNN-based classifier are utilized in the model to obtain context-sensitive representations and capture the local correlation of sequences. A large and widely used dataset is utilized to verify the effectiveness of the proposed method. The results show that the proposed method has improvements in generic evaluation metrics compared with the optimal baseline method. Accordingly, external information can have a positive impact on software defect prediction, and our model effectively incorporates such information to improve detection performance.

Full Text