Semantic Feature Learning via Dual Sequences for Defect Prediction

Junhao Lin,Lu Lu

doi:10.1109/access.2021.3051957

Abstract

Software defect prediction (SDP) can help developers reasonably allocate limited resources for locating bugs and prioritizing their testing efforts. Existing methods often serialize an Abstract Syntax Tree (AST) obtained from the program source code into a token sequence, which is then inputted into the deep learning model to learn the semantic features. However, there are different ASTs with the same token sequence, and it is impossible to distinguish the tree structure of the ASTs only by a token sequence. To solve this problem, this paper proposes a framework called Semantic Feature Learning via Dual Sequences (SFLDS), which can capture the semantic and structural information in the AST for feature generation. Specifically, based on the AST, we select the representative nodes in the AST and convert the program source code into a simplified AST (S-AST). Our method introduces two sequences to represent the semantic and structural information of the S-AST, one is the result of traversing the S-AST node in pre-order, and another is composed of parent nodes. Then each token in the dual sequences is encoded as a numerical vector via mapping and word embedding. Finally, we use a bi-directional long short-term memory (BiLSTM) based neural network to automatically generate semantic features from the dual sequences for SDP. In addition, to leverage the statistical characteristics contained in the handcrafted metrics, we also propose a framework called Defect Prediction via SFLDS (DP-SFLDS) which combines the semantic features generated from SFLDS with handcrafted metrics to perform SDP. In our empirical studies, eight open-source Java projects from the PROMISE repository are chosen as our empirical subjects. Experimental results show that our proposed approach can perform better than several state-of-the-art baseline SDP methods.

Highlights

With the increasing scale and complexity of software, software testing has become one of the most critical phases in the software life cycle [1], [2]
Handcrafted software metrics such as lines of code, number of methods, and cyclomatic complexity play important roles in the development of Software defect prediction (SDP). Many existing studies, such as [3]–[6], use handcrafted software metrics to describe the features of software and take them as input to train various machine learning models. These handcrafted software metrics are manually designed by researchers (e.g., McCabe metrics [7] based on dependencies, MOOD metrics [8] built on polymorphic factors and coupling factors, Halstead metrics [9] based on operation and operand counts, and CK metrics [10] developed from function and inheritance counts)
We propose to learn contextual semantic features from the pre-order and parent token sequences extracted from S-Abstract Syntax Tree (AST) by a bi-directional long short-term memory (BiLSTM)-based neural network

Summary

Introduction

With the increasing scale and complexity of software, software testing has become one of the most critical phases in the software life cycle [1], [2]. The idea behind SDP is to use the historical versions of the software as a data set to train a machine learning model and predict whether new instances of code regions (e.g., files, changes, and functions) contain defects. Handcrafted software metrics such as lines of code, number of methods, and cyclomatic complexity play important roles in the development of SDP Many existing studies, such as [3]–[6], use handcrafted software metrics to describe the features of software and take them as input to train various machine learning models. The trained model is used to predict whether the new instance contains defects

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 57	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Semantic Feature Learning via Dual Sequences for Defect Prediction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

An Approach to Software Defect Prediction Combining Semantic Features and Code Changes
Chuanqi Tao ... Tao Wang
International Journal of Software Engineering and Knowledge Engineering | VOL. 32
Chuanqi Tao, et. al.Chuanqi Tao ... Tao Wang
26 Aug 2022
International Journal of Software Engineering and Knowledge Engineering | VOL. 32

Software defect prediction with semantic and structural information of codes based on Graph Neural Networks
Chunying Zhou ... Cheng Zeng
Information and Software Technology | VOL. 152
Chunying Zhou, et. al.Chunying Zhou ... Cheng Zeng
01 Dec 2022
Information and Software Technology | VOL. 152

Deep Semantic Feature Learning with Embedded Static Metrics for Software Defect Prediction
Guisheng Fan ... Huiqun Yu
-
Guisheng Fan, et. al.Guisheng Fan ... Huiqun Yu
01 Dec 2019
01 Dec 2019

Software Defect Prediction via Attention-Based Recurrent Neural Network
Guisheng Fan ... Kang Yang
Scientific Programming | VOL. 2019
Guisheng Fan, et. al.Guisheng Fan ... Kang Yang
15 Apr 2019
Scientific Programming | VOL. 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Semantic Feature Learning via Dual Sequences for Defect Prediction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access