Abstract

Software defect prediction (SDP) can help developers reasonably allocate limited resources for locating bugs and prioritizing their testing efforts. Existing methods often serialize an Abstract Syntax Tree (AST) obtained from the program source code into a token sequence, which is then inputted into the deep learning model to learn the semantic features. However, there are different ASTs with the same token sequence, and it is impossible to distinguish the tree structure of the ASTs only by a token sequence. To solve this problem, this paper proposes a framework called Semantic Feature Learning via Dual Sequences (SFLDS), which can capture the semantic and structural information in the AST for feature generation. Specifically, based on the AST, we select the representative nodes in the AST and convert the program source code into a simplified AST (S-AST). Our method introduces two sequences to represent the semantic and structural information of the S-AST, one is the result of traversing the S-AST node in pre-order, and another is composed of parent nodes. Then each token in the dual sequences is encoded as a numerical vector via mapping and word embedding. Finally, we use a bi-directional long short-term memory (BiLSTM) based neural network to automatically generate semantic features from the dual sequences for SDP. In addition, to leverage the statistical characteristics contained in the handcrafted metrics, we also propose a framework called Defect Prediction via SFLDS (DP-SFLDS) which combines the semantic features generated from SFLDS with handcrafted metrics to perform SDP. In our empirical studies, eight open-source Java projects from the PROMISE repository are chosen as our empirical subjects. Experimental results show that our proposed approach can perform better than several state-of-the-art baseline SDP methods.

Highlights

  • With the increasing scale and complexity of software, software testing has become one of the most critical phases in the software life cycle [1], [2]

  • Handcrafted software metrics such as lines of code, number of methods, and cyclomatic complexity play important roles in the development of Software defect prediction (SDP). Many existing studies, such as [3]–[6], use handcrafted software metrics to describe the features of software and take them as input to train various machine learning models. These handcrafted software metrics are manually designed by researchers (e.g., McCabe metrics [7] based on dependencies, MOOD metrics [8] built on polymorphic factors and coupling factors, Halstead metrics [9] based on operation and operand counts, and CK metrics [10] developed from function and inheritance counts)

  • We propose to learn contextual semantic features from the pre-order and parent token sequences extracted from S-Abstract Syntax Tree (AST) by a bi-directional long short-term memory (BiLSTM)-based neural network

Read more

Summary

Introduction

With the increasing scale and complexity of software, software testing has become one of the most critical phases in the software life cycle [1], [2]. The idea behind SDP is to use the historical versions of the software as a data set to train a machine learning model and predict whether new instances of code regions (e.g., files, changes, and functions) contain defects. Handcrafted software metrics such as lines of code, number of methods, and cyclomatic complexity play important roles in the development of SDP Many existing studies, such as [3]–[6], use handcrafted software metrics to describe the features of software and take them as input to train various machine learning models. The trained model is used to predict whether the new instance contains defects

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.