Abstract
In the last few years, with the development of deep learning theory, researchers have tried to introduce the method of artificial intelligence into the field of software defect prediction (SDP) to improve its prediction effect. To be fed into the neural network, the sample codes are represented as an abstract syntax tree (AST), and the AST is encoded as real numbers. However, in most cross-project defect prediction (CPDP) task, the method for converting the AST into a real number cannot effectively estimate the semantic distance between the ASTs, resulting in a significant reduction in training effects. To solve that problem, we present a new encoding framework, tree-based-embedding (TBE), to convert AST into real vectors and make the semantic gap between the ASTs measurable. To estimate the effect of this encoding method, we promise a tree-based-embedding convolutional neural network with transferable hybrid feature learning (TBCNN-THFL) to perform the CPDP tasks. TBCNN-THFL is fed data encoded with TBE method for learning the transferable joint features between different projects; meanwhile, TBCNN-THFL introduces a transfer component analysis algorithm. Furthermore, the model combines the handcrafted and deep-learning-generated features and then feeds them into the classifier to train a defect prediction model. A sufficient number of experiments demonstrate that TBCNN-THFL is superior to referential models on 72 pairs of CPDP tasks formed by 9 open-source projects.
Highlights
In the process of developing and maintaining software, the scale and complexity of the software will increase, making the task of debugging more difficult
To exaggerate the transferability of hybrid features in cross-project defect prediction (CPDP) tasks, we introduce transfer component analysis (TCA), which could reduce the distance between different project data distributions and learn transfer components among projects in a reproducing kernel Hilbert space (RKHS)
THE PERFORMANCE OF TREE-BASED EMBEDDING METHOD (ANSWER FOR RQ1) To demonstrate that our tree-based embedding method can improve the performance of deep learning model in CPDP, we will compare the area under curve (AUC) of TBCNNTCA/TBCNN-THFL with models without TBE
Summary
In the process of developing and maintaining software, the scale and complexity of the software will increase, making the task of debugging more difficult. Features for determining whether software is defective are divided into manually extracted features and. Extracted features are the features designed by researchers to distinguish between defect-prone code and bug-free code, (e.g, MOOD features [5] built on polymorphic factors, coupling factors, CK features [6] developed from function and inheritance counts, Halstead features [7] based on operation and operand counts, and McCabe features [8] based on dependencies). Machine learning models such as native Bayes (NB) [9], decision tree (DT) [10], [11] and support vector machine (SVM) [12], are fed the features describe above and trained to determine whether the code is defective. As deep learning has rapidly developed, many researchers [13]–[15] have begun to introduce deep learning into SDP, leveraging its powerful feature
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.