A Suitable AST Node Granularity and Multi-Kernel Transfer Convolutional Neural Network for Cross-Project Defect Prediction

Jiehan Deng,Shaojian Qiu,Lu Lu,Yangpeng Ou

doi:10.1109/access.2020.2985780

Abstract

Cross-project defect prediction (CPDP) is a feasible way to perform software defect prediction (SDP) when lacking historical data. Recent CPDP approaches have employed deep learning techniques to better exploit the information from the program's abstract syntax trees (ASTs). However, the granularity of the AST nodes and the data distribution difference between projects may have negative impacts on the prediction performance, which many CPDP studies didn't take into consideration. To handle these issues, this paper explores a better AST node granularity and proposes a CPDP framework based on multi-kernel transfer convolutional neural networks. Specifically, for AST node granularity, we explore the difference of three AST node granularities and then compare the prediction performance of each granularity on several prediction models. For the CPDP framework, we first parse the program source code into ASTs and then encode the AST nodes into numerical vectors using the embedding technique. Secondly, to mine transferable semantic features, the encoded ASTs are fed into a convolutional neural network, in which a multi-kernel matching layer is added to minimize the data distribution divergence between the source and target project. Finally, to make use of the information from the handcrafted features, the semantic features mined from the AST are joint with handcrafted features to form the joint features for CPDP. We evaluate our approach on 110 CPDP tasks formed by 11 open-source projects and results show that the proposed CPDP method outperforms most deep learning-based approaches.

Highlights

Modern software is becoming more and more powerful, and the continuous growing scale and complexity threaten its quality and reliability
EVALUATION we evaluate the cross-project defect prediction (CPDP) performance under three kinds of granularity of abstract syntax trees (ASTs) node and the effectiveness of our MK-TCNN approach by comparing its F-measure score with other CPDP models
EVALUATION ON AST NODE GRANULARITY For the evaluation of the AST node granularity, we conducted 110 pairs of CPDP tasks formed by 11 open source projects over the above three AST node granularity

Summary

Introduction

Modern software is becoming more and more powerful, and the continuous growing scale and complexity threaten its quality and reliability. Companies have to employ quality assurance teams to find defects in software, which is labor-intensive and costly work [1]. Software defect prediction (SDP) is proposed to reduce the cost and time for software testing and to help assurance. The prerequisite to perform SDP is to obtain sufficient historical data (i.e. files that are tagged as buggy or clean from the previous versions), which is hard to achieve in the early stage of software development. To solve the problem of lacking historical data, cross-project defect prediction (CPDP) is proposed, whose main concept is to use the defect information of a mature project ( called source project) to build defect predictors and apply them on a new project ( called target project) to predict defect-prone software modules

Methods

Results

Discussion

Conclusion