Graph Neural Network for Source Code Defect Prediction

Lucija Sikic,Adrian Satja Kurdija,Marin Silic,Klemo Vladimir

doi:10.1109/access.2022.3144598

Lucija Sikic, Adrian Satja Kurdija + Show 2 more

Open Access

https://doi.org/10.1109/access.2022.3144598

Copy DOI

Abstract

Predicting defective software modules before testing is a useful operation that ensures that the time and cost of software testing can be reduced. In recent years, several models have been proposed for this purpose, most of which are built using deep learning-based methods. However, most of these models do not take full advantage of a source code as they ignore its tree structure or they focus only on a small part of a code. To investigate whether and to what extent information from this structure can be beneficial in predicting defective source code, we developed an end-to-end model based on a convolutional graph neural network (GCNN) for defect prediction, whose architecture can be adapted to the analyzed software, so that projects of different sizes can be processed with the same level of detail. The model processes the information of the nodes and edges from the abstract syntax tree (AST) of the source code of a software module and classifies the module as defective or not defective based on this information. Experiments on open source projects written in Java have shown that the proposed model performs significantly better than traditional defect prediction models in terms of AUC and F-score. Based on the F-scores of the existing <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">state-of-the-art</i> models, the model has shown comparable predictive capabilities for the analyzed projects.

Highlights

The development of source code defect prediction models plays an important role in improving software quality
The experimental results should allow us to determine if the proposed model outperforms the state-of-the-art models that are based on information from abstract syntax tree (AST) generated from the modules’ source code in cross-version defect prediction
We present DP-GCNN, a defect prediction model based on a neural network architecture that is tailored for graph data, which ASTs belong to

Summary

INTRODUCTION

The development of source code defect prediction models plays an important role in improving software quality. Despite the fact that the proposed defect prediction models extract features to represent software modules from ASTs of the modules’ source code, the vast majority of them treat ASTs as linear sequences of nodes and use natural language processing models to generate embedding vectors of these sequences. In this way, they ignore the structure of ASTs and miss the opportunity to use additional. This work makes the following contributions: We propose an end-to-end SDP model that identifies defective software modules by using GCNN to capture the entire information of ASTs representing the source code of the modules.

BACKGROUND

REPRESENTING SOFTWARE MODULES WITH ASTs

EXPERIMENTS AND RESULTS

RELATED WORK

THREATS TO VALIDITY

CONCLUSION AND FUTURE WORK