Code Aggregate Graph: Effective Representation for Graph Neural Networks to Detect Vulnerable Code

Hoang Viet Nguyen,Tetsutaro Uehara,Junjun Zheng,Atsuo Inomata

doi:10.1109/access.2022.3216395

Hoang Viet Nguyen, Tetsutaro Uehara + Show 2 more

Open Access

https://doi.org/10.1109/access.2022.3216395

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2022
Citations: 1	License type: CC BY 4.0

Affiliation: Ritsumeikan University, Osaka University

Abstract

Deep learning, especially graph neural networks (GNNs), provides efficient, fast, and automated methods to detect vulnerable code. However, the accuracy could be improved as previous studies were limited by existing code representations. Additionally, the diversity of embedding techniques and GNN models can make selecting the appropriate method challenging. Herein we propose Code Aggregate Graph (CAG) to improve vulnerability detection efficiency. CAG combines the principles of different code analyses such as abstract syntax tree, control flow graph, and program dependence graph with dominator and post-dominator trees. This extensive representation empowers deep graph networks for enhanced classification. We also implement different data encoding methods and neural networks to provide a multidimensional view of the system performance. Specifically, three word embedding approaches and three deep GNNs are utilized to build classifiers. Then CAG is evaluated using two datasets: a real-world open-source dataset and the software assurance reference dataset. CAG is also compared with seven state-of-the-art methods and six classic representations. CAG shows the best performance. Compared to previous studies, CAG has an increased accuracy (5.4%) and F1-score (5.1%). Additionally, experiments confirm that encoding has a positive impact on accuracy (4–6%) but the network type does not. The study should contribute to a meaningful benchmark for future research on code representations, data encoding, and GNNs.

Full Text