SCL-CVD: Supervised contrastive learning for code vulnerability detection via GraphCodeBERT

Rongcun Wang,Senlei Xu,Yuan Tian,Xingyu Ji,Xiaobing Sun,Shujuang Jiang

doi:10.1016/j.cose.2024.103994

Abstract

Detecting vulnerabilities in source code is crucial for protecting software systems from cyberattacks. Pre-trained language models such as CodeBERT and GraphCodeBERT have been applied in multiple code-related downstream tasks such as code search and code translation and have achieved notable success. Recently, this pre-trained and fine-tuned paradigm has also been applied to detect code vulnerabilities. However, fine-tuning pre-trained language models using cross-entropy loss has several limitations, such as poor generalization performance and lack of robustness to noisy labels. In particular, when the vulnerable code and the benign code are very similar, it is difficult for deep learning methods to differentiate them accurately. In this context, we introduce a novel approach for code vulnerability detection using supervised contrastive learning, namely SCL-CVD, which leverages GraphCodeBERT. This method aims to enhance the effectiveness of existing vulnerable code detection approaches. SCL-CVD represents the source code as data flow graphs. These graphs are then processed by GraphCodeBERT, which has been fine-tuned using a supervised contrastive loss function combined with R-Drop. This fine-tuning process is designed to generate more resilient and representative code embedding. Additionally, we incorporate LoRA (Low-Rank Adaptation) to streamline the fine-tuning process, significantly reducing the time required for model training. Finally, a Multilayer Perceptron (MLP) is employed to detect vulnerable code leveraging the learned representation of code. We designed and conducted experiments on three public benchmark datasets, i.e., Devign, Reveal, Big-Vul, and a combined dataset created by merging these sources. The experimental results demonstrate that SCL-CVD can effectively improve the performance of code vulnerability detection. Compared with the baselines, the proposed approach has a relative improvement of 0.48%∼3.42% for accuracy, 0.93%∼45.99% for precision, 35.68%∼67.48% for recall, and 16.31%∼49.67% for F1-score, respectively. Furthermore, compared to baselines, the model fine-tuning time of the proposed approach is reduced by 16.67%∼93.03%. In conclusion, our approach SCL-CVD offers significantly greater cost-effectiveness over existing approaches.

Full Text