Code vulnerability exposes millions of software to the possibility of being attacked, as evidence every year on increasing reports of security issues, such as information leaks, system compromise, and denial of service. Despite with many vulnerability detection models proposed so far, their effectiveness is still limited due to the ignorance of syntactic structural information analysis in source code and the improper handling of labeling errors. To address these issues, we propose the Graph Confident Learning for Software Vulnerability Detection (GCL4SVD) model, a machine learning model to detect software vulnerability in the development phase. It comprises two components: code graph embedding and graph confident learning denoising. To address the syntactic structural information analysis limitation, the code graph embedding component extracts the structure and semantic information of source code with a sliding window mechanism, and then encodes source code into a graph structure to capture the patterns and characteristics of code vulnerabilities. Additionally, the graph confident learning denoising component identifies labeling errors to improve the quality of training set. Experimental results show that GCL4SVD outperforms the state-of-the-art vulnerability detection models on four open source datasets by 3.7%, 3.3%, 2.5%, 0.8% in terms of Accuracy, respectively, and by 10.2%, 21.8%, 8.2%, 11.2% in terms of F1-score.
Read full abstract