Deep Neural Embedding for Software Vulnerability Discovery: Comparison and Optimization

Xue Yuan,Jun Zhang,Guanjun Lin,Yonghang Tai,Weizhi Meng

doi:10.1155/2022/5203217

Xue Yuan, Jun Zhang + Show 3 more

Open Access

https://doi.org/10.1155/2022/5203217

Copy DOI

Abstract

Due to multitudinous vulnerabilities in sophisticated software programs, the detection performance of existing approaches requires further improvement. Multiple vulnerability detection approaches have been proposed to aid code inspection. Among them, there is a line of approaches that apply deep learning (DL) techniques and achieve promising results. This paper attempts to utilize CodeBERT which is a deep contextualized model as an embedding solution to facilitate the detection of vulnerabilities in C open-source projects. The application of CodeBERT for code analysis allows the rich and latent patterns within software code to be revealed, having the potential to facilitate various downstream tasks such as the detection of software vulnerability. CodeBERT inherits the architecture of BERT, providing a stacked encoder of transformer in a bidirectional structure. This facilitates the learning of vulnerable code patterns which requires long-range dependency analysis. Additionally, the multihead attention mechanism of transformer enables multiple key variables of a data flow to be focused, which is crucial for analyzing and tracing potentially vulnerable data flaws, eventually, resulting in optimized detection performance. To evaluate the effectiveness of the proposed CodeBERT-based embedding solution, four mainstream-embedding methods are compared for generating software code embeddings, including Word2Vec, GloVe, and FastText. Experimental results show that CodeBERT-based embedding outperforms other embedding models on the downstream vulnerability detection tasks. To further boost performance, we proposed to include synthetic vulnerable functions and perform synthetic and real-world data fine tuning to facilitate the model learning of C-related vulnerable code patterns. Meanwhile, we explored the suitable configuration of CodeBERT. The evaluation results show that the model with new parameters outperform some state-of-the-art detection methods in our dataset.

Highlights

Software vulnerability has long been a severe but crucial research issue in cybersecurity [1,2,3]. ese security vulnerabilities threaten the IT infrastructure of organizations and government sectors. ere are increasingly more vulnerabilities being discovered
Research Question 1 (RQ1): selecting the suitable embedding method can be a critical task since it can affect the performance of the vulnerability detectors. us, we compare the effectiveness of CodeBERT with other traditional embedding models to determine the most viable embedding model for vulnerability detection
Work is paper proposes an embedding solution for vulnerability detection which is based on CodeBERT

Summary

Introduction

Software vulnerability has long been a severe but crucial research issue in cybersecurity [1,2,3]. ese security vulnerabilities threaten the IT infrastructure of organizations and government sectors. ere are increasingly more vulnerabilities being discovered. Erefore, researchers are Security and Communication Networks motivated to improve the usefulness of deep learning-based vulnerability detection solutions from various aspects. E process of applying deep learning techniques in the context of vulnerability detection can be divided into four steps: data collection, data preparation, model building, and evaluation/ test. CodeBERT is based on a bidirectional transformer which can capture long-distance dependencies of code sequences It can preserve the relationship between contexts, capture latent vulnerable code patterns, and minimize the loss of information. Recent research has achieved impressive results on embedding source code by applying natural language techniques such as Word2Vec, GloVe, and FastText. CodeBERT achieved a promising result in many code processing/analysis tasks such as clone detection, defection detection, and natural language code search It has not been used in the context of C language vulnerability detection.

Related Work

Methodology

Result

Experiment and Evaluation

Conclusions and Future