Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection?

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Vulnerability detection is garnering increasing attention in software engineering, since code vulnerabilities possibly pose significant security. Recently, reusing various code pre-trained models (e.g., CodeBERT, CodeT5, and CodeGen) has become common for code embedding without providing reasonable justifications in vulnerability detection. The premise for casually utilizing pre-trained models (PTMs) is that the code embeddings generated by different PTMs would generate a similar impact on the performance. Is that TRUE? To answer this important question, we systematically investigate the effects of code embedding generated by ten different code PTMs on the performance of vulnerability detection, and get the answer, i.e., that is NOT true. We observe that code embedding generated by various code PTMs can indeed influence the performance and selecting an embedding technique based on parameter scales and embedding dimension is not reliable. Our findings highlight the necessity of quantifying and evaluating the characteristics of code embedding generated by various code PTMs to understand the effects. To achieve this goal, we analyze the numerical representation and data distribution of code embedding generated by different PTMs to evaluate differences and characteristics. Based on these insights, we propose Coding-PTMs, a recommendation framework to assist engineers in selecting optimal code PTMs for their specific vulnerability detection tasks. Specifically, we define thirteen code embedding metrics across three dimensions (i.e., statistics, norm, and distribution) for constructing a specialized code PTM recommendation dataset. We then employ a Random Forest classifier to train a recommendation model and identify the optimal code PTMs from the candidate model zoo. We encourage engineers to use our Coding-PTMs to evaluate the characteristics of code embeddings generated by candidate code PTMs on the performance and recommend optimal code PTMs for code embedding in their vulnerability detection tasks.

Similar Papers
  • Research Article
  • Cite Count Icon 10
  • 10.1016/j.infsof.2024.107581
A dual graph neural networks model using sequence embedding as graph nodes for vulnerability detection
  • Sep 7, 2024
  • Information and Software Technology
  • Miaogui Ling + 4 more

A dual graph neural networks model using sequence embedding as graph nodes for vulnerability detection

  • Research Article
  • 10.1038/s41598-026-36196-9
Long-range context modeling for software vulnerability detection using an XLNet-based approach
  • Jan 16, 2026
  • Scientific Reports
  • Yinhu Zhao + 2 more

Software vulnerability detection is a critical area of research in cybersecurity. Recently, various Language Model (LM)-based approaches have shown strong potential in this domain. However, most existing methods rely on Transformer architectures that, while powerful, struggle to capture very long-range code dependencies essential for identifying subtle vulnerabilities. To address this limitation, we introduce XLNetVD, an XLNet-based function-level Vulnerability Detection framework leveraging a bidirectional Transformer-XL model for extended context modeling. XLNet effectively captures long code sequences encompassing data flow, control flow, and variable dependencies that are key factors in vulnerability identification. We benchmark XLNet against six mainstream contextual embedding models and three non-contextual embedding models to evaluate its representation capability for vulnerability detection. Experimental results show that XLNet surpasses CodeBERT and GPT-2, achieving the best F1-score of 68%. Furthermore, by applying the Low-Rank Adaptation (LoRA) fine-tuning technique, we demonstrate that XLNet-LoRA achieves the best trade-off between performance and efficiency among LoRA-enhanced LMs. We further integrate XLNet into an end-to-end framework, XLNetVD, and conduct extensive evaluations on two datasets: a real-world dataset with a highly imbalanced vulnerable-to-non-vulnerable ratio of 1:65, and the SARD dataset, which contains balanced, synthetic samples. Results confirm that XLNetVD consistently delivers competitive performance across both real-world and synthetic datasets, establishing it as one of the state-of-the-art vulnerability detection solutions.

  • Research Article
  • Cite Count Icon 7
  • 10.1002/nem.2198
Intelligent detection of vulnerable functions in software through neural embedding‐based code analysis
  • Mar 14, 2022
  • International Journal of Network Management
  • Peng Zeng + 3 more

SummarySoftware vulnerability is a fundamental problem in cybersecurity, which poses severe threats to the secure operation of devices and systems. In this paper, we propose a new vulnerability detection framework of employing advanced neural embedding. For example, CodeBERT is a large‐scale pre‐trained embedding model for natural language and programming language. It achieves state‐of‐the‐art performance on various natural language processing and code analysis tasks, demonstrating improved generalization ability compared with conventional models. The proposed framework encapsulates CodeBERT as a code representation generator and combines it with transfer learning to conduct cross‐project vulnerability detection. Considering the problem of lacking code embedding models on C source code, we extract the knowledge from C source code to fine‐tune the pre‐trained embedding model, so as to better facilitate the detection of function‐level vulnerabilities in C open‐source projects. To address the severe data imbalance issue in real‐world scenarios, we introduce code argumentation idea and use a large number of synthetic vulnerability data to further improve the robustness of the detection method. Experimental results show that the proposed vulnerability detection framework achieves better performance than existing methods.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/gcce53005.2021.9621783
Feature Extraction Method for Cross-Architecture Binary Vulnerability Detection
  • Oct 12, 2021
  • Ziyang Li + 2 more

Vulnerability detection identifies defects in various commercial software. Because most vulnerability detection methods are based on the source code, they are not useful if the source code is unavailable. In this paper, we propose a binary vulnerability detection method and use our tool named BVD that extracts binary features with the help of an intermediate language and then detects the vulnerabilities using an embedding model. Sufficiently robust features allow the binaries compiled in cross-architecture to be compared. Consequently, a similarity evaluation provides more accurate results.

  • Research Article
  • Cite Count Icon 6
  • 10.1002/stvr.1867
Tensor‐based gated graph neural network for automatic vulnerability detection in source code
  • Nov 27, 2023
  • Software Testing, Verification and Reliability
  • Jia Yang + 2 more

The rapid expansion of smart devices leads to the increasing demand for vulnerability detection in the cyber security field. Writing secure source codes is crucial to protect applications and software. Recent vulnerability detection methods are mainly using machine learning and deep learning. However, there are still some challenges, how to learn accurate source code semantic embedding at the function level, how to effectively perform vulnerability detection using the learned semantic embedding of source code and how to solve the overfitting problem of learning‐based models. In this paper, we consider codes as various graphs with node features and propose a tensor‐based gated graph neural network called TensorGNN to produce code embedding for function‐level vulnerability detection. First, we propose a high‐dimensional tensor for combining different code graph representations. Second, inspired by the work of tensor technology, we propose the TensorGNN model to produce accurate code representations using the graph tensor. We evaluate our model on 7 C and C++ large open‐source code corpus (e.g. SARD&NVD, Debian, SATE IV, FFmpeg, libpng&LibTiff, Wireshark and Github datasets), which contains 13 types of vulnerabilities. Our TensorGNN model improves on existing state‐of‐the‐art works by 10%–30% on average in terms of vulnerability detection accuracy and F1, while our TensorGNN model needs less training time and model parameters. Specifically, compared with other existing works, our model reduces 25–47 times of the number of parameters and decreases 3–10 times of training time. Results of evaluations show that TensorGNN has better performance while using fewer training parameters and less training time.

  • Research Article
  • Cite Count Icon 29
  • 10.1155/2022/5203217
Deep Neural Embedding for Software Vulnerability Discovery: Comparison and Optimization
  • Jan 18, 2022
  • Security and Communication Networks
  • Xue Yuan + 3 more

Due to multitudinous vulnerabilities in sophisticated software programs, the detection performance of existing approaches requires further improvement. Multiple vulnerability detection approaches have been proposed to aid code inspection. Among them, there is a line of approaches that apply deep learning (DL) techniques and achieve promising results. This paper attempts to utilize CodeBERT which is a deep contextualized model as an embedding solution to facilitate the detection of vulnerabilities in C open-source projects. The application of CodeBERT for code analysis allows the rich and latent patterns within software code to be revealed, having the potential to facilitate various downstream tasks such as the detection of software vulnerability. CodeBERT inherits the architecture of BERT, providing a stacked encoder of transformer in a bidirectional structure. This facilitates the learning of vulnerable code patterns which requires long-range dependency analysis. Additionally, the multihead attention mechanism of transformer enables multiple key variables of a data flow to be focused, which is crucial for analyzing and tracing potentially vulnerable data flaws, eventually, resulting in optimized detection performance. To evaluate the effectiveness of the proposed CodeBERT-based embedding solution, four mainstream-embedding methods are compared for generating software code embeddings, including Word2Vec, GloVe, and FastText. Experimental results show that CodeBERT-based embedding outperforms other embedding models on the downstream vulnerability detection tasks. To further boost performance, we proposed to include synthetic vulnerable functions and perform synthetic and real-world data fine tuning to facilitate the model learning of C-related vulnerable code patterns. Meanwhile, we explored the suitable configuration of CodeBERT. The evaluation results show that the model with new parameters outperform some state-of-the-art detection methods in our dataset.

  • Research Article
  • Cite Count Icon 72
  • 10.1016/j.jss.2023.111623
CSGVD: A deep learning approach combining sequence and graph embedding for source code vulnerability detection
  • Jan 31, 2023
  • Journal of Systems and Software
  • Wei Tang + 4 more

CSGVD: A deep learning approach combining sequence and graph embedding for source code vulnerability detection

  • Research Article
  • Cite Count Icon 5
  • 10.1145/3721481
VexIR2Vec : An Architecture-Neutral Embedding Framework for Binary Similarity
  • Oct 4, 2025
  • ACM Transactions on Software Engineering and Methodology
  • S Venkatakeerthy + 7 more

Binary similarity involves determining whether two binary programs exhibit similar functionality with applications in vulnerability detection, malware analysis, and copyright detection. However, variations in compiler settings, target architectures, and deliberate code obfuscations significantly complicate the similarity measurement by effectively altering the syntax, semantics, and structure of the underlying binary. To address these challenges, we propose VexIR2Vec , a robust, architecture-neutral approach based on VEX-IR to solve binary similarity tasks. VexIR2Vec consists of three key components: a peephole extractor, a normalization engine ( VexINE ), and an embedding model ( VexNet ). The process to build program embeddings starts with the extraction of sequences of basic blocks, or peepholes , from control-flow graphs via random walks, capturing structural information. These generated peepholes are then normalized using VexINE , which applies compiler-inspired transformations to reduce architectural and compiler-induced variations. Embeddings of peepholes are generated using representation learning techniques, avoiding Out-of-Vocabulary (OOV) issues. These embeddings are then fine-tuned with VexNet , a feed-forward Siamese network that maps functions into a high-dimensional space for diffing and searching tasks in an application-independent manner. We evaluate VexIR2Vec against five baselines—BinDiff, DeepBinDiff, SAFE, BinFinder, and histograms of opcodes—on a dataset comprising 2.7 M functions and 15.5 K binaries from 7 projects compiled across 12 compilers targeting x86 and ARM architectures. The experiments span four adversarial settings—cross-optimization, cross-compilation, cross-architecture, and obfuscations—that are typically exploited by malware and vulnerabilities. In diffing experiments, VexIR2Vec outperforms the nearest baseline in these four scenarios by \(40\%\) , \(18\%\) , \(21\%\) , and \(60\%\) , respectively. In the searching experiment, VexIR2Vec achieves a mean average precision of 0.76, the nearest baseline, by \(46\%\) . Our framework is highly scalable and is built as a lightweight, multi-threaded, parallel library using only open source tools. VexIR2Vec is \(\approx 3.1\) – \(3.5\times\) faster than the closest baselines and orders-of-magnitude faster than other tools.

  • Research Article
  • Cite Count Icon 1
  • 10.4108/eetsis.5056
E-GVD: Efficient Software Vulnerability Detection Techniques Based on Graph Neural Network
  • Mar 21, 2024
  • ICST Transactions on Scalable Information Systems
  • Haiye Wang + 2 more

INTRODUCTION: Vulnerability detection is crucial for preventing severe security incidents like hacker attacks, data breaches, and network paralysis. Traditional methods, however, face challenges such as low efficiency and insufficient detail in identifying code vulnerabilities. OBJECTIVES: This paper introduces E-GVD, an advanced method for source code vulnerability detection, aiming to address the limitations of existing methods. The objective is to enhance the accuracy of function-level vulnerability detection and provide detailed, understandable insights into the vulnerabilities. METHODS: E-GVD combines Graph Neural Networks (GNNs), which are adept at handling graph-structured data, with residual connections and advanced Programming Language (PL) pre-trained models. RESULTS: Experiments conducted on the real-world vulnerability dataset CodeXGLUE show that E-GVD significantly outperforms existing baseline methods in detecting vulnerabilities. It achieves a maximum accuracy gain of 4.98%, indicating its effectiveness over traditional methods. CONCLUSION: E-GVD not only improves the accuracy of vulnerability detection but also contributes by providing fine-grained explanations. These explanations are made possible through an interpretable Machine Learning (ML) model, which aids developers in quickly and efficiently repairing vulnerabilities, thereby enhancing overall software security.

  • Research Article
  • Cite Count Icon 1
  • 10.1002/cpe.8292
Vulnerability detection based on transformer and high‐quality number embedding
  • Sep 23, 2024
  • Concurrency and Computation: Practice and Experience
  • Yang Cao + 2 more

SummarySoftware vulnerability detection is an important problem in software security. In recent years, deep learning offers a novel approach for source code vulnerability detection. Due to the similarities between programming languages and natural languages, many natural language processing techniques have been applied to vulnerability detection tasks. However, specific problems within vulnerability detection tasks, such as buffer overflow, involve numerical reasoning. For these problems, the model needs to not only consider long dependencies and multiple relationships between statements of code but also capture the magnitude property of numerical literals in the program through high‐quality number embeddings. Therefore, we propose VDTransformer, a Transformer‐based method that improves source code embedding by integrating word and number embeddings. Furthermore, we employ Transformer encoders to construct a hierarchical neural network that extracts semantic features from the code and enables line‐level vulnerability detection. To evaluate the effectiveness of the method, we construct a dataset named OverflowGen based on templates for buffer overflow. Experimental comparisons on OverflowGen with a well‐known static vulnerability detector and two state‐of‐the‐art deep learning‐based methods confirm the effectiveness of VDTransformer and the importance of high‐quality number embeddings in vulnerability detection tasks involving numerical features.

  • Research Article
  • Cite Count Icon 9
  • 10.1016/j.cose.2023.103508
BinAIV: Semantic-enhanced vulnerability detection for Linux x86 binaries
  • Sep 27, 2023
  • Computers & Security
  • Yeming Gu + 2 more

BinAIV: Semantic-enhanced vulnerability detection for Linux x86 binaries

  • Research Article
  • Cite Count Icon 1
  • 10.3390/electronics11152446
Position Distribution Matters: A Graph-Based Binary Function Similarity Analysis Method
  • Aug 5, 2022
  • Electronics
  • Zulie Pan + 3 more

Binary function similarity analysis evaluates the similarity of functions at the binary level to aid program analysis, which is popular in many fields, such as vulnerability detection, binary clone detection, and malware detection. Graph-based methods have relatively good performance in practice, but currently, they cannot capture similarity in the aspect of the graph position distribution and lose information in graph processing, which leads to low accuracy. This paper presents PDM, a graph-based method to increase the accuracy of binary function similarity detection, by considering position distribution information. First, an enhanced Attributed Control Flow Graph (ACFG+) of a function is constructed based on a control flow graph, assisted by the instruction embedding technique and data flow analysis. Then, ACFG+ is fed to a graph embedding model using the CapsGNN and DiffPool mechanisms, to enrich information in graph processing by considering the position distribution. The model outputs the corresponding embedding vector, and we can calculate the similarity between different function embeddings using the cosine distance. Similarity detection is completed in the Siamese network. Experiments show that compared with VulSeeker and PalmTree+VulSeeker, PDM can stably obtain three-times and two-times higher accuracy, respectively, in binary function similarity detection and can detect up to six-times more results in vulnerability detection. When comparing with some state-of-the-art tools, PDM has comparable Top-5, Top-10, and Top-20 ranking results with respect to BinDiff, Diaphora, and Kam1n0 and significant advantages in the Top-50, Top-100, and Top-200 detection results.

  • Conference Article
  • Cite Count Icon 18
  • 10.1145/3544902.3546248
Heterogeneous Graph Neural Networks for Software Effort Estimation
  • Sep 19, 2022
  • Hung Phan + 1 more

Software effort can be measured by story point [35]. Current approaches for automatically estimating story points focus on applying pre-trained embedding models and deep learning for text regression to solve this problem which required expensive embedding models. We propose HeteroSP, a tool for estimating story points from textual input of Agile software project issues. We select GPT2SP [12] and Deep-SE [8] as the baselines for comparison. First, from the analysis of the story point dataset [8], we conclude that software issues are actually a mixture of natural language sentences with quoted code snippets and have problems related to large-size vocabulary. Second, we provide a module to normalize the input text including words and code tokens of the software issues. Third, we design an algorithm to convert an input software issue to a graph with different types of nodes and edges. Fourth, we construct a heterogeneous graph neural networks model with the support of fastText [6] for constructing initial node embedding to learn and predict the story points of new issues. We did the comparison over three scenarios of estimation, including within project, cross-project within the repository, and cross-project cross repository with our baseline approaches. We achieve the average Mean Absolute Error (MAE) as 2.38, 2.61, and 2.63 for three scenarios. We outperform GPT2SP in 2/3 of the scenarios while outperforming Deep-SE in the most challenging scenario with significantly less amount of running time. We also compare our approaches with different homogeneous graph neural network models and the results show that the heterogeneous graph neural networks model outperforms the homogeneous models in story point estimation. For time performance, we achieve about 570 seconds as the time performance in both three processes: node embedding initialization, model construction, and story point estimation.

  • Research Article
  • Cite Count Icon 70
  • 10.1109/tse.2023.3286586
Vulnerability Detection by Learning From Syntax-Based Execution Paths of Code
  • Aug 1, 2023
  • IEEE Transactions on Software Engineering
  • Junwei Zhang + 4 more

Vulnerability detection is essential to protect software systems. Various approaches based on deep learning have been proposed to learn the pattern of vulnerabilities and identify them. Although these approaches have shown vast potential in this task, they still suffer from the following issues: (1) It is difficult for them to distinguish vulnerability-related information from a large amount of irrelevant information, which hinders their effectiveness in capturing vulnerability features. (2) They are less effective in handling long code because many neural models would limit the input length, which hinders their ability to represent the long vulnerable code snippets. To mitigate these two issues, in this work, we proposed to decompose the syntax-based Control Flow Graph (CFG) of the code snippet into multiple execution paths to detect the vulnerability. Specifically, given a code snippet, we first build its CFG based on its Abstract Syntax Tree (AST), refer to such CFG as syntax-based CFG, and decompose the CFG into multiple paths from an entry node to its exit node. Next, we adopt a pre-trained code model and a convolutional neural network to learn the path representations with intra- and inter-path attention. The feature vectors of the paths are combined as the representation of the code snippet and fed into the classifier to detect the vulnerability. Decomposing the code snippet into multiple paths can filter out some redundant information unrelated to the vulnerability and help the model focus on the vulnerability features. Besides, since the decomposed paths are usually shorter than the code snippet, the information located in the tail of the long code is more likely to be processed and learned. To evaluate the effectiveness of our model, we build a dataset with over 231k code snippets, in which there are 24k vulnerabilities. Experimental results demonstrate that the proposed approach outperforms state-of-the-art baselines by at least 22.30%, 42.92%, and 32.58% in terms of Precision, Recall, and F1-Score, respectively. Our further analysis investigates the reason for the proposed approach's superiority.

  • Supplementary Content
  • Cite Count Icon 7
  • 10.48550/arxiv.2306.01754
Transformer-based Vulnerability Detection in Code at EditTime: Zero-shot, Few-shot, or Fine-tuning?
  • May 22, 2023
  • arXiv (Cornell University)
  • Alvin T S Chan + 7 more

Software vulnerabilities bear enterprises significant costs. Despite extensive efforts in research and development of software vulnerability detection methods, uncaught vulnerabilities continue to put software owners and users at risk. Many current vulnerability detection methods require that code snippets can compile and build before attempting detection. This, unfortunately, introduces a long latency between the time a vulnerability is injected to the time it is removed, which can substantially increases the cost of fixing a vulnerability. We recognize that the current advances in machine learning can be used to detect vulnerable code patterns on syntactically incomplete code snippets as the developer is writing the code at EditTime. In this paper we present a practical system that leverages deep learning on a large-scale data set of vulnerable code patterns to learn complex manifestations of more than 250 vulnerability types and detect vulnerable code patterns at EditTime. We discuss zero-shot, few-shot, and fine-tuning approaches on state of the art pre-trained Large Language Models (LLMs). We show that in comparison with state of the art vulnerability detection models our approach improves the state of the art by 10%. We also evaluate our approach to detect vulnerability in auto-generated code by code LLMs. Evaluation on a benchmark of high-risk code scenarios shows a reduction of up to 90% vulnerability reduction.

Save Icon
Up Arrow
Open/Close