Analysis of software code preprocessing methods to improve the effectiveness of using large language models in vulnerability detection tasks

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

As software systems grow in scale and complexity, the need for intelligent methods of vulnerability detection increases. One such method involves the use of large language models trained on source code, which are capable of analyzing and classifying vulnerable code segments at early stages of development. The effectiveness of these models depends on how the code is represented and how the input data is prepared. Preprocessing methods can significantly impact the accuracy and robustness of the model. The purpose of the study: to analyze the impact of various code preprocessing methods on the accuracy and robustness of large language models (CodeBERT, GraphCodeBERT, UniXcoder) in vulnerability detection tasks. The analysis is conducted using source code changes extracted from commits associated with vulnerabilities documented in the CVE database. The research methodology is an experimental analysis based on evaluation of the effectiveness and robustness of CodeBERT, GraphCodeBERT, and UniXcoder in the task of vulnerability classification. The models are assessed based on their performance using Accuracy and F1 score metrics. Research results: estimates of the effectiveness of different code preprocessing methods when applying large language models to vulnerability classification tasks.

Similar Papers
  • Conference Article
  • Cite Count Icon 52
  • 10.1109/icsme.2014.46
Mining Co-change Information to Understand When Build Changes Are Necessary
  • Sep 1, 2014
  • Shane Mcintosh + 3 more

As a software project ages, its source code is modified to add new features, restructure existing ones, and fix defects. These source code changes often induce changes in the build system, i.e., the system that specifies how source code is translated into deliverables. However, since developers are often not familiar with the complex and occasionally archaic technologies used to specify build systems, they may not be able to identify when their source code changes require accompanying build system changes. This can cause build breakages that slow development progress and impact other developers, testers, or even users. In this paper, we mine the source and test code changes that required accompanying build changes in order to better understand this co-change relationship. We build random forest classifiers using language-agnostic and language-specific code change characteristics to explain when code-accompanying build changes are necessary based on historical trends. Case studies of the Mozilla C++ system, the Lucene and Eclipse open source Java systems, and the IBM Jazz proprietary Java system indicate that our classifiers can accurately explain when build co-changes are necessary with an AUC of 0.60-0.88. Unsurprisingly, our highly accurate C++ classifiers (AUC of 0.88) derive much of their explanatory power from indicators of structural change (e.g., was a new source file added?). On the other hand, our Java classifiers are less accurate (AUC of 0.60-0.78) because roughly 75% of Java build co-changes do not coincide with changes to the structure of a system, but rather are instigated by concerns related to release engineering, quality assurance, and general build maintenance.

  • Research Article
  • Cite Count Icon 3
  • 10.1145/3748263
Leveraging LLMs for Memory Forensics: A Comparative Analysis of Malware Detection
  • Dec 13, 2025
  • Digital Threats: Research and Practice
  • Jan-Hendrik Lang + 1 more

Memory forensics plays an important role in modern digital investigations in terms of detecting stealthy, fileless malware, and advanced persistent threats. Moreover, large language models (LLMs) have shown promise in different cybersecurity tasks. In this article, we integrate intelligence based on LLM into memory forensic workflows and evaluate multiple LLMs, including OpenAI GPT4o, OpenAI o1, Gemini 2.0 Flash, Gemini 2.0 Flash-Thinking, Grok 3, and Grok 3 with thinking mode enabled . We collect memory dumps encompassing a variety of attack scenarios such as process injection (using MSFVenom) , a PowerShell Empire-based attack , and real-world malware such as Quasar RAT, MassLogger, DarkCloud, LockBit, and LockiBot . Our evaluation includes accuracy, precision, recall, and F1 score metrics and statistical analyses (ANOVA and correlation tests). The findings show that the reasoning-based (“thinking”) LLM models outperform standard models. OpenAI o1 and Gemini Flash-Thinking excel at decoding base64 obfuscated payloads, while Grok3 leads in detecting network anomalies. All LLM-based approaches suffer from high false-positive (FP) rates, reflected in low precision (often < 20%). This tendency appears to stem from the precautionary principle in AI safety orientation, leading to models erring on the side of caution and occasionally hallucinating plausible threats when faced with ambiguous or incomplete evidence. The LockBit indicator of compromise (IoC) could not be detected with the LLM because the IoCs lie beyond the Volatility3 modules used. Due to this reason and the limited size of the context window from the LLM, it is essential to select appropriate data. Despite limitations, the study demonstrates the practical viability of integrating LLM-driven intelligence into a forensic system. The study lays the foundation for hybrid forensic systems combining symbolic analysis, domain-specific heuristics, and LLM-driven intelligence.

  • Research Article
  • Cite Count Icon 4
  • 10.3897/jucs.134739
An Empirical Evaluation of Large Language Models in Static Code Analysis for PHP Vulnerability Detection
  • Sep 14, 2024
  • JUCS - Journal of Universal Computer Science
  • Orçun Çetin + 3 more

Web services play an important role in our daily lives. They are used in a wide range of activities, from online banking and shopping to education, entertainment and social interactions. Therefore, it is essential to ensure that they are kept as secure as possible. However – as is the case with any complex software system – creating a sophisticated software free from any security vulnerabilities is a very challenging task. One method to enhance software security is by employing static code analysis. This technique can be used to identify potential vulnerabilities in the source code before they are exploited by bad actors. This approach has been instrumental in tackling many vulnerabilities, but it is not without limitations. Recent research suggests that static code analysis can benefit from the use of large language models (LLMs). This is a promising line of research, but there are still very few and quite limited studies in the literature on the effectiveness of various LLMs at detecting vulnerabilities in source code. This is the research gap that we aim to address in this work. Our study examined five notable LLM chatbot models: ChatGPT 4, ChatGPT 3.5, Claude, Bard/Gemini1, and Llama-2, assessing their abilities to identify 104 known vulnerabilities spanning the Top-10 categories defined by the Open Worldwide Application Security Project (OWASP). Moreover, we evaluated issues related to these LLMs’ false-positive rates using 97 patched code samples. We specifically focused on PHP vulnerabilities, given its prevalence in web applications. We found that ChatGPT-4 has the highest vulnerability detection rate, with over 61.5% of vulnerabilities found, followed by ChatGPT-3.5 at 50%. Bard has the highest rate of vulnerabilities missed, at 53.8%, and the lowest detection rate, at 13.4%. For all models, there is a significant percentage of vulnerabilities that were classified as partially found, indicating a level of uncertainty or incomplete detection across all tested LLMs. Moreover, we found that ChatGPT-4 and ChatGPT-3.5 are consistently more effective across most categories, compared to other models. Bard and Llama-2 display limited effectiveness in detecting vulnerabilities across the majority of categories listed. Surprisingly, our findings reveal high false positive rates across all LLMs. Even the model demonstrating the best performance (ChatGPT-4) notched a false positive rate of nearly 63%, while several models glaringly under-performed, hitting startlingly bad false positive rates of over 90%. Finally, simultaneously deploying multiple LLMs for static analysis resulted in only a marginal enhancement in the rates of vulnerability detection. We believe these results are generalizable to most other programming languages, and hence far from being limited to PHP only.

  • Research Article
  • 10.1109/access.2026.3676577
Evaluating Retrieval-Augmented Generation for LLM-Based Vulnerability Detection: An Empirical Study on Real-World Java Vulnerabilities
  • Jan 1, 2026
  • IEEE Access
  • Gábor Antal + 3 more

Software vulnerabilities are growing as fast as the digital platforms and applications that contain them. Thus, the timely and effective detection of software vulnerabilities is becoming increasingly important. Another emerging trend is the widespread use of large language models (LLMs) for software engineering tasks such as source and test code generation, refactoring, and debugging. Finding security-related issues is crucial, but it is a very resource-intensive task. This work offers an empirical benchmark of the impact of retrieval-augmented prompts on LLM-based vulnerability detection, rather than proposing a new detection method. In thiswork, we explore the source code vulnerability detection capabilities of large language models (LLMs) in a realistic, language-specific setting.We investigate to what extent retrieval-augmented generation (RAG) improves their performance when applied to real-world Java vulnerabilities.We evaluated seven widely used LLMs (GPT-5, GPT-4o, Claude Sonnet 4, Gemini 2.5, LLaMa 4, DeepSeek 3.1, and Grok Code Fast) on Java source code. We used the Vul4J dataset, a manually curated benchmark of real-world vulnerabilities, comparing a basic zero-shot approach against a RAG-enhanced approach that provides up to three vulnerable and three secure code examples based on their semantic similarity to the code under analysis. To mitigate the non-deterministic nature of LLMs, each experiment was repeated three times and the results were averaged. We empirically found that RAG helps LLMs reduce false positives without significantly affecting recall, leading to improved MCC scores under specific conditions. However, retrieval was often sparse, and false positives remained common, limiting practical impact. Our study serves as an empirical evaluation rather than a deployment-ready methodology.

  • Research Article
  • Cite Count Icon 3
  • 10.1145/3771923
M2CVD: Enhancing Vulnerability Understanding through Multi-Model Collaboration for Code Vulnerability Detection
  • Oct 16, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Ziliang Wang + 6 more

Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization; conversely, fine-tuned models such as CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages. To address these challenges, this paper introduces the Multi-Model Collaborative Vulnerability Detection approach (M2CVD) that leverages the strong capability of analyzing vulnerability semantics from LLMs to improve the detection accuracy of fine-tuned models. M2CVD employs a novel collaborative process: first enhancing the quality of vulnerability description produced by LLMs through the understanding of project code by fine-tuned models, and then using these improved vulnerability descriptions to boost the detection accuracy of fine-tuned models. M2CVD include three main phases: 1) Initial Vulnerability Detection: The initial vulnerability detection is conducted by fine-tuning a detection model (e.g., CodeBERT) and interacting with an LLM (e.g., ChatGPT) respectively. The vulnerability description will be generated by the LLM when the code is detected vulnerable by the LLM. 2) Vulnerability Description Refinement: By informing the LLM of the vulnerability assessment results of the detection model, we refine the vulnerability description by interacting with the LLM. Such refinement can enhance LLM’s vulnerability understanding in specific projects, effectively bridging the previously mentioned alignment gap; 3) Integrated Vulnerability Detection: M2CVD integrates code fragment and the refined vulnerability descriptions inferred to form synthetic data. Then, the synthetic data is used to fine-tune a validation model, optimize the defect feature learning efficiency of the model, and improve the detection accuracy. We demonstrated M2CVD’s effectiveness on two real-world datasets, where M2CVD significantly outperformed the baseline. In addition, we demonstrate that the M2CVD collaborative method can extend to other different LLMs and fine-tuned models to improve their accuracy in vulnerability detection tasks.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 32
  • 10.3390/fi15100326
A New Approach to Web Application Security: Utilizing GPT Language Models for Source Code Inspection
  • Sep 28, 2023
  • Future Internet
  • Zoltán Szabó + 1 more

Due to the proliferation of large language models (LLMs) and their widespread use in applications such as ChatGPT, there has been a significant increase in interest in AI over the past year. Multiple researchers have raised the question: how will AI be applied and in what areas? Programming, including the generation, interpretation, analysis, and documentation of static program code based on promptsis one of the most promising fields. With the GPT API, we have explored a new aspect of this: static analysis of the source code of front-end applications at the endpoints of the data path. Our focus was the detection of the CWE-653 vulnerability—inadequately isolated sensitive code segments that could lead to unauthorized access or data leakage. This type of vulnerability detection consists of the detection of code segments dealing with sensitive data and the categorization of the isolation and protection levels of those segments that were previously not feasible without human intervention. However, we believed that the interpretive capabilities of GPT models could be explored to create a set of prompts to detect these cases on a file-by-file basis for the applications under study, and the efficiency of the method could pave the way for additional analysis tasks that were previously unavailable for automation. In the introduction to our paper, we characterize in detail the problem space of vulnerability and weakness detection, the challenges of the domain, and the advances that have been achieved in similarly complex areas using GPT or other LLMs. Then, we present our methodology, which includes our classification of sensitive data and protection levels. This is followed by the process of preprocessing, analyzing, and evaluating static code. This was achieved through a series of GPT prompts containing parts of static source code, utilizing few-shot examples and chain-of-thought techniques that detected sensitive code segments and mapped the complex code base into manageable JSON structures.Finally, we present our findings and evaluation of the open source project analysis, comparing the results of the GPT-based pipelines with manual evaluations, highlighting that the field yields a high research value. The results show a vulnerability detection rate for this particular type of model of 88.76%, among others.

  • Research Article
  • Cite Count Icon 1
  • 10.1177/20552076251342078
Reference decisions enhance LLM performance, amplified by source disclosure
  • Apr 1, 2025
  • DIGITAL HEALTH
  • Yongxiang Zhang + 4 more

Objective The rapid integration of large language models (LLMs) has propelled advancements in automated dialog technologies, improving the public's access to healthcare services. Drawing inspiration from the collaborative decision-making practices of medical professionals in complex cases, we investigated whether LLMs could enhance their diagnostic accuracy through interaction. Methods An experimental study was conducted in China (September–December 2024) to investigate the impact of LLM-generated reference decisions and source disclosure on LLMs’ diagnostic performance. We used a Chinese clinical diagnostic task in a controlled comparative design, where three Chinese LLMs interpreted symptoms and conditions based on patient queries. LLMs’ outcomes were evaluated through accuracy and weighted F1 score metrics, with statistical analysis to determine significance. Results Analysis of variance on LLMs’ diagnostic accuracy scores demonstrated that incorporating LLM-generated decisions as a reference significantly improved diagnostic outcomes, with source disclosure amplifying this improvement. Conclusion Our findings underscore the potential of LLM collaboration in healthcare, offering strategies to refine response generation and decision-making across various applications.

  • Research Article
  • Cite Count Icon 14
  • 10.1016/j.mlwa.2024.100598
Vulnerability detection using BERT based LLM model with transparency obligation practice towards trustworthy AI
  • Nov 2, 2024
  • Machine Learning with Applications
  • Jean Haurogné + 2 more

Vulnerability detection using BERT based LLM model with transparency obligation practice towards trustworthy AI

  • Supplementary Content
  • Cite Count Icon 2
  • 10.5167/uzh-61703
Fine-grained code changes and bugs: Improving bug prediction
  • Jan 1, 2012
  • Zurich Open Repository and Archive (University of Zurich)
  • Emanuel Giger

Software development and, in particular, software maintenance are time consuming and require detailed knowledge of the structure and the past development activities of a software system. Limited resources and time constraints make the situation even more difficult. Therefore, a significant amount of research effort has been dedicated to learning software prediction models that allow project members to allocate and spend the limited resources efficiently on the (most) critical parts of their software system. Prominent examples are bug prediction models and change prediction models: Bug prediction models identify the bug-prone modules of a software system that should be tested with care; change prediction models identify modules that change frequently and in combination with other modules, i.e., they are change coupled. By combining statistical methods, data mining approaches, and machine learning techniques software prediction models provide a structured and analytical basis to make decisions.Researchers proposed a wide range of approaches to build effective prediction models that take into account multiple aspects of the software development process. They achieved especially good prediction performance, guiding developers towards those parts of their system where a large share of bugs can be expected. For that, they rely on change data provided by version control systems (VCS). However, due to the fact that current VCS track code changes only on file-level and textual basis most of those approaches suffer from coarse-grained and rather generic change information. More fine-grained change information, for instance, at the level of source code statements, and the type of changes, e.g., whether a method was renamed or a condition expression was changed, are often not taken into account. Therefore, investigating the development process and the evolution of software at a fine-grained change level has recently experienced an increasing attention in research.The key contribution of this thesis is to improve software prediction models by using fine-grained source code changes. Those changes are based on the abstract syntax tree structure of source code and allow us to track code changes at the fine-grained level of individual statements. We show with a series of empirical studies using the change history of open-source projects how prediction models can benefit in terms of prediction performance and prediction granularity from the more detailed change information.First, we compare fine-grained source code changes and code churn, i.e., lines modified, for bug prediction. The results with data from the Eclipse platform show that fine grained-source code changes significantly outperform code churn when classifying source files into bug- and not bug-prone, as well as when predicting the number of bugs in source files. Moreover, these results give more insights about the relation of individual types of code changes, e.g., method declaration changes and bugs. For instance, in our dataset method declaration changes exhibit a stronger correlation with the number of bugs than class declaration changes.Second, we leverage fine-grained source code changes to predict bugs at method-level. This is beneficial as files can grow arbitrarily large. Hence, if bugs are predicted at the level of files a developer needs to manually inspect all methods of a file one by one until a particular bug is located.Third, we build models using source code properties, e.g., complexity, to predict whether a source file will be affected by a certain type of code change. Predicting the type of changes is of practical interest, for instance, in the context of software testing as different change types require different levels of testing: While for small statement changes local unit-tests are mostly sufficient, API changes, e.g., method declaration changes, might require system-wide integration-tests which are more expensive. Hence, knowing (in advance) which types of changes will most likely occur in a source file can help to better plan and develop tests, and, in case of limited resources, prioritize among different types of testing.Finally, to assist developers in bug triaging we compute prediction models based on the attributes of a bug report that can be used to estimate whether a bug will be fixed fast or whether it will take more time for resolution.The results and findings of this thesis give evidence that fine-grained source code changes can improve software prediction models to provide more accurate results.

  • Research Article
  • 10.1016/j.jbi.2026.105034
A study of large language models for patient information extraction: Model architecture, fine-tuning strategy, and multi-task instruction tuning.
  • Mar 1, 2026
  • Journal of biomedical informatics
  • Cheng Peng + 5 more

A study of large language models for patient information extraction: Model architecture, fine-tuning strategy, and multi-task instruction tuning.

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/cee-secr.2011.6188468
Mining source code changes from software repositories
  • Oct 1, 2011
  • Crt Gerlec + 3 more

The primary goal of software repositories is to store the source code of software during its development. Developers constantly store small parts (i.e. software modifications) of code into the repository and share those parts with others until the software is finished. However, software repositories store a significant amount of information about software and development processes. With the appropriate tool, source code modifications could be identified. In the article, we will introduce a tool for identifying structural source code changes from software repositories. With this tool, three open source projects were analyzed and different source code changes were identified during their development. We showed that the tool could be used to identify source code changes from software repositories.

  • Research Article
  • Cite Count Icon 84
  • 10.1016/j.procs.2020.04.217
A Comparative Study of Static Code Analysis tools for Vulnerability Detection in C/C++ and JAVA Source Code
  • Jan 1, 2020
  • Procedia Computer Science
  • Arvinder Kaur + 1 more

A Comparative Study of Static Code Analysis tools for Vulnerability Detection in C/C++ and JAVA Source Code

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 32
  • 10.1145/3637528.3671709
LLM4DyG: Can Large Language Models Solve Spatial-Temporal Problems on Dynamic Graphs?
  • Aug 24, 2024
  • Zeyang Zhang + 5 more

In an era marked by the increasing adoption of Large Language Models (LLMs) for various tasks, there is a growing focus on exploring LLMs' capabilities in handling web data, particularly graph data. Dynamic graphs, which capture temporal network evolution patterns, are ubiquitous in real-world web data. Evaluating LLMs' competence in understanding spatial-temporal information on dynamic graphs is essential for their adoption in web applications, which remains unexplored in the literature. In this paper, we bridge the gap via proposing to evaluate LLMs' spatial-temporal understanding abilities on dynamic graphs, to the best of our knowledge, for the first time. Specifically, we propose the LLM4DyG benchmark, which includes nine specially designed tasks considering the capability evaluation of LLMs from both temporal and spatial dimensions. Then, we conduct extensive experiments to analyze the impacts of different data generators, data statistics, prompting techniques, and LLMs on the model performance. Finally, we propose Disentangled Spatial-Temporal Thoughts (DST2) for LLMs on dynamic graphs to enhance LLMs' spatial-temporal understanding abilities. Our main observations are: 1) LLMs have preliminary spatial-temporal understanding abilities on dynamic graphs, 2) Dynamic graph tasks show increasing difficulties for LLMs as the graph size and density increase, while not sensitive to the time span and data generation mechanism, 3) the proposed DST2 prompting method can help to improve LLMs' spatial-temporal understanding abilities on dynamic graphs for most tasks. The data and codes are publicly available at Github.

  • Research Article
  • Cite Count Icon 78
  • 10.1001/jamanetworkopen.2024.12687
Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models
  • May 22, 2024
  • JAMA Network Open
  • Honghao Lai + 17 more

Large language models (LLMs) may facilitate the labor-intensive process of systematic reviews. However, the exact methods and reliability remain uncertain. To explore the feasibility and reliability of using LLMs to assess risk of bias (ROB) in randomized clinical trials (RCTs). A survey study was conducted between August 10, 2023, and October 30, 2023. Thirty RCTs were selected from published systematic reviews. A structured prompt was developed to guide ChatGPT (LLM 1) and Claude (LLM 2) in assessing the ROB in these RCTs using a modified version of the Cochrane ROB tool developed by the CLARITY group at McMaster University. Each RCT was assessed twice by both models, and the results were documented. The results were compared with an assessment by 3 experts, which was considered a criterion standard. Correct assessment rates, sensitivity, specificity, and F1 scores were calculated to reflect accuracy, both overall and for each domain of the Cochrane ROB tool; consistent assessment rates and Cohen κ were calculated to gauge consistency; and assessment time was calculated to measure efficiency. Performance between the 2 models was compared using risk differences. Both models demonstrated high correct assessment rates. LLM 1 reached a mean correct assessment rate of 84.5% (95% CI, 81.5%-87.3%), and LLM 2 reached a significantly higher rate of 89.5% (95% CI, 87.0%-91.8%). The risk difference between the 2 models was 0.05 (95% CI, 0.01-0.09). In most domains, domain-specific correct rates were around 80% to 90%; however, sensitivity below 0.80 was observed in domains 1 (random sequence generation), 2 (allocation concealment), and 6 (other concerns). Domains 4 (missing outcome data), 5 (selective outcome reporting), and 6 had F1 scores below 0.50. The consistent rates between the 2 assessments were 84.0% for LLM 1 and 87.3% for LLM 2. LLM 1's κ exceeded 0.80 in 7 and LLM 2's in 8 domains. The mean (SD) time needed for assessment was 77 (16) seconds for LLM 1 and 53 (12) seconds for LLM 2. In this survey study of applying LLMs for ROB assessment, LLM 1 and LLM 2 demonstrated substantial accuracy and consistency in evaluating RCTs, suggesting their potential as supportive tools in systematic review processes.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icsme.2019.00064
Processing Large Datasets of Fined Grained Source Code Changes
  • Sep 1, 2019
  • Stanislav Levin + 1 more

In the era of Big Code, when researchers seek to study an increasingly large number of repositories to support their findings, the data processing stage may require manipulating millions and more of records. In this work we focus on studies involving fine-grained AST level source code changes. We present how we extended the CodeDistillery source code mining framework with data manipulation capabilities, aimed to alleviate the processing of large datasets of fine grained source code changes. The capabilities we have introduced allow researchers to highly automate their repository mining process and streamline the data acquisition and processing phases. These capabilities have been successfully used to conduct a number of studies, in the course of which dozens of millions of fine-grained source code changes have been processed.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant