Detecting Protracted Vulnerabilities in Open Source Projects
Timely resolution and disclosure of vulnerabilities are essential for maintaining the security of open-source software. However, many vulnerabilities remain unreported, unpatched, or undisclosed for extended periods, exposing users to prolonged security threats. While various vulnerability detection tools exist, they primarily focus on predicting or identifying known vulnerabilities, often failing to capture vulnerabilities that experience significant delays in resolution. In this study, we examine the vulnerability lifecycle by analyzing protracted vulnerabilities (PCVEs), which remain unresolved or undisclosed over long periods. We construct a dataset of PCVEs and conduct a qualitative analysis to uncover underlying causes of delay. To assess current automated solutions, we evaluate four state-of-the-art (SOTA) vulnerability detectors on our dataset. These tools detect only 1,059 out of 2,402 PCVEs, achieving approximately 44% coverage. To address this limitation, we propose DeeptraVul , an enhanced detection approach designed specifically for protracted cases. DeeptraVul integrates multiple development artifacts and code signals, supported by a Large Language Model (LLM)-based summarization component. For comparison, we also evaluate a standalone LLM. Our results show that DeeptraVul improves detection performance, achieving a 14% increase in coverage across all PCVEs and reaching 90% coverage on the DeeptraVul PCVE subset, outperforming existing SOTA detectors and standalone LLM based inference.
- Research Article
16
- 10.1145/3715908
- Feb 28, 2025
- ACM Transactions on Software Engineering and Methodology
Large Language Models (LLMs) have shown impressive In-Context Learning (ICL) ability in code generation. LLMs take a prompt context consisting of a few demonstration examples and a new requirement as input, and output new programs without any parameter update. Existing studies have found that the performance of ICL-based code generation heavily depends on the quality of demonstration examples and thus arises research on selecting demonstration examples: given a new requirement, a few demonstration examples are selected from a candidate pool, where LLMs are expected to learn the pattern hidden in these selected demonstration examples. Existing approaches are mostly based on heuristics or randomly selecting examples. However, the distribution of randomly selected examples usually varies greatly, making the performance of LLMs less robust. The heuristics retrieve examples by only considering textual similarities of requirements, leading to sub-optimal performance. To fill this gap, we propose a L arge language model- A ware selection approach for I n-context- L earning-based code generation named LAIL. LAIL uses LLMs themselves to select examples. It requires LLMs themselves to label a candidate example as a positive example or a negative example for a requirement. Positive examples are helpful for LLMs to generate correct programs, while negative examples are trivial and should be ignored. Based on the labeled positive and negative data, LAIL trains a model-aware retriever to learn the preference of LLMs and select demonstration examples that LLMs need. During the inference, given a new requirement, LAIL uses the trained retriever to select a few examples and feed them into LLMs to generate desired programs. We apply LAIL to four widely used LLMs and evaluate it on five code generation datasets. Extensive experiments demonstrate that LAIL outperforms the state-of-the-art (SOTA) baselines by 11.58%, 3.33%, and 5.07% on CodeGen-Multi-16B, 1.32%, 2.29%, and 1.20% on CodeLlama-34B, and achieves 4.38%, 2.85%, and 2.74% improvements on Text-davinci-003 in terms of Pass@1 at MBJP, MBPP, and MBCPP, respectively. In addition to function-level code generation, LAIL improves the performance of LLMs on DevEval, a repository-level code generation dataset, which achieves 10.04%, 8.12%, and 4.63% improvements compared to the SOTA baselines at Pass@1, 3, and 5 on CodeLlama-7B. Human evaluation further verifies that the generated programs of LAIL are superior in correctness, code quality, and maintainability. Besides, LAIL has satisfactory transferability across different LLMs and datasets, where the retriever learned on one LLM (dataset) can be transferred to other LLMs (datasets).
- Research Article
1
- 10.1609/aaai.v39i26.34945
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
Large language models (LLMs) are expected to follow instructions from users and engage in conversations. Techniques to enhance LLMs' instruction-following capabilities typically fine-tune them using data structured according to a predefined chat template. Although chat templates are shown to be effective in optimizing LLM performance, their impact on safety alignment of LLMs has been less understood, which is crucial for deploying LLMs safely at scale. In this paper, we investigate how chat templates affect safety alignment of LLMs. We identify a common vulnerability, named ChatBug, that is introduced by chat templates. Our key insight to identify ChatBug is that the chat templates provide a rigid format that need to be followed by LLMs, but not by users. Hence, a malicious user may not necessarily follow the chat template when prompting LLMs. Instead, malicious users could leverage their knowledge of the chat template and accordingly craft their prompts to bypass safety alignments of LLMs. We study two attacks to exploit the ChatBug vulnerability. Additionally, we demonstrate that the success of multiple existing attacks can be attributed to the ChatBug vulnerability. We show that a malicious user can exploit the ChatBug vulnerability of eight state-of-the-art (SOTA) LLMs and effectively elicit unintended responses from these models. Moreover, we show that ChatBug can be exploited by existing jailbreak attacks to enhance their attack success rates. We investigate potential countermeasures to ChatBug. Our results show that while adversarial training effectively mitigates the ChatBug vulnerability, the victim model incurs significant performance degradation. These results highlight the trade-off between safety alignment and helpfulness. Developing new methods for instruction tuning to balance this trade-off is an open and critical direction for future research.
- Research Article
6
- 10.1038/s41598-025-92889-7
- Mar 17, 2025
- Scientific Reports
Large Language models (LLMs) have demonstrated impressive capabilities in natural language processing and understanding. LLMs are being rapidly adopted in major industry sectors including mobile computing, healthcare, finance, government, and education driven by technology giants such as NVIDIA, OpenAI, Microsoft, Apple, Meta, Google, Broadcom, AMD, and IBM. However, due to the emerging nature of this technology, many security/privacy challenges remain unresolved that we must tackle before rolling out LLMs to critical applications (e.g. Healthcare, Legal). In this article, we focus on the Reinforcement Learning via Human Feedback (RLHF) process that is widely used for training LLMs giving them the human-like feel most applications value. The RLHF process involves employing human experts to generate feedback based on an LLM’s query-response pairs and using this feedback to then retrain (fine-tune) the model. However, RLHF can also expose the LLM to malicious feedback generated by one or more individuals in the process leading to degraded performance of the LLM and harmful responses. Most state-of-the-art (SOTA) solutions to this problem involve utilizing a KL-Divergence-based brute-force update-rejection approach that can render the whole RLHF process completely useless (model quality is not improved) in the presence of malicious entities in the process. We propose the COnsensus-Based RewArd framework (COBRA), a consensus-based technique that can effectively negate the malicious noise generated by a certain segment of the RLHF human-expert pool, leading to improved LLM training performance in a mixed-trust scenario. We have evaluated COBRA for two separate LLM use cases, Sentiment Analysis and Conversational Task. We have experimented with a wide range of LLM models (e.g. GPT-2 XL - 1.5B parameters). COBRA outperformed the standard unprotected reward generation scheme by for the generative conversational task and by for the sentiment analysis task. We have also quantitatively compared COBRA with Coste et al. and observed state-of-the-art performance, particularly when a lower number of reward models are used ( increased reward accuracy at ).
- Research Article
1
- 10.1145/3801158
- Mar 10, 2026
- ACM Transactions on Software Engineering and Methodology
Large Language Models (LLMs) aim to generate and understand human-like text by leveraging deep learning and natural language processing techniques. In software development, LLMs can enhance the coding experience through coding automation, reducing development time and improving code quality. Code refactoring is a technique used to enhance the internal quality of the code base without altering its external functionalities. Leveraging LLMs for code refactoring can help developers improve code quality with minimal effort. This paper presents an empirical study evaluating the quality of refactored code produced by StarCoder2, GPT-4o-mini, GPT-4o, LLaMA 3, and DeepSeek-v3. Specifically, we (1) evaluate whether the code refactored by the LLMs can improve code quality, (2) understand the differences between the types of refactoring applied by the different LLMs and compare their effectiveness, and (3) evaluate whether the quality of the refactored code generated by the LLM can be improved through one-shot prompting and chain-of-thought prompting. We analyze the refactoring capabilities of LLMs on 30 open-source Java projects. We evaluate StarCoder2, LLaMA 3, GPT-4o-mini, GPT-4o, and DeepSeek-v3 on their ability to improve static code quality metrics, pass unit tests, and reduce code smells. Our findings reveal that production-grade models such as GPT-4o and DeepSeek-v3 achieve pass@5 unit test success rates above 90% on multi-file refactorings. LLaMA 3 achieves the highest overall code smell reduction with a median reduction of 15.1%, while DeepSeek-v3 and GPT-4o achieve the greatest improvements in cohesion, coupling, and complexity. StarCoder2 demonstrates strengths in modularity improvements and systematic refactorings. Developers outperform LLMs in complex, context-sensitive refactorings such as attribute encapsulation. We also show that prompt engineering significantly affects LLM performance: chain-of-thought prompting improves StarCoder2's test pass rate by 1.7% and increases code smell reduction compared to zero-shot prompting. One-shot prompting also expands the variety of refactorings LLMs can perform. These results suggest that LLMs are effective for many refactoring tasks, especially when guided with tailored prompts, but benefit from integration with human expertise for architectural or semantically complex changes. By providing insights into the capabilities and best practices for integrating LLMs into the software development process, our study aims to enhance the effectiveness and efficiency of code refactoring in real-world applications.
- Research Article
- 10.1161/circ.152.suppl_3.4367224
- Nov 4, 2025
- Circulation
Background: Indications for cardiac magnetic resonance imaging (CMR) are often stored in heterogenous, unstructured reports. Manual adjudication of indications is time-consuming and requires domain expertise. Recent large language models (LLMs) have shown promise in complex clinical interpretation and categorization tasks. No prior study has systematically evaluated the ability of state-of-the-art (SOTA) LLMs to extract indications from raw CMR reports. Research question: How well do SOTA open-source and commercial LLMs adjudicate clinical indications from real-world CMR reports? Methods: We analyzed 486 CMR reports from a large academic center. Reports were de-identified using the Stanford-Penn-MIDRC deidentification tool, and ground-truth indications were annotated by a physician expert. 18 LLMs varying in accessibility (8 open-source, 10 commercial), parameter size (4 to 70 billion), and training corpus (general vs medical) were evaluated. For each report, LLMs were instructed to extract the top two possible indications (correct if either matched the ground-truth indication)—reflecting the fact that real-world indications can fall into more than 1 category—from ten possible categories: oncologic therapy toxicity, cardiomyopathy/elevated troponin, chest pain/dyspnea, arrythmia/abnormal ECG, cardiac mass/metastasis, thrombus, structural evaluation, pericarditis, risk stratification, or viability evaluation (ischemic). Results: Higher-cost commercial models (Spearman’s rank r = 0.683, p = 0.03) and larger-parameter open-source models ( r = 0.307) exhibited better adjudication ability, Fig 1A, 1B . The best performing commercial LLMs performed markedly better than the top open-source LLMs (90% vs ~78% accuracy [acc]), Fig 2 . Grok 3 (91% acc, 0.94 F1-score) and OpenAI o3 (90% acc, 0.93 F1) were the best models overall, and Gemma 3 27B was the best open-source LLM (80% acc, 0.86 F1), Fig 2 . Reasoning models performed comparably to non-reasoning models, with Grok 3 mini having the best relative cost-vs-performance, Fig 1A, 2 . Interestingly, medical LLMs performed worse than their generally pretrained counterparts (e.g., MedGemma 27B vs Gemma 3 27B), suggesting domain-specific pretraining may negatively affect adjudication ability, Fig 2 . Conclusion: Open-source and commercial LLMs demonstrate promise in automated, accurate extraction of indications from CMR reports. Our findings help clinician-researchers decide between LLMs for use-cases involving CMR reports.
- Research Article
1
- 10.17821/srels/2024/v61i5/171583
- Oct 21, 2024
- Journal of Information and Knowledge
This study primarily aims to prepare a prototype and demonstrate that libraries can develop a low-cost conversational search system using open-source software tools and Large Language Models (LLMs) through a Retrieval-Augmented Generation (RAG) framework. LLMs often hallucinate and provide outdated and non-contextualized responses. However, this experiment shows that LLMs can deliver contextualized, relevant responses when augmented with a set of relevant documents. Augmenting LLMs with relevant documents before generating answers is known as retrieval-augmented generation. The methodology involved creating a RAG pipeline using tools like LangChain, vector databases like ChromaDB, and open-source LLMs like Llama3 (a 70-billion parameter-based model). The prototype developed includes a dataset of 250+ relevant documents on the Chandrayaan-3 mission that was collected, processed, and ingested into the pipeline. Finally, the study compared responses from standard LLMs and LLMs with RAG augmentation. Key findings revealed that standard LLMs (without RAG) produced confidently incorrect, hallucinated responses against queries related to Chandrayaan-3, while LLMs with RAG consistently provided accurate, informative, and contextualized answers when supplied with a set of relevant documents before generating the response. The study concluded that open-source RAG-based systems offer a cost-effective solution for libraries to enhance information retrieval and transform libraries into dynamic information services.
- Research Article
- 10.54254/2755-2721/2025.22701
- May 15, 2025
- Applied and Computational Engineering
Understanding and interpreting code is a crucial task in intelligent software engineering, aiding developers and users in adjusting code for correctness and robustness. The emergence of large language models (LLMs) provides new perspectives for code interpretation tasks. However, current LLM-based code interpretation remains restricted to limited dimensions, lacks a unified evaluation standard, and is missing a comprehensive and systematic assessment methodology. To address this issue, this paper proposes an LLM code understanding evaluation method based on a multi-granularity voting mechanism, aiming to systematically investigate and analyze LLMs' performance in code interpretation tasks. First, we carefully select code snippets from open-source GitHub projects and preprocess them for LLM analysis. Second, we use identical prompts and inputs to test three popular LLMs, recording their output. During this process, we apply prompt engineering techniques to specific target code snippets and conduct repeated experiments to explore the impact of prompt engineering on LLM-generated code explanations. Next, we design evaluation metrics to quantify the LLM outputs and assess their effectiveness based on the obtained scores. Experimental results demonstrate significant differences in code analysis and generation capabilities among the evaluated general-purpose LLMs from different vendors when given identical prompts and inputs. When multiple dimensions are considered in evaluating the generated content, different LLMs exhibit varying strengths in different aspects. Additionally, applying specific prompt engineering techniques can moderate the discrepancies in code analysis and generation capabilities among different LLMs.
- Research Article
- 10.59717/ipj.energy-use.2025.100026
- Jan 1, 2025
- Energy Use
<p>Large language models (LLMs) are increasingly adopted across scientific and engineering fields. However, applying general-purpose LLMs to specialized engineering domains imposes stringent requirements for structured knowledge, rigorous reasoning, and technical precision. Thus, the suitability of current general-purpose LLMs for practical applications in engineering domains remains questionable. To understand the mastery level of LLMs in the building science domain as one broad but specific engineering domain, in this paper, we perform a comprehensive benchmark analysis (with benchmark dataset of 1,487 questions) to evaluate abilities of 15 state-of-the-art (SOTA) LLMs across 12 core subject topics in the building science domain. To enable scalable and robust evaluation, we propose and validate an AI-Judger for assessment across five dimensions of abilities, i.e., knowledge and concept, logic and consistency, clarity of expression, and reflection and exploratory. Overall, SOTA general-purposes LLMs achieve only ~50% accuracy on average in answering different types of questions. The capabilities of LLMs decrease progressively from linguistic expression and factual knowledge to logical reasoning, then reflection and exploratory thinking. For different tasks, LLMs exhibit notably low accuracy on calculation (~13%), short-answer (~23%), and cloze tasks (~30%), contrast to stronger performance on single-choice (74%) and multiple-choice questions (63%). Finally, pronounced variance of LLM performance exists across topics, with relatively low accuracy on physics fundamental and HVAC&R-related questions (median of 20%-40%) compared to ~80% for building standards and codes. These identified gaps highlight the limitations of general-purpose LLMs in engineering contexts, clearly pointing to the necessity of developing domain-specific LLMs tailored for engineering applications.</p>
- Research Article
61
- 10.1109/jas.2024.124971
- Feb 1, 2025
- IEEE/CAA Journal of Automatica Sinica
Software security poses substantial risks to our society because software has become part of our life. Numerous techniques have been proposed to resolve or mitigate the impact of software security issues. Among them, software testing and analysis are two of the critical methods, which significantly benefit from the advancements in deep learning technologies. Due to the successful use of deep learning in software security, recently, researchers have explored the potential of using large language models (LLMs) in this area. In this paper, we systematically review the results focusing on LLMs in software security. We analyze the topics of fuzzing, unit test, program repair, bug reproduction, data-driven bug detection, and bug triage. We deconstruct these techniques into several stages and analyze how LLMs can be used in the stages. We also discuss the future directions of using LLMs in software security, including the future directions for the existing use of LLMs and extensions from conventional deep learning research.
- Research Article
9
- 10.1016/j.jss.2025.112452
- Sep 1, 2025
- Journal of Systems and Software
Demystifying issues, causes and solutions in LLM open-source projects
- Research Article
1
- 10.25136/2409-8698.2024.4.70455
- Apr 1, 2024
- Litera
The subject of the study is the analysis and improvement of methods for determining the relevance of project names to the information content of purchases using large language models. The object of the study is a database containing the names of projects and purchases in the field of electric power industry, collected from open sources. The author examines in detail such aspects of the topic as the use of TF-IDF and cosine similarity metrics for primary data filtering, and also describes in detail the integration and evaluation of the effectiveness of large language models such as GigaChat, GPT-3.5, and GPT-4 in text data matching tasks. Special attention is paid to the methods of clarifying the similarity of names based on reflection introduced into the prompta of large language models, which makes it possible to increase the accuracy of data comparison. The study uses TF-IDF and cosine similarity methods for primary data analysis, as well as large GigaChat, GPT-3.5 and GPT-4 language models for detailed verification of the relevance of project names and purchases, including reflection in model prompta to improve the accuracy of results. The novelty of the research lies in the development of a combined approach to determining the relevance of project names and purchases, combining traditional methods of processing text information (TF-IDF, cosine similarity) with the capabilities of large language models. A special contribution of the author to the research of the topic is the proposed methodology for improving the accuracy of data comparison by clarifying the results of primary selection using GPT-3.5 and GPT-4 models with optimized prompta, including reflection. The main conclusions of the study are confirmation of the prospects of using the developed approach in the tasks of information support for procurement processes and project implementation, as well as the possibility of using the results obtained for the development of text data mining systems in various sectors of the economy. The study showed that the use of language models makes it possible to improve the value of the F2 measure to 0.65, which indicates a significant improvement in the quality of data comparison compared with basic methods.
- Research Article
2
- 10.1145/3716822
- Mar 24, 2025
- ACM Transactions on Software Engineering and Methodology
Vulnerability disclosure is critical for ensuring the security and reliability of open source software (OSS). However, in practice, many vulnerabilities are reported and discussed on public platforms before being formally disclosed, posing significant risks to vulnerability management. Inadequate vulnerability disclosure can expose users to security threats and severely impact the stability and reliability of software systems. For example, prior work shows that over 21% of CVEs are publicly discussed before a patch is released. Despite its importance, we still lack clarity on the vulnerability disclosure practices adopted by open source communities and the preferences of practitioners regarding vulnerability management. To fill this gap, we analyzed the vulnerability disclosure practices of 8,073 OSS projects spanning from 2017 to 2023. We then conducted an empirical study by surveying practitioners about their preferences and recommendations in vulnerability disclosure management. Finally, we compared the survey results with the actual vulnerability practice observed within the OSS projects. Our results show that while over 80% of practitioners support Coordinated Vulnerability Disclosure (CVD), only 55% of vulnerabilities conform to CVD in practice. Although only 20% of practitioners advocate discussions before disclosure, 42% of vulnerabilities are discussed in issue reports before their disclosure. This study reveals the vulnerability management practices in OSS, provides valuable guidance to OSS owners, and highlights potential directions to improve the security of OSS platforms.
- Abstract
- 10.1017/s0266462325101141
- Dec 1, 2025
- International Journal of Technology Assessment in Health Care
IntroductionIntegrating large language models (LLMs) into horizon scanning workflows requires understanding of baseline features, like the ability to extract data and handle noisy data, and contextual understanding to inform considerations for LLM use. We evaluated 25 LLMs to assess their applicability for horizon scanning methods in general and to inform the design and integration strategy of our unit’s advanced horizon scanning system.MethodsWe developed a comprehensive framework detailing 32 features across 10 categories for 25 LLMs. To build this framework, we used ChatGPT-4 to generate a preliminary list of categories, features, and LLMs relevant to HS. We supplemented this with parameters from the 2024 LeewayHertz assessment and finalized it through team consensus. Next, we employed a human-in-the-loop approach utilizing a recursive prover-verifier-chain: Microsoft Copilot>Claude 3.5>ChatGPT-4. Each LLM was assessed for variations in baseline features impacting their applicability in horizon scanning methods and potential integration into our horizon scanning system.ResultsWe identified six variable features (19%) across five categories. Nineteen of the LLMs support on-premises or self-hosted deployment. Regarding integration flexibility, only seven LLMs were open source and four lacked strong vendor support. Eighteen models offered a usage-based pricing system, allowing budget tailoring. Five LLMs excelled in handling noisy data, beneficial for horizon scanning methods dealing with diverse information sources. Seventeen models had multimodal capabilities.ConclusionsVariations in key features among the 25 candidate LLMs affected their suitability for integration into horizon scanning workflows. Units must consider the trade-offs between deployment options, open-source availability, vendor support, pricing models, data handling capabilities, and multimodal features. This extensive framework supports assessment and selection of appropriate LLMs for horizon scanning workflows by filtering models according to these key features.
- Research Article
5
- 10.1007/s00117-024-01327-8
- Jun 7, 2024
- Radiologie (Heidelberg, Germany)
In 2023, the release of ChatGPT triggered an artificial intelligence (AI) boom. The underlying large language models (LLM) of the nonprofit organization "OpenAI" are not freely available under open-source licenses, which does not allow on-site implementation inside secure clinic networks. However, efforts are being made by open-source communities, start-ups and large tech companies to democratize the use of LLMs. This opens up the possibility of using LLMs in adata protection-compliant manner and even adapting them to our own data. This paper aims to explain the potential of privacy-compliant local LLMs for radiology and to provide insights into the "open" versus "closed" dynamics of the currently rapidly developing field of AI. PubMed search for radiology articles with LLMs and subjective selection of references in the sense of anarrative key topic article. Various stakeholders, including large tech companies such as Meta, Google andX, but also European start-ups such as Mistral AI, contribute to the democratization of LLMs by publishing the models (open weights) or by publishing the model and source code (open source). Their performance is lower than current "closed" LLMs, such as GPT‑4 from OpenAI. Despite differences in performance, open and thus locally implementable LLMs show great promise for improving the efficiency and quality of diagnostic reporting as well as interaction with patients and enable retrospective extraction of diagnostic information for secondary use of clinical free-text databases for research, teaching or clinical application.
- Conference Article
1
- 10.18260/1-2--45767
- May 14, 2024
In this study, we investigate the types of stereotypical bias in Large Language Models (LLMs).We highlight the risks of ignoring bias in LLMs, ranging from perpetuating stereotypes to affecting hiring decisions, medical diagnostics, and criminal justice outcomes.To address these issues, we propose a novel approach to evaluate bias in LLMs using metrics developed by Stereoset [1].Our experiments involve evaluating several proprietary and open-source LLMs (GPT4, GEMINI PRO, OPENCHAT, LLAMA) for stereotypical bias and examining the attributes that influence bias.We used a selected 100 prompts from the stereoset dataset to query the LLMs via their respective APIs.The results were evaluated using the language modeling score, stereotype score and the combination iCAT[1] score.In particular, open source LLMs showed higher levels of bias in handling stereotypes than proprietary LLMs (40% average stereotype score for the open source LLMs and 47% average stereotype score for the proprietary ones: 50% being the ideal, unbiased stereotype score).The language modeling score was even between the models, with the open source models achieving 94% and the proprietary ones 91%.The combined average iCAT score was 76.6% for the proprietary models and 62.5% for the open source models.This disparity in stereotypical bias could be due to the regulatory inspection and user testing through reinforcement learning with human feedback (RLHF) that the proprietary models are subject to.We present our findings and discuss their implications for mitigating bias in LLMs.Overall, this research contributes to the understanding of bias in LLMs and provides insights into strategies for improving fairness and equity in NLP applications.