Attacks and Defenses for Large Language Models on Coding Tasks

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Modern large language models (LLMs), such as ChatGPT, have demonstrated impressive capabilities for coding tasks, including writing and reasoning about code. They improve upon previous neural network models of code, such as code2seq or seq2seq, that already demonstrated competitive results when performing tasks such as code summarization and identifying code vulnerabilities. However, these previous code models were shown vulnerable to adversarial examples, i.e., small syntactic perturbations designed to "fool" the models. In this paper, we first aim to study the transferability of adversarial examples, generated through white-box attacks on smaller code models, to LLMs. We also propose a new attack using an LLM to generate the perturbations. Further, we propose novel cost-effective techniques to defend LLMs against such adversaries via prompting, without incurring the cost of retraining. These prompt-based defenses involve modifying the prompt to include additional information, such as examples of adversarially perturbed code and explicit instructions for reversing adversarial perturbations. Our preliminary experiments show the effectiveness of the attacks and the proposed defenses on popular LLMs such as GPT-3.5 and GPT-4.

Similar Papers
  • Research Article
  • Cite Count Icon 6
  • 10.1609/aaai.v39i24.34811
DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Qiming Zhu + 6 more

Code benchmarks such as HumanEval are widely adopted to evaluate the capabilities of Large Language Models (LLMs), providing insights into their strengths and weaknesses. However, current benchmarks primarily exercise LLMs' capability on common coding tasks (e.g., bubble sort, greatest common divisor), leaving domain-specific coding tasks (e.g., computation, system, cryptography) unexplored. To fill this gap, we propose a multi-domain code benchmark, DOMAINEVAL, designed to evaluate LLMs' coding capabilities thoroughly. Our pipeline works in a fully automated manner, enabling a push-button construction from code repositories into formatted subjects under study. Interesting findings are observed by evaluating 12 representative LLMs against DOMAINEVAL. We notice that LLMs are generally good at computation tasks while falling short on cryptography and system coding tasks. The performance gap can be as much as 68.94% (80.94% - 12.0%) in some LLMs. We also observe that generating more samples can increase the overall performance of LLMs, while the domain bias may even increase. The contributions of this study include a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains, a fully automated pipeline for constructing code benchmarks, and an identification of the limitations of LLMs in code generation tasks based on their performance on DOMAINEVAL, providing directions for future research improvements.

  • Research Article
  • Cite Count Icon 4
  • 10.1145/3707457
A Catalog of Data Smells for Coding Tasks
  • Apr 28, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Antonio Vitale + 2 more

Large Language Models (LLMs) are increasingly becoming fundamental in supporting software developers in coding tasks. The massive datasets used for training LLMs are often collected automatically, leading to the introduction of data smells. Previous work addressed this issue by using quality filters to handle some specific smells. Still, the literature lacks a systematic catalog of the data smells for coding tasks currently known. This article presents a Systematic Literature Review (SLR) focused on articles that introduce LLMs for coding tasks. We first extracted the quality filters adopted for training and testing such LLMs, inferred the root problem behind their adoption (data smells for coding tasks), and defined a taxonomy of such smells. Our results highlight discrepancies in the adoption of quality filters between pre-training and fine-tuning stages and across different coding tasks, shedding light on areas for improvement in LLM-based software development support.

  • Research Article
  • Cite Count Icon 7
  • 10.1145/3695868
Building a Coding Assistant via the Retrieval-Augmented Language Model
  • Jan 17, 2025
  • ACM Transactions on Information Systems
  • Xinze Li + 8 more

Pretrained language models have shown strong effectiveness in code-related tasks, such as code retrieval, code generation, code summarization, and code completion tasks. In this article, we propose COde assistaNt viA retrieval-augmeNted language model (CONAN), which aims to build a code assistant by mimicking the knowledge-seeking behaviors of humans during coding. Specifically, it consists of a code structure-aware retriever (CONAN-R) and a dual-view code representation-based retrieval-augmented generation model (CONAN-G). CONAN-R pretrains CodeT5 using Code-Documentation Alignment and Masked Entity Prediction tasks to make language models code structure-aware and learn effective representations for code snippets and documentation. Then CONAN-G designs a dual-view code representation mechanism for implementing a retrieval-augmented code generation model. CONAN-G regards the code documentation descriptions as prompts, which help language models better understand the code semantics. Our experiments show that CONAN achieves convincing performance on different code generation tasks and significantly outperforms previous retrieval augmented code generation models. Our further analyses show that CONAN learns tailored representations for both code snippets and documentation by aligning code-documentation data pairs and capturing structural semantics by masking and predicting entities in the code data. Additionally, the retrieved code snippets and documentation provide necessary information from both program language and natural language to assist the code generation process. CONAN can also be used as an assistant for Large Language Models (LLMs), providing LLMs with external knowledge in shorter code document lengths to improve their effectiveness on various code tasks. It shows the ability of CONAN to extract necessary information and help filter out the noise from retrieved code documents.

  • Conference Article
  • Cite Count Icon 2
  • 10.54941/ahfe1006232
Exploring Inductive and Deductive Qualitative Coding with AI: Investigating Inter-Rater Reliability between Large Language Model and Human Coders
  • Jan 1, 2025
  • AHFE international
  • He Zhang + 7 more

Qualitative research provides valuable insights into complex human phenomena, but its coding processes are often time-intensive and labor-intensive. The advent of Large Language Models (LLMs) has introduced new opportunities to streamline qualitative analysis. This study investigates the application of LLMs in both inductive and deductive coding tasks using real-world datasets, assessing their ability to complement traditional coding methods. To address challenges such as privacy concerns, prompt customization, and integration with qualitative workflows, we developed QualiGPT, an API-based tool that facilitates efficient and secure qualitative coding. Our evaluation shows that the consistency level between AI-generated codes and human coders is acceptable, particularly for inductive coding tasks where themes are identified without prior frameworks. In our case study using data from a Discord community, GPT-4 achieved a Cohen's Kappa of 0.57 in inductive coding, demonstrating moderate agreement with human coders. For deductive coding, the inter-rater reliability between human coders and GPT-4 reached a Fleiss' Kappa of 0.46, indicating a promising level of consistency when applying pre-established codebooks. These findings highlight the potential of LLMs to augment qualitative research by improving efficiency and consistency while maintaining the contextual depth that human researchers provide. We also observed that LLMs demonstrated higher internal consistency compared to human coders when using a codebook for deductive coding, suggesting their value in standardizing coding approaches. Additionally, we explored a novel paradigm where LLMs function not merely as coding tools but as collaborative co-researchers that independently analyze data alongside humans. This approach leverages LLMs' strengths in generating high-quality themes and providing genuine content references, thereby enriching researchers' insights while maintaining human oversight to ensure contextual understanding and ethical standards. Nevertheless, challenges remain regarding prompt engineering, domain-specific training, and the risk of fabricated information, underscoring the importance of human validation in the final analysis. This research advances human-AI collaboration in qualitative methods by exploring AI-assisted coding and highlighting future improvements in interaction design.

  • Research Article
  • Cite Count Icon 3
  • 10.3390/electronics14071384
Evaluating Large Language Model Application Impacts on Evasive Spectre Attack Detection
  • Mar 29, 2025
  • Electronics
  • Jiajia Jiao + 3 more

This paper investigates the impact of different Large Language Models (DeepSeek, Kimi and Doubao) on the attack detection success rate of evasive Spectre attacks while accessing text, image, and code tasks. By running different Large Language Models (LLMs) tasks concurrently with evasive Spectre attacks, a unique dataset with LLMs noise was constructed. Subsequently, clustering algorithms were employed to reduce the dimension of the data and filter out representative samples for the test set. Finally, based on a random forest detection model, the study systematically evaluated the impact of different task types on the attack detection success rate. The experimental results indicate that the attack detection success rate follows the pattern of “code > text > image” in both the evasive Spectre memory attack and the evasive Spectre nop attack. To further assess the influence of different architectures on evasive Spectre attacks, additional experiments were conducted on an NVIDIA RTX 3060 GPU. The results reveal that, on the RTX 3060, the attack detection success rate for code tasks decreased, while those for text and image tasks increased compared to the 2080 Ti. This finding suggests that architectural differences impact the manifestation of Hardware Performance Counters (HPCs), influencing the attack detection success rate.

  • Research Article
  • Cite Count Icon 3
  • 10.1142/s0218194024500050
Multi-Intent Inline Code Comment Generation via Large Language Model
  • Mar 23, 2024
  • International Journal of Software Engineering and Knowledge Engineering
  • Xiaowei Zhang + 4 more

Code comment generation typically refers to the process of generating concise natural language descriptions for a piece of code, which facilitates program comprehension activities. Inline code comments, as a part of code comments, are also crucial for program comprehension. Recently, the emergence of large language models (LLMs) has significantly boosted the performance of natural language processing tasks. This naturally inspires us to explore the performance of the LLMs in the task of inline code comment generation. To this end, we evaluate open-source LLMs on a large-scale dataset and compare the results with the current state-of-the-art methods. Specifically, we explore the model performance in the following scenarios based on the widely used evaluation metrics (i.e. BLEU, Meteor, and ROUGE-L): (1) generation with simple instruction; (2) few-shot-guided generation with random examples selected from the database; (3) few-shot-guided generation with similar examples selected from the database; and (4) adopt the re-ranking strategy for the output of LLMs. Our findings reveal that: (1) under the simple instruction scenario, LLMs could not fully show the potential in the task of inline comment generation compared to the state-of-the-art models; (2) random few-shot leads to a slight improvement; (3) similar few-shot and re-ranking strategy could significantly enhance the performance of LLMs; and (4) for inline comment and code snippet pairs with different intents, why category achieves the best performance and what category achieves relatively poorer performance. That remains consistent across all four scenarios. Our findings shed light on future research directions for using LLMs in inline comment generation tasks.

  • Research Article
  • Cite Count Icon 13
  • 10.1145/3643758
Rocks Coding, Not Development: A Human-Centric, Experimental Evaluation of LLM-Supported SE Tasks
  • Jul 12, 2024
  • Proceedings of the ACM on Software Engineering
  • Wei Wang + 4 more

Recently, large language models (LLM) based generative AI has been gaining momentum for their impressive high-quality performances in multiple domains, particularly after the release of the ChatGPT. Many believe that they have the potential to perform general-purpose problem-solving in software development and replace human software developers. Nevertheless, there are in a lack of serious investigation into the capability of these LLM techniques in fulfilling software development tasks. In a controlled 2 x 2 between-subject experiment with 109 participants, we examined whether and to what degree working with ChatGPT was helpful in the coding task and typical software development task and how people work with ChatGPT. We found that while ChatGPT performed well in solving simple coding problems, its performance in supporting typical software development tasks was not that good. We also observed the interactions between participants and ChatGPT and found the relations between the interactions and the outcomes. Our study thus provides first-hand insights into using ChatGPT to fulfill software engineering tasks with real-world developers and motivates the need for novel interaction mechanisms that help developers effectively work with large language models to achieve desired outcomes.

  • Research Article
  • 10.62311/nesx/rp-1-09-2021
Operationalizing Inference Quality for Code-Generating LLMs: The INFINITE Evaluation Methodology
  • Jan 30, 2021
  • International Journal of Academic and Industrial Research Innovations(IJAIRI)
  • Murali Krishna Pasupuleti

Abstract: Code-generating large language models are increasingly deployed across the software lifecycle, yet headline metrics such as pass@k omit the inference process—prompting, decoding, tool use, retries, and budget consumption—that determines real-world utility. Building on this concept analysis, a budget- and interaction-aware evaluation framework, INFINITE, is proposed to operationalize inference quality. The central problem addressed is the absence of a standardized, compute-normalized methodology that remains stable across seeds and decoding settings, incorporates safety, and predicts developer-relevant outcomes. The methodology defines inference as a controlled process with explicit budgets (tokens, time, tool calls) and prescribes protocol cards for environments, prompts, and decoding policies. A multi-dimensional scoring scheme is introduced—functional correctness, efficiency, execution reliability, repair gain, safety/compliance, and stability—aggregated into a calibrated Inference Index via additive or geometric weighting with bootstrap confidence intervals. Evaluation is conducted in containerized, version-pinned environments on stratified, contamination-checked tasks across languages and code tasks. Results indicate materially improved ranking stability under seed/temperature perturbations, stronger correlation with human measures of effort and time-to-solution, and fairer comparisons after budget normalization; safety penalties meaningfully re-rank models that otherwise appear superior. The impact is to provide decision-grade evidence for procurement, deployment, and governance, enable reproducible comparison across models and strategies, and establish reporting standards that align with expectations of high-impact journals and research-intensive institutions. Keywords: code generation, large language models, evaluation methodology, inference quality, compute normalization, budgeted inference, pass@k, execution reliability, self-repair, safety and compliance, stability, uncertainty quantification, bootstrap confidence intervals, containerized environments, reproducibility

  • Book Chapter
  • 10.1007/978-3-032-07132-3_7
Integrating LLMs with QC-OpenDRIVE: Ensuring Normative Correctness in Autonomous Driving Scenarios
  • Oct 11, 2025
  • Julian Müller + 2 more

This paper investigates on the integration of Large Language Models (LLMs) with the framework in order to generate syntactically and semantically correct OpenDRIVE files. OpenDRIVE files play an important role in the scenario-based validation of autonomous driving systems as they define the static part (e.g. road layout) on which the function are validated. While LLMs excel at generating code or similar tasks which mostly needs to be syntactically correct, the validation of semantic, especially normative, correctness remains challenging. To ensure norm-adherent correctness of generated OpenDRIVE files this paper proposes an integration of a feedback-loop with LLMs and . While LLM allow to easily generate different road layouts, they often show issues like missing or unconnected roads or improper continuity. To address this issue, we have implemented E.5.9.1 to ensure geometric continuity between connected roads, which is a key contribution of this paper. State-of-the-art models are evaluated on three tasks to create OpenDRIVE road networks and validate the results featuring the feedback-loop. Results show that models leveraging Retrieval Augmented Generation (RAG) or internal reasoning and using the feedback loop can generate syntactically and semantically valid outputs after iterative corrections. However, challenges remain to prompt complex scenarios and tasks, especially following geometric rules without explicit feedback. The results demonstrate the necessity of domain-specific normative validation frameworks to prepare the use of LLMs for safety-critical applications. They can be used to enable scalable generation of edge-case scenarios while ensuring compliance with industry standards. This work bridges the gap between automated scenario generation and rigorous validation of reliable autonomous driving systems.

  • Research Article
  • 10.63337/term.2025.43286
Enhancing Reasoning in LLMs through Contrastive Estimation and Representation Engineering
  • Mar 3, 2025
  • The Edge Review
  • Tim Condello

Large Language Models (LLMs) have made tremendous strides, yet true reasoning remains a frontier challenge. These models often struggle with complex multi-step reasoning tasks, especially those requiring logic or intermediate calculations. Unlike straightforward queries (e.g., “What’s the capital of France?”), reasoning questions demand a chain of logical steps – for example, solving a math word problem or debugging code – which pushes LLMs beyond simple pattern matching. The critical value of advancing reasoning in large language models cannot be emphasized enough. Stronger reasoning abilities empower AI to solve sophisticated challenges across science, engineering, and daily decision-making, moving closer to reliable AI assistants and autonomous problem solvers. Leading research initiatives underscore this priority. OpenAI’s and Deepseek’s latest models explicitly spend “more time thinking through problems before they respond,” yielding significant advancements on hard tasks in math, coding, and science [3]. In short, enhanced reasoning is key to unlocking a new level of AI capability and trustworthiness in real-world applications.

  • Research Article
  • 10.1016/j.compbiomed.2025.110747
Infusing clinical knowledge into language models by subword optimisation and embedding initialisation.
  • Sep 1, 2025
  • Computers in biology and medicine
  • Abul Hasan + 9 more

This study introduces a novel tokenisation methodology, K-Tokeniser, to infuse clinical knowledge into language models for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At training or inference stage, sentence level localised context will be utilised for choosing the optimal global token representation to realise the semantic-based tokenisation. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens. Using three transformer-based language models, a comprehensive set of experiments are conducted on four real-world datasets for evaluating K-Tokeniser in a wide range of clinical text analytics tasks including clinical concept and relation extraction, automated clinical coding, clinical phenotype identification, and clinical research article classification. Overall, our models demonstrate consistent improvements over their counterparts in all tasks. In particular, substantial improvements are observed in the automated clinical coding task with 13% increase on Micro F1 score. Furthermore, K-Tokeniser also shows significant capacities in facilitating quicker convergence of language models. Models built using K-Tokeniser have shown faster convergence. Specifically,the language models would only require 50% of the training data to achieve the best performance of the baseline tokeniser using all training data in the concept extraction task and less than 20% of the data for the automated coding task. It is worth mentioning that all these improvements require no pre-training process, making the approach generalisable. Code availability: Our full implementation is openly available at https://github.com/abulhasanbbk/K-Tokenizer.

  • Research Article
  • Cite Count Icon 6
  • 10.15388/lmitt.2024.20
Unit Test Generation Using Large Language Models: A Systematic Literature Review
  • May 13, 2024
  • Vilnius University Open Series
  • Dovydas Marius Zapkus + 1 more

Unit testing is a fundamental aspect of software development, ensuring the correctness and robustness of code implementations. Traditionally, unit tests are manually crafted by developers based on their understanding of the code and its requirements. However, this process can be time-consuming, errorprone, and may overlook certain edge cases. In recent years, there has been growing interest in leveraging large language models (LLMs) for automating the generation of unit tests. LLMs, such as GPT (Generative Pre-trained Transformer), CodeT5, StarCoder, LLaMA, have demonstrated remarkable capabilities in natural language understanding and code generation tasks. By using LLMs, researchers aim to develop techniques that automatically generate unit tests from code snippets or specifications, thus optimizing the software testing process. This paper presents a literature review of articles that use LLMs for unit test generation tasks. It also discusses the history of the most commonly used large language models and their parameters, including the first time they have been used for code generation tasks. The result of this study presents the large language models for code and unit test generation tasks and their increasing popularity in code generation domain, indicating a great promise for the future of unit test generation using LLMs.

  • Research Article
  • 10.58557/(ijeh).v5i4.371
Enhancing Algorithm Learning with Large Language Models: Design and Evaluation of AlgoLLM in Higher Education Practice
  • Aug 3, 2025
  • International Journal of Education and Humanities
  • Shitong Peng + 3 more

Algorithm learning remains challenging in computer science education due to its abstract logic, steep conceptual difficulty, and lack of personalized support in traditional settings. This study presents AlgoLLM, a modular instructional system built on large language models (LLMs) to support students through natural language explanations, code-level guidance, and feedback-based refinement. The system includes four core components: Knowledge Explainer, Exercise Generator, Code Assistant and Debugger, and Feedback Evaluator. A four-week case study was conducted with 60 undergraduate students, comparing a control group using textbooks and an experimental group using AlgoLLM. Paired and independent t-tests showed that the experimental group achieved significantly higher learning gains in post-tests (mean increase of 18.3 percent, Cohen's d = 0.94). Code accuracy and task efficiency also improved. Pearson correlation revealed a moderate relationship between LLM interaction frequency and learning gain. Questionnaire feedback indicated high perceived usefulness, clarity, and satisfaction. These results suggest that LLM-based systems like AlgoLLM can enhance algorithm comprehension and offer scalable, personalized support in technical education

  • Research Article
  • 10.1145/3729345
SmartNote: An LLM-Powered, Personalised Release Note Generator That Just Works
  • Jun 19, 2025
  • Proceedings of the ACM on Software Engineering
  • Farbod Daneshyan + 3 more

The release note is a crucial document outlining changes in new software versions. It plays a key role in helping stakeholders recognise important changes and understand the implications behind them. Despite this fact, many developers view the process of writing software release notes as a tedious and dreadful task. Consequently, numerous tools (e.g., DeepRelease and Conventional Changelog) have been developed by researchers and practitioners to automate the generation of software release notes. However, these tools fail to consider project domain and target audience for personalisation, limiting their relevance and conciseness. Additionally, they suffer from limited applicability, often necessitating significant workflow adjustments and adoption efforts, hindering practical use and stressing developers. Despite recent advancements in natural language processing and the proven capabilities of large language models (LLMs) in various code and text-related tasks, there are no existing studies investigating the integration and utilisation of LLMs in automated release note generation. Therefore, we propose SmartNote, a novel and widely applicable release note generation approach that produces high-quality, contextually personalised release notes by leveraging LLM capabilities to aggregate, describe, and summarise changes based on code, commit, and pull request details. It categorises and scores (for significance) commits to generate structured and concise release notes of prioritised changes. We conduct human and automatic evaluations that reveal SmartNote outperforms or achieves comparable performance to DeepRelease (state-of-the-art), Conventional Changelog (off-the-shelf), and the projects' original release note across four quality metrics: completeness, clarity, conciseness, and organisation. In both evaluations, SmartNote ranked first for completeness and organisation, while clarity ranked first in the human evaluation. Furthermore, our controlled study reveals the significance of contextual awareness, while our applicability analysis confirms SmartNote's effectiveness across diverse projects.

  • Research Article
  • Cite Count Icon 6
  • 10.1186/s44342-024-00036-x
Comparative analysis of generative LLMs for labeling entities in clinical notes
  • Feb 6, 2025
  • Genomics & Informatics
  • Rodrigo Del Moral-González + 2 more

This paper evaluates and compares different fine-tuned variations of generative large language models (LLM) in the zero-shot named entity recognition (NER) task for the clinical domain. As part of the 8th Biomedical Linked Annotation Hackathon, we examined Llama 2 and Mistral models, including base versions and those that have been fine-tuned for code, chat, and instruction-following tasks. We assess both the number of correctly identified entities and the models’ ability to retrieve entities in structured formats. We used a publicly available set of clinical cases labeled with mentions of diseases, symptoms, and medical procedures for the evaluation. Results show that instruction fine-tuned models perform better than chat fine-tuned and base models in recognizing entities. It is also shown that models perform better when simple output structures are requested.

Save Icon
Up Arrow
Open/Close