Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

AI-Driven Developer Ecosystem

  • TL;DR
  • Abstract
  • Literature Map
  • Similar Papers
TL;DR

This paper introduces the AI-Driven Developer Ecosystem (AIDE), a unified, context-aware platform integrating large language models to enhance coding, CI/CD, issue classification, and collaboration, leading to improved code quality, reduced downtime, and higher developer satisfaction through continuous, intelligent support.

Abstract
Translate article icon Translate Article Star icon

The advent of Large Language Models (LLMs), including tools like GitHub Copilot and OpenAI Codex, has brought substantial changes to the field of software engineering. These technologies support developers through features such as automated code generation, smart code suggestions, and productivity enhancements. Despite these advancements, the development workflow is still scattered across multiple standalone tools used for coding, testing, documentation, and team communication. This lack of integration disrupts the development flow and negatively impacts overall team efficiency. To address these challenges, this paper proposes the AI-Driven Developer Ecosystem (AIDE)—a comprehensive development framework that harnesses the capabilities of LLMs while addressing gaps in tool interoperability and contextual awareness. AIDE functions as a unified, intelligent development environment that offers AI-assisted coding, predictive insights for continuous integration and deployment (CI/CD), automated issue classification, adaptive system architecture analysis, and harmonized documentation tools. AIDE sets itself apart from conventional development environments by providing continuous, context-aware support. It does this by analyzing real-time code changes, historical data, and team collaboration behavior. The platform also integrates collaborative tools such as Excalidraw for visual planning and embedded communication features for real-time coordination, promoting a deeply collaborative development experience that extends beyond code writing. By drawing from current academic research and industry practices, this paper illustrates how AIDE effectively addresses critical issues in intelligent software development, resulting in better code quality, minimized downtime, and increased developer satisfaction.

Similar Papers
  • Research Article
  • Cite Count Icon 1
  • 10.47363/jaicc/2023(2)442
AI-Powered Code Generation Evaluating the Effectiveness of Large Language Models (LLMs) in Automated Software Development
  • Mar 31, 2023
  • Journal of Artificial Intelligence & Cloud Computing
  • Ravikanth Konda

The rapid evolution of Artificial Intelligence (AI) has brought about significant advancements in multiple domains, including software development. One of the most promising innovations is AI-powered code generation through Large Language Models (LLMs), such as OpenAI’s GPT-3 and GPT-4. These models, having been trained on large amounts of programming data, have the ability to produce human-readable code from natural language inputs, which is a big potential for simplifying and optimizing software development processes. The aim of this paper is to analyze the performance of LLMs in automated software development by testing their performance on a variety of tasks such as code generation, debugging, and optimization of software. The research explores both the strengths and weaknesses that these models have to offer, in terms of some of the most important indicators like code quality, generation time, and maintainability of the code. According to our observation, although LLMs hold immense potential to automate mundane programming tasks and enhance developer productivity, they still struggle to cope with more intricate, domain-specific programming tasks involving a higher level of understanding, for example, designing architectures and top-level decision-making. In spite of such shortcomings, LLMs can tremendously enhance software development processes, particularly for small-scale projects or act as helpers for more senior developers. The paper summarizes by reflecting on the potential for LLMs to transform software development processes in the future, while also the importance of the model's reliability, coding quality, and security to be improved if it is to be made applicable to larger, more crucial uses.

  • Research Article
  • Cite Count Icon 15
  • 10.1145/3715908
Large Language Model-Aware In-Context Learning for Code Generation
  • Feb 28, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Jia Li + 7 more

Large Language Models (LLMs) have shown impressive In-Context Learning (ICL) ability in code generation. LLMs take a prompt context consisting of a few demonstration examples and a new requirement as input, and output new programs without any parameter update. Existing studies have found that the performance of ICL-based code generation heavily depends on the quality of demonstration examples and thus arises research on selecting demonstration examples: given a new requirement, a few demonstration examples are selected from a candidate pool, where LLMs are expected to learn the pattern hidden in these selected demonstration examples. Existing approaches are mostly based on heuristics or randomly selecting examples. However, the distribution of randomly selected examples usually varies greatly, making the performance of LLMs less robust. The heuristics retrieve examples by only considering textual similarities of requirements, leading to sub-optimal performance. To fill this gap, we propose a L arge language model- A ware selection approach for I n-context- L earning-based code generation named LAIL. LAIL uses LLMs themselves to select examples. It requires LLMs themselves to label a candidate example as a positive example or a negative example for a requirement. Positive examples are helpful for LLMs to generate correct programs, while negative examples are trivial and should be ignored. Based on the labeled positive and negative data, LAIL trains a model-aware retriever to learn the preference of LLMs and select demonstration examples that LLMs need. During the inference, given a new requirement, LAIL uses the trained retriever to select a few examples and feed them into LLMs to generate desired programs. We apply LAIL to four widely used LLMs and evaluate it on five code generation datasets. Extensive experiments demonstrate that LAIL outperforms the state-of-the-art (SOTA) baselines by 11.58%, 3.33%, and 5.07% on CodeGen-Multi-16B, 1.32%, 2.29%, and 1.20% on CodeLlama-34B, and achieves 4.38%, 2.85%, and 2.74% improvements on Text-davinci-003 in terms of Pass@1 at MBJP, MBPP, and MBCPP, respectively. In addition to function-level code generation, LAIL improves the performance of LLMs on DevEval, a repository-level code generation dataset, which achieves 10.04%, 8.12%, and 4.63% improvements compared to the SOTA baselines at Pass@1, 3, and 5 on CodeLlama-7B. Human evaluation further verifies that the generated programs of LAIL are superior in correctness, code quality, and maintainability. Besides, LAIL has satisfactory transferability across different LLMs and datasets, where the retriever learned on one LLM (dataset) can be transferred to other LLMs (datasets).

  • Research Article
  • 10.55041/isjem05150
Automating Software Release Notes with AI: A Comparative Study of Agent-Based Systems vs. LLM Fine-Tuning Approaches
  • Nov 17, 2025
  • International Scientific Journal of Engineering and Management
  • Abhishek Sharma

The increasing frequency of software deployments in Agile and DevOps-driven environments has amplified the need for efficient and accurate generation of release notes. These documents serve as essential communication artifacts that summarize code changes, feature enhancements, performance improvements, and bug fixes for internal stakeholders and end users. Traditionally, software release notes have been curated manually by developers, product managers, or technical writers—a process that is often time-consuming, inconsistent, and prone to human error. The rapid evolution of artificial intelligence (AI), particularly in the domains of intelligent agents and natural language processing (NLP), presents promising avenues for automating this critical yet repetitive task. This paper presents a comprehensive comparative study of two advanced AI methodologies: Agent-Based Systems (ABS) and Large Language Model (LLM) Fine-Tuning Approaches, with the aim of effectively and reliably automating software release note generation. Agent-Based Systems are rule-driven architectures composed of autonomous, goal-oriented agents that interact within defined environments. In the context of release note automation, these systems utilize structured event logs, commit metadata, and issue tracking systems to extract relevant data using ontologies and rule sets. The agents operate independently or cooperatively to detect, classify, and describe changes, and then convert those into standardized release summaries. Such systems offer advantages in scenarios where high levels of traceability, explainability, and control over the documentation process are required, such as in safety-critical or regulated software domains. On the other hand, LLM fine-tuning approaches leverage large-scale, pre-trained transformer models, which are further trained on domain-specific corpora, including annotated commit logs, pull request descriptions, and historical release notes. These models aim to infer intent and meaning from software development artifacts and generate fluent, human- like release documentation. Fine-tuned LLMs adapt to project-specific lexicons, programming idioms, and formatting standards without requiring explicitly encoded rules, making them highly suitable for dynamic and heterogeneous development environments. This research explores the operational, architectural, and performance distinctions between the two approaches using a rigorous experimental framework. The methodology involves collecting datasets from multiple open-source projects, including Kubernetes, TensorFlow, and Apache Kafka, which encompass tens of thousands of commit messages and their corresponding manually crafted release notes. A portion of the dataset is annotated to serve as a gold standard for supervised evaluation. Agent-based pipelines are constructed using a set of behavior trees and domain-specific rules. At the same time, LLM models are fine-tuned using techniques such as reinforcement learning with human feedback (RLHF), transfer learning, and low-rank adaptation (LoRA). Evaluation is conducted on metrics including semantic coverage (using BLEU and ROUGE scores), linguistic coherence (via BERTScore and human expert reviews), execution latency, scalability, and operational maintainability. The results indicate that LLM-based systems excel in natural language fluency, contextual generalization, and adaptability to evolving project vocabularies. However, they struggle with traceability and deterministic behavior in highly structured or compliance-sensitive contexts. Agent- based systems, while often more rigid and limited in language diversity, offer more substantial alignment with business logic and traceability for audit-ready documentation. A key contribution of this study is the design of a hybrid architecture that combines the deterministic preprocessing power of agents with the generative fluency of LLMs. In this setup, agents are responsible for extracting and organizing relevant data into structured templates, which are then passed to fine-tuned LLMs for natural language realization. This hybrid model shows promising results in achieving both accuracy and fluency, while reducing annotation and tuning overhead. Ultimately, this paper offers actionable insights for AI researchers, DevOps engineers, and product teams seeking to automate release documentation. It maps out the trade-offs between model interpretability, fluency, scalability, and compliance support, and suggests deployment patterns based on project size, regulatory requirements, and team maturity. As the landscape of AI-assisted software documentation continues to evolve, the findings of this study position both agent-based and LLM-based solutions as viable and potentially complementary options for organizations seeking to modernize their release management practices. Keywords- AI-assisted documentation, release note automation, agent-based systems, large language models, LLM fine-tuning, natural language generation, DevOps automation, software engineering, rule-based agents, transformer models, hybrid AI architectures, commit message analysis, GPT fine-tuning, software documentation intelligence, continuous delivery, change management.

  • Research Article
  • Cite Count Icon 4
  • 10.1145/3735129
MORepair : Teaching LLMs to Repair Code via Multi-Objective Fine-Tuning
  • Jan 21, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Boyang Yang + 7 more

Within the realm of software engineering, specialized tasks on code, such as program repair, present unique challenges, necessitating fine-tuning Large language models (LLMs) to unlock state-of-the-art performance. Fine-tuning approaches proposed in the literature for LLMs on program repair tasks generally overlook the need to reason about the logic behind code changes, beyond syntactic patterns in the data. High-performing fine-tuning experiments also usually come at very high computational costs. With MORepair , we propose a novel perspective on the learning focus of LLM fine-tuning for program repair: we not only adapt the LLM parameters to the syntactic nuances of the task of code transformation (objective ➊), but we also specifically fine-tune the LLM with respect to the logical reason behind the code change in the training data (objective ➋). Such a multi-objective fine-tuning will instruct LLMs to generate high-quality patches. We apply MORepair to fine-tune four open-source LLMs with different sizes and architectures. Experimental results on function-level and repository-level repair benchmarks show that the implemented fine-tuning effectively boosts LLM repair performance by 11.4% to 56.0%. We further show that our fine-tuning strategy yields superior performance compared to the state-of-the-art approaches, including standard fine-tuning, Fine-tune-CoT, and RepairLLaMA.

  • Research Article
  • Cite Count Icon 27
  • 10.1145/3709358
Exploring the Capabilities of LLMs for Code-Change-Related Tasks
  • Jul 1, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Lishui Fan + 5 more

Developers deal with code-change-related tasks daily, e.g., reviewing code. Pre-trained code and code-change-oriented models have been adapted to help developers with such tasks. Recently, large language models (LLMs) have shown their effectiveness in code-related tasks. However, existing LLMs for code focus on general code syntax and semantics rather than the differences between two code versions. Thus, it is an open question how LLMs perform on code-change-related tasks. To answer this question, we conduct an empirical study using \(>\) 1B parameters LLMs on three code-change-related tasks, i.e., code review generation, commit message generation, and just-in-time comment update, with in-context learning (ICL) and parameter-efficient fine-tuning (PEFT, including LoRA and prefix-tuning). We observe that the performance of LLMs is poor without examples and generally improves with examples, but more examples do not always lead to better performance. LLMs tuned with LoRA have comparable performance to the state-of-the-art small pre-trained models. Larger models are not always better, but Llama 2 and Code Llama families are always the best. The best LLMs outperform small pre-trained models on the code changes that only modify comments and perform comparably on other code changes. We suggest future work should focus more on guiding LLMs to learn the knowledge specific to the changes related to code rather than comments for code-change-related tasks.

  • Research Article
  • 10.1109/access.2025.3633086
Comparative Analysis of the Code Generated by Popular Large Language Models (LLMs) for MISRA C++ Compliance
  • Jan 1, 2025
  • IEEE Access
  • Malik Muhammad Umer

Safety-critical systems are engineered systems whose failure or malfunction could result in catastrophic consequences. The software development for safety-critical systems necessitates rigorous engineering practices and adherence to certification standards like DO-178C for avionics. DO-178C is a guidance document which requires compliance to well-defined software coding standards like MISRA C++ to enforce coding guidelines that prevent the use of ambiguous, unsafe, or undefined constructs. Large Language Models (LLMs) have demonstrated significant capabilities in automatic code generation across a wide range of programming languages, including C++. Despite their impressive performance, code generated by LLMs in safety-critical domains must be carefully analyzed for conformance to MISRA C++ coding standards. In this paper, I have conducted a comparative analysis of the C++ code generated by popular LLMs including: OpenAI ChatGPT, Google Gemini, DeepSeek, Meta AI, and Microsoft Copilot for compliance with MISRA C++. The study revealed that none of the evaluated LLMs generated MISRA-compliant code despite clear prompts, with DeepSeek showing the fewest violations and Meta AI the most. While all models could correct individual violations when explicitly instructed, only ChatGPT consistently identified and resolved all targeted rule violations across complete code snippets, whereas others achieved partial success. Overall, LLMs show promise as aids for initial code generation, but they are not yet dependable for producing fully MISRA-compliant code required in safety-critical domains.

  • PDF Download Icon
  • Book Chapter
  • Cite Count Icon 11
  • 10.1007/978-3-031-65630-9_15
Guiding Enumerative Program Synthesis with Large Language Models
  • Jan 1, 2024
  • Yixuan Li + 2 more

Pre-trained Large Language Models (LLMs) are beginning to dominate the discourse around automatic code generation with natural language specifications. In contrast, the best-performing synthesizers in the domain of formal synthesis with precise logical specifications are still based on enumerative algorithms. In this paper, we evaluate the abilities of LLMs to solve formal synthesis benchmarks by carefully crafting a library of prompts for the domain. When one-shot synthesis fails, we propose a novel enumerative synthesis algorithm, which integrates calls to an LLM into a weighted probabilistic search. This allows the synthesizer to provide the LLM with information about the progress of the enumerator, and the LLM to provide the enumerator with syntactic guidance in an iterative loop. We evaluate our techniques on benchmarks from the Syntax-Guided Synthesis (SyGuS) competition. We find that GPT-3.5 as a stand-alone tool for formal synthesis is easily outperformed by state-of-the-art formal synthesis algorithms, but our approach integrating the LLM into an enumerative synthesis algorithm shows significant performance gains over both the LLM and the enumerative synthesizer alone and the winning SyGuS competition tool.

  • Research Article
  • 10.55041/ijsrem36242
ProgAI: Enhancing Code Generation with LLMs For Real World Challenges
  • Jul 4, 2024
  • INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Afsal Ahamad A + 2 more

Large Language Models (LLMs) have shown promise in automated code generation but generate code units with errors because of reasons like hallucinations. Real-world soft- ware development, however, often involves complex requirements with complex dependencies and extensive documentation. To fill this gap, our research pivots towards evaluating LLMs in a more realistic setting real- world repo-level code generation. We introduce ProgAI, a manually curated LLM for proficient code generation. This LLM supports Code generation 4 coding languages – namely C++, Java, Python and C. We assess nine leading LLMs on code generation tasks and observe a decline in their performance. To tackle this, we present ProgAI, a novel LLM-based agent framework that employs external tools for effective code generation. ProgAI integrates four programming tools, enabling interaction with software artifacts for information retrieval, code symbol navigation, and code testing. We implement four agent strategies to optimize these tools’ usage. Our experiments on ProgAI show that ProgAI enhances LLM performance significantly, with improvements ranging from 18.1% to 25%. Further tests on the HumanEval benchmark confirm ProgAI’s adaptability and efficacy across various code generation tasks. Notably, ProgAI outperforms commercial products like Github Copilot, showcasing superior accuracy and efficiency. These results demonstrate ProgAI’s robust capabilities in code generation, highlighting its potential for real-world repo-level coding challenges.

  • Research Article
  • 10.1145/3777383
Large Language Models for Code Translation: An In-Depth Analysis of Code Smells and Functional Correctness
  • Nov 19, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Christof Feischl + 1 more

The conversion of program code from a given source programming language (PL) to another target PL is known as code translation, and has a wide applicability. Since Large Language Models (LLMs) have shown remarkable performance across different application fields, research considers LLMs to mitigate shortcomings of traditional approaches in code translation. However, existing literature mainly focuses on code correctness and falls short of an investigation of the resulting code quality. Hence, we conduct an in-depth analysis of code smells and code correctness obtained by LLM-based code translations to fill this gap. We consider numerous LLMs, datasets, PLs, and prompts, and we reveal that the prompt selection may have a statistically significant impact on an LLM’s performance. Our analyses further indicate that the code quality can be considered a performance dimension largely independent of the code correctness. Moreover, the exploitation of an LLM’s non-determinism, an iterative repair approach, and the collaboration of LLMs may enhance the performance if used accordingly. Surprisingly, we find that a backtranslation approach poses a viable way for mitigating code quality issues in the source, and that LLMs appear to reproduce code smells which were learned during the training process.

  • Research Article
  • Cite Count Icon 32
  • 10.1145/3643762
CORE: Resolving Code Quality Issues using LLMs
  • Jul 12, 2024
  • Proceedings of the ACM on Software Engineering
  • Nalin Wadhwa + 7 more

As software projects progress, quality of code assumes paramount importance as it affects reliability, maintainability and security of software. For this reason, static analysis tools are used in developer workflows to flag code quality issues. However, developers need to spend extra efforts to revise their code to improve code quality based on the tool findings. In this work, we investigate the use of (instruction-following) large language models (LLMs) to assist developers in revising code to resolve code quality issues. We present a tool, CORE (short for COde REvisions), architected using a pair of LLMs organized as a duo comprised of a proposer and a ranker. Providers of static analysis tools recommend ways to mitigate the tool warnings and developers follow them to revise their code. The proposer LLM of CORE takes the same set of recommendations and applies them to generate candidate code revisions. The candidates which pass the static quality checks are retained. However, the LLM may introduce subtle, unintended functionality changes which may go un-detected by the static analysis. The ranker LLM evaluates the changes made by the proposer using a rubric that closely follows the acceptance criteria that a developer would enforce. CORE uses the scores assigned by the ranker LLM to rank the candidate revisions before presenting them to the developer. We conduct a variety of experiments on two public benchmarks to show the ability of CORE: (1) to generate code revisions acceptable to both static analysis tools and human reviewers (the latter evaluated with user study on a subset of the Python benchmark), (2) to reduce human review efforts by detecting and eliminating revisions with unintended changes, (3) to readily work across multiple languages (Python and Java), static analysis tools (CodeQL and SonarQube) and quality checks (52 and 10 checks, respectively), and (4) to achieve fix rate comparable to a rule-based automated program repair tool but with much smaller engineering efforts (on the Java benchmark). CORE could revise 59.2% Python files (across 52 quality checks) so that they pass scrutiny by both a tool and a human reviewer. The ranker LLM reduced false positives by 25.8% in these cases. CORE produced revisions that passed the static analysis tool in 76.8% Java files (across 10 quality checks) comparable to 78.3% of a specialized program repair tool, with significantly much less engineering efforts. We release code, data, and supplementary material publicly at http://aka.ms/COREMSRI .

  • Research Article
  • 10.3991/ijim.v20i03.59861
Large Language Model Selection for Test-Driven Prompt Android iOS Development
  • Feb 13, 2026
  • International Journal of Interactive Mobile Technologies (iJIM)
  • Muhammad Rizqullah + 1 more

Large language model (LLM) code generation research predominantly focuses on Python, with test-driven prompt engineering exclusively targeting this language. This study presents a comprehensive LLM selection framework for mobile development through rigorous empirical analysis. We conducted 8,704 evaluations across 544 programming tasks (HumanEval and MBPP datasets) on Android (Java) and iOS (Swift) platforms using four state-of-the-art LLMs (GPT-4o, GPT-4o-mini, Qwen 14B, and Qwen 32B), two prompting strategies (base and test-driven), and two metrics (accuracy and remediation accuracy). Systematic analysis of platform-specific patterns yielded a decision tree incorporating first-attempt correctness, budget constraints, and self-hosting requirements, validated through three industry-relevant use cases. Results show test-driven prompting (TDP) achieves a +2.22 pp average accuracy improvement over baseline (95% CI [1.22–3.23 pp], p < 0.001, d = 0.3974). However, LLMs consistently underperform in mobile development (66.85%–88.87%) compared to Pythonbased code generation (86.90%–91.30%) regardless of model size or type. This framework establishes groundwork for platform-specific optimizations while providing practitioners with actionable guidance for model selection in mobile development contexts.

  • Research Article
  • Cite Count Icon 61
  • 10.1088/1361-6552/ad1fa2
The impact of AI in physics education: a comprehensive review from GCSE to university levels
  • Feb 6, 2024
  • Physics Education
  • Will Yeadon + 1 more

With the rapid evolution of artificial intelligence (AI), its potential implications for higher education have become a focal point of interest. This study delves into the capabilities of AI in physics education and offers actionable AI policy recommendations. Using openAI’s flagship gpt-3.5-turbo large language model (LLM), we assessed its ability to answer 1337 physics exam questions spanning general certificate of secondary education (GCSE), A-Level, and introductory university curricula. We employed various AI prompting techniques: Zero Shot, in context learning, and confirmatory checking, which merges chain of thought reasoning with reflection. The proficiency of gpt-3.5-turbo varied across academic levels: it scored an average of 83.4% on GCSE, 63.8% on A-Level, and 37.4% on university-level questions, with an overall average of 59.9% using the most effective prompting technique. In a separate test, the LLM’s accuracy on 5000 mathematical operations was found to be 45.2%. When evaluated as a marking tool, the LLM’s concordance with human markers averaged at 50.8%, with notable inaccuracies in marking straightforward questions, like multiple-choice. Given these results, our recommendations underscore caution: while current LLMs can consistently perform well on physics questions at earlier educational stages, their efficacy diminishes with advanced content and complex calculations. LLM outputs often showcase novel methods not in the syllabus, excessive verbosity, and miscalculations in basic arithmetic. This suggests that at university, there’s no substantial threat from LLMs for non-invigilated physics questions. However, given the LLMs’ considerable proficiency in writing physics essays and coding abilities, non-invigilated examinations of these skills in physics are highly vulnerable to automated completion by LLMs. This vulnerability also extends to pysics questions pitched at lower academic levels. It is thus recommended that educators be transparent about LLM capabilities with their students, while emphasizing caution against overreliance on their output due to its tendency to sound plausible but be incorrect.

  • Research Article
  • Cite Count Icon 91
  • 10.1136/bmj-2023-078538
Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis
  • Mar 20, 2024
  • BMJ
  • Bradley D Menz + 13 more

ObjectivesTo evaluate the effectiveness of safeguards to prevent large language models (LLMs) from being misused to generate health disinformation, and to evaluate the transparency of artificial intelligence (AI) developers regarding...

  • Research Article
  • Cite Count Icon 2
  • 10.1109/tse.2025.3619281
Towards Secure Code Generation With LLMs: A Study on Common Weakness Enumeration
  • Dec 1, 2025
  • IEEE Transactions on Software Engineering
  • Jianguo Zhao + 6 more

Automated code generation has revolutionized software development, enabling developers to accelerate project timelines and reduce manual coding errors significantly. As reliance on these technologies grows, the inherent weaknesses of generated code become increasingly apparent. Recent studies have shown that code produced by AI is not inherently safer or of higher quality than human-written code, often replicating existing vulnerabilities. <p xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">To this end, we propose SECURECODER, which integrates Retrieval-Augmented Generation (RAG) with Common Weakness Enumeration (CWE). SECURECODER first utilizes the advanced reasoning capabilities of large language models (LLMs) to generate natural language descriptions of the code’s core business logic and functionality. Then, from a semantic perspective, it matches the requirements of the code generation task with the CWE descriptions through a multi-label classification process. Finally, based on the matched CWE, SECURECODER generates a list of security guidelines the code generation model must adhere to. Breaking down end-to-end code generation tasks into single-target tasks that LLMs excel at ensures that the generated code not only meets functional requirements but also adheres to best security practices, thereby enhancing the interpretability of the automated code generation process. After evaluating 2 programming languages and 7 LLMs on Coploit-generated code, SECURECODER has great generalization capability and could be applied to more programming languages and vulnerability types. SECURECODER could significantly decrease the security weakness in the AI-generated code and is able to mitigate more than 65% of vulnerabilities exposed to software developers. Compared to the baseline open-source LLMs, code vulnerabilities were reduced by at least 14% and the code business logic was not affected.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/tps-isa67132.2025.00026
LLMalMorph: On the Feasibility of Generating Variant Malware Using Large-Language-Models
  • Nov 12, 2025
  • Md Ajwad Akil + 6 more

Large Language Models (LLMs) have transformed software development and automated code generation. Motivated by these advancements, this paper explores the feasibility of LLMs in modifying malware source code to generate variants. We introduce LLMalMorph, a semi-automated framework that leverages semantical and syntactical code comprehension by LLMs to generate new malware variants. LLMalMorph extracts function-level information from the malware source code and employs custom-engineered prompts coupled with strategically defined code transformations to guide the LLM in generating variants without resource-intensive fine-tuning. To evaluate LLMalMorph, we collected <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{1 0}$</tex> diverse Windows malware samples of varying types, complexity and functionality and generated 618 variants. Our experiments demonstrate that LLMalMorph variants can effectively evade antivirus engines, achieving typical detection rate reductions of <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{1 0 - 1 5 \%}$</tex> across multiple complex samples. Furthermore, without explicitly targeting learning-based detectors, LLMalMorph attained attack success rates of up to <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{9 1 \%}$</tex> against a Machine Learning (ML)-based malware detector. We also discuss the limitations of current LLM capabilities in generating malware variants from source code and assess where this emerging technology stands in the broader context of malware variant generation.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant