- New
- Research Article
- 10.1145/3798054
- Mar 9, 2026
- ACM Transactions on Software Engineering and Methodology
- Zirui Chen + 3 more
Maintenance is a critical stage in the software lifecycle, ensuring that post-release systems remain reliable, efficient, and adaptable. However, manual software maintenance is labor-intensive, time-consuming, and error-prone, which highlights the urgent need for automation. Learning from maintenance activities conducted on other software systems offers an effective way to improve efficiency. In particular, recent research has demonstrated that migration-based approaches transfer knowledge, artifacts, or solutions from one system to another and show strong potential in tasks such as API evolution adaptation, software testing, and migrating patches for fault correction. This makes migration-based maintenance a valuable research direction for advancing automated maintenance. This paper takes a step further by presenting the first systematic research agenda on migration-based approaches to software maintenance. We characterize the migration-based maintenance lifecycle through four key stages: ❶ identifying a maintenance task that can be addressed through migration, ❷ selecting suitable migration sources for the target project, ❸ matching relevant data across systems and adapting the migrated data to the target context, and ❹ validating the correctness of the migration. We also analyze the challenges that may arise at each stage. Our goal is to encourage the community to explore migration-based approaches more thoroughly and to tackle the key challenges that must be solved to advance automated software maintenance.
- New
- Research Article
- 10.1145/3799693
- Mar 2, 2026
- ACM Transactions on Software Engineering and Methodology
- Quanjun Zhang + 6 more
Automated Program Repair (APR) attempts to patch software bugs and reduce manual debugging efforts. Very recently, with the advances in Large Language Models (LLMs), a rapidly increasing number of APR techniques have been proposed, significantly facilitating software development and maintenance and demonstrating remarkable performance. However, due to ongoing explorations in the LLM-based APR field, it is challenging for researchers to understand the current achievements, challenges, and potential opportunities. This work provides the first systematic literature review to summarize the applications of LLMs in APR between 2020 and 2025. We analyze 189 relevant papers from LLMs, APR and their integration perspectives. First, we categorize existing popular LLMs that are applied to support APR and outline four types of utilization strategies for their deployment. Besides, we detail some specific repair scenarios that benefit from LLMs, e.g., semantic bugs and security vulnerabilities. Furthermore, we discuss several critical aspects of integrating LLMs into APR research, e.g., input forms and open science. Finally, we highlight a set of challenges remaining to be investigated and the potential guidelines for future research. Overall, our paper provides a systematic overview of the research landscape to the APR community, helping researchers gain a comprehensive understanding of achievements and promote future research. Our artifacts are publicly available at the GitHub repository: https://github.com/iSEngLab/AwesomeLLM4APR .
- New
- Research Article
- 10.1145/3793863
- Feb 24, 2026
- ACM Transactions on Software Engineering and Methodology
- Haonan Chen + 3 more
With the rapid growth of OpenHarmony, a new distributed operating system (OS), security and privacy issues have become major concerns. While taint analysis has proven effective in Android, OpenHarmony still lacks a comparable framework. Additionally, OpenHarmony's programming language, ArkTS, has several unique features compared to other languages like TypeScript and Java, including complex lifecycles, closure mechanisms, and a distinct API ecosystem. Directly adapting existing taint analysis tools for Android cannot achieve effectiveness in OpenHarmony applications. To address this challenge, we propose HapFlow , a novel taint analysis framework tailored specifically for the OpenHarmony platform and ArkTS programs. This work presents: (1) an LLM-assisted method for identifying and categorizing source APIs for OpenHarmony, (2) a lifecycle and callback functions modeling approach that supports declarative UI in ArkTS, and (3) an IFDS-based taint propagation extension that accurately handles closure-induced cross-function data flows in OpenHarmony applications. Therefore, HapFlow enables precise inter-procedural data-flow analysis for OpenHarmony applications. We validate HapFlow 's effectiveness on the HapBench benchmark, achieving 96.15% precision and 94.34% recall. Furthermore, HapFlow identifies 73 sensitive data leak flows across over 3,000 open-source projects with 8 false positives, completing over 98% of analyses within 10 seconds. These results demonstrate HapFlow 's practicability and scalability for taint analysis in OpenHarmony applications.
- New
- Research Article
- 10.1145/3797910
- Feb 23, 2026
- ACM Transactions on Software Engineering and Methodology
- Chenxi Zhang + 8 more
With the increasing complexity of modern online service systems, understanding the state and behavior of the systems is essential for ensuring their reliability and stability. Therefore, metric monitoring systems are widely used and become an important infrastructure in online service systems. Engineers usually interact with metrics data by manually writing domain-specific language (DSL) queries to achieve various analysis objectives. However, writing these queries can be challenging and time-consuming, as it requires engineers to have high programming skills and understand the context of the system. In this paper, we focus on PromQL, which is the metric query DSL provided by the widely used metric monitoring system Prometheus. We aim to simplify metrics querying by enabling engineers to interact with metrics data in Prometheus through natural language, and we call this task text-to-PromQL. Building upon the insight, this paper proposes PromCopilot, a Large Language Model-based text-to-PromQL framework. PromCopilot first uses a knowledge graph to describe the complex context of a cloud native online service system. Then, through the synergistic reasoning of LLMs and the knowledge graph, PromCopilot transforms engineers’ natural language questions into PromQL queries. To evaluate PromCopilot, we manually construct the first text-to-PromQL benchmark dataset which contains 280 metric query questions. The experiment results show that PromCopilot is effective in text-to-PromQL. When using GPT-4 as the backbone LLM, PromCopilot achieves an accuracy of 69.1% in translating natural language questions to PromQL queries when using. To the best of our knowledge, this paper is the first study of text-to-PromQL, and PromCopilot pioneered the DSL generation framework for metric querying and analysis.
- New
- Research Article
- 10.1145/3797277
- Feb 20, 2026
- ACM Transactions on Software Engineering and Methodology
- Xueying Du + 11 more
Although LLMs have shown promising potential in vulnerability detection, this study reveals their limitations in distinguishing between vulnerable and similar-but-benign patched code (only 0.06 - 0.14 accuracy). It shows that LLMs struggle to capture the root causes of vulnerabilities during vulnerability detection. To address this challenge, we propose enhancing LLMs with multi-dimensional vulnerability knowledge distilled from historical vulnerabilities and fixes. We design a novel knowledge-level Retrieval-Augmented Generation framework Vul-RAG, which improves LLMs with an accuracy increase of 16% - 24% in identifying vulnerable and patched code. Additionally, vulnerability knowledge generated by Vul-RAG can further (1) serve as high-quality explanations to improve manual detection accuracy (from 60% to 77%), and (2) detect 10 previously-unknown bugs in the recent Linux kernel release with 6 assigned CVEs.
- New
- Research Article
- 10.1145/3797276
- Feb 19, 2026
- ACM Transactions on Software Engineering and Methodology
- Junda He + 7 more
The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks from code generation to program repair, producing a massive volume of software artifacts. This surge in automated creation has exposed a critical bottleneck: the lack of scalable and reliable methods to evaluate the quality of these outputs. Human evaluation, while effective, is very costly and time-consuming. Traditional automated metrics like BLEU rely on high-quality references and struggle to capture nuanced aspects of software quality, such as readability and usefulness. In response, the LLM-as-a-Judge paradigm, which employs LLMs for automated evaluation, has emerged. This approach leverages the advanced reasoning and coding capabilities of LLMs themselves to perform automated evaluations, offering a compelling path toward achieving both the nuance of human insight and the scalability of automated systems. Nevertheless, LLM-as-a-Judge research in the SE community is still in its early stages, with many breakthroughs needed. This forward-looking SE 2030 paper aims to steer the research community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts, while also sharing potential research paths to achieve this goal. We provide a literature review of existing SE studies on LLM-as-a-Judge and envision these frameworks as reliable, robust, and scalable human surrogates capable of evaluating software artifacts with consistent, multi-faceted assessments by 2030 and beyond. To validate this vision, we analyze the limitations of current studies, identify key research gaps, and outline a detailed roadmap to guide future developments of LLM-as-a-Judge in software engineering. While not intended to be a definitive guide, our work aims to foster further research and adoption of LLM-as-a-Judge frameworks within the SE community, ultimately improving the effectiveness and scalability of software artifact evaluation methods.
- New
- Research Article
- 10.1145/3795887
- Feb 18, 2026
- ACM Transactions on Software Engineering and Methodology
- Weizhe Wang + 7 more
With the rapid advancement of Large Language Models (LLMs), traditional web testing research faces increasingly severe challenges and higher demands. To advance web testing toward higher levels of intelligence and automation, it is essential to systematically investigate the application and development of LLMs in this field. Accordingly, this paper proposes a comprehensive research roadmap to guide these efforts. We structure this roadmap around four pivotal research directions that span the entire testing process: (1) Adapting LLMs for Web Testing (Pre-Testing); (2) The Role of LLMs in Web Testing (In-Testing); (3) Results Analysis and Decision Support (Post-Testing); and (4) Necessity of Using LLMs in Web Testing. We contend that rigorous exploration within these domains is critical for realizing the next generation of automated and intelligent web testing frameworks. For each research direction, we summarize the latest progress in LLMs applications for web testing and identify the challenges and remaining gaps that future research must address. While this roadmap is not exhaustive, our objective is to catalyze further inquiry within the academic community, thereby advancing the state-of-the-art and fully leveraging the potential of LLMs in web testing.
- New
- Research Article
- 10.1145/3796519
- Feb 16, 2026
- ACM Transactions on Software Engineering and Methodology
- Xinyi Hou + 3 more
The Model Context Protocol (MCP) is an emerging open standard that defines a unified, bi-directional communication and dynamic discovery protocol between AI models and external tools or resources, aiming to enhance interoperability and reduce fragmentation across diverse systems. This paper conducts a systematic study of MCP from both architectural and security perspectives. We first define the full lifecycle of an MCP server, comprising four phases (creation, deployment, operation, and maintenance), further decomposed into 16 key activities that capture its functional evolution. Building on this lifecycle analysis, we construct a comprehensive threat taxonomy that categorizes security and privacy risks across four major attacker types: malicious developers, external attackers, malicious users, and security flaws, encompassing 16 distinct threat scenarios. To validate these risks, we develop and analyze real-world case studies that demonstrate concrete attack surfaces and vulnerability manifestations within MCP implementations. Based on these findings, the paper proposes a set of fine-grained, actionable security safeguards tailored to each lifecycle phase and threat category, offering practical guidance for secure MCP adoption. We also analyze the current MCP landscape, covering industry adoption, integration patterns, and supporting tools, to identify its technological strengths as well as existing limitations that constrain broader deployment. Finally, we outline future research and development directions aimed at strengthening MCP’s standardization, trust boundaries, and sustainable growth within the evolving ecosystem of tool-augmented AI systems. All collected data and implementation examples are publicly available at https://github.com/security-pride/MCP_Landscape .
- New
- Research Article
- 10.1145/3742475
- Feb 13, 2026
- ACM Transactions on Software Engineering and Methodology
- Skyler Grandel + 3 more
Software maintenance constitutes a substantial portion of the total lifetime costs of software, with a significant portion attributed to code comprehension. Software comprehension is eased by documentation such as comments that summarize and explain code. We present ComCat , an approach to automate comment generation by augmenting Large Language Models (LLMs) with expertise-guided context to target the annotation of source code with comments that improve comprehension. Our approach enables the selection of the most relevant and informative comments for a given snippet or file containing source code. We develop the ComCat pipeline to comment C/C++ files by (1) automatically identifying suitable locations in which to place comments, (2) predicting the most helpful type of comment for each location, and (3) generating a comment based on the selected location and comment type. In a human subject evaluation, we demonstrate that ComCat -generated comments significantly improve developer code comprehension across three indicative software engineering tasks by up to 13% for 80% of participants. In addition, we demonstrate that ComCat -generated comments are at least as accurate and readable as human-generated comments and are preferred over standard ChatGPT-generated comments for up to 92% of snippets of code. Furthermore, we develop and release a dataset containing source code snippets, human-written comments, and human-annotated comment categories. ComCat leverages LLMs to offer a significant improvement in code comprehension across a variety of human software engineering tasks.
- New
- Research Article
- 10.1145/3735555
- Feb 13, 2026
- ACM Transactions on Software Engineering and Methodology
- Peihong Lin + 5 more
Directed Grey-Box Fuzzing (DGF) can improve bug exposure efficiency by stressing bug-prone areas. Recent studies have modeled DGF as the problem of finding and optimizing paths to reach target sites. However, they still face the “ multi-path ” challenge. When a target site is reachable by multiple paths, it is crucial to comprehensively evaluate and effectively select these paths, as this affects the fuzzer’s choice between reaching target sites via optimal paths and enhancing path diversity toward targets to expose hidden bugs in non-optimal paths. In this article, we propose MultiGo, a directed hybrid fuzzer designed for multi-path optimization. First, we propose a new fitness metric called path difficulty to comprehensively evaluate the promising paths. This metric uses the Poisson distribution to estimate the probability of exploring basic blocks along execution paths based on statistical block frequency, distinguishing between optimal and challenging paths. With path difficulty as a key factor, a customized Contextual Multi-Armed Bandit (CMAB) model is employed to efficiently optimize path scheduling by comprehensively considering the impact of testing conditions on path scheduling. We introduce the concept of the fuzzing context to represent and evaluate testing conditions, which encompass factors such as path characteristics (e.g., path difficulty), the testing agent (e.g., fuzzing or symbolic execution), and the testing goal (e.g., path exploitation or exploration). Then, the CMAB model predicts the expected rewards for scheduling paths under different testing agents and goals, thereby optimizing path scheduling. By leveraging the CMAB model, MultiGo enhances DGF’s capability to explore easier paths and symbolic execution’s capacity to handle more complex ones, enabling efficient target reaching through optimal paths while ensuring sufficient coverage of non-optimal paths. MultiGo is evaluated on 136 target sites of 41 real-world programs from 3 benchmarks. The experimental results show that MultiGo outperforms the state-of-the-art directed fuzzers (AFLGo, SelectFuzz, Beacon, WindRanger, and DAFL) and hybrid fuzzers (SymCC and SymGo) in reaching target sites and exposing known vulnerabilities. Moreover, MultiGo also discovered 14 undisclosed vulnerabilities.