- New
- Research Article
- 10.1145/3787101
- Jan 5, 2026
- ACM Transactions on Software Engineering and Methodology
- Xiaoning Ren + 5 more
In the realm of deep learning, a variety of neuron coverage criteria for Deep Neural Networks (DNNs) have been devised to effectively assess the quality of test suites and facilitate the generation of test inputs. Recently proposed coverage criteria, incorporating representation distribution and causal relationships, have infused fresh vitality into this field. However, the focus of previous works is primarily on Convolutional Neural Networks for computer vision, leading to a research gap in exploring coverage testing for language models. Concurrently, with the rise of large language models, transformer-based language models have become increasingly dominant, and numerous ones have sprouted. Therefore, the effectiveness of coverage criteria in transformer-based language tasks, especially with the introduction of novel criteria, remains an unresolved open problem. To tackle it, this study examines these concerns by evaluating a wide range of criteria, including four well-established ones and two state-of-the-art criteria, across three types of transformer-based models: encoder-only, decoder-only, and encoder-decoder models. Building on previous research, we conduct a comprehensive evaluation across three key areas: regarding test suite properties, 1) Error-revealing capability, i.e., sensitivity to adversarial examples; 2) Diversity, i.e., distribution diversity and sample fairness (category diversity); and regarding test suite generation, 3) Input generation guidance, i.e., the ability to guide the generation of more valuable samples. The experimental results demonstrate that the impact of coverage criteria is multifaceted. For the error-revealing capability of test suites, the additional coverage for erroneous samples over noise samples is only 0.32%. In terms of distribution diversity and sample fairness, 26 and 30 cases, respectively, out of 33 configurations are effectively evaluated. Additionally, incorporating neuron-wise coverage guidance during test suite generation slightly increases the production of adversarial samples by 4.56%. In conclusion, while current coverage criteria can act as an antidote for assessing simple diversity, they remain largely a placebo for the core task of revealing adversarial errors, particularly when relying on individual criterion. Consequently, their practical application requires carefully evaluating the trade-off between computational overhead and potential benefits given the massive scale of Transformers. However, this low cost-effectiveness ultimately highlights the urgent need to explore and develop more robust and efficient criteria designed specifically for Transformer-based models.
- New
- Research Article
- 10.1145/3786773
- Dec 29, 2025
- ACM Transactions on Software Engineering and Methodology
- Sabato Nocera + 4 more
The increasing adoption of Artificial Intelligence (AI) in software systems has raised concerns regarding transparency, accountability, and security in the AI supply chain. To address these challenges, Artificial Intelligence Bills of Materials (AIBOMs) have emerged as structured artifacts documenting AI components, datasets, tools, and methodologies used in AI-enabled systems. This paper presents the results of a Multivocal Literature Review (MLR) on AIBOMs. An MLR systematically maps existing evidence while synthesizing key findings from formal literature (FL) and gray literature (GL) to understand emerging themes in evolving fields. Our MLR synthesizes insights from FL and GL to understand benefits, regulatory implications, structural elements, applications, and challenges behind the use of AIBOMs. Our work aims to inform practitioners and researchers about the state of AIBOMs and their role in fostering responsible AI use and deployment. The results of our MLR suggest that AIBOMs improve quality, traceability, management, and compliance in AI-enabled systems by documenting models, datasets, and their relationships. However, several challenges remain to be addressed, including immature generation and consumption tools, data source availability, poor interoperability with existing infrastructures, and limited stakeholder awareness.
- New
- Research Article
- 10.1145/3786602
- Dec 29, 2025
- ACM Transactions on Software Engineering and Methodology
- Shanquan Gao + 4 more
Many methods of TPL recommendation are developed to help app developers find available TPLs, but current methods encounter two common limitations when evaluating the match degree between the app and the candidate TPL. First, they mainly complete the evaluation task based on the TPL context, but ignore the function information of the app. In fact, since TPLs are primarily used to support the implementation of various app functions, this information is critical for making an informed decision. Second, they focus on the relationship between the candidate TPL and the entire TPL context. However, it is crucial to emphasize that a candidate TPL is worth considering as long as it can collaborate with certain TPLs in the TPL context. In this study, we propose a novel model called Atten-TPL that can evaluate the match degree between the app and the candidate TPL by mining the relationships among the candidate TPL, TPL context, and app functions. By using an attention mechanism, Atten-TPL can pay more attention to the TPLs in the TPL context that belong to the same task domain as the candidate TPL, thus mitigating the second limitation of current methods. The experiments indicate that Atten-TPL outperforms prevalent methods of recommendation.
- New
- Research Article
- 10.1145/3786771
- Dec 29, 2025
- ACM Transactions on Software Engineering and Methodology
- Xing Hu + 7 more
Large language models (LLMs) are gaining increasing popularity in software engineering (SE) due to their unprecedented performance across various applications. These models are increasingly being utilized for a range of SE tasks, including requirements engineering and design, code analysis and generation, software maintenance, and quality assurance. As LLMs become more integral to SE, evaluating their effectiveness is crucial for understanding their potential in this field. In recent years, substantial efforts have been made to assess LLM performance in various SE tasks, resulting in the creation of several benchmarks tailored to this purpose. This paper offers a thorough review of 291 benchmarks, addressing three main aspects: what benchmarks are available , how benchmarks are constructed , and the future outlook for these benchmarks . We begin by examining SE tasks such as requirements engineering and design, coding assistant, software testing, AIOps, software maintenance, and quality management. We then analyze the benchmarks and their development processes, highlighting the limitations of existing benchmarks. Additionally, we discuss the successes and failures of LLMs in different software tasks and explore future opportunities and challenges for SE-related benchmarks. We aim to provide a comprehensive overview of benchmark research in SE and offer insights to support the creation of more effective evaluation tools.
- New
- Research Article
- 10.1145/3786776
- Dec 29, 2025
- ACM Transactions on Software Engineering and Methodology
- Yuechen Li + 2 more
This paper presents a Replicated Computational Results (RCR) report for our article “ Preparation and Utilization of Mixed States for Testing Quantum Programs ” accepted by ACM TOSEM. The article proposes a novel type of test cases tailored for unit testing of Quantum Programs (QPs), i.e., Mixed-State Test Cases (MSTCs). Compared to Pure-State Test Cases (PSTCs) adopted in previous related works, which merely considered pure states as the test inputs, MSTCs can incorporate mixed states in the input domain of QPs. As claimed in our article, MSTCs can promote test efficiency when covering a given input domain, and also contribute to test effectiveness owing to their prone to detect more faults. This RCR report describes how to examine the functionality of our related artifacts and replicate the empirical results of our article. We have made our artifacts publicly available, including complete code, raw data, and detailed documentation, which not only facilitates result replication but also enhances the potential for reuse in future studies.
- New
- Research Article
- 10.1145/3785469
- Dec 24, 2025
- ACM Transactions on Software Engineering and Methodology
- Jia Xu + 5 more
Code search, which retrieves relevant code snippets from a large codebase based on natural language queries, significantly enhances software development efficiency and quality, with effective code representation being crucial for its success. While recent studies leverage both syntactic and semantic structures of code snippets to improve representation, they inadequately address two key challenges: (1) inadequate capture of long-range dependencies in these structures, and (2) ineffective fusion of syntactic and semantic information. To overcome these limitations, we propose CoSrch, a deep model for syntactic and semantic-aware code representation. CoSrch first encodes code snippets as graphs to enable structured alignment and fusion, then employs a GNN-based graph encoder to capture long-range dependencies (addressing Challenge 1), along with an overlap-aware modality decomposition and fusion framework that eliminates redundancy when fusing the syntactic and semantic information (resolving Challenge 2). Extensive experiments on prominent benchmarks demonstrate CoSrch’s superiority, achieving at least a 7.60% improvement in SuccessRate@1 and a 5.41% gain in MRR on the CSN-Java dataset compared to the best baseline, validating its effectiveness in advancing code search performance.
- New
- Research Article
- 10.1145/3786330
- Dec 23, 2025
- ACM Transactions on Software Engineering and Methodology
- Carol Hanna + 3 more
A hot fix is an unplanned improvement to a specific time-critical issue deployed to a software system in production. While hot fixing is an essential and common activity in software maintenance, it has never been surveyed as a research activity. Thus, such a review is long overdue. In this paper, we conduct a comprehensive literature review of work on hot fixing. We highlight the fields where this topic has been addressed, inconsistencies we identified in the terminology, gaps in the literature, and directions for future work. Our search concluded with 140 articles on the topic between the years 1986 and 2024. The articles found encompass many different research areas such as log analysis, runtime patching (also known as hot patching), and automated repair, as well as various application domains such as security, mobile, and video games. We find that many directions can take hot fix research forward such as unifying existing terminology, establishing a benchmark set of hot fixes, researching costs and frequency of hot fixes, and researching the possibility of end-to-end automation of detection, mitigation, and deployment. We discuss these avenues in detail to inspire the community to systematize hot fixing as a software engineering activity.
- New
- Research Article
- 10.1145/3785479
- Dec 23, 2025
- ACM Transactions on Software Engineering and Methodology
- Xiao Long + 4 more
AI coding assistants (ACATs) are reshaping computer science (CS) education, yet students’ perception and responses to ACATs’ suggestions remains limited understood, especially regarding behavioral patterns, decision-making, and usability challenges. To address this gap, we conducted a study with 27 CS students, examining their interactions with three widely used ACATs across five key dimensions: interaction frequency and acceptance rate, self-perceived productivity, behavioral patterns, decision-making factors, and challenges and expectations. To support this investigation, we developed an experimental platform incorporating a VSCode extension for log data collection, screen recording and automatic generation of personalized interview and survey questions. Our findings reveal substantial variation in ACAT acceptance rates depending on task types, recommendation methods, and content. We propose a novel five-layer interaction behavior model that captures different stages of user interaction. Notable insights include the problem-solving value of rejected AI suggestions, the inefficiencies introduced by modifying existing code that often lead to backtracking, and the high stability of “slowly accepted” suggestions. Moreover, we identify 22 decision-making factors, 11 challenges, and 23 student expectations for future ACAT improvements—such as enhanced debugging accuracy and adaptive learning of individual coding styles. This study contributes actionable design implications for improving ACAT usability, informing student interaction strategies, and guiding future research in human-software interaction, ultimately aiming to better support CS education.
- New
- Research Article
- 10.1145/3783989
- Dec 23, 2025
- ACM Transactions on Software Engineering and Methodology
- Marius Smytzek + 3 more
Fault localization aims to identify code regions responsible for failures. Traditional techniques primarily correlate statement execution with failures; however, program behavior involves diverse execution features, including variable values, branch conditions, and definition-use pairs, which can provide richer diagnostic insights. This paper comprehensively investigates execution features for fault understanding, addressing two complementary goals. First, we conduct an empirical study of 310 bugs across 20 projects, analyzing 17 execution features and assessing their correlation with failure outcomes. Our findings suggest that fault localization benefits from a broader range of execution features: (1) Scalar pairs exhibit the strongest correlation with failures; (2) Beyond line executions, def-use pairs and functions executed are key indicators for fault localization; and (3) Combining multiple features enhances effectiveness compared to relying on individual features. Second, building on these insights, we introduce a debugging approach that learns relevant features from labeled test outcomes. The approach extracts fine-grained execution features and trains a decision tree to differentiate passing and failing runs. The trained model generates fault diagnoses that explain the underlying causes of failures. Our evaluation demonstrates that the generated diagnoses achieve high predictive accuracy. These interpretable diagnoses empower developers to debug software efficiently by providing deeper insights into failures.
- New
- Research Article
- 10.1145/3785472
- Dec 22, 2025
- ACM Transactions on Software Engineering and Methodology
- Souvick Das + 3 more
Ensuring compliance with regulations poses considerable challenges for software development, particularly during the requirements specification phase. Traditional methods rely heavily on manual inspections that are time-consuming, and prone to errors. This research proposes an innovative framework that leverages the synergy of multiple AI agents to automate software requirement compliance verification partially. The framework integrates Large Language Models (LLMs), prompt engineering, and Retrieval-Augmented Generation (RAG) to analyze, detect, and revise non-compliant requirements. The core of our proposal lies in multi-agent communication, where distinct AI agents collaborate to achieve the overarching goal of compliance checking. LLMs comprehend requirements specifications, while prompt engineering guides LLMs towards compliance-related aspects. The RAG techniques detect non-compliant requirements and suggest changes. Finally, a robust Human-in-the-Loop mechanism ensures accuracy, reliability, and adaptability. A tool, available online, is implemented to translate the technology for effective application. We discuss its ability to identify non-compliant requirements in an extensive experimental evaluation.