Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Export
Sort by: Relevance
  • New
  • Research Article
  • 10.1145/3787972
VulDeNoise: Outlier Detection to Reduce Label Noises for Effective Vulnerability Detection
  • Feb 7, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Yutao Hu + 4 more

While automated vulnerability detection approaches, especially those using Graph Neural Networks (GNNs), have shown remarkable promise, their effectiveness is often constrained by significant label noise in prevalent vulnerability datasets. To address this, we propose VulDeNoise , an innovative dataset denoising framework. Our approach is grounded in multi-view learning theory, postulating that correctly labeled samples exhibit consistent training dynamics across different code graph representations, whereas mislabeled samples manifest significant discrepancies due to inherent semantic conflicts. VulDenoise operationalizes this by training a target detector on each of the three code representation graphs, constructing a loss vector for each sample from its cross-view training loss sequences, and then employing an ensemble of outlier detection algorithms to find noisy instances. We conducted extensive experiments on four prominent GNN-based detectors ( Devign , DeepWukong , ReVEAL , and IVDetect ) using the Big-Vul and FFmpeg+QEMU datasets. The results demonstrate that training on Big-Vul dataset denoised by VulDeNoise consistently enhances the F1-score of these four detectors by 5-10%. In controlled experiments on the FFmpeg+QEMU dataset, where label noise was synthetically added at varying ratios, VulDeNoise achieved a denoising F1-score of up to 70%, demonstrating its high effectiveness in identifying noisy labels. Furthermore, VulDeNoise substantially outperforms state-of-the-art denoising methods like Confident Learning and Differential Training , and even surpasses a Large Language Model (LLM)-based auditing approach. Ablation studies confirm the robustness of our design, revealing that the synergy of all three code representations and a carefully selected training duration are essential for optimal performance. VulDeNoise offers an effective, automated solution for improving the quality of vulnerability datasets, paving the way for more reliable deep learning-based vulnerability detection models.

  • New
  • Research Article
  • 10.1145/3796225
Boosting Metamorphic Testing: A General Metamorphic Specification Language and A Supporting System
  • Feb 7, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Chang-Ai Sun + 3 more

Metamorphic testing (MT) is a black-box testing technique to alleviate the oracle problem via leveraging Metamorphic Relations (MRs) based on the domain knowledge of the software under test. In recent years, software testing researchers have made significant research advancements on MT in fundamental theories (e.g., MR identification and composition), methodologies (e.g., test input generation), and fault detection effectiveness in various application domains. However, there still exist some major challenges of MT yet to be addressed, such as the needs for a general MR description language, and an automated system that supports all major steps of MT and integrates the various MT tasks. To address these challenges, we have developed a general MR description language (called the Category-Choice Metamorphic specification Language; abbreviated as CCML), through which an automated supporting tool (called the Category-Choice Metamorphic testing Tool; abbreviated as CCMT) integrating the various MT tasks has been built. CCMT supports the automatic generation of test inputs and composite MRs and integrates various run-time optimization strategies. We have also conducted empirical studies to evaluate the expressiveness of CCML and the performance of CCMT in various testing aspects. Overall, our empirical findings are encouraging, and have demonstrated the merits of CCML and CCMT. In this regard, our work contributes to improving the fault detection effectiveness, efficiency, and practicality of MT and, hence, brings the use of MT to a new height.

  • New
  • Research Article
  • 10.1145/3796239
Unsupervised, Accurate, and Efficient Log Parsing Using Smaller Open-Source Large Language Models
  • Feb 7, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Zeyang Ma + 2 more

Log parsing transforms unstructured logs into structured templates for downstream analysis. Syntax-based parsers are fast but lose accuracy on logs that deviate from predefined rules. Recently, large language models (LLMs) based log parsers have shown superior parsing accuracy but face three issues: (1) manual labeling for fine-tuning or in-context learning, (2) high cost from large volumes and limited context size of LLMs, and (3) privacy risks with commercial models. We present LibreLog, an unsupervised approach using open-source LLMs to enhance privacy and reduce cost while achieving state-of-the-art accuracy. LibreLog groups logs with a fixed-depth tree, then parses each group via: (i) similarity scoring-based retrieval augmented generation, (ii) self-reflection to refine templates, and (iii) a template memory to reduce LLM queries. On LogHub-2.0, LibreLog achieves GA 87.2, PA 85.4, FGA 82.3, and FTA 65.1, PA and FTA outperforming prior state-of-the-art LLM-based parsers by 13.7% and 6.9%, respectively. LibreLog processes all logs in 5.94 hours, a 1.7 times speedup over the fastest LLM parser. Using a larger LLM only for self-reflection further improves PA to 86.3 and FTA to 68.3 with a moderate runtime cost increase (31%). In short, LibreLog addresses privacy and cost concerns of using commercial LLMs while achieving state-of-the-art parsing efficiency and accuracy.

  • New
  • Research Article
  • 10.1145/3789503
Contributions, Collaborations, and Transitions: Paid and Volunteer Developers in the Rust Community
  • Feb 6, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Yuxia Zhang + 5 more

An increasing number of companies are contributing to open source software (OSS) projects by assigning their employees to advance their business objectives. These paid developers collaborate with volunteer contributors, but the differing motivations of these two groups can sometimes lead to conflicts, which might endanger the OSS project's sustainability. This article presents a multi-method comparative study of paid developers and volunteers in Rust, currently one of the most popular open source programming languages. We compare volunteers and paid developers through contribution behavior, social collaboration, and long-term participation. Then, we solicit volunteers’ perceptions of paid developers and explore the emotions caused when volunteers transition to paid roles. We find that core paid developers tend to contribute more frequently; peripheral paid developers contribute bigger commits and focus more on implementing features; both core and peripheral paid developers collaborate more with volunteers but less intensively than expected; and being paid correlates positively with becoming a long-term contributor. Our study also reveals existing unfamiliarity and prejudices among volunteers towards paid developers, and that volunteer-to-paid transitions can evoke negative community sentiments. This study suggests that the dichotomous view of paid vs. volunteer developers is too simplistic and that further subgroups could be identified. Contributing organizations should become more sensitive to how OSS communities perceive them when they attempt to get involved and make improvements.

  • New
  • Research Article
  • 10.1145/3795773
Actionable Framework for Understanding and Improving Social and Human Factors that Influence the Requirements Management in Software Ecosystems
  • Feb 4, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Rodrigo Feitosa + 4 more

In software ecosystems (SECO), requirements management requires the collaboration of multiple actors (end-users and external developers). Thus, it is essential to consider social and human factors (SHF) when performing requirements management activities in SECO. This research aims to identify which SHF influence requirements management activities in SECO through a rapid review (RR) and semi-structured interviews with professionals who work in requirements management activities in SECO. Our findings elucidate the SHF that influences requirements management in SECO, as well as the contextual characteristics that impact these factors. We also identify strategies employed by professionals to improve SHF and the barriers that stand in their way. Finally, we describe coping mechanisms that professionals use when SHF factors cannot be sufficiently improved. Our findings lead to the construction of an actionable framework for understanding and improving SHF in requirements management activities at SECO. We evaluated the framework through a focus group comprising subject-matter experts, resulting in the final version of the framework, titled SHFiRM-SECO. A feasibility study was then conducted to assess the practical applicability of SHFiRM-SECO. The study involved professionals performing requirements management activities and confirmed the framework’s usefulness and ease of use to enhance efficiency in managing SHF within SECO. The framework serves as a go-to reference for requirements professionals and key organizations that aim to enhance productivity and effectiveness in requirements management within SECO. They will also support researchers in refining and instantiating our framework in the future.

  • New
  • Research Article
  • 10.1145/3793675
Less Is More: Failing Test Generation with Large Language Models
  • Feb 4, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Tsz-On Li + 8 more

Failing test generation is challenging. It involves searching in a vast space for fault-triggering test inputs and the oracles asserting these faulty executions. Despite techniques proposed to generate tests using large language models (LLMs), they are ineffective in finding failing tests, particularly for programs that implement non-trivial coding tasks such as medium/advanced-level coding contest problems. To tackle this limitation, we are inspired by an earlier finding that constituent snippets within a program typically implement simpler coding tasks compared to the program as a whole. As a result, LLMs can be leveraged to generate failing tests that target a program’s constituent snippets, thereby revealing the program defects. Leveraging this insight, we propose Mi croscopic T est Gen eration ( MitGen ), a novel technique of failing test generation. Unlike previous approaches that generate tests to fulfill code coverage, MitGen focuses on generating tests that reveal faults in a given program’s constituent code snippets. We evaluate MitGen using Starcoder2-15B-instruct-v0.1 , Meta-Llama-3-8B-Instruct and CodeQwen1.5-7B-Chat , on two popular benchmarks (EvoEval-Difficult and ClassEval) and 100 real-world subjects. We compare MitGen with three baselines, including state-of-the-art approaches (Differential Prompting and Pynguin) in finding failing tests . The evaluation results show that MitGen ’s recall is 0.66 , 112.7% enhancement over the best baseline (0.31 ).

  • New
  • Research Article
  • 10.1145/3788879
A Research Roadmap for Augmenting Software Engineering Processes and Software Products with Generative AI
  • Jan 30, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Domenico Amalfitano + 10 more

Generative AI (GenAI) is rapidly transforming software engineering (SE) practices, influencing how SE processes are executed, as well as how software systems are developed, operated, and evolved. This paper applies design science research to build a roadmap for GenAI-augmented SE. The process consists of three cycles that incrementally integrate multiple sources of evidence, including collaborative discussions from the FSE 2025 “Software Engineering 2030” workshop, rapid literature reviews, and external feedback sessions involving peers. McLuhan’s tetrads were used as a conceptual instrument to systematically capture the transforming effects of GenAI on SE processes and software products. The resulting roadmap identifies four fundamental forms of GenAI augmentation in SE and systematically characterizes their related research challenges and opportunities. These insights are then consolidated into a set of future research directions. By grounding the roadmap in a rigorous multi-cycle process and cross-validating it among independent author teams and peers, the study provides a transparent and reproducible foundation for analyzing how GenAI affects SE processes, methods and tools, and for framing future research within this rapidly evolving area.

  • New
  • Research Article
  • 10.1145/3789665
How Hard Can It Be? Quantifying MITRE Attack Campaigns with Attack Trees and cATM Logic
  • Jan 28, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Stefano M Nicoletti + 4 more

In the cyber threats landscape, Advanced Persistent Threats carry out attack campaigns—e.g. operations Dream Job, Wocao, and WannaCry—against which cybersecurity practitioners must defend. To prioritise which of these to defend against, experts must be equipped with the ability to evaluate the most threatening ones: they would strongly benefit from (a) an estimation of the likelihood values for each attack recorded in the wild, and (b) transparently operationalising these values to compare campaigns quantitatively. Here we construct such a framework: (1) quantifying the likelihood of attack campaigns via data-driven procedures on the MITRE knowledge-base, (2) introducing a methodology for automatic modelling of MITRE intelligence data, that captures any attack campaign via template attack tree models, and (3) proposing an open-source tool to perform these comparisons based on the cATM logic. Finally, we quantify the likelihood of all MITRE Enterprise campaigns, and compare the likelihood of the Wocao and Dream Job MITRE campaigns—generated with our proposed approach—against manually-built attack tree models. We demonstrate how our methodology is substantially lighter in modelling effort, and capable of capturing all the quantitative relevant data. To ensure broader applicability, further validation with cybersecurity experts is recommended, especially by sourcing more manually-built models.

  • New
  • Research Article
  • 10.1145/3788873
A First Look at Bugs in LLM Inference Engines
  • Jan 28, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Mugeng Liu + 7 more

Large language model-specific inference engines (in short as LLM inference engines ) have become a fundamental component of modern AI infrastructure, enabling the deployment of LLM-powered applications (LLM apps) across cloud and local devices. Despite their critical role, LLM inference engines are prone to bugs due to the immense resource demands of LLMs and the complexities of cross-platform compatibility. However, a systematic understanding of these bugs remains lacking. To bridge this gap, we present the first empirical study on bugs in LLM inference engines. We mine official repositories of 5 widely adopted LLM inference engines, constructing a comprehensive dataset of 929 real-world bugs. Through a rigorous open coding process, we analyze these bugs to uncover their symptoms, root causes, commonality, fix effort, fix strategies, and temporal evolution. Our findings reveal six bug symptom types and a taxonomy of 28 root causes, shedding light on the key challenges in bug detection and location within LLM inference engines. Based on these insights, we propose a series of actionable implications for researchers, inference engine vendors, and LLM app developers, along with general guidelines for developing LLM inference engines.

  • New
  • Research Article
  • 10.1145/3789667
Code-Enhanced Cross-Perspective Bug Question Retrieval
  • Jan 28, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Mengzhen Wang + 6 more

The bug question retrieval task aims to identify the most relevant questions from databases to find solutions for specific bugs. Existing methods often treat this as a text matching problem, primarily focusing on leveraging semantic similarities between bug descriptions for retrieval. However, these methods often overlook the semantic gap posed by users describing bugs from different perspectives, which significantly hampers their retrieval performance. To address this challenge, we first propose the Cross-Perspective Retrieval (CPR) model, which integrates a Semantic Association Module and an Information Fusion Module to align descriptions effectively, utilizing code as auxiliary information. The Semantic Association Module establishes semantic connections between descriptions by extracting implicit information from the code and developing a coherent semantic context. Meanwhile, the Information Fusion Module employs modality contrastive learning to integrate information from both the code and the descriptions. Furthermore, we introduce CPRSearchNet, a new dataset specifically designed for cross-perspective bug question retrieval. CPRSearchNet comprises 8,785 samples, each including bug descriptions from three distinct perspectives alongside the corresponding code context, thereby filling a critical gap in existing datasets. Experiments demonstrate that the CPR significantly outperforms existing baselines in the cross-perspective bug question retrieval task, resulting in substantial improvements in R@K and MRR.