Discovery Logo
Sign In
Search
Paper
Search Paper
R Discovery for Libraries Pricing Sign In
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
Discovery Logo menuClose menu
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
features
  • Audio Papers iconAudio Papers
  • Paper Translation iconPaper Translation
  • Chrome Extension iconChrome Extension
Content Type
  • Journal Articles iconJournal Articles
  • Conference Papers iconConference Papers
  • Preprints iconPreprints
  • Seminars by Cassyni iconSeminars by Cassyni
More
  • R Discovery for Libraries iconR Discovery for Libraries
  • Research Areas iconResearch Areas
  • Topics iconTopics
  • Resources iconResources

Related Topics

  • Test Input Generation
  • Test Input Generation
  • Test Oracles
  • Test Oracles
  • Black-box Testing
  • Black-box Testing
  • White-box Testing
  • White-box Testing

Articles published on Test Inputs

Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
1024 Search results
Sort by
Recency
  • Research Article
  • 10.1007/s10439-025-03917-6
In Vivo Cervical Spine Posture Changes During Non-impact Inverted Freefalls.
  • May 1, 2026
  • Annals of biomedical engineering
  • Loay Al-Salehi + 3 more

Axial headfirst impacts can cause catastrophic cervical spine injuries when the head is rapidly decelerated and the cervical spine is compressed by the torso's inertia. The goal of this study was to quantify the cervical vertebral translations, rotations, eccentricity and curvature, and the head rotation in human subjects exposed to non-impact inverted freefalls that represented the pre-impact dynamics of a headfirst impact. Eleven human subjects were exposed to 4 headfirst freefalls (2 relaxed, 2 with pre-bracing) while secured in a race car seat fixed to a carriage that was inverted and released to freefall over 312.5ms (0.479m) before being decelerated to rest. Sagittal fluoroscopy of the cervical spine was acquired and analyzed at freefall onset and end to extract vertebral and head posture variables. Linear mixed models were used to assess the effect of time (freefall onset/end) and condition (relaxed/braced) on the posture variables. Subjects consistently moved their cervical spine anteriorly and inferiorly, rotated their vertebrae and head in flexion, and increased their spinal eccentricity. Small changes in spinal curvature and intervertebral angles suggested that the subjects responded using an "en bloc" rotation of the cervical spine and head about a point inferior to C6. Compared to the relaxed condition, pre-freefall bracing produced a different initial posture but a similar end posture. Despite considerable inter-subject variability, a consistent neck and head reorientation was observed, albeit with variable underlying segmental cervical spine posture, providing valuable input for cadaveric tests and computational models simulating headfirst impacts to improve injury prediction.

  • Research Article
  • 10.1080/00295450.2026.2636433
Toward Incorporating Epistemic Uncertainty of Neural Network–Based Turbulence Closures in RANS CFD Simulations
  • Apr 6, 2026
  • Nuclear Technology
  • Cody Grogan + 4 more

With increasing computational demand, neural network– (NN) based models are being developed as pretrained surrogates for different thermohydraulics phenomena. An area where this approach has shown promise is in developing higher-fidelity turbulence closures for computational fluid dynamics (CFD) simulations. The primary bottleneck to the widespread adaptation of these NN-based closures for nuclear engineering applications is the uncertainty associated with them. The current paper illustrates three commonly used methods that can be used to quantify model uncertainty in NN-based turbulence closures. The NN model used for the current study is trained on data from an algebraic turbulence closure model. The uncertainty quantification (UQ) methods explored are deep ensembles, Monte Carlo dropout (MC-Dropout), and stochastic variational inference (SVI). The paper ends with a discussion on the relative performance of the three methods for quantifying epistemic uncertainties of NN-based turbulence closures and potentially how they could be further extended to quantify out-of-training uncertainties. For accuracy in turbulence modeling, this research finds that deep ensembles have the best prediction accuracy, with an root mean squared error of 4.31 ⋅ 10 − 4 on the testing inputs, followed by MC-Dropout and SVI. For UQ, this paper finds each that method produces unique epistemic uncertainty estimates, with deep ensembles being overconfident in regions, MC-Dropout being underconfident, and SVI producing principled uncertainty at the cost of function diversity. Finally, the paper lays out a strategy of how UQ of the NN-based turbulence closures could be incorporated into RANS CFD simulations.

  • Research Article
  • 10.1109/tvcg.2026.3658714
Interactive Visual Assessment for Text-to-Image Generation Models.
  • Apr 1, 2026
  • IEEE transactions on visualization and computer graphics
  • Xiaoyue Mi + 7 more

Visual generation models have achieved remarkable progress in computer graphics applications but still face significant challenges in real-world deployment. Current assessment approaches for visual generation tasks typically follow an isolated three-phase framework: test input collection, model output generation, and user assessment. These fashions suffer from fixed coverage, evolving difficulty, and data leakage risks, limiting their effectiveness in comprehensively evaluating increasingly complex generation models. To address these limitations, we propose DyEval, an LLM-powered dynamic interactive visual assessment framework that facilitates collaborative evaluation between humans and generative models for text-to-image systems. DyEval features an intuitive visual interface that enables users to interactively explore and analyze model behaviors, while adaptively generating hierarchical, fine-grained, and diverse textual inputs to continuously probe the capability boundaries of the models based on their feedback. Additionally, to provide interpretable analysis for users to further improve tested models, we develop a contextual reflection module that mines failure triggers of test inputs and reflects model potential failure patterns, supporting in-depth analysis using the logical reasoning ability of LLM. Qualitative and quantitative experiments demonstrate that DyEval can effectively help users identify max up to 2.56 timesmore generation failures than conventional methods, and uncover complex and rare failure patterns, such as issues with pronoun generation and specific cultural context generation. Our framework provides valuable insights for improving generative models and has broad implications for advancing the reliability and capabilities of visual generation systems across various domains.

  • Research Article
  • 10.1109/tse.2026.3675285
How Composite Metamorphic Relations Enhance Test Effectiveness of DNN Testing: An Empirical Study
  • Apr 1, 2026
  • IEEE Transactions on Software Engineering
  • Huayao Wu + 5 more

Metamorphic Testing (MT) is a powerful technique to alleviate the test oracle problem of DNN testing. At its core is a set of Metamorphic Relations (MRs), which indicate necessary properties that multiple test inputs and their outputs should satisfy. Recently, researchers have proposed to compose multiple individual MRs to create new Composite Metamorphic Relations (CMRs), in the hope of enhancing the cost-effectiveness of MT. However, the potential benefit of such compositions is yet to be systematically explored for DNN based systems. In this paper, we present the first empirical study that investigates the relative test effectiveness of CMRs over individual MRs in DNN testing. Our experiment is performed under five popular image recognition models and seven representative component MRs, based on which a total of 3,612 CMRs and more than 200 million pairs of source and follow-up test images are generated and exercised. The experimental results highlight the advantages of employing CMRs in DNN testing, as they generally exhibit superior failure and fault revelation effectiveness compared to any individual MR. Moreover, to further illuminate the conditions that allow CMRs to maximise their performance in DNN testing, we propose to analyse the geometric relationships of MRs in the latent embedding space of DNNs, thereby quantifying the extent to which different MRs complement each other. Accordingly, by opting for highly complementary MRs in composition, we can create CMRs that are most likely to enhance test effectiveness.

  • Research Article
  • 10.1111/coep.70026
The value of hybrid evaluation methods for targeted social assistance programs: A case study of China's Dibao
  • Mar 14, 2026
  • Contemporary Economic Policy
  • Jiajun Lan + 3 more

Abstract Evaluating the targeting of large‐scale social assistance programs is crucial for improving public administration. Traditional approaches, relying solely on quantitative proxy means testing or qualitative community inputs, have limitations in capturing complex socioeconomic realities. This paper argues for employing hybrid methods that integrate quantitative and qualitative data sources to comprehensively assess targeting accuracy. Using China's Dibao program as a case study, we demonstrate that a hybrid approach reduced inclusion errors and enhanced program understanding among beneficiaries and communities. Our findings highlight the importance of collaborative evaluation mechanics that leverage localized knowledge while guarding against elite capture.

  • Research Article
  • Cite Count Icon 2
  • 10.1145/3745765
Automated Unit Test Generation via Chain-of-Thought Prompt and Reinforcement Learning from Coverage Feedback
  • Mar 11, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Junwei Zhang + 4 more

Recently, Large Language Models (LLMs) have shown promising results in code generation, and several automated test generation approaches based on LLMs have been proposed. Although these approaches achieve promising performance, they suffer from two limitations. First, they lack the intrinsic understanding of the semantic intricacies and logical constructs inherent to the focal method. Second, they ignore the diversity of the generated tests and generate tests with limited code coverage. To alleviate these two limitations, in this work, we propose a novel approach named TestCTRL that optimizes LLMs for unit test generation by the Chain-of-Thought (CoT) prompt and Reinforcement Learning (RL) strategy. Specifically, we first build a new CoT dataset, containing the focal methods, corresponding unit tests, and CoT prompts. The CoT prompt includes the intention and possible test input values. Then, the CoT dataset is used to fine-tune one LLM (i.e., CodeLlama 7B) that can be seen as the policy model in RL. Meanwhile, we fine-tune another LLM (i.e., CodeGPT) as the reward model by predicting the line coverage of the focal method and its test. Moreover, we employ the Proximal Policy Optimization (PPO) algorithm to optimize the policy model and generate unit tests. We use the Defects4J benchmark to evaluate our approach from three perspectives (i.e., naturalness, validity, and code coverage). To avoid data leakage threats, we filtered out data from the CoT dataset that have the same focal method and test case names as those in the Defects4J. The experimental results demonstrate that TestCTRL outperforms state-of-the-art baselines in line and branch coverages, respectively. Besides, TestCTRL improves bug detection performance. We also investigate the reason for the proposed approach’s superiority.

  • Research Article
  • 10.1109/tse.2026.3655712
Subgraph-Oriented Testing for Deep Learning Libraries
  • Mar 1, 2026
  • IEEE Transactions on Software Engineering
  • Xiaoyuan Xie + 3 more

Deep Learning (DL) libraries, such as PyTorch, are widely used for building and deploying DL models on various hardware platforms. Meanwhile, they are found to contain bugs that lead to incorrect calculation results and cause issues like non-convergence training and inaccurate prediction of DL models. Thus, many efforts have been made to test DL libraries and reveal bugs. However, existing DL library testing methods manifest limitations: model-level testing methods cause complexity in fault localization. Meanwhile, API-level testing methods often generate invalid inputs or primarily focus on extreme inputs that lead to crash failures; they also ignore testing realistic API interactions. These limitations may lead to missing detection of bugs, even in the frequently used APIs. To address these limitations, we propose SORT (Subgraph-Oriented Realistic Testing) to differential test DL libraries on different hardware platforms. SORT takes popular API interaction patterns, represented as frequent subgraphs of model computation graphs, as test subjects. In this way, it introduces realistic API interaction sequences while maintaining efficiency in locating faulty APIs for observed errors. Besides, SORT prepares test inputs by referring to extensive features of runtime inputs for each API in executing real-life benchmark data. The generated inputs are expected to better simulate such valid real inputs and reveal bugs that are more likely to happen in real-life usage. Evaluation on 728 frequent subgraphs of 49 popular PyTorch models demonstrates that SORT achieves a 100% valid input generation rate, detects more precision bugs than existing methods, and reveals interaction-related bugs missed by single-API testing. 18 precision bugs in PyTorch are identified and reported to PyTorch developers.

  • Research Article
  • 10.1145/3735552
Integrating Path Selection for Symbolic Execution and Variable Selection for Constraint Solving
  • Feb 13, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Shunkai Zhu + 4 more

Symbolic execution is a powerful technique that can accurately synthesize program inputs for program testing through constraint solving. Applying symbolic execution effectively means that we must solve two searching problems efficiently. One is to search through the many program paths and the other is, given a particular path condition, to search through the numerous variable assignments to identify one satisfying solution. With few exceptions, existing symbolic execution engines treat constraint solvers as black boxes. As a result, the two searches are completely separated, which results in much redundancy (i.e., the same variable assignments may be tried for solving many program paths). Existing attempts on addressing this issue include those approaches based on constrained Horn clauses (in which the whole program is encoded as one constraint) and one preliminary attempt on caching and reusing partial solving results from the constraint solver. In this work, we propose SEC , which systematically computes the reward of concretizing a program path (for symbolic execution) and a variable (for constraint solving) and uses the reward as guide for integrating the two searches. We implemented SEC based on KLEE and evaluated it on a diverse set of programs. The results show that SEC is effective, i.e., achieving 15% more code coverage than the state-of-the-art baseline symbolic execution engines. Furthermore, we show that SEC can be readily combined with a state-of-the-art concolic testing engine to improve its performance

  • Research Article
  • 10.1007/s10664-026-10816-4
VLM-Fuzz: Vision language model assisted recursive depth-first search exploration for effective GUI testing of android apps
  • Feb 13, 2026
  • Empirical Software Engineering
  • Biniam Fisseha Demissie + 3 more

Testing Android apps effectively requires a systematic exploration of the app’s possible states by simulating user interactions and system events. While existing approaches have proposed several fuzzing techniques to generate various text inputs and trigger user and system events for GUI state exploration, achieving high code coverage remains a significant challenge in Android app testing. The main challenges are (1) reasoning about the complex and dynamic layout of GUI screens; (2) generating required inputs/events to deal with certain widgets like pop-ups; and (3) coordination between current test inputs and previous inputs to avoid getting stuck in the same GUI screen without improving test coverage. To address these problems, we propose VLM-Fuzz, a novel automated approach for Android GUI testing. At its foundation, VLM-Fuzz utilizes a heuristic-based, recursive depth-first search (DFS) strategy that is intelligently guided by a Vision Language Model (VLM) to effectively explore the app’s complex GUI states. The core innovation of VLM-Fuzz is not simply the use of a VLM, but its strategic, on-demand integration within a hybrid exploration framework. Our approach combines a fast, heuristic-based DFS for standard GUI interactions with targeted, VLM-assisted analysis for visually complex screens. We use static analysis to analyze the Android Manifest file and the runtime GUI hierarchy XML to extract the list of components, intent-filters and interactive GUI widgets. VLM is used to reason about complex GUI layout and widgets on an on-demand basis. Based on the inputs from static analysis, VLM, and the current GUI state, we use some heuristics to deal with the above-mentioned challenges. We evaluated VLM-Fuzz based on a benchmark containing 59 apps obtained from a recent work and compared it against two state-of-the-art approaches: APE and DeepGUI. VLM-Fuzz outperforms the best baseline by 9.0% , 3.7% , and 2.1% in terms of class coverage, method coverage, and line coverage, respectively. We also ran VLM-Fuzz on 80 recent Google Play apps (i.e., updated in 2024). VLM-Fuzz detected 52 unique crashes in 12 apps, which have been reported to respective developers.

  • Research Article
  • 10.1145/3796225
Boosting Metamorphic Testing: A General Metamorphic Specification Language and A Supporting System
  • Feb 7, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Chang-Ai Sun + 3 more

Metamorphic testing (MT) is a black-box testing technique to alleviate the oracle problem via leveraging Metamorphic Relations (MRs) based on the domain knowledge of the software under test. In recent years, software testing researchers have made significant research advancements on MT in fundamental theories (e.g., MR identification and composition), methodologies (e.g., test input generation), and fault detection effectiveness in various application domains. However, there still exist some major challenges of MT yet to be addressed, such as the needs for a general MR description language, and an automated system that supports all major steps of MT and integrates the various MT tasks. To address these challenges, we have developed a general MR description language (called the Category-Choice Metamorphic specification Language; abbreviated as CCML), through which an automated supporting tool (called the Category-Choice Metamorphic testing Tool; abbreviated as CCMT) integrating the various MT tasks has been built. CCMT supports the automatic generation of test inputs and composite MRs and integrates various run-time optimization strategies. We have also conducted empirical studies to evaluate the expressiveness of CCML and the performance of CCMT in various testing aspects. Overall, our empirical findings are encouraging, and have demonstrated the merits of CCML and CCMT. In this regard, our work contributes to improving the fault detection effectiveness, efficiency, and practicality of MT and, hence, brings the use of MT to a new height.

  • Research Article
  • 10.1145/3793675
Less Is More: Failing Test Generation with Large Language Models
  • Feb 4, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Tsz-On Li + 8 more

Failing test generation is challenging. It involves searching in a vast space for fault-triggering test inputs and the oracles asserting these faulty executions. Despite techniques proposed to generate tests using large language models (LLMs), they are ineffective in finding failing tests, particularly for programs that implement non-trivial coding tasks such as medium/advanced-level coding contest problems. To tackle this limitation, we are inspired by an earlier finding that constituent snippets within a program typically implement simpler coding tasks compared to the program as a whole. As a result, LLMs can be leveraged to generate failing tests that target a program’s constituent snippets, thereby revealing the program defects. Leveraging this insight, we propose Mi croscopic T est Gen eration ( MitGen ), a novel technique of failing test generation. Unlike previous approaches that generate tests to fulfill code coverage, MitGen focuses on generating tests that reveal faults in a given program’s constituent code snippets. We evaluate MitGen using Starcoder2-15B-instruct-v0.1 , Meta-Llama-3-8B-Instruct and CodeQwen1.5-7B-Chat , on two popular benchmarks (EvoEval-Difficult and ClassEval) and 100 real-world subjects. We compare MitGen with three baselines, including state-of-the-art approaches (Differential Prompting and Pynguin) in finding failing tests . The evaluation results show that MitGen ’s recall is 0.66 , 112.7% enhancement over the best baseline (0.31 ).

  • Research Article
  • 10.1038/s41598-026-38020-w
Generating borderline test samples for randomness testers via intelligent optimization and evolutionary algorithms.
  • Feb 4, 2026
  • Scientific reports
  • Peng Gao + 3 more

Ensuring information security heavily relies on high-quality random sequences for encryption keys. Physical entropy sources, despite their use in generating true random sequences, are susceptible to environmental disturbances, necessitating real-time randomness testing to maintain high entropy. However, existing methods for generating test data for real-time randomness testers face significant challenges, including producing sequences that fail to meet specific randomness criteria, constructing borderline sequences with slight non-randomness, and addressing the difficulty of simultaneously violating multiple randomness criteria. This paper introduces a dynamic test data generation framework designed to address these challenges. The framework leverages evolutionary algorithm (EA) to transform the generation of borderline sequences into a multi-constrained optimization problem, where a large language model (LLM) acts as a dynamic parameter adjuster. By analyzing evolutionary trends in population statistics and interacting with evolutionary dynamics through a game-theoretic mechanism, the LLM adaptively adjusts scaling factors and weight coefficients, mitigating the curse of dimensionality in multi-objective optimization and enabling real-time parameter tuning. The experimental results also highlight the high quality of the generated sequences: our approach can generate borderline test data that slightly fail to satisfy the target randomness criteria, yet exhibit statistical properties very similar to those of high-entropy sources under standard test suites. These borderline sequences are fault-detectable and provide challenging, realistic test inputs for classical statistical-test-based real-time randomness testers.

  • Research Article
  • Cite Count Icon 1
  • 10.1145/3730435
PriCod: Prioritizing Test Inputs for Compressed Deep Neural Networks
  • Jan 20, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Yinghua Li + 4 more

The widespread adoption of Deep Neural Networks (DNNs) has brought remarkable advances in machine learning. However, the computational and memory demands of complex DNNs hinder their deployment in resource-constrained environments. To address this challenge, compressed DNN models have emerged, offering a compromise between efficiency and accuracy. Nonetheless, assessing the performance of these compressed models can demand extensive testing, typically requiring high manual labeling costs, rendering the process resource-intensive and time-consuming. To mitigate these challenges, test input prioritization has emerged as a promising technique aimed at reducing labeling costs by prioritizing inputs that are more likely to be misclassified. This enables the early identification of bug-revealing tests with reduced time and manual labeling effort. In this article, we propose PriCod, a novel test prioritization approach designed for compressed DNNs. PriCod leverages the behavior disparities caused by model compression, along with the embeddings of test inputs, to effectively prioritize potentially misclassified tests. It operates on the premises that significant behavior disparities between the models indicate potential misclassifications and that inputs near decision boundaries are more likely to be misclassified. To this end, PriCod generates two types of features for each test input (i.e., deviation features and embedding features) to capture the prediction deviation caused by model compression and the proximity to decision boundaries, respectively. By combining these features, PriCod predicts the probability of misclassification for each test, ranking tests accordingly. We conduct an extensive study to evaluate the effectiveness of PriCod, comparing it with multiple test prioritization approaches. The experimental results demonstrate the effectiveness of PriCod, with average improvements of 7.43%–55.89% on natural test inputs, 7.92%–52.91% on noisy test inputs, and 7.03%–51.59% on adversarial test inputs, compared with existing test prioritization approaches.

  • Research Article
  • 10.21275/sr26109233300
Eccentricity-Based Bounds for the Spectral Radius of Graph Matrices
  • Jan 15, 2026
  • International Journal of Science and Research (IJSR)
  • Sunilgar Gusai

This article presents a framework for bounding the spectral radius of classical graph matrices, specifically the adjacency and signless Laplacian matrices, using vertex eccentricity as a global structural parameter. By applying the Collatz-Wielandt characterization and Rayleigh quotient methods with the eccentricity vector as a test input, the study derives both upper and lower bounds that incorporate global distance distributions. The proposed bounds are shown to be sharp for regular and self-centered graph families and remain useful for irregular structures. Through these results, eccentricity emerges as a complementary control parameter to degree-based approaches, providing enhanced insight into how both local and global structures shape spectral behaviour.

  • Research Article
  • 10.1145/3787101
Antidote or Placebo? Unraveling the Efficacy of Neuron Coverage Criteria on Testing Transformer-based Language Models
  • Jan 5, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Xiaoning Ren + 5 more

In the realm of deep learning, a variety of neuron coverage criteria for Deep Neural Networks (DNNs) have been devised to effectively assess the quality of test suites and facilitate the generation of test inputs. Recently proposed coverage criteria, incorporating representation distribution and causal relationships, have infused fresh vitality into this field. However, the focus of previous works is primarily on Convolutional Neural Networks for computer vision, leading to a research gap in exploring coverage testing for language models. Concurrently, with the rise of large language models, transformer-based language models have become increasingly dominant, and numerous ones have sprouted. Therefore, the effectiveness of coverage criteria in transformer-based language tasks, especially with the introduction of novel criteria, remains an unresolved open problem. To tackle it, this study examines these concerns by evaluating a wide range of criteria, including four well-established ones and two state-of-the-art criteria, across three types of transformer-based models: encoder-only, decoder-only, and encoder-decoder models. Building on previous research, we conduct a comprehensive evaluation across three key areas: regarding test suite properties, 1) Error-revealing capability, i.e., sensitivity to adversarial examples; 2) Diversity, i.e., distribution diversity and sample fairness (category diversity); and regarding test suite generation, 3) Input generation guidance, i.e., the ability to guide the generation of more valuable samples. The experimental results demonstrate that the impact of coverage criteria is multifaceted. For the error-revealing capability of test suites, the additional coverage for erroneous samples over noise samples is only 0.32%. In terms of distribution diversity and sample fairness, 26 and 30 cases, respectively, out of 33 configurations are effectively evaluated. Additionally, incorporating neuron-wise coverage guidance during test suite generation slightly increases the production of adversarial samples by 4.56%. In conclusion, while current coverage criteria can act as an antidote for assessing simple diversity, they remain largely a placebo for the core task of revealing adversarial errors, particularly when relying on individual criterion. Consequently, their practical application requires carefully evaluating the trade-off between computational overhead and potential benefits given the massive scale of Transformers. However, this low cost-effectiveness ultimately highlights the urgent need to explore and develop more robust and efficient criteria designed specifically for Transformer-based models.

  • Research Article
  • 10.1038/s44387-026-00095-1
Explainable AI needs formalization.
  • Jan 1, 2026
  • NPJ artificial intelligence
  • Stefan Haufe + 6 more

The field of "explainable artificial intelligence" (XAI) seemingly addresses the desire that decisions of machine learning systems should be human-understandable. However, in its current state, XAI itself needs scrutiny. Popular methods cannot reliably answer relevant questions about ML models, their training data, or test inputs, because they systematically attribute importance to input features that are independent of the prediction target. This limits the utility of XAI for diagnosing and correcting data and models, for scientific discovery, and for identifying intervention targets. The fundamental reason for this is that current XAI methods do not address well-defined problems and are not evaluated against the targeted criteria of explanation correctness. Researchers should formally define the problems they intend to solve and design methods accordingly. This will lead to diverse use-case-dependent notions of explanation correctness and objective metrics of explanation performance that can be used to validate XAI algorithms.

  • Research Article
  • 10.1016/j.sysarc.2025.103682
MetaDTS: Distribution difference-based adaptive test input selection for Deep Neural Networks
  • Jan 1, 2026
  • Journal of Systems Architecture
  • Xiang Su + 5 more

MetaDTS: Distribution difference-based adaptive test input selection for Deep Neural Networks

  • Research Article
  • 10.31861/sisiot2025.2.02003
An Adaptive Benchmark Testing Method for Evaluating Arithmetic Precision
  • Dec 30, 2025
  • Security of Infocommunication Systems and Internet of Things
  • Denys Deineko

Floating-point arithmetic is inherently prone to precision errors, which can accumulate over time and significantly influence the outcomes of numerical computations. This work presents a method designed to systematically assess and compare the accuracy of various arithmetic implementations by adaptively refining test inputs in response to observed computational inaccuracies. In contrast to conventional approaches that use either fixed sets of numerical values or random sampling techniques, the method introduced here continuously updates the test set. It does so by identifying areas in the numerical domain where computational errors tend to be the most significant. The refinement process is iterative and guided by statistical analysis of previous results, ensuring that regions with elevated error levels receive more focused attention in subsequent testing phases. At the heart of the method is an adaptive process for determining which numerical values require further examination. This is achieved by analyzing the distribution of previously recorded errors and updating a decision criterion based on those findings. Specifically, thresholds for acceptable accuracy are recalculated using statistical measures such as quantiles, which reflect the severity and frequency of encountered errors. This ensures that the refinement of test inputs is driven by actual data, rather than relying on predetermined heuristics. The method begins with the generation of a diverse collection of numerical inputs that spans a broad spectrum of floating-point values, including those known to cause instability in calculations – such as extremely small or large values and those located at the boundaries of numerical precision. These inputs are then used to perform arithmetic operations including addition, subtraction, multiplication, and division. Two different arithmetic implementations are evaluated: the standard arithmetic used in a widely adopted programming language and an alternative, custom-developed arithmetic designed to enhance numerical accuracy. For each operation, the resulting values produced by the two arithmetic systems are compared. Measures of accuracy are derived by calculating the differences between the outputs using both absolute and relative error estimations. These differences are then statistically analyzed to detect patterns in the occurrence and magnitude of errors. Based on this analysis, if the error associated with a particular input is determined to be higher than expected, additional test values are generated in the vicinity of that input. This is accomplished through carefully controlled variations, allowing the method to explore neighboring regions where similar errors might occur. In this way, the test suite evolves over time, becoming increasingly focused on those numerical situations that are most likely to expose weaknesses in arithmetic implementations. By uncovering patterns in how errors emerge and accumulate, the method provides a structured and repeatable process for evaluating the reliability of floating-point arithmetic under varying conditions. Its targeted nature makes it especially useful for scientific and engineering applications, where computational precision is critical. In summary, this approach improves upon traditional benchmarking techniques by introducing an adaptive, data-driven strategy that emphasizes the most challenging areas of numerical computation. As such, it offers a powerful tool for the verification and validation of arithmetic systems, supporting both development and quality assurance in software that relies heavily on floating-point calculations.

  • Research Article
  • 10.1145/3786776
Preparation and Utilization of Mixed States for Testing Quantum Programs–RCR Report
  • Dec 29, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Yuechen Li + 2 more

This paper presents a Replicated Computational Results (RCR) report for our article “ Preparation and Utilization of Mixed States for Testing Quantum Programs ” accepted by ACM TOSEM. The article proposes a novel type of test cases tailored for unit testing of Quantum Programs (QPs), i.e., Mixed-State Test Cases (MSTCs). Compared to Pure-State Test Cases (PSTCs) adopted in previous related works, which merely considered pure states as the test inputs, MSTCs can incorporate mixed states in the input domain of QPs. As claimed in our article, MSTCs can promote test efficiency when covering a given input domain, and also contribute to test effectiveness owing to their prone to detect more faults. This RCR report describes how to examine the functionality of our related artifacts and replicate the empirical results of our article. We have made our artifacts publicly available, including complete code, raw data, and detailed documentation, which not only facilitates result replication but also enhances the potential for reuse in future studies.

  • PDF Download Icon
  • Research Article
  • 10.3390/chemengineering10010001
Integration of Machine Learning and Feature Analysis for the Optimization of Enhanced Oil Recovery and Carbon Sequestration in Oil Reservoirs
  • Dec 19, 2025
  • ChemEngineering
  • Bukola Mepaiyeda + 5 more

The dual imperative of mitigating carbon emissions and maximizing hydrocarbon recovery has amplified global interest in carbon capture, utilization, and storage (CCUS) technologies. These integrated processes hold significant promise for achieving net-zero targets while extending the productive life of mature oil reservoirs. However, their effectiveness hinges on a nuanced understanding of the complex interactions between geological formations, reservoir characteristics, and injection strategies. In this study, a comprehensive machine learning-based framework is presented for estimating CO2 storage capacity and enhanced oil recovery (EOR) performance simultaneously in subsurface reservoirs. The methodology combines simulation-driven uncertainty quantification with supervised machine learning to develop predictive surrogate models. Simulation results were used to generate a diverse dataset of reservoir and operational parameters, which served as inputs for training and testing three machine learning models: Random Forest, Extreme Gradient Boosting (XGBoost), and Artificial Neural Networks (ANN). The models were trained to predict three key performance indicators (KPIs): cumulative oil production (bbl), oil recovery factor (%), and CO2 sequestration volume (SCF). All three models exhibited exceptional predictive accuracy, achieving coefficients of determination (R2) greater than 0.999 across both training and testing datasets for all KPIs. Specifically, the Random Forest and XGBoost models consistently outperformed the ANN model in terms of generalization, particularly for CO2 sequestration volume predictions. These results underscore the robustness and reliability of machine learning models for evaluating and forecasting the performance of CO2-EOR and sequestration strategies. To enhance model interpretability and support decision-making, SHapley Additive exPlanations (SHAP) analysis was applied. SHAP, grounded in cooperative game theory, offers a model-agnostic approach to feature attribution by assigning an importance value to each input parameter for a given prediction. The SHAP results provided transparent and quantifiable insights into how geological and operational features such as porosity, injection rate, water production rate, pressure, etc., affect key output metrics. Overall, this study demonstrates that integrating machine learning with domain-specific simulation data offers a scalable approach for optimizing CCUS operations. The insights derived from the predictive models and SHAP analysis can inform strategic planning, reduce operational uncertainty, and support more sustainable oilfield development practices.

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • .
  • .
  • .
  • 10
  • 1
  • 2
  • 3
  • 4
  • 5

Popular topics

  • Latest Artificial Intelligence papers
  • Latest Nursing papers
  • Latest Psychology Research papers
  • Latest Sociology Research papers
  • Latest Business Research papers
  • Latest Marketing Research papers
  • Latest Social Research papers
  • Latest Education Research papers
  • Latest Accounting Research papers
  • Latest Mental Health papers
  • Latest Economics papers
  • Latest Education Research papers
  • Latest Climate Change Research papers
  • Latest Mathematics Research papers

Most cited papers

  • Most cited Artificial Intelligence papers
  • Most cited Nursing papers
  • Most cited Psychology Research papers
  • Most cited Sociology Research papers
  • Most cited Business Research papers
  • Most cited Marketing Research papers
  • Most cited Social Research papers
  • Most cited Education Research papers
  • Most cited Accounting Research papers
  • Most cited Mental Health papers
  • Most cited Economics papers
  • Most cited Education Research papers
  • Most cited Climate Change Research papers
  • Most cited Mathematics Research papers

Latest papers from journals

  • Scientific Reports latest papers
  • PLOS ONE latest papers
  • Journal of Clinical Oncology latest papers
  • Nature Communications latest papers
  • BMC Geriatrics latest papers
  • Science of The Total Environment latest papers
  • Medical Physics latest papers
  • Cureus latest papers
  • Cancer Research latest papers
  • Chemosphere latest papers
  • International Journal of Advanced Research in Science latest papers
  • Communication and Technology latest papers

Latest papers from institutions

  • Latest research from French National Centre for Scientific Research
  • Latest research from Chinese Academy of Sciences
  • Latest research from Harvard University
  • Latest research from University of Toronto
  • Latest research from University of Michigan
  • Latest research from University College London
  • Latest research from Stanford University
  • Latest research from The University of Tokyo
  • Latest research from Johns Hopkins University
  • Latest research from University of Washington
  • Latest research from University of Oxford
  • Latest research from University of Cambridge

Popular Collections

  • Research on Reduced Inequalities
  • Research on No Poverty
  • Research on Gender Equality
  • Research on Peace Justice & Strong Institutions
  • Research on Affordable & Clean Energy
  • Research on Quality Education
  • Research on Clean Water & Sanitation
  • Research on COVID-19
  • Research on Monkeypox
  • Research on Medical Specialties
  • Research on Climate Justice
Discovery logo
FacebookTwitterLinkedinInstagram

Download the FREE App

  • Play store Link
  • App store Link
  • Scan QR code to download FREE App

    Scan to download FREE App

  • Google PlayApp Store
FacebookTwitterTwitterInstagram
  • Universities & Institutions
  • Publishers
  • R Discovery PrimeNew
  • Ask R Discovery
  • Blog
  • Accessibility
  • Topics
  • Journals
  • Open Access Papers
  • Year-wise Publications
  • Recently published papers
  • Pre prints
  • Questions
  • FAQs
  • Contact us
Lead the way for us

Your insights are needed to transform us into a better research content provider for researchers.

Share your feedback here.

FacebookTwitterLinkedinInstagram
Cactus Communications logo

Copyright 2026 Cactus Communications. All rights reserved.

Privacy PolicyCookies PolicyTerms of UseCareers