Articles published on Compositional Reasoning
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
159 Search results
Sort by Recency
- Research Article
- 10.1109/tpami.2026.3650864
- May 1, 2026
- IEEE transactions on pattern analysis and machine intelligence
- Jiangtong Li + 9 more
Video Question-Answering (VideoQA) enables machines to interpret and respond to complex video content, advancing human-computer interaction. However, existing multimodal large language models (MLLMs) often provide incomplete or opaque explanations and existing benchmarks mainly focus on the correction of final answers, limiting insight into their reasoning processes and hindering both transparency and verifiability. To address this gap, we propose the Question Parsing, Video Alignment and Answer Aggregation framework (QPVA$^{3}$3), which leverages a compositional graph to drive visual and logical reasoning in VideoQA. Specifically, QPVA$^{3}$3 consists of three core components, the planner, executor, and reasoner to generate the compositional graph and conduct graph-driven reasoning. For the original question, the planner parses it into the compositional graph, capturing the underlying reasoning logic and structuring it into a series of interconnected questions. For each question in compositional graph, the executor aligns the video by selecting relevant video clips and generates answers, ensuring accurate, context-specific responses. For each question with its first-order descents, the reasoner aggregates answers by integrating reasoning logic with visual evidence, resolving conflicts to produce a coherent and accurate response. Moreover, to assess the performance of existing MLLMs in the reasoning processes of VideoQA, we introduce novel compositional consistency metrics and construct a VideoQA benchmark (QPVA$^{3}$3 Bench) with 3,492 question-video tuples, each annotated with detailed compositional graphs and fine-grained answers. We evaluate the QPVA$^{3}$3 framework on QPVA$^{3}$3 Bench and 5 other VideoQA benchmarks. Experimental results demonstrate that our framework improves both consistency and accuracy compared to baselines, leading to a more transparent and verifiable VideoQA system. This approach has the potential to advance the field, as supported by our comprehensive evaluation and benchmarking efforts.
- Research Article
- 10.1080/14498596.2026.2653491
- Apr 19, 2026
- Journal of Spatial Science
- Jinyu Kai + 4 more
ABSTRACT Existing 3D spatial relation models depend on holistic bounding volume strategies, which cannot capture local vertical variations of morphologically complex objects. High uncertainty in topological reasoning also restricts their practical application. This paper presents 3D-SCRM, a 3D Spatial Comprehensive Relation Model based on Spatial Slicing, which discretizes 3D space into 2D vertical slices. It integrates three synergistic sub-models for directional, topological and distance relations, and adopts a compositional reasoning method to reduce topological ambiguity via directional and distance constraints. Unlike traditional models, 3D-SCRM precisely depicts dynamic spatial relation changes with altitude, supporting urban planning and 3D geospatial modeling.
- Research Article
- 10.1145/3786762
- Mar 19, 2026
- ACM Transactions on Programming Languages and Systems
- Zongyuan Liu + 5 more
Very relaxed concurrency memory models, like those of the Arm-A, RISC-V and IBM Power hardware architectures, underpin much of computing but break a fundamental intuition about programs, namely that syntactic program order and the reads-from relation always both induce order in the execution. Instead, out-of-order execution is allowed except where prevented by certain pairwise dependencies, barriers, or other synchronisation. This means that there is no notion of the ‘current’ state of the program, making it challenging to design (and prove sound) syntax-directed, modular reasoning methods like Hoare logics, as usable resources cannot implicitly flow from one program point to the next. We present AxSL, a family of separation logics for relaxed hardware memory models, and instantiate it on sequential consistency and on the Arm-A memory model. The Arm-A instance captures the fine-grained reasoning underpinning the low-overhead synchronisation idioms used by high-performance systems code. We mechanise AxSL in the Iris separation logic framework, illustrate it on key examples, and prove it sound with respect to the axiomatic memory model of Arm-A. By instantiating AxSL on different memory models, we demonstrate the generality of our approach, and show that it is largely generic in the axiomatic model and in the instruction-set semantics, offering a potential way forward for compositional reasoning for other models, and for the combination of production concurrency models and full-scale ISAs.
- Research Article
- 10.1609/aaai.v40i11.37834
- Mar 14, 2026
- Proceedings of the AAAI Conference on Artificial Intelligence
- Sahil Shah + 7 more
While vision-language models (VLMs) excel at tasks involving single images or short videos, they still struggle with Long Video Question Answering (LVQA) due to its demand for complex multi-step temporal reasoning. Vanilla approaches, which simply sample frames uniformly and feed them to a VLM along with the question, incur significant token overhead. This forces aggressive downsampling of long videos, causing models to miss fine-grained visual structure, subtle event transitions, and key temporal cues. Recent works attempt to overcome these limitations through heuristic approaches; however, they lack explicit mechanisms for encoding temporal relationships and fail to provide any formal guarantees that the sampled context actually encodes the compositional or causal logic required by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA first translates a natural language question into a logic specification that models the temporal relationship between frame-level events. Next, we construct a video automaton to model the video's frame-by-frame event progression, and finally employ model checking to compare the automaton against the specification to identify all video segments that satisfy the question's logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on the LongVideoBench and CinePile benchmarks show that NeuS-QA significantly improves performance by over 10%, particularly on questions involving event ordering, causality, and multi-step reasoning.
- Research Article
- 10.1016/j.neucom.2026.133150
- Feb 1, 2026
- Neurocomputing
- Jiahe Wan + 4 more
TINCLIP: Improving compositional reasoning of CLIP via textual inversion with no
- Research Article
- 10.1038/s41598-025-31627-5
- Jan 4, 2026
- Scientific reports
- Yuehua Li + 1 more
Visual Question Answering (VQA) effectively integrates image and text information to provide accurate answers to user queries. Despite the advances in multi-modal learning, traditional multi-head attention models struggle with limited interaction between attention heads and the inability to capture positional information, which are critical for modeling both intra-modal and cross-modal connections. In this work, we propose a novel position-aware collaborative attention framework to address these challenges. Our framework introduces an Inter-Head Communication Matrix (IHCM) before and after normalization in multi-head attention, enabling effective information sharing across attention heads. We design two collaborative attention components, i.e., the Intra-modal Self-Attention with Collaboration (IMSAC) for refining single-modality features and the Cross-modal Guided Attention with Collaboration (CMGAC) for leveraging textual information to guide image attention. To further enhance positional awareness, absolute positional encoding is incorporated into the self-attention mechanism, significantly improving semantic understanding in text features. We evaluate our framework on the TDIUC, VQA-CP v2, and GQA datasets to demonstrate its effectiveness and robustness. Our collaborative attention block consistently improves accuracy across various question categories, with the IMSAC and CMGAC combination achieving the best results. Comprehensive ablation studies confirm the importance of inter-head collaboration and positional encoding, highlighting their contributions to addressing the "semantic gap" and enhancing cross-modal reasoning. The proposed framework achieves competitive and superior performance compared to several recent attention-based methods, showcasing superior resilience to language bias and strong compositional reasoning capabilities.
- Research Article
- 10.30829/zero.v9i3.26781
- Dec 29, 2025
- ZERO: Jurnal Sains, Matematika dan Terapan
- Rahmat Tullah + 1 more
<p>This study presents a structured literature review of Neuro-Symbolic Artificial Intelligence (NSAI) approaches for extracting cultural semantics and fractal features from Batik motifs. A structured multi-database screening (2015–2025) yielded 69 peer-reviewed studies, which were synthesized thematically. The review identifies three key findings: existing vision-based models generally lack explicit mechanisms for encoding intangible cultural rules; hybrid neural–symbolic approaches demonstrate improved interpretability and compositional reasoning; and fractal-based descriptors show promise for representing culturally grounded motif structures. Based on these findings, this study proposes a conceptual NSAI framework that combines symbolic knowledge representations with fractal feature modeling, without empirical validation at this stage. The synthesis highlights potential applications in motif recognition, generative motif modeling, and computer-assisted cultural heritage preservation. Overall, NSAI offers a feasible and explainable conceptual framework for modeling Batik’s intangible cultural knowledge. </p>
- Research Article
- 10.71465/mrcis159
- Dec 25, 2025
- Multidisciplinary Research in Computing Information Systems
- Qingyuan Zhou
The exponential growth of heterogeneous data sources has created unprecedented challenges for information retrieval and knowledge extraction systems. Modern enterprises and research institutions routinely manage vast repositories containing both structured databases and unstructured text collections, yet traditional indexing approaches remain siloed in their treatment of these distinct data modalities. This research investigates compositional reasoning mechanisms that enable unified query processing across structured and unstructured data through hybrid indexing frameworks. We propose a novel architecture that integrates semantic embeddings with relational schema representations, employing gating mechanisms to dynamically balance contributions from both modalities. Our methodology combines graph-based knowledge structures with dense vector retrieval systems, implementing attention mechanisms and modular reasoning components that enable flexible query decomposition and execution. Through extensive experiments on enterprise datasets containing financial records, technical documentation, and operational logs, we demonstrate that hybrid indexing frameworks achieve superior performance in multi-hop reasoning tasks compared to single-modality approaches. The proposed system reduces query response time by 34% while improving answer accuracy by 28% on compositional queries requiring integration across database tables and document collections. These findings suggest that unified indexing strategies with compositional reasoning represent a critical enabler for next-generation question answering systems, business intelligence platforms, and knowledge management applications operating in complex data environments.
- Research Article
- 10.61173/0p6jpg91
- Dec 19, 2025
- Science and Technology of Engineering, Chemistry and Environmental Protection
- Jiacheng Shi
With the rapid growth of multimodal learning, VisionLanguage Models (VLMs) have become a cutting-edge direction in artificial intelligence. Among them, the Contrastive Language–Image Pre-training (CLIP) model, based on large-scale contrastive learning, has demonstrated powerful capabilities in zero-shot transfer and crossmodal retrieval. However, CLIP’s weakly supervised training paradigm shows clear shortcomings when dealing with compositional reasoning. Therefore, this survey systematically reviews and analyzes representative methods proposed in recent years to address CLIP’s compositional reasoning limitations, including Self-supervision meets Language-Image Pre-training (SLIP), Language augmented CLIP (LaCLIP), TripletCLIP, Synthetic Perturbations for Advancing Robust Compositional Learning (SPARCL), Compositionally-aware Learning in CLIP (CLIC), and Training-Time Negation Data Generation for Negation Awareness of CLIP (TNG-CLIP). We introduce the principles and characteristics of these methods, followed by a comparative analysis of their performance on different benchmarks and how they mitigate deficiencies. Through this overview of CLIP and its derivative methods, we hope future research will focus on integrating their strengths, while also developing more efficient data synthesis techniques and more comprehensive evaluation benchmarks.
- Research Article
- 10.1177/18758967251394597
- Dec 8, 2025
- Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology
- Ngoc-Khuong Nguyen + 2 more
Large Language Models (LLMs) excel at many tasks but often struggle with complex, multi-step reasoning, leading to inconsistencies and hallucinations. Consequently, we propose a neural-symbolic integration framework that enhances LLM reasoning by incorporating formal knowledge—such as logical rules, ontologies, and knowledge graphs—into their CoT process. Our approach retrieves and integrates symbolic information to guide logical inference, resulting in more accurate and interpretable outputs. Experiments on compositional reasoning benchmarks demonstrate significant improvements over standard LLM methods. This work highlights the potential of neural-symbolic integration for developing more reliable and explainable AI systems in high-stakes applications.
- Research Article
4
- 10.5964/bioling.19021
- Dec 4, 2025
- Biolinguistics
- Elliot Murphy + 5 more
A core component of a successful artificial general intelligence would be the rapid creation and manipulation of grounded compositional abstractions and the demonstration of expertise in the family of recursive hierarchical syntactic objects necessary for the creative use of human language. We evaluated the recently released o3 model (OpenAI; o3-mini-high) from ChatGPT and discovered that while it succeeds on some basic linguistic tests relying on linear, surface statistics (e.g., the Strawberry Test), it fails to generalize basic phrase structure rules; it fails with comparative sentences involving semantically illegal cardinality comparisons (‘Escher sentences’); it fails to correctly rate and explain acceptability dynamics; and it fails to distinguish between instructions to generate unacceptable semantic vs. unacceptable syntactic outputs. When tasked with generating simple violations of grammatical rules, it is seemingly incapable of representing multiple parses to evaluate against various possible semantic interpretations. We ran all of these prompts multiple times again through the API and provide basic accuracy scores. In stark contrast to many recent claims that artificial language models are on the verge of replacing the field of linguistics, our results suggest not only that deep learning is hitting a wall with respect to compositionality ( Marcus, 2022 ), but that it is hitting [ a [ stubbornly [ resilient wall ]]] that cannot readily be surmounted to reach human-like compositional reasoning simply through more compute.
- Research Article
- 10.51317/jmds.v3i1.797
- Dec 2, 2025
- Journal of Mathematics and Data Science (JMDS)
- Marion Namuki Nyongesa + 2 more
The aim of this paper is to present a framework that integrates optimisation into algebraic structures in Cartesian Closed Categories (CCCs). Traditional mathematical methods treated optimisation as an external process, which limited its foundational role in mathematics. Inspired by the finite nature hypothesis and Hilbert's sixth problem, which calls for the axiomatisation of physical principles, this study formalises optimisation as an intrinsic algebraic axiom in a way that aligns with Hilbert's vision of uniting mathematical and physical laws. The framework builds on Lawvere's categorical treatment of metric spaces and Birkhoff's HSP theorem, which the study uses to define an optimisation algebra. The study then provides proof that the class of such algebras forms a variety in universal algebra and demonstrates categorical soundness within CCCs. The proposed approach guarantees that optimisation is inherent within algebraic systems, establishing natural substructures of optimal elements and facilitating compositional reasoning in computational models. Applying this framework in dataflow networks demonstrates convergence to optimal steady states, enhancing resource utilisation and system efficiency. Future research includes using the framework in enriched categories, distributed systems and incorporating the operator in tools that can be used to solve real-world problems.
- Research Article
- 10.15587/2706-5448.2025.342365
- Oct 30, 2025
- Technology audit and production reserves
- Roman Malyi + 1 more
The object of this research is the process of selecting an architectural strategy for event-schema evolution in event-sourcing systems. This process involves complex architectural trade-offs and is a critical task for maintaining the integrity and long-term viability of the immutable event log. The addressed problem is the inconsistent performance and reliability ceiling of standard LLM prompting techniques like few-shot learning. These methods rely on heuristic pattern matching and thus lack the systematic framework required for high-stakes architectural decisions. This issue is compounded by the subjectivity inherent in the manual selection process by engineers. The principal result is the development of a rule-based “atomic taxonomy” method. This approach enabled large-scale models (GPT-5, Gemini-2.5-pro) to achieve perfect predictive performance (1.0 Macro F1-score), while simultaneously degrading the performance of most medium-sized models when compared to the few-shot prompting baseline. This divergence is explained by the cognitive demands of the task. The proposed method shifts the process from heuristic pattern matching to structured, compositional reasoning. The results indicate that large models possess the necessary architectural capabilities to execute this formal logic, whereas medium-sized models are overwhelmed by its cognitive overhead, making a simpler, example-based approach more effective for them. In practice, the findings provide a clear, actionable guideline for architects. The atomic taxonomy serves as a robust framework to assist in manual decision-making. For automated support systems, its application is recommended exclusively with large-scale LLMs capable of advanced reasoning. The study concludes that for systems leveraging smaller, more efficient models, traditional few-shot prompting remains the more reliable and superior strategy.
- Research Article
- 10.1145/3763051
- Oct 9, 2025
- Proceedings of the ACM on Programming Languages
- Lang Liu + 4 more
Given the high cost of formal verification, a large system may include differently analyzed components: a few are fully verified, and the rest are tested. Currently, there is no reasoning system that can soundly compose these heterogeneous analyses and derive the overall formal guarantees of the entire system. The traditional compositional reasoning technique—rely-guarantee reasoning—is effective for verified components, which undergo over-approximated reasoning, but not for those components that undergo under-approximated reasoning, e.g., using testing or other program analysis techniques. The goal of this paper is to develop a formal, logical foundation for composing heterogeneous analysis, deploying both over-approximated (verification) and under-approximated (testing) reasoning. We focus on systems that can be modeled as a collection of communicating processes. Each process owns its internal resources and a set of channels through which it communicates with other processes. The key idea is to quantify the guarantees obtained about the behavior of a process as a test level, which captures the constraints under which this guarantee is analyzed to be true. We design a novel proof system LabelBI based on the logic of bunched implications that enables rely-guarantee reasoning principles for a system of differently analyzed components. We develop trace semantics for this logic, against which we prove our logic is sound. We also prove cut elimination of our sequent calculus. We demonstrate the expressiveness of our logic via a case study.
- Research Article
2
- 10.4218/etrij.2025-0063
- Oct 1, 2025
- ETRI Journal
- Jungyu Kang + 2 more
Abstract Advancements in autonomous vehicles and smart traffic systems require vision datasets capable of capturing complex interactions and dynamic behaviors in real‐world urban environments. Although datasets such as COCO, Cityscapes, and ROAD have advanced object detection, segmentation, and action recognition, they often treat scene elements in isolation, thereby limiting their use for comprehensive understanding. This paper presents DOROS, a dataset with multilevel annotations across Agent , Location , and Behavior categories. DOROS is designed to support compositional reasoning under diverse traffic conditions. An annotation pipeline combining foundation models with structured human refinement ensures consistent, high‐quality supervision. To support structured evaluation, we introduce the Combined mAP ( mask ) metric, which assesses instance segmentation under strict category‐level label matching while mitigating the effects of class imbalance. Extensive experiments, including ablation studies and transformer‐based baselines, validate DOROS as a resource for structured scene understanding in complex traffic scenarios. The dataset and code will be released upon publication.
- Research Article
- 10.1007/s10817-025-09731-y
- Aug 11, 2025
- Journal of Automated Reasoning
- Lawrence Dunn + 2 more
Abstract Reasoning about substitution remains one of the most tedious and error-prone aspects of formal metatheory. We present Tealeaves, a framework implemented in Coq for developing such infrastructure generically and modularly. Tealeaves is centered on a novel categorical abstraction, decorated traversable monads (DTMs), which provide a unifying foundation for first-order syntax and enable local, compositional reasoning about syntactic operations, such as substitution, that are defined purely by their effect on individual variable occurrences. Within this framework, Tealeaves supports extensible backend modules, each implementing the metatheory of a specific concrete strategy for representing binders. Our current backends include implementations of de Bruijn indices in the style of Autosubst, as well as locally nameless in the style of LNgen. Tealeaves goes further by providing a certified translation between these representations, illustrating how DTMs reconcile their underlying structures. The framework also accommodates challenging features such as variadic and mutually-recursive binders, which are often overlooked by both theoretical treatments and practical tools. We describe the implementation and use of Tealeaves’ backends in formalized language developments, introduce the equational axioms that characterize DTMs, and conclude with a presentation of those axioms instantiated for the lambda calculus extended with a variadic binding constructor.
- Research Article
- 10.1002/iis2.70086
- Jul 1, 2025
- INCOSE International Symposium
- Isaac Amundson + 4 more
Abstract Formal methods have proved to be a valuable tool for identifying defects early in the development of safety‐critical systems. Despite that, several factors have impeded their adoption within the systems engineering community. Some of these include lack of commercially available solutions, poor integration of analysis functionality in existing model‐based systems engineering (MBSE) tools, and difficulty interpreting the results of the formal analyses. One such analysis that is popular among pockets within the aerospace community is the Assume Guarantee Reasoning Environment (AGREE), which analyzes Architecture Analysis and Design Language (AADL) models. AGREE is an open‐source property‐proving model checker that uses compositional reasoning to prove the system composition is valid based on assumptions and guarantees associated with the system components. The goals of this work are to develop a method for using AGREE in a more widely adopted commercially available tool and to take advantage of MBSE formalisms to better convey the analysis results, especially counterexamples. The hope is that this will increase the use of formal methods by high‐assurance systems developers.
- Research Article
- 10.15276/ej.02.2025.4
- Jun 24, 2025
- Economic journal Odessa polytechnic university
- Tetiana Grynko + 2 more
This article investigates visual culture as a key factor in shaping managerial thinking within the restaurant industry. The study analyzes case examples of both Ukrainian and international restaurant brands, such as Taco Love, The Oak Stave, and Chipotle. These cases demonstrate how consistent visual identity, coherent content across social media platforms, and immersive spatial aesthetics contribute to brand recognition, emotional connection with customers, and the overall efficiency of internal management practices. The research findings support the notion that visual culture fosters a new type of managerial thinking – one grounded in imaginative, spatial, and compositional reasoning. The proposed framework opens opportunities for further empirical research and the development of practical guidelines for restaurant managers, brand designers, and service strategy professionals.
- Research Article
27
- 10.1016/j.neucom.2025.129906
- Jun 1, 2025
- Neurocomputing
- Souvik Chowdhury + 1 more
Handling language prior and compositional reasoning issues in Visual Question Answering system
- Research Article
- 10.1049/cps2.70033
- Jan 1, 2025
- IET Cyber-Physical Systems: Theory & Applications
- Hao Ren + 1 more
ABSTRACTFormally verifying complex model‐based designs has posed a significant challenge for complex systems, primarily due to their sheer scale and the critical nature of safety involved. A common method for tackling this challenge is the divide‐and‐conquer strategy, which leverages the system model architecture to decompose the verification tasks into smaller subtasks focused on subsystems or components. This approach entails articulating the verification goals as formal property contracts and subsequently verifying each one separately. Once the individual contracts of the subsystems or components are validated, they are integrated through formal reasoning to achieve verification at the system level also represented as a formal property contract. However, the current procedures and tools designed for this type of compositional verification often requires manual postulation of system‐level contracts and are susceptible to false alarms in verification outcomes due to over‐approximation. In the paper, we introduce our approach to compositional reasoning and verification using quantifier elimination (QE), which automates the derivation of the strongest system‐level property given the component‐level ones and their connectivity, enabling precise automated analysis for even time‐dependent and nonlinear systems. QE serves as the foundation for composition calculus, allowing us to derive the strongest system‐level property in a single step. We begin by applying this framework to properties that are time‐independent, and subsequently, we expand our approach to encompass the composition of time‐dependent properties. For the latter case, we shift the given properties over time to span the time horizon of interest, which we show to be no greater than the total time horizons of the component‐level properties. Similarly, we use QE to infer the system‐initial‐condition from the component‐level initial conditions. The automatically inferred strongest system‐level property becomes useful in verifying a postulated desired system‐level property through induction, involving inferred strongest system‐level property and its initial condition. In this regard, we also advance the existing ‐induction based model‐checking by incorporating QE and formulating its base and inductive steps as QE problems. Our composition approach is uniform regardless of the type of composition (cascade/parallel/feedback) and regardless the component properties being composed are time‐independent or time‐dependent. We also present a prototype verifier called ReLIC (Reduced Logic Inference for Composition), which implements our approach and demonstrate it through several illustrative and practical examples. We also demonstrate the recent integration of our approach into an industrial verification and validation (V&V) tool suite, which allows for augmented static analysis of Simulink models and deep neural networks (DNNs).