Articles published on Human-Object Interaction
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
341 Search results
Sort by Recency
- Research Article
- 10.1038/s41598-026-49105-x
- Apr 20, 2026
- Scientific Reports
- Xuyang Li + 7 more
Semantic knowledge enhanced 3D human–object interaction learning
- Research Article
- 10.3390/biomimetics11030214
- Mar 17, 2026
- Biomimetics (Basel, Switzerland)
- Jinsong Zhang + 1 more
Monocular 3D human-object interaction (HOI) reconstruction requires jointly recovering articulated human geometry, object pose, and physically plausible contact from a single RGB image. While recent token-based methods commonly employ dense self-attention to capture global dependencies, isotropic all-to-all mixing tends to entangle spatial-geometric cues (e.g., contact locality) with channel-wise semantic cues (e.g., action/affordance), and provides limited control for representing directional and asymmetric physical influence between humans and objects. This paper presents HOIMamba, a state-space sequence modeling framework that reformulates HOI reconstruction as bidirectional, multi-scale interaction state inference. Instead of relying on symmetric correlation aggregation, HOIMamba uses structured state evolution to propagate interaction evidence. We introduce a multi-scale state-space module (MSSM) to capture interaction dependencies spanning local contact details and global body-object coordination. Building on MSSM, we propose a spatial-channel grouped SSM (SCSSM) block that factorizes interaction modeling into a spatial pathway for geometric/contact dependencies and a channel pathway for semantic/functional correlations, followed by gated fusion. HOIMamba further performs explicit bidirectional propagation between human and object states to better reflect asymmetric reciprocity in physical interactions. We evaluate HOIMamba on two public benchmarks, BEHAVE and InterCap, using Chamfer distance for human/object meshes and contact precision/recall induced by reconstructed geometry. HOIMamba achieves consistent improvements over representative prior methods. On the BEHAVE dataset, it reduces human Chamfer distance by 8.6% and improves contact recall by 13.5% compared to the strongest Transformer-based baseline, with similar gains observed on the InterCap dataset. Ablation studies on BEHAVE verify the contributions of state-space modeling, multi-scale inference, spatial-channel factorization, and bidirectional interaction reasoning.
- Research Article
1
- 10.3390/technologies14030150
- Mar 1, 2026
- Technologies
- Sergey Linok + 4 more
Analyzing events in dynamic environments poses a fundamental challenge in the development of intelligent agents and robots capable of interacting with humans. Current approaches predominantly rely on visual–text models; however, these methods often capture information implicitly from images, lacking interpretable and structured spatio-temporal object representations and their relationships. To address this issue, we introduce DyGEnc—a novel method for dynamic graph encoding. This method integrates compressed spatio-temporal representation with the cognitive capabilities of large language models. The purpose of this integration is to enable advanced question answering based on sequences of textual scene graphs. Extensive evaluations on the STAR and AGQA datasets demonstrate that DyGEnc improves large language model performance when addressing queries related to the history of human–object interactions. Furthermore, the proposed method can be extended to process input images by leveraging foundation models to extract explicit textual scene graphs, as validated by the evaluation results. We expect these findings to contribute to the development of robust and compact graph-based memory for long-horizon reasoning in real-world applications, as demonstrated in a robotic experiment conducted using a wheeled manipulator platform.
- Research Article
- 10.1016/j.eswa.2025.130216
- Mar 1, 2026
- Expert Systems with Applications
- Kunyuan Deng + 2 more
Egocentric human-object interaction detection: A new benchmark and method
- Research Article
- 10.1007/s42486-025-00215-x
- Feb 26, 2026
- CCF Transactions on Pervasive Computing and Interaction
- Muhammad Hamza + 2 more
Person identification from egocentric human-object interactions using 3D hand pose
- Research Article
- 10.1109/tvcg.2026.3662720
- Feb 9, 2026
- IEEE transactions on visualization and computer graphics
- Ziyi Xu + 8 more
The generation of anchor-style product promotion videos presents promising opportunities in e-commerce, advertising, and consumer engagement. Despite advancements in pose-guided human video generation, creating product promotion videos remains challenging. In addressing this challenge, we identify the integration of human-object interactions (HOI) into pose-guided human video generation as a core issue. To this end, we introduce AnchorCrafter, a novel diffusion-based system designed to generate 2D videos featuring a target human and a customized object, achieving high visual fidelity and controllable interactions. Specifically, we propose two key innovations: the HOI-appearance perception, which enhances object appearance recognition from arbitrary multi-view perspectives and disentangles object and human appearance, and the HOI-motion injection, which enables complex human-object interactions by overcoming challenges in object trajectory conditioning and inter-occlusion management. Extensive experiments show that our system improves object appearance preservation by 7.5%, and achieves the best video quality compared to existing state-of-the-art approaches. It also outperforms existing approaches in maintaining human motion consistency and high-quality video generation. Project page including data, code, and Huggingface demo: https://github.com/cangcz/AnchorCrafter.
- Research Article
1
- 10.1016/j.asoc.2026.114765
- Feb 1, 2026
- Applied Soft Computing
- Xiaoqian Han + 4 more
Soft-label guided multi-granularity prompts learning for human-object interaction detection
- Research Article
- 10.1016/j.engappai.2025.113406
- Feb 1, 2026
- Engineering Applications of Artificial Intelligence
- Qing Ye + 3 more
Appearance-semantic graphical model for human-object interaction detection
- Research Article
- 10.1016/j.cviu.2026.104648
- Feb 1, 2026
- Computer Vision and Image Understanding
- Hang Su + 5 more
Cascaded-parallel decoders and anchor-guided query generator for human-object interaction
- Research Article
- 10.1080/2159032x.2025.2604362
- Jan 10, 2026
- Heritage & Society
- Arianna Magyaricsová
ABSTRACT This article examines the post-excavation trajectories of iron nails from the Roman hoard uncovered at Inchtuthil, Scotland, in 1960, using the framework of object itinerary theory. Through twenty-four semi-structured interviews and questionnaires with private collectors and institutional representatives the study traces how these artefacts have circulated globally and acquired new meanings beyond their archaeological contexts. The study introduces a dual framework of physical and ontological itineraries: the former traces the tangible dispersal of the nails, while the latter explores how their roles and values evolve through relational, representational and affective engagements. The results show that the nails have moved from utilitarian Roman tools to contemporary heritage artefacts, functioning variously as teaching aids, mementos and social connectors. Their presence in both public and private collections demonstrates how material culture mediates memory, emotion, and belonging across time and contexts. In highlighting the evolving roles of these artefacts, the article challenges static models of object interpretation and advances a more fluid understanding of heritage as shaped through ongoing human-object interactions.
- Research Article
- 10.1109/tmm.2025.3632627
- Jan 1, 2026
- IEEE Transactions on Multimedia
- Dongpan Chen + 5 more
Human-object interaction (HOI) detection task aims to learn how humans interact with surrounding objects by inferring fine-grained triples of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\left\langle \rm {\emph {human, action, object}} \right\rangle$</tex-math></inline-formula>, which plays a vital role in computer vision tasks such as human-centered scene understanding and visual question answering. However, HOI detection suffers from class long-tailed distributions and zero-shot problems. Current methods typically identify HOI only from input images or label spaces in a data-driven manner, lacking sufficient knowledge prompts, and consequently limits their potential for real-world scenes. Hence, to fill this gap, this paper introduces affordance and scene knowledge as prompts on different granularities to the HOI detector to improve its recognition ability. Concretely, we first construct a large-scale affordance-scene knowledge graph, named ASKG, whose knowledge can be divided into two categories according to the fields of image information, i.e., the knowledge related to affordances of object instances and the knowledge associated with the scene. Subsequently, the knowledge of affordance and scene specific to the input image is extracted by an ASKG-based prior knowledge embedding module. Since this knowledge corresponds to the image at different granularities, we then propose an instance field adaptive fusion module and a scene field adaptive fusion module to enable visual features fully absorb the knowledge prompts. These two encoded features of different fields and knowledge embeddings are finally fed into a proposed HOI recognition module to predict more accurate HOI results. Extensive experiments on both HICO-DET and V-COCO benchmarks demonstrate that the proposed method leads to competitive results compared with the state-of-the-art methods.
- Research Article
1
- 10.3390/technologies14010028
- Jan 1, 2026
- Technologies
- Nicole Unsihuay + 2 more
This systematic review focused on the validity of markerless motion capture (MMC) systems used for human movement assessment during tasks that involve physical interaction with objects. Five electronic databases were searched until May 2025. Eligible studies (i) assessed the validity of an MMC system, (ii) required human participants to perform tasks that involved physical interaction with objects (e.g., lifts, carrying, gait with loads), (iii) employed a marker-based reference system, and (iv) reported at least one kinematic metric. Risk of bias was assessed using the SURE checklist. A best-evidence synthesis was conducted to classify the level of evidence across included studies. Fifteen studies met eligibility (median = 21 participants per study). In general, MMC systems presented good performance in capturing the waveforms related to movement (i.e., high associations with reference systems), but its level of precision (i.e., the magnitude of differences to the reference systems) still requires improvement regarding tasks involving human–object interactions. Most tasks analyzed were lifts, gait with load, squatting and reaching/manipulation, and technical gestures. There was strong evidence for the validity of MMC for implementation during lifting tasks. In summary, markerless motion capture (MMC) systems exhibit promising evidence of validity for some human–object interaction tasks, that is, especially when lifting as strong evidence was observed across studies on this type of task. In contrast, some evidence for tasks including gait under load, squatting, reaching, or touchscreen interaction is limited, moderate, or conflicting. Notwithstanding these limitations, most studies were observed to have moderate- to high-quality methodology. Additional research is required to optimize protocols to study the measurement error aspects of MMC under human–object interaction in real-world environments.
- Research Article
- 10.1109/access.2026.3655585
- Jan 1, 2026
- IEEE Access
- Muhammad Shurahbeel Bin Sajid + 6 more
Human Object Interaction (HOI) recognition is a critical challenge in computer vision, enabling machines to interpret complex human object dynamics with applications in robotics, surveillance, and human computer interaction. Traditional methods often struggle to capture long range temporal dependencies and to effectively fuse multimodal features such as RGB appearance and optical flow, which are critical for understanding complex interactions. In this research, we propose TransFlow-HOI which is a transformer based feature fusion framework designed to capture long range temporal dependencies by integrating RGB appearance and optical flow features. Our method addresses the limitations of conventional fusion techniques by using attention driven multimodal representation learning combined with ensemble classification. Experimental results on HOI4D dataset demonstrate that TransFlow-HOI achieves an average accuracy of 83% in single-stage recognition, whereas 93% accuracy in case of two stage recognition. Similarly, it outperforms existing methods in other performance measures including precision, recall, and F1-score.
- Research Article
- 10.1109/access.2026.3678513
- Jan 1, 2026
- IEEE Access
- Giovanni Minelli + 4 more
Creating realistic human-world interactions with diffusion models remains a key challenge, often requiring tedious trial-and-error processes and iterative manual refinements to achieve the desired result. Current approaches either fail to seamlessly integrate new content while maintaining global consistency of the scene, or require time-consuming editing and prompt engineering, making the process impractical for large-scale applications. To address this challenge, we propose an inpainting approach that specifically tackles the complexities of generating contextual human-object interactions, which we refer to as GeCHO. Our method improves local object fidelity and global scene consistency by leveraging cross-attention maps for automated, annotation-free object placement and using ControlNet to ensure precise spatial localization.We demonstrate the practical impact of our approach through two key applications: natural image inpainting, where we achieve contextual object placement with flexible spatial control, and human-object interaction detection, where we address the problem of long-tail distributions through synthetic data generation. Our results show that the proposed method enhances realism and adherence of generated images to text prompts, simplifies the generation of complex scenes without extensive input engineering, and improves performance in computer vision tasks limited by data scarcity. The source code implementation is available here: https://github.com/johnMinelli/gecho.
- Research Article
- 10.1109/tip.2026.3682124
- Jan 1, 2026
- IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
- Weiquan Wang + 4 more
Rendering realistic human-object interactions (HOIs) from sparse-view inputs is a challenging yet crucial task for various real-world applications. Existing methods often struggle to simultaneously achieve high rendering quality, physical plausibility, and computational efficiency. To address these limitations, we propose HOGS (Human-Object Rendering via 3D Gaussian Splatting), a novel framework for efficient HOI rendering with physically plausible geometric constraints from sparse views. HOGS represents both humans and objects as dynamic 3D Gaussians. Central to HOGS is a novel optimization process that operates directly on these Gaussians to enforce geometric consistency (i.e., preventing inter-penetration or floating contacts) to achieve physical plausibility. To support this core optimization under sparse-view ambiguity, our framework incorporates two pre-trained modules: an optimization-guided Human Pose Refiner for robust estimation under sparse-view occlusions, and a Human-Object Contact Predictor that efficiently identifies interaction regions to guide our novel contact and separation losses. Extensive experiments on both human-object and hand-object interaction datasets demonstrate that HOGS achieves state-of-the-art rendering quality and maintains high computational efficiency.
- Research Article
- 10.1109/access.2026.3659132
- Jan 1, 2026
- IEEE Access
- Junwen Chen + 1 more
Human–Object Interaction Detection (HOID) has benefited greatly from advances in modern detection architectures and vision-language foundation models. In this paper, we present two progressively improved HOID frameworks—SOV-STG-VLA and Hybrid-SOV—that jointly push the frontier of accurate and efficient interaction understanding. SOV-STG-VLA reformulates HOI prediction as a subject–object–verb (SOV) decoding problem and introduces a Split Target Guided (STG) denoising strategy that accelerates convergence while enhancing structural consistency. Furthermore, a Vision–Language Advisor (VLA) integrates priors from large multimodal models to enrich verb semantics and improve HOI classification. Building upon this, Hybrid-SOV aligns HOID with the latest object-detection paradigms by incorporating an efficient hybrid encoder and a query-selection mechanism that directly constructs HOI queries from visual features, eliminating predefined embeddings and enabling more interpretable decoding. When coupled with DINO-v3 foundation features, Hybrid-SOV achieves state-of-the-art accuracy with superior inference efficiency. Extensive experiments on HICO-DET and V-COCO demonstrate that our frameworks not only advance HOID performance but also establish an effective path toward bridging detection architectures with vision-language foundation models.
- Research Article
- 10.1016/j.jvcir.2025.104660
- Jan 1, 2026
- Journal of Visual Communication and Image Representation
- Jing Han + 5 more
Detecting human-object interactions with image category-guided and query denoising
- Research Article
- 10.1145/3770746
- Dec 19, 2025
- ACM Transactions on Graphics
- Jintao Lu + 5 more
Animating human-scene interactions such as picking and placing a wide range of objects with different geometries is a challenging task, especially in a cluttered environment where interactions with complex articulated containers are involved. The main difficulty lies in the sparsity of the motion data compared to the wide variation of the objects and environments, as well as the poor availability of transition motions between different actions, increasing the complexity of the generalization to arbitrary conditions. To cope with this issue, we develop a system that tackles the interaction synthesis problem as a hierarchical goal-driven task. Firstly, we develop a bimanual scheduler that plans a set of keyframes for simultaneously controlling the two hands to efficiently achieve the pick-and-place task from an abstract goal signal such as the target object selected by the user. Next, we develop a neural implicit planner that generates hand trajectories to guide reaching and leaving motions across diverse object shapes/types and obstacle layouts. Finally, we propose a linear dynamic model for our DeepPhase controller that incorporates a Kalman filter to enable smooth transitions in the frequency domain, resulting in a more realistic and effective multi-objective control of the character. Our system can synthesize a rich variety of natural pick-and-place movements that adapt to different object geometries, container articulations, and scene layouts.
- Research Article
- 10.1007/s44267-025-00102-0
- Dec 1, 2025
- Visual Intelligence
- Fang Nan + 4 more
Abstract The aim of human-object interaction (HOI) detection is to identify the triplets consisting of a human, a verb, and an object. Although existing methods leverage vision-language models (e.g., CLIP) to transfer textual information for unseen compositions, they often fail to capture the fine-grained visual cues that are essential for complex interactions, such as spatial configurations and object affordances. In this paper, we introduce visual guidance as an alternative approach to achieving the desired outcome. We define a new visual-guided HOI detection task for the first time, aiming at detecting unseen HOI categories using a small number of guidance examples. To support this new task, we have constructed a new benchmark dataset, which contains one base set and four novel sets, taking into account the peculiarities of HOI. Then, we propose a VG-HOI model with progressive guidance, query reconstruction, and a conditional uncoupling decoder to supplement common HOI knowledge and task-specific cues to improve the generalization capability of our model. Besides, we explore a new guidance sampling strategy — disentangled guidance — for real-world scenarios. Our in-depth analysis of the experimental results shows that the proposed model can improve the ability to generalize when detecting visual-guided HOI.
- Research Article
5
- 10.1109/tnnls.2025.3591538
- Nov 1, 2025
- IEEE transactions on neural networks and learning systems
- Fan Yang + 7 more
To enable robots to use tools, the initial step is teaching robots to employ dexterous gestures for touching specific areas precisely where tasks are performed. Affordance features of objects serve as a bridge in the functional interaction between agents and objects. However, leveraging these affordance cues to help robots achieve functional tool grasping remains unresolved. To address this, we propose a granularity-aware affordance feature extraction method for locating functional affordance areas and predicting dexterous coarse gestures. We study the intrinsic mechanisms of human tool use. On the one hand, we use fine-grained affordance features of object-functional finger contact areas to locate functional affordance regions. On the other hand, we use highly activated coarse-grained affordance features in hand-object interaction regions to predict grasp gestures. Additionally, we introduce a model-based postprocessing module that transforms affordance localization and gesture prediction into executable robotic actions. This forms GAAF-Dex, a complete framework that learns granularity-aware affordances from human-object interaction to enable tool-based functional grasping with dexterous hands. Unlike fully supervised methods that require extensive data annotation, we employ a weakly supervised approach to extract relevant cues from exocentric (Exo) images of hand-object interactions to supervise feature extraction in egocentric (Ego) images. To support this approach, we have constructed a small-scale dataset, functional affordance hand (FAH)-object interaction dataset, which includes nearly 6k images of functional hand-object interaction Exo images and Ego images of 18 commonly used tools performing six tasks. Extensive experiments on the dataset demonstrate that our method outperforms state-of-the-art methods, and real-world localization and grasping experiments validate the practical applicability of our approach. The source code and the established dataset are available at https://github.com/yangfan293/GAAF-DEX.