Articles published on Human-Object Interaction Detection
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
116 Search results
Sort by Recency
- Research Article
- 10.1016/j.eswa.2025.130216
- Mar 1, 2026
- Expert Systems with Applications
- Kunyuan Deng + 2 more
Egocentric human-object interaction detection: A new benchmark and method
- Research Article
1
- 10.1016/j.asoc.2026.114765
- Feb 1, 2026
- Applied Soft Computing
- Xiaoqian Han + 4 more
Soft-label guided multi-granularity prompts learning for human-object interaction detection
- Research Article
- 10.1016/j.engappai.2025.113406
- Feb 1, 2026
- Engineering Applications of Artificial Intelligence
- Qing Ye + 3 more
Appearance-semantic graphical model for human-object interaction detection
- Research Article
- 10.1109/access.2026.3678513
- Jan 1, 2026
- IEEE Access
- Giovanni Minelli + 4 more
Creating realistic human-world interactions with diffusion models remains a key challenge, often requiring tedious trial-and-error processes and iterative manual refinements to achieve the desired result. Current approaches either fail to seamlessly integrate new content while maintaining global consistency of the scene, or require time-consuming editing and prompt engineering, making the process impractical for large-scale applications. To address this challenge, we propose an inpainting approach that specifically tackles the complexities of generating contextual human-object interactions, which we refer to as GeCHO. Our method improves local object fidelity and global scene consistency by leveraging cross-attention maps for automated, annotation-free object placement and using ControlNet to ensure precise spatial localization.We demonstrate the practical impact of our approach through two key applications: natural image inpainting, where we achieve contextual object placement with flexible spatial control, and human-object interaction detection, where we address the problem of long-tail distributions through synthetic data generation. Our results show that the proposed method enhances realism and adherence of generated images to text prompts, simplifies the generation of complex scenes without extensive input engineering, and improves performance in computer vision tasks limited by data scarcity. The source code implementation is available here: https://github.com/johnMinelli/gecho.
- Research Article
- 10.1109/tmm.2025.3632627
- Jan 1, 2026
- IEEE Transactions on Multimedia
- Dongpan Chen + 5 more
Human-object interaction (HOI) detection task aims to learn how humans interact with surrounding objects by inferring fine-grained triples of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\left\langle \rm {\emph {human, action, object}} \right\rangle$</tex-math></inline-formula>, which plays a vital role in computer vision tasks such as human-centered scene understanding and visual question answering. However, HOI detection suffers from class long-tailed distributions and zero-shot problems. Current methods typically identify HOI only from input images or label spaces in a data-driven manner, lacking sufficient knowledge prompts, and consequently limits their potential for real-world scenes. Hence, to fill this gap, this paper introduces affordance and scene knowledge as prompts on different granularities to the HOI detector to improve its recognition ability. Concretely, we first construct a large-scale affordance-scene knowledge graph, named ASKG, whose knowledge can be divided into two categories according to the fields of image information, i.e., the knowledge related to affordances of object instances and the knowledge associated with the scene. Subsequently, the knowledge of affordance and scene specific to the input image is extracted by an ASKG-based prior knowledge embedding module. Since this knowledge corresponds to the image at different granularities, we then propose an instance field adaptive fusion module and a scene field adaptive fusion module to enable visual features fully absorb the knowledge prompts. These two encoded features of different fields and knowledge embeddings are finally fed into a proposed HOI recognition module to predict more accurate HOI results. Extensive experiments on both HICO-DET and V-COCO benchmarks demonstrate that the proposed method leads to competitive results compared with the state-of-the-art methods.
- Research Article
- 10.1109/access.2026.3659132
- Jan 1, 2026
- IEEE Access
- Junwen Chen + 1 more
Human–Object Interaction Detection (HOID) has benefited greatly from advances in modern detection architectures and vision-language foundation models. In this paper, we present two progressively improved HOID frameworks—SOV-STG-VLA and Hybrid-SOV—that jointly push the frontier of accurate and efficient interaction understanding. SOV-STG-VLA reformulates HOI prediction as a subject–object–verb (SOV) decoding problem and introduces a Split Target Guided (STG) denoising strategy that accelerates convergence while enhancing structural consistency. Furthermore, a Vision–Language Advisor (VLA) integrates priors from large multimodal models to enrich verb semantics and improve HOI classification. Building upon this, Hybrid-SOV aligns HOID with the latest object-detection paradigms by incorporating an efficient hybrid encoder and a query-selection mechanism that directly constructs HOI queries from visual features, eliminating predefined embeddings and enabling more interpretable decoding. When coupled with DINO-v3 foundation features, Hybrid-SOV achieves state-of-the-art accuracy with superior inference efficiency. Extensive experiments on HICO-DET and V-COCO demonstrate that our frameworks not only advance HOID performance but also establish an effective path toward bridging detection architectures with vision-language foundation models.
- Research Article
- 10.1007/s44267-025-00102-0
- Dec 1, 2025
- Visual Intelligence
- Fang Nan + 4 more
Abstract The aim of human-object interaction (HOI) detection is to identify the triplets consisting of a human, a verb, and an object. Although existing methods leverage vision-language models (e.g., CLIP) to transfer textual information for unseen compositions, they often fail to capture the fine-grained visual cues that are essential for complex interactions, such as spatial configurations and object affordances. In this paper, we introduce visual guidance as an alternative approach to achieving the desired outcome. We define a new visual-guided HOI detection task for the first time, aiming at detecting unseen HOI categories using a small number of guidance examples. To support this new task, we have constructed a new benchmark dataset, which contains one base set and four novel sets, taking into account the peculiarities of HOI. Then, we propose a VG-HOI model with progressive guidance, query reconstruction, and a conditional uncoupling decoder to supplement common HOI knowledge and task-specific cues to improve the generalization capability of our model. Besides, we explore a new guidance sampling strategy — disentangled guidance — for real-world scenarios. Our in-depth analysis of the experimental results shows that the proposed model can improve the ability to generalize when detecting visual-guided HOI.
- Research Article
2
- 10.1016/j.neucom.2025.130882
- Oct 1, 2025
- Neurocomputing
- Tianlun Luo + 6 more
Exploring interaction concepts for human–object-interaction detection via global- and local-scale enhancing
- Research Article
2
- 10.1016/j.neucom.2025.130709
- Oct 1, 2025
- Neurocomputing
- Tianlun Luo + 6 more
Simple yet effective: An explicit query-based relation learner for human–object-interaction detection
- Research Article
1
- 10.1109/tcyb.2025.3587037
- Sep 1, 2025
- IEEE transactions on cybernetics
- Weibo Jiang + 5 more
human-object interaction (HOI) detection tackles the problem of joint localization and classification of HOIs. Recent HOI detection methods are mainly based on transformer networks, where the explicit priors at the object level (e.g., scene layout, object appearance, or category) are usually fed into the transformer to improve the object query ability. Though these methods have achieved remarkable results, they did not pay enough attention to the implicit action-level information, which is the fundamental element of HOI. In this work, we propose an interaction-aware transformer network (IATN) to obtain the interaction-aware query, by jointly utilizing implicit action-level priors and explicit object-level priors. Specifically, we design an action-aware module (AAM) to aggregate implicit action priors from the scene level and instance level, respectively. Then, we design an action-oriented graph (AOG), where human feature and object feature are graph nodes and action semantics represent graph edges, to aggregate priors jointly from action level and object level. Afterwards, the interaction-aware query is acquired and finally adopted to obtain the HOI predictions. Besides, we leverage knowledge distillation to enhance the action-level priors by transferring the final HOI predictions to the intermediate features. Extensive experiments on HICO-DET and V-COCO datasets verify the effectiveness of our proposed interaction-aware model.
- Research Article
1
- 10.1016/j.patcog.2025.112242
- Aug 1, 2025
- Pattern Recognition
- Yuequan Yang + 5 more
CORE-CLIP: Smart Collaborative Reasoning Driven by CLIP for Human-Object Interaction Detection
- Research Article
- 10.1007/s10489-025-06730-9
- Jul 10, 2025
- Applied Intelligence
- Huanchun Peng + 7 more
Human-object interaction detection based on adaptive contrastive learning and class-specific feature enhancement
- Research Article
2
- 10.1016/j.neunet.2025.107348
- Jul 1, 2025
- Neural networks : the official journal of the International Neural Network Society
- Weiying Xue + 5 more
Towards zero-shot human-object interaction detection via vision-language integration.
- Research Article
- 10.3390/info16060474
- Jun 6, 2025
- Information
- Gyubin Park + 1 more
In Human–Object Interaction (HOI) detection, class imbalance severely limits the performance of a model on infrequent interaction categories. To overcome this problem, a Re-Splitting algorithm has been developed. This algorithm implements DreamSim-based clustering and performs k-means-based partitioning to restructure the train–test splits. By doing so, the approach balances the rarities and frequent classes of interaction equally, thereby increasing robustness. A Real-World test dataset has also been introduced. This dataset is comparable to a truly independent benchmark. It is designed to address class distribution bias, which is commonly present in traditional test sets. However, as shown in the Experiment and Evaluation subsection, a high level of performance can be achieved for the general case using different few-shot and rare-class training instances. Models trained solely on the re-split dataset show significant improvements in rare-class mAP, particularly for one-stage models. Evaluation on the test dataset from the real world further emphasizes previously overlooked model performance and supports fair structuring of dataset. The methods are validated with extensive experiments using five one-stage and two two-stage models. Our analysis shows that reshaping dataset distributions increases rare-class detection by as much as 8.0 mAP. This study paves the way for balanced training and evaluation leading to the formulation of a general framework for scalable, fair, and generalizable HOI detection.
- Research Article
1
- 10.1007/s11227-025-07308-5
- May 3, 2025
- The Journal of Supercomputing
- Limin Xia + 1 more
Comprehensive context learning for two-stage human-object interaction detection
- Research Article
- 10.1007/s11263-025-02445-z
- Apr 28, 2025
- International Journal of Computer Vision
- Hong-Bo Zhang + 5 more
Interaction Confidence Attention for Human–Object Interaction Detection
- Research Article
1
- 10.1609/aaai.v39i9.32972
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
- Yongchao Xu + 4 more
Human-object interaction (HOI) detection aims to detect the spatial positions of human-object pairs and recognize their interactions. Existing single-branch, two-branch, and three-branch methods are challenging to make an appropriate trade-off on efficiency, multi-task decoupling, and collaborative learning, while they fail to identify rare and complex interaction categories effectively as well. In this work, we propose a novel Efficient Mamba-based Disentangled Progressive Learning (HOIMamba) for HOI Detection to absorb the advantages of the existing three approaches and adaptively aggregate multi-level interaction semantics guided by cross-task bidirectional information contexts. Specifically, HOIMamba builds an efficient and effective decoder through cascaded Low-Rank Adaptations (LoRAs), with high efficiency, thorough decoupling of tasks, and good multi-task collaborative learning. Furthermore, to alleviate the recognition problem of interactions in difficult HOI samples, a novel Mamba-based comprehensive progressive learning strategy with Cross-enhance Mamba (CEM) blocks and Detection Context Propagation (DCP) blocks is designed to gradually excavate interaction-related discriminative cues from four levels. CEM blocks automatically aggregate context to generate diverse task-shared semantics and simultaneously realize the cross-task interaction between human and object branches, guiding the interaction branch to extract more expressive HOI representation. DCP blocks further transfer the comprehensive interaction context to human and object branches to achieve rich and effective information exchange, facilitating the model to discover more HOI instances. Extensive experimental results on two standard benchmarks demonstrate the effectiveness of our HOIMamba.
- Research Article
3
- 10.1609/aaai.v39i4.32411
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
- Mingda Jia + 3 more
Spatial contexts, such as the backgrounds and surroundings, are considered critical in Human-Object Interaction (HOI) recognition, especially when the instance-centric foreground is blurred or occluded. Recent advancements in HOI detectors are usually built upon detection transformer pipelines. While such an object-detection-oriented paradigm shows promise in localizing objects, its exploration of spatial context is often insufficient for accurately recognizing human actions. To enhance the capabilities of object detectors for HOI detection, we present a dual-branch framework named ContextHOI, which efficiently captures both object detection features and spatial contexts. In the context branch, we train the model to extract informative spatial context without requiring additional hand-craft background labels. Furthermore, we introduce context-aware spatial and semantic supervision to the context branch to filter out irrelevant noise and capture informative contexts. ContextHOI achieves state-of-the-art performance on the HICO-DET and v-coco benchmarks. For further validation, we construct a novel benchmark, HICO-ambiguous, which is a subset of HICO-DET that contains images with occluded or impaired instance cues. Extensive experiments across all benchmarks, complemented by visualizations, underscore the enhancements provided by ContextHOI, especially in recognizing interactions involving occluded or blurred instances.
- Research Article
- 10.52783/jisem.v10i30s.4891
- Mar 31, 2025
- Journal of Information Systems Engineering and Management
- Sakshi Birthi
This article presents ReCap Pro, a framework that corrects auto-generated captions by dealing with the possible errors in nouns and verbs in the caption. While caption correction has been attempted earlier, it is observed that it has never been tried as a meta-learning-based approach. The work described in this article offers few-shot learning enabling faster learning with fewer samples of images, solving one of the critical limitations of the traditional data-intensive caption generation models. An object detection model trained using Reptile Meta-Learning is employed to detect the correct nouns and a human object interaction (HOI) detection model trained using Prototypical Networks is used to detect the verbs in the image. The proposed method addresses a long-standing limitation of existing caption generation models that rely on large amounts of training data and can be used as an extra layer of performance enhancer with existing caption generators. The suggested technique can be applied as an additional performance enhancer layer over current caption generators to overcome a long-standing shortcoming of those models
- Research Article
10
- 10.1007/s44267-025-00074-1
- Mar 17, 2025
- Visual Intelligence
- Hang Zhang + 3 more
Human-centered dynamic scene understanding plays a pivotal role in enhancing the capability of robotic and autonomous systems, where video-based human-object interaction (V-HOI) detection is a crucial task in semantic scene understanding, which aims to comprehensively understand HOI relationships within a video to benefit the behavioral decisions of mobile robots and autonomous driving systems. Although previous V-HOI detection models have made significant advances in accurate detection on specific datasets, they still lack the general reasoning ability of humans to effectively induce HOI relationships. In this study, we propose V-HOI multi-LLMs collaborated reasoning (V-HOI MLCR), a novel framework consisting of a series of plug-and-play modules that could facilitate the performance of current V-HOI detection models by leveraging the strong reasoning ability of different off-the-shelf pre-trained large language models (LLMs). We design a two-stage collaboration system of different LLMs for the V-HOI task. Specifically, in the first stage, we design a cross-agents reasoning scheme to leverage the LLM to perform reasoning from different aspects. In the second stage, we perform multi-LLMs debate to get the final reasoning answer based on the different knowledge in different LLMs. Additionally, we develop an auxiliary training strategy using CLIP, a large vision-language model to enhance the base V-HOI models’ discriminative ability to better cooperate with LLMs. We validate the superiority of our design by demonstrating its effectiveness in improving the predictive accuracy of the base V-HOI model through reasoning from multiple perspectives.