Fusion of Independent and Interactive Features for Human-Object Interaction Detection
Human-Object Interaction (HOI) detection, which aims to identify humans and objects with interactive behaviors in images and predict the behaviors between them, is of great significance for semantic understanding. The existing works primarily focus on exploring the fine-grained semantic features of humans and objects, as well as the spatial relationships between them. However, these methods do not leverage the contextual information within the interaction area, which could potentially be valuable for predicting interaction behavior. To investigate the impact of contextual information on behavior prediction, we propose a novel approach to extract both independent and interactive features and fuse them. Specifically, our method is capable of extracting interaction features from the interaction region. These features are then merged with fine-grained independent features of humans and objects. Finally, the fused features are utilized to predict interaction behavior. In addition, the feature fusion module does not add extra storage and computation costs to our method. Experiments demonstrate the effectiveness of our method, achieving state-of-the-art performance on two benchmark HOI datasets, namely HCO-DET and V-COCO.
- Research Article
7
- 10.1109/access.2021.3113781
- Jan 1, 2021
- IEEE Access
Humans often leverage spatial clues to categorize scenes in a fraction of a second. This form of intelligence is very relevant in time-critical situations (e.g., when driving a car) and valuable to transfer to automated systems. This work investigates the predictive power of solely processing spatial clues for scene understanding in 2D images and compares such an approach with the predictive power of visual appearance. To this end, we design the laboratory task of predicting the identity of two objects (e.g., “man” and “horse”) and their relationship or predicate (e.g., “riding”) given exclusively the ground truth bounding box coordinates of both objects. We also measure the performance attainable in Human Object Interaction (HOI) detection, a real-world spatial task, which includes a setting where ground truth boxes are not available at test time. An additional goal is to identify the principles necessary to effectively represent a spatial template, that is, the visual region in which two objects involved in a relationship expressed by a predicate occur. We propose a scale-, mirror-, and translation-invariant representation that captures the spatial essence of the relationship, i.e., a canonical spatial representation. Tests in two benchmarks reveal: (1) High performance is attainable by using exclusively spatial information in all tasks. (2) In HOI detection, the canonical template outperforms the rest of spatial, visual, and several state-of-the-art baselines. (3) Simple fusion of visual and spatial features substantially improves performance. (4) Our methods fare remarkably well with a small amount of data and rare categories. Our results obtained on the Visual Genome (VG) and the Humans Interacting with Common Objects - Detection (HICO-DET) datasets indicate that great predictive power can be obtained from spatial clues alone, opening up possibilities for performing fast scene understanding at a glance.
- Research Article
- 10.1016/j.daach.2024.e00364
- Jul 24, 2024
- Digital Applications in Archaeology and Cultural Heritage
Human object interaction detection in paintings using multi-task learning
- Research Article
4
- 10.1145/3674980
- Oct 30, 2024
- ACM Transactions on Multimedia Computing, Communications, and Applications
In recent years, one-stage HOI (Human–Object Interaction) detection methods tend to divide the original task into multiple sub-tasks by using a multi-branch network structure. However, there is no sufficient attention to information communication between these branches. The inference approach in the cascaded structure is singular, while fully parallel methods will disrupt the associations between different pieces of information. Besides, noise interference may occur during the fusion of different features and thus affect the detection performance. To address these issues, this article proposes a one-stage three-branch parallel HOI detection method, which treats HOI as three separate sub-tasks (human detection, object detection, and interaction detection) and leverages three distinct reasoning relationships to generate richer relational information. Firstly , an auxiliary feature fusion (AFF) module is introduced, which integrates features originally extracted independently to form fused features enriched with supplementary information. This approach strengthens communication between branches in the network while handling the three sub-tasks concurrently, thereby facilitating the exchange of more contextual information. Secondly , to mitigate noise interference generated during the fusion process, a fusion noise suppression (FNS) module is introduced, which effectively suppresses noise and enhances the model’s performance in interaction detection tasks. Finally , experiments are conducted on two major benchmark datasets, and experimental results show that our HOI detection method is superior to previous methods. Also, ablation studies confirm the effectiveness of all the components in our proposed method.
- Conference Article
100
- 10.1109/cvpr46437.2021.01441
- Jun 1, 2021
Human-Object Interaction (HOI) detection, inferring the relationships between human and objects from images/videos, is a fundamental task for high-level scene understanding. However, HOI detection usually suffers from the open long-tailed nature of interactions with objects, while human has extremely powerful compositional perception ability to cognize rare or unseen HOI samples. Inspired by this, we devise a novel HOI compositional learning framework, termed as Fabricated Compositional Learning (FCL), to address the problem of open long-tailed HOI detection. Specifically, we introduce an object fabricator to generate effective object representations, and then combine verbs and fabricated objects to compose new HOI samples. With the proposed object fabricator, we are able to generate large-scale HOI samples for rare and unseen categories to alleviate the open long-tailed issues in HOI detection. Extensive experiments on the most popular HOI detection dataset, HICO-DET, demonstrate the effectiveness of the proposed method for imbalanced HOI detection and significantly improve the state-of-the-art performance on rare and unseen HOI categories. Code is available at https://github.com/zhihou7/HOI-CL.
- Conference Article
- 10.1109/icmtma.2019.00168
- Apr 1, 2019
The rapid development of deep learning in recent years has led to breakthroughs in more and more problems, such as image recognition and object detection. Human-object interaction(HOI) detection is an important issue in image comprehension, which has higher actual value than image classification and object detection. but still not well solved. Different from HOI recognition, HOI detection is not only to determine the presence of HOIs, but also to estimate their locations, so this problem is defined as predicting a human and an object bounding box with an interaction class label that connects them. In this paper, we propose an approach that uses multi-features to solve HOI detection task. We use different methods to extract visual features and spatial features from images, and use their fusion to provide more effective support for HOI detection . The core of our approach is the Visual Feature Extracted Model which base on convolutional neural network with attention module and can provide more detailed visual features. We validate our approach on the recently introduced Verbs in COCO (V-COCO), and verification results indicate that our approach achieves extraordinary improvements in HOI detection.
- Research Article
9
- 10.1016/j.cviu.2024.104091
- Jul 20, 2024
- Computer Vision and Image Understanding
UAHOI: Uncertainty-aware robust interaction learning for HOI detection
- Conference Article
224
- 10.1109/cvpr46437.2021.01165
- Jun 1, 2021
We propose HOI Transformer to tackle human object interaction (HOI) detection in an end-to-end manner. Current approaches either decouple HOI task into separated stages of object detection and interaction classification or introduce surrogate interaction problem. In contrast, our method, named HOI Transformer, streamlines the HOI pipeline by eliminating the need for many hand-designed components. HOI Transformer reasons about the relations of objects and humans from global image context and directly predicts HOI instances in parallel. A quintuple matching loss is introduced to force HOI predictions in a unified way. Our method is conceptually much simpler and demonstrates improved accuracy. Without bells and whistles, HOI Transformer achieves 26.61% AP on HICO-DET and 52.9% AP <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">role</inf> on V-COCO, surpassing previous methods with the advantage of being much simpler. We hope our approach will serve as a simple and effective alternative for HOI tasks. Code is available at https://github.com/bbepoch/HoiTransformer.
- Research Article
3
- 10.1016/j.neunet.2023.11.002
- Nov 13, 2023
- Neural Networks
Human–Object Interaction detection via Global Context and Pairwise-level Fusion Features Integration
- Book Chapter
3
- 10.1007/978-3-031-20497-5_36
- Jan 1, 2022
In recent years, rapid progress has been made in detecting and identifying single object instances. In order to understand the situation in the scene, computers need to recognize how humans interact with surrounding objects. Human-object interaction (HOI) detection aims to identify a set of interactions in images or videos. It involves the positioning of interactive subjects and objects and the classification of interactive types. It is crucial to realize high-level semantic understanding of people-centered scenarios. The study of HOI detection is also conducive to promoting the research of other advanced visual tasks. In this paper, we introduce the previous works on HOI detection based on deep learning, which are raised from the two primary development trends of sequential and parallel methods. Secondly, we summarize the main challenges faced by the HOI detection task. Further, we introduce the most popular HOI detection datasets, including image and video datasets, and main metrics. Finally, we summarize the future research directions for the HOI detection task.KeywordsHuman-object interaction (HOI) DetectionComputer visionDeep learning
- Research Article
4
- 10.1109/tnnls.2021.3131154
- Sep 1, 2023
- IEEE Transactions on Neural Networks and Learning Systems
The existing works on human-object interaction (HOI) detection usually rely on expensive large-scale labeled image datasets. However, in real scenes, labeled data may be insufficient, and some rare HOI categories have few samples. This poses great challenges for deep-learning-based HOI detection models. Existing works tackle it by introducing compositional learning or word embedding but still need large-scale labeled data or extremely rely on the well-learned knowledge. In contrast, the freely available unlabeled videos contain rich motion-relevant information that can help infer rare HOIs. In this article, we creatively propose a multitask learning (MTL) perspective to assist in HOI detection with the aid of motion-relevant knowledge learning on unlabeled videos. Specifically, we design the appearance reconstruction loss (ARL) and sequential motion mining module in a self-supervised manner to learn more generalizable motion representations for promoting the detection of rare HOIs. Moreover, to better transfer motion-related knowledge from unlabeled videos to HOI images, a domain discriminator is introduced to decrease the domain gap between two domains. Extensive experiments on the HICO-DET dataset with rare categories and the V-COCO dataset with minimum supervision demonstrate the effectiveness of motion-aware knowledge implied in unlabeled videos for HOI detection.
- Research Article
1
- 10.1016/j.imavis.2024.105249
- Sep 2, 2024
- Image and Vision Computing
Exploring the synergy between textual identity and visual signals in human-object interaction
- Research Article
8
- 10.1109/jbhi.2024.3525054
- May 1, 2025
- IEEE journal of biomedical and health informatics
Vision transformers have achieved remarkable success in image classification. The dual-branch vision transformer generates more features by taking advantage of feature fusion. Inspired by this, a dual-branch vision transformer with Real-Time Share feature was proposed during the encoding process for retinal image classification tasks. The approach processes image patches of varying sizes (base and large) through two independent branches and implements multi-stage Real-Time feature fusion via the Real-Time Share feature encoder. This encoder enables the branches to complement each other's features at each encoding stage, facilitating finer feature learning and enhancing the self-attention information passed to subsequent stages. It significantly boosts feature representation and classification performance. Additionally, a straightforward and effective feature fusion method, L-Times Attention Fusion, was proposed: vector concatenation for Real-Time Share feature in the earlier (L-1) encoding stages and element-wise addition for overall feature fusion at the L-th stage, achieving more efficient feature integration. The method was validated on a retinal image dataset. Results show that the approach outperforms the recent Cross-ViT average TOP-1 Acc by 5.61% with lower FLOPs and model parameters, without relying on pre-trained weights, highlighting stronger self-learning feature capabilities and reduced reliance on extensive pre-training data.
- Conference Article
7
- 10.1109/wacv48630.2021.00127
- Jan 1, 2021
Human object interaction (HOI) detection is an important task in image understanding and reasoning. It is in a form of HOI triplet 〈human, verb, object〉, requiring bounding boxes for human and object, and action between them for the task completion. In other words, this task requires strong supervision for training that is however hard to procure. A natural solution to overcome this is to pursue weakly-supervised learning, where we only know the presence of certain HOI triplets in images but their exact location is unknown. Most weakly-supervised learning methods do not make provision for leveraging data with strong supervision, when they are available; and indeed a naive combination of this two paradigms in HOI detection fails to make contributions to each other. In this regard we propose a mixed-supervised HOI detection pipeline: thanks to a specific design of momentum-independent learning that learns seamlessly across these two types of supervision. Moreover, in light of the annotation insufficiency in mixed supervision, we introduce an HOI element swapping technique to synthesize diverse and hard negatives across images and improve the robustness of the model. Our method is evaluated on the challenging HICO-DET dataset. It performs close to or even better than many fully-supervised methods by using a mixed amount of strong and weak annotations; furthermore, it outperforms representative state of the art weakly- and fully-supervised methods under the same supervision.
- Research Article
4
- 10.1609/aaai.v38i6.28422
- Mar 24, 2024
- Proceedings of the AAAI Conference on Artificial Intelligence
This work is oriented toward the task of open-set Human Object Interaction (HOI) detection. The challenge lies in identifying completely new, out-of-domain relationships, as opposed to in-domain ones which have seen improvements in zero-shot HOI detection. To address this challenge, we introduce a simple Disentangled HOI Detection (DHD) model for detecting novel relationships by integrating an open-set object detector with a Visual Language Model (VLM). We utilize a disentangled image-text contrastive learning metric for training and connect the bottom-up visual features to text embeddings through lightweight unary and pair-wise adapters. Our model can benefit from the open-set object detector and the VLM to detect novel action categories and combine actions with novel object categories. We further present the VG-HOI dataset, a comprehensive benchmark with over 17k HOI relationships for open-set scenarios. Experimental results show that our model can detect unknown action classes and combine unknown object classes. Furthermore, it can generalize to over 17k HOI classes while being trained on just 600 HOI classes.
- Research Article
19
- 10.1016/j.neucom.2023.126243
- Apr 27, 2023
- Neurocomputing
From detection to understanding: A survey on representation learning for human-object interaction