Effects of Motion-Relevant Knowledge From Unlabeled Video to Human-Object Interaction Detection.
The existing works on human-object interaction (HOI) detection usually rely on expensive large-scale labeled image datasets. However, in real scenes, labeled data may be insufficient, and some rare HOI categories have few samples. This poses great challenges for deep-learning-based HOI detection models. Existing works tackle it by introducing compositional learning or word embedding but still need large-scale labeled data or extremely rely on the well-learned knowledge. In contrast, the freely available unlabeled videos contain rich motion-relevant information that can help infer rare HOIs. In this article, we creatively propose a multitask learning (MTL) perspective to assist in HOI detection with the aid of motion-relevant knowledge learning on unlabeled videos. Specifically, we design the appearance reconstruction loss (ARL) and sequential motion mining module in a self-supervised manner to learn more generalizable motion representations for promoting the detection of rare HOIs. Moreover, to better transfer motion-related knowledge from unlabeled videos to HOI images, a domain discriminator is introduced to decrease the domain gap between two domains. Extensive experiments on the HICO-DET dataset with rare categories and the V-COCO dataset with minimum supervision demonstrate the effectiveness of motion-aware knowledge implied in unlabeled videos for HOI detection.
- Conference Article
102
- 10.1109/cvpr46437.2021.01441
- Jun 1, 2021
Human-Object Interaction (HOI) detection, inferring the relationships between human and objects from images/videos, is a fundamental task for high-level scene understanding. However, HOI detection usually suffers from the open long-tailed nature of interactions with objects, while human has extremely powerful compositional perception ability to cognize rare or unseen HOI samples. Inspired by this, we devise a novel HOI compositional learning framework, termed as Fabricated Compositional Learning (FCL), to address the problem of open long-tailed HOI detection. Specifically, we introduce an object fabricator to generate effective object representations, and then combine verbs and fabricated objects to compose new HOI samples. With the proposed object fabricator, we are able to generate large-scale HOI samples for rare and unseen categories to alleviate the open long-tailed issues in HOI detection. Extensive experiments on the most popular HOI detection dataset, HICO-DET, demonstrate the effectiveness of the proposed method for imbalanced HOI detection and significantly improve the state-of-the-art performance on rare and unseen HOI categories. Code is available at https://github.com/zhihou7/HOI-CL.
- Research Article
1
- 10.1016/j.daach.2024.e00364
- Jul 24, 2024
- Digital Applications in Archaeology and Cultural Heritage
Human object interaction detection in paintings using multi-task learning
- Research Article
1
- 10.1016/j.imavis.2024.105249
- Sep 2, 2024
- Image and Vision Computing
Exploring the synergy between textual identity and visual signals in human-object interaction
- Conference Article
- 10.1109/icmtma.2019.00168
- Apr 1, 2019
The rapid development of deep learning in recent years has led to breakthroughs in more and more problems, such as image recognition and object detection. Human-object interaction(HOI) detection is an important issue in image comprehension, which has higher actual value than image classification and object detection. but still not well solved. Different from HOI recognition, HOI detection is not only to determine the presence of HOIs, but also to estimate their locations, so this problem is defined as predicting a human and an object bounding box with an interaction class label that connects them. In this paper, we propose an approach that uses multi-features to solve HOI detection task. We use different methods to extract visual features and spatial features from images, and use their fusion to provide more effective support for HOI detection . The core of our approach is the Visual Feature Extracted Model which base on convolutional neural network with attention module and can provide more detailed visual features. We validate our approach on the recently introduced Verbs in COCO (V-COCO), and verification results indicate that our approach achieves extraordinary improvements in HOI detection.
- Conference Article
308
- 10.1109/cvpr42600.2020.00056
- Jun 1, 2020
We propose a single-stage Human-Object Interaction (HOI) detection method that has outperformed all existing methods on HICO-DET dataset at 37 fps on a single Titan XP GPU. It is the first real-time HOI detection method. Conventional HOI detection methods are composed of two stages, i.e., human-object proposals generation, and proposals classification. Their effectiveness and efficiency are limited by the sequential and separate architecture. In this paper, we propose a Parallel Point Detection and Matching (PPDM) HOI detection framework. In PPDM, an HOI is defined as a point triplet < human point, interaction point, object point>. Human and object points are the center of the detection boxes, and the interaction point is the midpoint of the human and object points. PPDM contains two parallel branches, namely point detection branch and point matching branch. The point detection branch predicts three points. Simultaneously, the point matching branch predicts two displacements from the interaction point to its corresponding human and object points. The human point and the object point originated from the same interaction point are considered as matched pairs. In our novel parallel architecture, the interaction points implicitly provide context and regularization for human and object detection. The isolated detection boxes unlikely to form meaningful HOI triplets are suppressed, which increases the precision of HOI detection. Moreover, the matching between human and object detection boxes is only applied around limited numbers of filtered candidate interaction points, which saves much computational cost. Additionally, we build a new application-oriented database named as HOI-A, which serves as a good supplement to the existing datasets.
- Research Article
11
- 10.1016/j.cviu.2024.104091
- Jul 20, 2024
- Computer Vision and Image Understanding
UAHOI: Uncertainty-aware robust interaction learning for HOI detection
- Research Article
13
- 10.1007/s10489-020-01794-1
- Jul 21, 2020
- Applied Intelligence
Human-Object Interaction (HOI) Detection is a new genre of human-centric visual relationship detection task, which is significant to deep understanding of visual scenes. Due to the complexity of the visual scene in the image, HOI detection is still a challenging task, the most critical part of which is feature extraction and representation. Some existing approaches rely solely on local region information for HOI detection without using global contextual information, but global contextual information contributes to this task in some HOI categories. Other approaches incorporate global contextual information for HOI detection while losing local region information. In this work, we propose a multi-stream neural network architecture composed of three special module that employs both local region information and global contextual information for HOI detection. This model can detect not only the HOI categories based on local region information but also on global contextual information. Our model more fully considers all HOI categories in the dataset. Compared with other existing approaches, the proposed model shows improved performance on V-COCO and HICO-DET benchmark datasets, especially when predicting rare HOI categories.
- Research Article
7
- 10.1109/access.2021.3113781
- Jan 1, 2021
- IEEE Access
Humans often leverage spatial clues to categorize scenes in a fraction of a second. This form of intelligence is very relevant in time-critical situations (e.g., when driving a car) and valuable to transfer to automated systems. This work investigates the predictive power of solely processing spatial clues for scene understanding in 2D images and compares such an approach with the predictive power of visual appearance. To this end, we design the laboratory task of predicting the identity of two objects (e.g., “man" and “horse") and their relationship or predicate (e.g., “riding") given exclusively the ground truth bounding box coordinates of both objects. We also measure the performance attainable in Human Object Interaction (HOI) detection, a real-world spatial task, which includes a setting where ground truth boxes are not available at test time. An additional goal is to identify the principles necessary to effectively represent a spatial template, that is, the visual region in which two objects involved in a relationship expressed by a predicate occur. We propose a scale-, mirror-, and translation-invariant representation that captures the spatial essence of the relationship, i.e., a canonical spatial representation. Tests in two benchmarks reveal: (1) High performance is attainable by using exclusively spatial information in all tasks. (2) In HOI detection, the canonical template outperforms the rest of spatial, visual, and several state-of-the-art baselines. (3) Simple fusion of visual and spatial features substantially improves performance. (4) Our methods fare remarkably well with a small amount of data and rare categories. Our results obtained on the Visual Genome (VG) and the Humans Interacting with Common Objects - Detection (HICO-DET) datasets indicate that great predictive power can be obtained from spatial clues alone, opening up possibilities for performing fast scene understanding at a glance.
- Research Article
6
- 10.1109/mce.2023.3343919
- Nov 1, 2024
- IEEE Consumer Electronics Magazine
This paper systematically summarizes and discusses recent research on image-based human-object interaction (HOI) detection, which aims to detect human-object pairs and recognize the interactive behaviors between humans and objects in an image. It has plenty of applications and can serve as the basis to assist higher-level tasks of visual understanding. We introduce existing methods by categorizing them into two main groups based on the model structure: one-stage and two-stage approaches. We further divide one-stage methods into point-based, region-based, and query-based methods. Similarly, the two-stage methods are divided into HOI detection with multi-stream modeling, HOI detection with human parts and pose, HOI detection with compositional learning, HOI detection with graph-based modeling, and HOI detection with query-based modeling. According to this taxonomy, we also summarize and analyze the core ideas behind each strategy. Then, we present the details of the experimental protocols, evaluation metrics, datasets, and the evaluation results of the most recent representative methods. Finally, we discuss the HOI detection task's main open challenges and future trends.
- Conference Article
227
- 10.1109/cvpr46437.2021.01165
- Jun 1, 2021
We propose HOI Transformer to tackle human object interaction (HOI) detection in an end-to-end manner. Current approaches either decouple HOI task into separated stages of object detection and interaction classification or introduce surrogate interaction problem. In contrast, our method, named HOI Transformer, streamlines the HOI pipeline by eliminating the need for many hand-designed components. HOI Transformer reasons about the relations of objects and humans from global image context and directly predicts HOI instances in parallel. A quintuple matching loss is introduced to force HOI predictions in a unified way. Our method is conceptually much simpler and demonstrates improved accuracy. Without bells and whistles, HOI Transformer achieves 26.61% AP on HICO-DET and 52.9% AP <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">role</inf> on V-COCO, surpassing previous methods with the advantage of being much simpler. We hope our approach will serve as a simple and effective alternative for HOI tasks. Code is available at https://github.com/bbepoch/HoiTransformer.
- Book Chapter
3
- 10.1007/978-3-031-20497-5_36
- Jan 1, 2022
In recent years, rapid progress has been made in detecting and identifying single object instances. In order to understand the situation in the scene, computers need to recognize how humans interact with surrounding objects. Human-object interaction (HOI) detection aims to identify a set of interactions in images or videos. It involves the positioning of interactive subjects and objects and the classification of interactive types. It is crucial to realize high-level semantic understanding of people-centered scenarios. The study of HOI detection is also conducive to promoting the research of other advanced visual tasks. In this paper, we introduce the previous works on HOI detection based on deep learning, which are raised from the two primary development trends of sequential and parallel methods. Secondly, we summarize the main challenges faced by the HOI detection task. Further, we introduce the most popular HOI detection datasets, including image and video datasets, and main metrics. Finally, we summarize the future research directions for the HOI detection task.KeywordsHuman-object interaction (HOI) DetectionComputer visionDeep learning
- Research Article
9
- 10.1109/tpami.2024.3386891
- Oct 1, 2024
- IEEE transactions on pattern analysis and machine intelligence
Human-Object Interaction (HOI) detection aims to understand human activities by detecting interaction triplets. Previous HOI detection methods adopt a two-stage instance-driven paradigm. Unfortunately, many non-interactive human-object pairs generated by the first stage are the main obstacle impeding HOI detectors from high efficiency and promising performance. To remedy this, we propose a novel top-down interaction-driven paradigm, detecting interactions first and bridging interactive human-object pairs through interactions. We formulate HOI as a point triplet human point, interaction point, object point and design a Parallel Point Detection and Matching (PPDM) framework. We further take advantage of two-stage methods and propose a novel framework, PPDM++, that detects the interactive human-object pairs by PPDM, then extracts region features for each pair to predict actions. The core of PPDM/PPDM++ is to convert the instance-driven bottom-up paradigm to an interaction-driven top-down paradigm, thus avoiding additional computation costs from traversing a tremendous number of non-interactive pairs. Benefiting from the advanced paradigm, PPDM/PPDM++ has achieved significant performance gains with high efficiency. PPDM-DLA-34 has achieved 19.94 mAP with 42 FPS as the first real-time HOI detector, and PPDM++-SwinB achieves 30.1 mAP with 17 FPS on HICO-DET dataset. We also built an application-oriented database named HOI-A, a supplement to the existing datasets.
- Conference Article
7
- 10.1109/wacv48630.2021.00127
- Jan 1, 2021
Human object interaction (HOI) detection is an important task in image understanding and reasoning. It is in a form of HOI triplet 〈human, verb, object〉, requiring bounding boxes for human and object, and action between them for the task completion. In other words, this task requires strong supervision for training that is however hard to procure. A natural solution to overcome this is to pursue weakly-supervised learning, where we only know the presence of certain HOI triplets in images but their exact location is unknown. Most weakly-supervised learning methods do not make provision for leveraging data with strong supervision, when they are available; and indeed a naive combination of this two paradigms in HOI detection fails to make contributions to each other. In this regard we propose a mixed-supervised HOI detection pipeline: thanks to a specific design of momentum-independent learning that learns seamlessly across these two types of supervision. Moreover, in light of the annotation insufficiency in mixed supervision, we introduce an HOI element swapping technique to synthesize diverse and hard negatives across images and improve the robustness of the model. Our method is evaluated on the challenging HICO-DET dataset. It performs close to or even better than many fully-supervised methods by using a mixed amount of strong and weak annotations; furthermore, it outperforms representative state of the art weakly- and fully-supervised methods under the same supervision.
- Research Article
20
- 10.1016/j.neucom.2023.126243
- Apr 27, 2023
- Neurocomputing
From detection to understanding: A survey on representation learning for human-object interaction
- Research Article
5
- 10.1609/aaai.v38i6.28422
- Mar 24, 2024
- Proceedings of the AAAI Conference on Artificial Intelligence
This work is oriented toward the task of open-set Human Object Interaction (HOI) detection. The challenge lies in identifying completely new, out-of-domain relationships, as opposed to in-domain ones which have seen improvements in zero-shot HOI detection. To address this challenge, we introduce a simple Disentangled HOI Detection (DHD) model for detecting novel relationships by integrating an open-set object detector with a Visual Language Model (VLM). We utilize a disentangled image-text contrastive learning metric for training and connect the bottom-up visual features to text embeddings through lightweight unary and pair-wise adapters. Our model can benefit from the open-set object detector and the VLM to detect novel action categories and combine actions with novel object categories. We further present the VG-HOI dataset, a comprehensive benchmark with over 17k HOI relationships for open-set scenarios. Experimental results show that our model can detect unknown action classes and combine unknown object classes. Furthermore, it can generalize to over 17k HOI classes while being trained on just 600 HOI classes.