From detection to understanding: A survey on representation learning for human-object interaction
From detection to understanding: A survey on representation learning for human-object interaction
- Research Article
1
- 10.1016/j.daach.2024.e00364
- Jul 24, 2024
- Digital Applications in Archaeology and Cultural Heritage
Human object interaction detection in paintings using multi-task learning
- Conference Article
102
- 10.1109/cvpr46437.2021.01441
- Jun 1, 2021
Human-Object Interaction (HOI) detection, inferring the relationships between human and objects from images/videos, is a fundamental task for high-level scene understanding. However, HOI detection usually suffers from the open long-tailed nature of interactions with objects, while human has extremely powerful compositional perception ability to cognize rare or unseen HOI samples. Inspired by this, we devise a novel HOI compositional learning framework, termed as Fabricated Compositional Learning (FCL), to address the problem of open long-tailed HOI detection. Specifically, we introduce an object fabricator to generate effective object representations, and then combine verbs and fabricated objects to compose new HOI samples. With the proposed object fabricator, we are able to generate large-scale HOI samples for rare and unseen categories to alleviate the open long-tailed issues in HOI detection. Extensive experiments on the most popular HOI detection dataset, HICO-DET, demonstrate the effectiveness of the proposed method for imbalanced HOI detection and significantly improve the state-of-the-art performance on rare and unseen HOI categories. Code is available at https://github.com/zhihou7/HOI-CL.
- Research Article
11
- 10.1016/j.cviu.2024.104091
- Jul 20, 2024
- Computer Vision and Image Understanding
UAHOI: Uncertainty-aware robust interaction learning for HOI detection
- Conference Article
- 10.1109/icmtma.2019.00168
- Apr 1, 2019
The rapid development of deep learning in recent years has led to breakthroughs in more and more problems, such as image recognition and object detection. Human-object interaction(HOI) detection is an important issue in image comprehension, which has higher actual value than image classification and object detection. but still not well solved. Different from HOI recognition, HOI detection is not only to determine the presence of HOIs, but also to estimate their locations, so this problem is defined as predicting a human and an object bounding box with an interaction class label that connects them. In this paper, we propose an approach that uses multi-features to solve HOI detection task. We use different methods to extract visual features and spatial features from images, and use their fusion to provide more effective support for HOI detection . The core of our approach is the Visual Feature Extracted Model which base on convolutional neural network with attention module and can provide more detailed visual features. We validate our approach on the recently introduced Verbs in COCO (V-COCO), and verification results indicate that our approach achieves extraordinary improvements in HOI detection.
- Research Article
6
- 10.1109/mce.2023.3343919
- Nov 1, 2024
- IEEE Consumer Electronics Magazine
This paper systematically summarizes and discusses recent research on image-based human-object interaction (HOI) detection, which aims to detect human-object pairs and recognize the interactive behaviors between humans and objects in an image. It has plenty of applications and can serve as the basis to assist higher-level tasks of visual understanding. We introduce existing methods by categorizing them into two main groups based on the model structure: one-stage and two-stage approaches. We further divide one-stage methods into point-based, region-based, and query-based methods. Similarly, the two-stage methods are divided into HOI detection with multi-stream modeling, HOI detection with human parts and pose, HOI detection with compositional learning, HOI detection with graph-based modeling, and HOI detection with query-based modeling. According to this taxonomy, we also summarize and analyze the core ideas behind each strategy. Then, we present the details of the experimental protocols, evaluation metrics, datasets, and the evaluation results of the most recent representative methods. Finally, we discuss the HOI detection task's main open challenges and future trends.
- Book Chapter
3
- 10.1007/978-3-031-20497-5_36
- Jan 1, 2022
In recent years, rapid progress has been made in detecting and identifying single object instances. In order to understand the situation in the scene, computers need to recognize how humans interact with surrounding objects. Human-object interaction (HOI) detection aims to identify a set of interactions in images or videos. It involves the positioning of interactive subjects and objects and the classification of interactive types. It is crucial to realize high-level semantic understanding of people-centered scenarios. The study of HOI detection is also conducive to promoting the research of other advanced visual tasks. In this paper, we introduce the previous works on HOI detection based on deep learning, which are raised from the two primary development trends of sequential and parallel methods. Secondly, we summarize the main challenges faced by the HOI detection task. Further, we introduce the most popular HOI detection datasets, including image and video datasets, and main metrics. Finally, we summarize the future research directions for the HOI detection task.KeywordsHuman-object interaction (HOI) DetectionComputer visionDeep learning
- Conference Article
227
- 10.1109/cvpr46437.2021.01165
- Jun 1, 2021
We propose HOI Transformer to tackle human object interaction (HOI) detection in an end-to-end manner. Current approaches either decouple HOI task into separated stages of object detection and interaction classification or introduce surrogate interaction problem. In contrast, our method, named HOI Transformer, streamlines the HOI pipeline by eliminating the need for many hand-designed components. HOI Transformer reasons about the relations of objects and humans from global image context and directly predicts HOI instances in parallel. A quintuple matching loss is introduced to force HOI predictions in a unified way. Our method is conceptually much simpler and demonstrates improved accuracy. Without bells and whistles, HOI Transformer achieves 26.61% AP on HICO-DET and 52.9% AP <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">role</inf> on V-COCO, surpassing previous methods with the advantage of being much simpler. We hope our approach will serve as a simple and effective alternative for HOI tasks. Code is available at https://github.com/bbepoch/HoiTransformer.
- Research Article
4
- 10.1109/tnnls.2021.3131154
- Sep 1, 2023
- IEEE Transactions on Neural Networks and Learning Systems
The existing works on human-object interaction (HOI) detection usually rely on expensive large-scale labeled image datasets. However, in real scenes, labeled data may be insufficient, and some rare HOI categories have few samples. This poses great challenges for deep-learning-based HOI detection models. Existing works tackle it by introducing compositional learning or word embedding but still need large-scale labeled data or extremely rely on the well-learned knowledge. In contrast, the freely available unlabeled videos contain rich motion-relevant information that can help infer rare HOIs. In this article, we creatively propose a multitask learning (MTL) perspective to assist in HOI detection with the aid of motion-relevant knowledge learning on unlabeled videos. Specifically, we design the appearance reconstruction loss (ARL) and sequential motion mining module in a self-supervised manner to learn more generalizable motion representations for promoting the detection of rare HOIs. Moreover, to better transfer motion-related knowledge from unlabeled videos to HOI images, a domain discriminator is introduced to decrease the domain gap between two domains. Extensive experiments on the HICO-DET dataset with rare categories and the V-COCO dataset with minimum supervision demonstrate the effectiveness of motion-aware knowledge implied in unlabeled videos for HOI detection.
- Research Article
1
- 10.1016/j.imavis.2024.105249
- Sep 2, 2024
- Image and Vision Computing
Exploring the synergy between textual identity and visual signals in human-object interaction
- Research Article
5
- 10.1609/aaai.v38i6.28422
- Mar 24, 2024
- Proceedings of the AAAI Conference on Artificial Intelligence
This work is oriented toward the task of open-set Human Object Interaction (HOI) detection. The challenge lies in identifying completely new, out-of-domain relationships, as opposed to in-domain ones which have seen improvements in zero-shot HOI detection. To address this challenge, we introduce a simple Disentangled HOI Detection (DHD) model for detecting novel relationships by integrating an open-set object detector with a Visual Language Model (VLM). We utilize a disentangled image-text contrastive learning metric for training and connect the bottom-up visual features to text embeddings through lightweight unary and pair-wise adapters. Our model can benefit from the open-set object detector and the VLM to detect novel action categories and combine actions with novel object categories. We further present the VG-HOI dataset, a comprehensive benchmark with over 17k HOI relationships for open-set scenarios. Experimental results show that our model can detect unknown action classes and combine unknown object classes. Furthermore, it can generalize to over 17k HOI classes while being trained on just 600 HOI classes.
- Conference Article
7
- 10.1109/wacv48630.2021.00127
- Jan 1, 2021
Human object interaction (HOI) detection is an important task in image understanding and reasoning. It is in a form of HOI triplet 〈human, verb, object〉, requiring bounding boxes for human and object, and action between them for the task completion. In other words, this task requires strong supervision for training that is however hard to procure. A natural solution to overcome this is to pursue weakly-supervised learning, where we only know the presence of certain HOI triplets in images but their exact location is unknown. Most weakly-supervised learning methods do not make provision for leveraging data with strong supervision, when they are available; and indeed a naive combination of this two paradigms in HOI detection fails to make contributions to each other. In this regard we propose a mixed-supervised HOI detection pipeline: thanks to a specific design of momentum-independent learning that learns seamlessly across these two types of supervision. Moreover, in light of the annotation insufficiency in mixed supervision, we introduce an HOI element swapping technique to synthesize diverse and hard negatives across images and improve the robustness of the model. Our method is evaluated on the challenging HICO-DET dataset. It performs close to or even better than many fully-supervised methods by using a mixed amount of strong and weak annotations; furthermore, it outperforms representative state of the art weakly- and fully-supervised methods under the same supervision.
- Conference Article
- 10.1117/12.2606873
- Nov 11, 2021
The goal of human-object interaction (HOI) detection is to localize both the human and object in a picture and recognize the interactions between them. HOIs are always scattering in the image. The traditional methods based on CNNs are unable to aggregate the information scattered in the image. Many new methods utilizing the contextual features cropped from the outputs of the CNNs, which sometimes are not effective enough. To overcome the challenge, we utilize the deformable transformer to aggregate the whole features output form the CNNs. The attention mechanism and query-based predictions are the keys. In view of the success of the methods based on graph neural networks, the attention mechanism is proved to be effective to aggregate the contextual information image-wide. The queries can extract the features of each human-object pair without mixing up the features of other instances. The deformable transformer can extract effective embeddings and the prediction heads can be fairly simple. Experimental results show that the proposed method is effective in HOI detection.
- Conference Article
28
- 10.1109/iccv48922.2021.01322
- Oct 1, 2021
In this work, we study the problem of human-object interaction (HOI) detection with large vocabulary object categories. Previous HOI studies are mainly conducted in the regime of limit object categories (e.g., 80 categories). Their solutions may face new difficulties in both object detection and interaction classification due to the increasing diversity of objects (e.g., 1000 categories). Different from previous methods, we formulate the HOI detection as a query problem. We propose a unified model to jointly discover the target objects and predict the corresponding interactions based on the human queries, thereby eliminating the need of using generic object detectors, extra steps to associate human-object instances, and multi-stream interaction recognition. This is achieved by a repurposed Transformer unit and a novel cascade detection over multi-scale feature maps. We observe that such a highly-coupled solution brings benefits for both object detection and interaction classification in a large vocabulary setting. To study the new challenges of the large vocabulary HOI detection, we assemble two datasets from the publicly available SWiG and 100 Days of Hands datasets. Experiments on these datasets validate that our proposed method can achieve a notable mAP improvement on HOI detection with a faster inference speed than existing one-stage HOI detectors. Our code is available at https://github.com/scwangdyd/large_vocabulary_hoi_detection.
- Conference Article
308
- 10.1109/cvpr42600.2020.00056
- Jun 1, 2020
We propose a single-stage Human-Object Interaction (HOI) detection method that has outperformed all existing methods on HICO-DET dataset at 37 fps on a single Titan XP GPU. It is the first real-time HOI detection method. Conventional HOI detection methods are composed of two stages, i.e., human-object proposals generation, and proposals classification. Their effectiveness and efficiency are limited by the sequential and separate architecture. In this paper, we propose a Parallel Point Detection and Matching (PPDM) HOI detection framework. In PPDM, an HOI is defined as a point triplet < human point, interaction point, object point>. Human and object points are the center of the detection boxes, and the interaction point is the midpoint of the human and object points. PPDM contains two parallel branches, namely point detection branch and point matching branch. The point detection branch predicts three points. Simultaneously, the point matching branch predicts two displacements from the interaction point to its corresponding human and object points. The human point and the object point originated from the same interaction point are considered as matched pairs. In our novel parallel architecture, the interaction points implicitly provide context and regularization for human and object detection. The isolated detection boxes unlikely to form meaningful HOI triplets are suppressed, which increases the precision of HOI detection. Moreover, the matching between human and object detection boxes is only applied around limited numbers of filtered candidate interaction points, which saves much computational cost. Additionally, we build a new application-oriented database named as HOI-A, which serves as a good supplement to the existing datasets.
- Conference Article
- 10.1117/12.2604708
- Aug 6, 2021
Recent years, deep neural networks have achieved impressive progress in object detection. However, detecting the interactions between objects is still challenging. Many researchers pay attention to human-object interaction (HOI) detection as a basic task in detailed scene understanding. Most conventional HOI detectors are in a two-stage manner and usually slow in inference. One-stage methods for direct parallel detection of HOI triples breaks through the limitation of object detection, but the extracted features are still insufficient. To overcome these drawbacks above, we propose an improved one-stage HOI detection approach, in which attention aggregation module and dynamic point matching strategy play key roles. The attention aggregation enhances the semantic expression ability of interaction points explicitly by aggregating contextually important information, while the matching strategy can filter the negative HOI pairs effectively in the inference stage. Extensive experiments on two challenging HOI detection benchmarks: VCOCO and HICO-DET show that our method achieves considerable performance compared to state-of-the-art performance without any additional human pose and language features.