Auxiliary Feature Fusion and Noise Suppression for HOI Detection

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

In recent years, one-stage HOI (Human–Object Interaction) detection methods tend to divide the original task into multiple sub-tasks by using a multi-branch network structure. However, there is no sufficient attention to information communication between these branches. The inference approach in the cascaded structure is singular, while fully parallel methods will disrupt the associations between different pieces of information. Besides, noise interference may occur during the fusion of different features and thus affect the detection performance. To address these issues, this article proposes a one-stage three-branch parallel HOI detection method, which treats HOI as three separate sub-tasks (human detection, object detection, and interaction detection) and leverages three distinct reasoning relationships to generate richer relational information. Firstly , an auxiliary feature fusion (AFF) module is introduced, which integrates features originally extracted independently to form fused features enriched with supplementary information. This approach strengthens communication between branches in the network while handling the three sub-tasks concurrently, thereby facilitating the exchange of more contextual information. Secondly , to mitigate noise interference generated during the fusion process, a fusion noise suppression (FNS) module is introduced, which effectively suppresses noise and enhances the model’s performance in interaction detection tasks. Finally , experiments are conducted on two major benchmark datasets, and experimental results show that our HOI detection method is superior to previous methods. Also, ablation studies confirm the effectiveness of all the components in our proposed method.

Similar Papers
  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.daach.2024.e00364
Human object interaction detection in paintings using multi-task learning
  • Jul 24, 2024
  • Digital Applications in Archaeology and Cultural Heritage
  • Maya Antoun + 1 more

Human object interaction detection in paintings using multi-task learning

  • Conference Article
  • 10.1109/icmtma.2019.00168
An Optimization Model for Human-Object Interaction Detection Inspired by Multi-features
  • Apr 1, 2019
  • Hailan Kuang + 3 more

The rapid development of deep learning in recent years has led to breakthroughs in more and more problems, such as image recognition and object detection. Human-object interaction(HOI) detection is an important issue in image comprehension, which has higher actual value than image classification and object detection. but still not well solved. Different from HOI recognition, HOI detection is not only to determine the presence of HOIs, but also to estimate their locations, so this problem is defined as predicting a human and an object bounding box with an interaction class label that connects them. In this paper, we propose an approach that uses multi-features to solve HOI detection task. We use different methods to extract visual features and spatial features from images, and use their fusion to provide more effective support for HOI detection . The core of our approach is the Visual Feature Extracted Model which base on convolutional neural network with attention module and can provide more detailed visual features. We validate our approach on the recently introduced Verbs in COCO (V-COCO), and verification results indicate that our approach achieves extraordinary improvements in HOI detection.

  • Conference Article
  • Cite Count Icon 102
  • 10.1109/cvpr46437.2021.01441
Detecting Human-Object Interaction via Fabricated Compositional Learning
  • Jun 1, 2021
  • Zhi Hou + 4 more

Human-Object Interaction (HOI) detection, inferring the relationships between human and objects from images/videos, is a fundamental task for high-level scene understanding. However, HOI detection usually suffers from the open long-tailed nature of interactions with objects, while human has extremely powerful compositional perception ability to cognize rare or unseen HOI samples. Inspired by this, we devise a novel HOI compositional learning framework, termed as Fabricated Compositional Learning (FCL), to address the problem of open long-tailed HOI detection. Specifically, we introduce an object fabricator to generate effective object representations, and then combine verbs and fabricated objects to compose new HOI samples. With the proposed object fabricator, we are able to generate large-scale HOI samples for rare and unseen categories to alleviate the open long-tailed issues in HOI detection. Extensive experiments on the most popular HOI detection dataset, HICO-DET, demonstrate the effectiveness of the proposed method for imbalanced HOI detection and significantly improve the state-of-the-art performance on rare and unseen HOI categories. Code is available at https://github.com/zhihou7/HOI-CL.

  • Conference Article
  • Cite Count Icon 308
  • 10.1109/cvpr42600.2020.00056
PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection
  • Jun 1, 2020
  • Yue Liao + 5 more

We propose a single-stage Human-Object Interaction (HOI) detection method that has outperformed all existing methods on HICO-DET dataset at 37 fps on a single Titan XP GPU. It is the first real-time HOI detection method. Conventional HOI detection methods are composed of two stages, i.e., human-object proposals generation, and proposals classification. Their effectiveness and efficiency are limited by the sequential and separate architecture. In this paper, we propose a Parallel Point Detection and Matching (PPDM) HOI detection framework. In PPDM, an HOI is defined as a point triplet < human point, interaction point, object point>. Human and object points are the center of the detection boxes, and the interaction point is the midpoint of the human and object points. PPDM contains two parallel branches, namely point detection branch and point matching branch. The point detection branch predicts three points. Simultaneously, the point matching branch predicts two displacements from the interaction point to its corresponding human and object points. The human point and the object point originated from the same interaction point are considered as matched pairs. In our novel parallel architecture, the interaction points implicitly provide context and regularization for human and object detection. The isolated detection boxes unlikely to form meaningful HOI triplets are suppressed, which increases the precision of HOI detection. Moreover, the matching between human and object detection boxes is only applied around limited numbers of filtered candidate interaction points, which saves much computational cost. Additionally, we build a new application-oriented database named as HOI-A, which serves as a good supplement to the existing datasets.

  • Conference Article
  • Cite Count Icon 227
  • 10.1109/cvpr46437.2021.01165
End-to-End Human Object Interaction Detection with HOI Transformer
  • Jun 1, 2021
  • Cheng Zou + 10 more

We propose HOI Transformer to tackle human object interaction (HOI) detection in an end-to-end manner. Current approaches either decouple HOI task into separated stages of object detection and interaction classification or introduce surrogate interaction problem. In contrast, our method, named HOI Transformer, streamlines the HOI pipeline by eliminating the need for many hand-designed components. HOI Transformer reasons about the relations of objects and humans from global image context and directly predicts HOI instances in parallel. A quintuple matching loss is introduced to force HOI predictions in a unified way. Our method is conceptually much simpler and demonstrates improved accuracy. Without bells and whistles, HOI Transformer achieves 26.61% AP on HICO-DET and 52.9% AP <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">role</inf> on V-COCO, surpassing previous methods with the advantage of being much simpler. We hope our approach will serve as a simple and effective alternative for HOI tasks. Code is available at https://github.com/bbepoch/HoiTransformer.

  • Conference Article
  • Cite Count Icon 28
  • 10.1109/iccv48922.2021.01322
Discovering Human Interactions with Large-Vocabulary Objects via Query and Multi-Scale Detection
  • Oct 1, 2021
  • Suchen Wang + 5 more

In this work, we study the problem of human-object interaction (HOI) detection with large vocabulary object categories. Previous HOI studies are mainly conducted in the regime of limit object categories (e.g., 80 categories). Their solutions may face new difficulties in both object detection and interaction classification due to the increasing diversity of objects (e.g., 1000 categories). Different from previous methods, we formulate the HOI detection as a query problem. We propose a unified model to jointly discover the target objects and predict the corresponding interactions based on the human queries, thereby eliminating the need of using generic object detectors, extra steps to associate human-object instances, and multi-stream interaction recognition. This is achieved by a repurposed Transformer unit and a novel cascade detection over multi-scale feature maps. We observe that such a highly-coupled solution brings benefits for both object detection and interaction classification in a large vocabulary setting. To study the new challenges of the large vocabulary HOI detection, we assemble two datasets from the publicly available SWiG and 100 Days of Hands datasets. Experiments on these datasets validate that our proposed method can achieve a notable mAP improvement on HOI detection with a faster inference speed than existing one-stage HOI detectors. Our code is available at https://github.com/scwangdyd/large_vocabulary_hoi_detection.

  • Research Article
  • Cite Count Icon 11
  • 10.1016/j.cviu.2024.104091
UAHOI: Uncertainty-aware robust interaction learning for HOI detection
  • Jul 20, 2024
  • Computer Vision and Image Understanding
  • Mu Chen + 2 more

UAHOI: Uncertainty-aware robust interaction learning for HOI detection

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 9
  • 10.3390/e26030205
Pairwise CNN-Transformer Features for Human-Object Interaction Detection.
  • Feb 27, 2024
  • Entropy
  • Hutuo Quan + 5 more

Human-object interaction (HOI) detection aims to localize and recognize the relationship between humans and objects, which helps computers understand high-level semantics. In HOI detection, two-stage and one-stage methods have distinct advantages and disadvantages. The two-stage methods can obtain high-quality human-object pair features based on object detection but lack contextual information. The one-stage transformer-based methods can model good global features but cannot benefit from object detection. The ideal model should have the advantages of both methods. Therefore, we propose the Pairwise Convolutional neural network (CNN)-Transformer (PCT), a simple and effective two-stage method. The model both fully utilizes the object detector and has rich contextual information. Specifically, we obtain pairwise CNN features from the CNN backbone. These features are fused with pairwise transformer features to enhance the pairwise representations. The enhanced representations are superior to using CNN and transformer features individually. In addition, the global features of the transformer provide valuable contextual cues. We fairly compare the performance of pairwise CNN and pairwise transformer features in HOI detection. The experimental results show that the previously neglected CNN features still have a significant edge. Compared to state-of-the-art methods, our model achieves competitive results on the HICO-DET and V-COCO datasets.

  • Research Article
  • Cite Count Icon 6
  • 10.1109/mce.2023.3343919
Human–Object Interaction Detection: An Overview
  • Nov 1, 2024
  • IEEE Consumer Electronics Magazine
  • Jia Wang + 3 more

This paper systematically summarizes and discusses recent research on image-based human-object interaction (HOI) detection, which aims to detect human-object pairs and recognize the interactive behaviors between humans and objects in an image. It has plenty of applications and can serve as the basis to assist higher-level tasks of visual understanding. We introduce existing methods by categorizing them into two main groups based on the model structure: one-stage and two-stage approaches. We further divide one-stage methods into point-based, region-based, and query-based methods. Similarly, the two-stage methods are divided into HOI detection with multi-stream modeling, HOI detection with human parts and pose, HOI detection with compositional learning, HOI detection with graph-based modeling, and HOI detection with query-based modeling. According to this taxonomy, we also summarize and analyze the core ideas behind each strategy. Then, we present the details of the experimental protocols, evaluation metrics, datasets, and the evaluation results of the most recent representative methods. Finally, we discuss the HOI detection task's main open challenges and future trends.

  • Book Chapter
  • Cite Count Icon 3
  • 10.1007/978-3-031-20497-5_36
Human-Object Interaction Detection: A Survey of Deep Learning-Based Methods
  • Jan 1, 2022
  • Fang Li + 3 more

In recent years, rapid progress has been made in detecting and identifying single object instances. In order to understand the situation in the scene, computers need to recognize how humans interact with surrounding objects. Human-object interaction (HOI) detection aims to identify a set of interactions in images or videos. It involves the positioning of interactive subjects and objects and the classification of interactive types. It is crucial to realize high-level semantic understanding of people-centered scenarios. The study of HOI detection is also conducive to promoting the research of other advanced visual tasks. In this paper, we introduce the previous works on HOI detection based on deep learning, which are raised from the two primary development trends of sequential and parallel methods. Secondly, we summarize the main challenges faced by the HOI detection task. Further, we introduce the most popular HOI detection datasets, including image and video datasets, and main metrics. Finally, we summarize the future research directions for the HOI detection task.KeywordsHuman-object interaction (HOI) DetectionComputer visionDeep learning

  • Research Article
  • Cite Count Icon 4
  • 10.1109/tnnls.2021.3131154
Effects of Motion-Relevant Knowledge From Unlabeled Video to Human-Object Interaction Detection.
  • Sep 1, 2023
  • IEEE Transactions on Neural Networks and Learning Systems
  • Xue Lin + 4 more

The existing works on human-object interaction (HOI) detection usually rely on expensive large-scale labeled image datasets. However, in real scenes, labeled data may be insufficient, and some rare HOI categories have few samples. This poses great challenges for deep-learning-based HOI detection models. Existing works tackle it by introducing compositional learning or word embedding but still need large-scale labeled data or extremely rely on the well-learned knowledge. In contrast, the freely available unlabeled videos contain rich motion-relevant information that can help infer rare HOIs. In this article, we creatively propose a multitask learning (MTL) perspective to assist in HOI detection with the aid of motion-relevant knowledge learning on unlabeled videos. Specifically, we design the appearance reconstruction loss (ARL) and sequential motion mining module in a self-supervised manner to learn more generalizable motion representations for promoting the detection of rare HOIs. Moreover, to better transfer motion-related knowledge from unlabeled videos to HOI images, a domain discriminator is introduced to decrease the domain gap between two domains. Extensive experiments on the HICO-DET dataset with rare categories and the V-COCO dataset with minimum supervision demonstrate the effectiveness of motion-aware knowledge implied in unlabeled videos for HOI detection.

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.imavis.2024.105249
Exploring the synergy between textual identity and visual signals in human-object interaction
  • Sep 2, 2024
  • Image and Vision Computing
  • Pinzhu An + 1 more

Exploring the synergy between textual identity and visual signals in human-object interaction

  • Research Article
  • Cite Count Icon 20
  • 10.1016/j.neucom.2023.126243
From detection to understanding: A survey on representation learning for human-object interaction
  • Apr 27, 2023
  • Neurocomputing
  • Tianlun Luo + 3 more

From detection to understanding: A survey on representation learning for human-object interaction

  • Research Article
  • Cite Count Icon 5
  • 10.1609/aaai.v38i6.28422
Toward Open-Set Human Object Interaction Detection
  • Mar 24, 2024
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Mingrui Wu + 4 more

This work is oriented toward the task of open-set Human Object Interaction (HOI) detection. The challenge lies in identifying completely new, out-of-domain relationships, as opposed to in-domain ones which have seen improvements in zero-shot HOI detection. To address this challenge, we introduce a simple Disentangled HOI Detection (DHD) model for detecting novel relationships by integrating an open-set object detector with a Visual Language Model (VLM). We utilize a disentangled image-text contrastive learning metric for training and connect the bottom-up visual features to text embeddings through lightweight unary and pair-wise adapters. Our model can benefit from the open-set object detector and the VLM to detect novel action categories and combine actions with novel object categories. We further present the VG-HOI dataset, a comprehensive benchmark with over 17k HOI relationships for open-set scenarios. Experimental results show that our model can detect unknown action classes and combine unknown object classes. Furthermore, it can generalize to over 17k HOI classes while being trained on just 600 HOI classes.

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/wacv48630.2021.00127
Detecting Human-Object Interaction with Mixed Supervision
  • Jan 1, 2021
  • Suresh Kirthi Kumaraswamy + 2 more

Human object interaction (HOI) detection is an important task in image understanding and reasoning. It is in a form of HOI triplet 〈human, verb, object〉, requiring bounding boxes for human and object, and action between them for the task completion. In other words, this task requires strong supervision for training that is however hard to procure. A natural solution to overcome this is to pursue weakly-supervised learning, where we only know the presence of certain HOI triplets in images but their exact location is unknown. Most weakly-supervised learning methods do not make provision for leveraging data with strong supervision, when they are available; and indeed a naive combination of this two paradigms in HOI detection fails to make contributions to each other. In this regard we propose a mixed-supervised HOI detection pipeline: thanks to a specific design of momentum-independent learning that learns seamlessly across these two types of supervision. Moreover, in light of the annotation insufficiency in mixed supervision, we introduce an HOI element swapping technique to synthesize diverse and hard negatives across images and improve the robustness of the model. Our method is evaluated on the challenging HICO-DET dataset. It performs close to or even better than many fully-supervised methods by using a mixed amount of strong and weak annotations; furthermore, it outperforms representative state of the art weakly- and fully-supervised methods under the same supervision.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant