Human-object Interaction Detection Research Articles

Since detecting and recognizing individual human or object are not adequate to understand the visual world, learning how humans interact with surrounding objects becomes a core technology. However, convolution operations are weak in depicting visual interactions between the instances since they only build blocks that process one local neighborhood at a time. To address this problem, we learn from human perception in observing HOIs to introduce a two-stage trainable reasoning mechanism, referred to as GID block. GID block breaks through the local neighborhoods and captures long-range dependency of pixels both in global-level and instance-level from the scene to help detecting interactions between instances. Furthermore, we conduct a multi-stream network called GID-Net, which is a human-object interaction detection framework consisting of a human branch, an object branch and an interaction branch. Semantic information in global-level and local-level are efficiently reasoned and aggregated in each of the branches. We have compared our proposed GID-Net with existing state-of-the-art methods on two public benchmarks, including V-COCO and HICO-DET. The results have showed that GID-Net outperforms the existing best-performing methods on both the above two benchmarks, validating its efficacy in detecting human-object interactions.

Human-Object Interaction (HOI) Detection is a new genre of human-centric visual relationship detection task, which is significant to deep understanding of visual scenes. Due to the complexity of the visual scene in the image, HOI detection is still a challenging task, the most critical part of which is feature extraction and representation. Some existing approaches rely solely on local region information for HOI detection without using global contextual information, but global contextual information contributes to this task in some HOI categories. Other approaches incorporate global contextual information for HOI detection while losing local region information. In this work, we propose a multi-stream neural network architecture composed of three special module that employs both local region information and global contextual information for HOI detection. This model can detect not only the HOI categories based on local region information but also on global contextual information. Our model more fully considers all HOI categories in the dataset. Compared with other existing approaches, the proposed model shows improved performance on V-COCO and HICO-DET benchmark datasets, especially when predicting rare HOI categories.

Human-object Interaction Detection Research Articles

Related Topics

Articles published on Human-object Interaction Detection

Hybrid Deep Learning Vision-based Models for Human Object Interaction Detection by Knowledge Distillation

GID-Net: Detecting human-object interaction with global and instance dependency

Improved human-object interaction detection through skeleton-object relations

Multi-stream neural network fused with local information and global information for HOI detection

Detecting Human-Object Interactions via Functional Generalization

Three‐stream network with context convolution module for human–object interaction detection

Interact as You Intend: Intention-Driven Human-Object Interaction Detection

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Human-object Interaction Detection Research Articles

Related Topics

Articles published on Human-object Interaction Detection

Hybrid Deep Learning Vision-based Models for Human Object Interaction Detection by Knowledge Distillation

GID-Net: Detecting human-object interaction with global and instance dependency

Improved human-object interaction detection through skeleton-object relations

Multi-stream neural network fused with local information and global information for HOI detection

Detecting Human-Object Interactions via Functional Generalization

Three‐stream network with context convolution module for human–object interaction detection

Interact as You Intend: Intention-Driven Human-Object Interaction Detection