Human-object Interaction Recognition Research Articles

Recent two-stage detector-based methods show superiority in Human-Object Interaction (HOI) detection along with the successful application of transformer. However, these methods are limited to extracting the global contextual features through instance-level attention without considering the perspective of human–object interaction pairs, and the fusion enhancement of interaction pair features lacks further exploration. The human–object interaction pairs guiding global context extraction relative to instance guiding global context extraction more fully utilize the semantics between human–object pairs, which helps HOI recognition. To this end, we propose a two-stage Global Context and Pairwise-level Fusion Features Integration Network (GFIN) for HOI detection. Specifically, the first stage employs an object detector for instance feature extraction. The second stage aims to capture the semantic-rich visual information through the proposed three modules, Global Contextual Feature Extraction Encoder (GCE), Pairwise Interaction Query Decoder (PID), and Human-Object Pairwise-level Attention Fusion Module (HOF). The GCE module intends to extract the global context memory by the proposed crossover-residual mechanism and then integrate it with the local instance memory from the DETR object detector. HOF utilizes the proposed pairwise-level attention mechanism to fuse and enhance the first stage’s multi-layer feature. PID outputs multi-label interaction recognition results with the input of the query sequence from HOF and the memory from GCE. Finally, comprehensive experiments conducted on HICO-DET and V-COCO datasets demonstrate that the proposed GFIN significantly outperforms the state-of-the-art methods. Code is available at https://github.com/ddwhzh/GFIN.

Read full abstract

Recognition of human activities is an essential field in computer vision. The most human activity consists of the interaction between humans and objects. Many successful works have been done on human-object interaction (HOI) recognition and achieved acceptable results in recent years. Still, they are fully supervised and need to train labeled data for all HOIs. Due to the enormous space of human-object interactions, listing and providing the training data for all possible categories is costly and impractical. We propose an approach for scaling human-object interaction recognition in video data through the zero-shot learning technique to solve this problem. Our method recognizes a verb and an object from the video and makes an HOI class. Recognition of the verbs and objects instead of HOIs allows identifying a new combination of verbs and objects. So, a new HOI class can be identified, which is not seen by the recognizer system. We introduce a neural network architecture that can understand and represent the video data. The proposed system learns verbs and objects from available training data at the training phase and can identify the verb-object pairs in a video at test time. So, the system can identify the HOI class with different combinations of objects and verbs. Also, we propose to use lateral information for combining the verbs and the objects to make valid verb-object pairs. It helps to prevent the detection of rare and probably wrong HOIs. The lateral information comes from word embedding techniques. Furthermore, we propose a new feature aggregation method for aggregating extracted high-level features from video frames before feeding them to the classifier. We illustrate that this feature aggregation method is more effective for actions that include multiple subactions. We evaluated our system by recently introduced Charades challengeable dataset, which has lots of HOI categories in videos. We show that our proposed system can detect unseen HOI classes in addition to the acceptable recognition of seen types. Therefore, the number of classes identifiable by the system is greater than the number of classes used for training.

Read full abstract

Human-object Interaction Recognition Research Articles

Related Topics

Articles published on Human-object Interaction Recognition

Semantic-Aware Dynamic Generation Networks for Few-Shot Human-Object Interaction Recognition.

Effective human–object interaction recognition for edge devices in intelligent space

Human–Object Interaction detection via Global Context and Pairwise-level Fusion Features Integration

Exploring Spatio–Temporal Graph Convolution for Video-Based Human–Object Interaction Recognition

Human–object interaction recognition based on interactivity detection and multi-feature fusion

Exploiting Human Pose and Scene Information for Interaction Detection

SSRT: A Sequential Skeleton RGB Transformer to Recognize Fine-Grained Human-Object Interactions and Action Recognition

Language-guided graph parsing attention network for human-object interaction recognition

HOME: 3D Human–Object Mesh Topology-Enhanced Interaction Recognition in Images

Distillation of human–object interaction contexts for action recognition

A Graph-Based Approach to Recognizing Complex Human Object Interactions in Sequential Data

Deep learning and RGB-D based human action, human–human and human–object interaction recognition: A survey

Automated Parts-Based Model for Recognizing Human–Object Interactions from Aerial Imagery with Fully Convolutional Network

An Intelligent HealthCare Monitoring Framework for Daily Assistant Living

An Intelligent Framework for Recognizing Social Human-Object Interactions

A Review of Vision-Based Techniques Applied to Detecting Human-Object Interactions in Still Images

Multi-stream Network for Human-object Interaction Detection

Cascaded Parsing of Human-Object Interaction Recognition.

Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning.

Few-Shot Human-Object Interaction Recognition With Semantic-Guided Attentive Prototypes Network.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Human-object Interaction Recognition Research Articles

Related Topics

Articles published on Human-object Interaction Recognition

Semantic-Aware Dynamic Generation Networks for Few-Shot Human-Object Interaction Recognition.

Effective human–object interaction recognition for edge devices in intelligent space

Human–Object Interaction detection via Global Context and Pairwise-level Fusion Features Integration

Exploring Spatio–Temporal Graph Convolution for Video-Based Human–Object Interaction Recognition

Human–object interaction recognition based on interactivity detection and multi-feature fusion

Exploiting Human Pose and Scene Information for Interaction Detection

SSRT: A Sequential Skeleton RGB Transformer to Recognize Fine-Grained Human-Object Interactions and Action Recognition

Language-guided graph parsing attention network for human-object interaction recognition

HOME: 3D Human–Object Mesh Topology-Enhanced Interaction Recognition in Images

Distillation of human–object interaction contexts for action recognition

A Graph-Based Approach to Recognizing Complex Human Object Interactions in Sequential Data

Deep learning and RGB-D based human action, human–human and human–object interaction recognition: A survey

Automated Parts-Based Model for Recognizing Human–Object Interactions from Aerial Imagery with Fully Convolutional Network

An Intelligent HealthCare Monitoring Framework for Daily Assistant Living

An Intelligent Framework for Recognizing Social Human-Object Interactions

A Review of Vision-Based Techniques Applied to Detecting Human-Object Interactions in Still Images

Multi-stream Network for Human-object Interaction Detection

Cascaded Parsing of Human-Object Interaction Recognition.

Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning.

Few-Shot Human-Object Interaction Recognition With Semantic-Guided Attentive Prototypes Network.