Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning.

Vali Ollah Maraghi,Karim Faez,Miguel Cazorla

doi:10.1155/2021/9922697

Vali Ollah Maraghi, Karim Faez + Show 1 more

Open Access

https://doi.org/10.1155/2021/9922697

Copy DOI

Abstract

Recognition of human activities is an essential field in computer vision. The most human activity consists of the interaction between humans and objects. Many successful works have been done on human-object interaction (HOI) recognition and achieved acceptable results in recent years. Still, they are fully supervised and need to train labeled data for all HOIs. Due to the enormous space of human-object interactions, listing and providing the training data for all possible categories is costly and impractical. We propose an approach for scaling human-object interaction recognition in video data through the zero-shot learning technique to solve this problem. Our method recognizes a verb and an object from the video and makes an HOI class. Recognition of the verbs and objects instead of HOIs allows identifying a new combination of verbs and objects. So, a new HOI class can be identified, which is not seen by the recognizer system. We introduce a neural network architecture that can understand and represent the video data. The proposed system learns verbs and objects from available training data at the training phase and can identify the verb-object pairs in a video at test time. So, the system can identify the HOI class with different combinations of objects and verbs. Also, we propose to use lateral information for combining the verbs and the objects to make valid verb-object pairs. It helps to prevent the detection of rare and probably wrong HOIs. The lateral information comes from word embedding techniques. Furthermore, we propose a new feature aggregation method for aggregating extracted high-level features from video frames before feeding them to the classifier. We illustrate that this feature aggregation method is more effective for actions that include multiple subactions. We evaluated our system by recently introduced Charades challengeable dataset, which has lots of HOI categories in videos. We show that our proposed system can detect unseen HOI classes in addition to the acceptable recognition of seen types. Therefore, the number of classes identifiable by the system is greater than the number of classes used for training.

Highlights

Humans play a significant role in most of the activities that take place in the world
How can we reduce the need for training data for training an human-object interaction (HOI) recognizer model? Can a recognition model be taught with data from only part of the target classes? We focus on this question and tackle it through zero-shot learning for scaling HOI recognition in video data
We present the results of our method and compare it to some other works. e used dataset is introduced first, and the implementation setups are described

Summary

Introduction

Humans play a significant role in most of the activities that take place in the world. Charades dataset [12] was recently supplied for the HOI understanding tasks in video data It has many human-object interaction categories and is suitable for action detection, HOI recognition, and video captioning purposes. Is dataset contains 9848 video clips of HOIs captured in real environments It has 157 categories of human activities, including some actions with “no interaction,” 149, which can be considered valid verb-object pairs. Is study’s primary purpose is to reduce the need for data to train an HOI recognition system by increasing the number of identifiable HOIs without increasing HOIs in training data We focus on this purpose through zero-shot learning, in which we decompose the HOI into a verb (human action) and an object and recognize them in the video.

Related Works

Results and Discussion

Method

Conclusion