An Efficient Human Instance-Guided Framework for Video Action Recognition.

Inwoong Lee,Sanghoon Lee,Dongyoon Wee,Doyoung Kim

doi:10.3390/s21248309

Inwoong Lee, Sanghoon Lee + Show 2 more

Open Access

https://doi.org/10.3390/s21248309

Copy DOI

Abstract

In recent years, human action recognition has been studied by many computer vision researchers. Recent studies have attempted to use two-stream networks using appearance and motion features, but most of these approaches focused on clip-level video action recognition. In contrast to traditional methods which generally used entire images, we propose a new human instance-level video action recognition framework. In this framework, we represent the instance-level features using human boxes and keypoints, and our action region features are used as the inputs of the temporal action head network, which makes our framework more discriminative. We also propose novel temporal action head networks consisting of various modules, which reflect various temporal dynamics well. In the experiment, the proposed models achieve comparable performance with the state-of-the-art approaches on two challenging datasets. Furthermore, we evaluate the proposed features and networks to verify the effectiveness of them. Finally, we analyze the confusion matrix and visualize the recognized actions at human instance level when there are several people.

Highlights

Human action recognition is a highly active research area with various industrial applications including visual surveillance, video communication, gaming control and sports analysis [1,2,3,4]
Since using the entire image area for action recognition involves lots of unnecessary information unrelated to recognizing actions, we only focus on interesting areas related to human actions through the extracted action region features
We investigate two kinds of features such as the basic and outermost box-based action region features guided by the tracked human instance boxes

Summary

Introduction

Human action recognition is a highly active research area with various industrial applications including visual surveillance, video communication, gaming control and sports analysis [1,2,3,4]. With the recent development of human instance segmentation [9,10] and deep learning technology [11,12,13], human instance-level video action recognition has begun to attract considerable attention [14,15,16,17,18,19,20,21]. Since human instance-level video action recognition requires distinguishing human instances from the background image and localizing human instances, it is a very challenging research area. Because of the difficulty to obtain human instances, human instance-level video action recognition research has only recently begun to progress

Methods

Results

Conclusion