Action Recognition Using Close-Up of Maximum Activation and ETRI-Activity3D LivingLab Dataset.

Doyoung Kim,Sanghoon Lee,Inwoong Lee,Dohyung Kim

doi:10.3390/s21206774

Doyoung Kim, Sanghoon Lee + Show 2 more

Open Access

https://doi.org/10.3390/s21206774

Copy DOI

Abstract

The development of action recognition models has shown great performance on various video datasets. Nevertheless, because there is no rich data on target actions in existing datasets, it is insufficient to perform action recognition applications required by industries. To satisfy this requirement, datasets composed of target actions with high availability have been created, but it is difficult to capture various characteristics in actual environments because video data are generated in a specific environment. In this paper, we introduce a new ETRI-Activity3D-LivingLab dataset, which provides action sequences in actual environments and helps to handle a network generalization issue due to the dataset shift. When the action recognition model is trained on the ETRI-Activity3D and KIST SynADL datasets and evaluated on the ETRI-Activity3D-LivingLab dataset, the performance can be severely degraded because the datasets were captured in different environments domains. To reduce this dataset shift between training and testing datasets, we propose a close-up of maximum activation, which magnifies the most activated part of a video input in detail. In addition, we present various experimental results and analysis that show the dataset shift and demonstrate the effectiveness of the proposed method.

Highlights

There have been significant improvements in action recognition researches [1,2,3,4,5,6,7,8] with various industrial applications including surveillance systems, human–computer interaction, virtual reality, sports video analysis and home-care robots [9,10,11,12,13,14,15,16,17]
We investigate practical problems of action recognition in terms of dataset shift
The applying the proposed method improves the performance of two state-of-the-art action recognition models by up to 10–12% in the accuracy. These results show that the proposed close-up of maximum activation can enhance the performance of action recognition models by cropping activated parts of a video

Summary

Introduction

There have been significant improvements in action recognition researches [1,2,3,4,5,6,7,8] with various industrial applications including surveillance systems, human–computer interaction, virtual reality, sports video analysis and home-care robots [9,10,11,12,13,14,15,16,17] The datasets such as Kinetics [1], UCF [10] and HMDB [11] are generally used public datasets in action recognition, which were made by clipping and collecting existing videos. This lack of the target actions forces users to acquire data directly

Methods

Results

Conclusion