Abstract

Effective video representation is a key ingredient in action recognition, but how to learn effective spatial features remains a fundamental and challenging task. The existing CNN-based methods apply low-resolution feature maps to get the high-level semantic labels. However, the slenderer spatial information for action representation has lost. In this paper, we propose a novel stacked spatial network (SSN), which integrates multi-layer feature maps in an end-to-end manner. Spatial features extraction network based on encoder-decoder is firstly used to obtain multi-level and multi-resolution spatial features under the supervision of high-level sematic labels. The multi-level features are aggregated through a stacked spatial fusion layer, which intrinsically refines the traditional convolutional neural network. Then, refined spatial network (RSN) is proposed to aggregate spatial network and SSN. Particularly, the learned representation of RSN comprises two components for representing semantic label information and local slenderer spatial information. Extensive experimental results on UCF-101 and HMDB-51 datasets demonstrate the effectiveness of the proposed RSN.

Highlights

  • The task of human action recognition is to identify the behavior of people automatically in video sequences

  • We propose a refined spatial network (RSN), which is comprised of spatial network and stacked spatial network (SSN)

  • OVERALL COMPARISON We further demonstrate the advances of the proposed refined spatial network (RSN) in comparison with other works for action recognition

Read more

Summary

INTRODUCTION

The task of human action recognition is to identify the behavior of people automatically in video sequences. Extracting effective spatial features to carry the slenderer spatial feature descriptions with high-level semantic can be seen as a tremendous challenge faced by us To address this problem, we propose a novel stacked spatial network (SSN) for action recognition. Current classification pipelines based on CNN are still adopting iDT for generating video representations, e.g. trajectory-pooled deep-convolutional Descriptor (TDD) [33] These traditional hand-crafted methods are still unable to capture abundant semantic information and computationally expensive. OVERALL ARCHITECTURE OF RSN Commonly, the traditional action recognition methods take advantage of low-resolution feature maps, namely, the visual input, to get the high-level semantic labels These approaches give little consideration on capturing slenderer spatial information for action representation.

STACKED SPATIAL NETWORK
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call