Abstract

One of challenging tasks in the field of artificial intelligence is the human action recognition. In this paper, we propose a novel long-term temporal feature learning architecture for recognizing human action in video, named Pseudo Recurrent Residual Neural Networks (P-RRNNs), which exploits the recurrent architecture and composes each in different connection among units. Two-stream CNNs model (GoogLeNet) is employed for extracting local temporal and spatial features respectively. The local spatial and temporal features are then integrated into global long-term temporal features by using our proposed two-stream P-RRNNs. Finally, the Softmax layer fuses the outputs of two-stream P-RRNNs for action recognition. The experimental results on two standard databases UCF101 and HMDB51 demonstrate the outstanding performance of proposed method based on architectures for human action recognition.

Highlights

  • Human action recognition in video is an important and focused research topic with various useful applications, such as intelligent video surveillance, video retrieval, humancomputer interaction and smart home appliance [1]–[3]

  • Our main contributions can be summarized as follows: First, we introduce residual learning into the recurrent structure and propose Pseudo Recurrent Residual Neural Networks (P-RRNNs) to model long-term temporal features

  • METHODS we describe the key components of the P-RRNNs, including long short-term memory (LSTM) and GATED RECURRENT UNIT (GRU) architectures, and pseudo recurrent residual network architectures

Read more

Summary

INTRODUCTION

Human action recognition in video is an important and focused research topic with various useful applications, such as intelligent video surveillance, video retrieval, humancomputer interaction and smart home appliance [1]–[3]. S. Yu et al.: Learning Long-Term Temporal Features With Deep Neural Networks for Human Action Recognition. CNNs with 3-dimensional (3D) convolutional kernels are proposed to directly learn spatio-temporal features for action recognition [34]–[36]. In order to tackle overfitting problem, Yu et al [42] proposed single layer pi-LSTM architecture to learn long-term information for action recognition. We experimentally evaluate the proposed method to learn richer semantic features and model longerterm temporal information in video for action recognition. Our main contributions can be summarized as follows: First, we introduce residual learning into the recurrent structure and propose Pseudo Recurrent Residual Neural Networks (P-RRNNs) to model long-term temporal features.

RELATED WORKS
METHODS
RESIDUAL NETWORKS
Findings
EXPERIMENTS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call