Abstract

Modeling spatiotemporal representations is one of the most essential yet challenging issues in video action recognition. Existing methods lack the capacity to accurately model either the correlations between spatial and temporal features or the global temporal dependencies. Inspired by the two-stream network for video action recognition, we propose an encoder–decoder framework named Two-Stream Bidirectional Long Short-Term Memory (LSTM) Residual Network (TBRNet) which takes advantage of the interaction between spatiotemporal representations and global temporal dependencies. In the encoding phase, the two-stream architecture, based on the proposed Residual Convolutional 3D (Res-C3D) network, extracts features with residual connections inserted between the two pathways, and then the features are fused to become the short-term spatiotemporal features of the encoder. In the decoding phase, those short-term spatiotemporal features are first fed into a temporal attention-based bidirectional LSTM (BiLSTM) network to obtain long-term bidirectional attention-pooling dependencies. Subsequently, those temporal dependencies are integrated with short-term spatiotemporal features to obtain global spatiotemporal relationships. On two benchmark datasets, UCF101 and HMDB51, we verified the effectiveness of our proposed TBRNet by a series of experiments, and it achieved competitive or even better results compared with existing state-of-the-art approaches.

Highlights

  • With the rapid development of the mobile Internet and the continuous updating of video capture devices, the number of video resources is growing explosively

  • In the encoding phase,to the network that extracts appearance motion features separately, our proposed inoriginal contrasttwo-stream to the original two-stream network that and extracts appearance and motion features two-stream encoder consisted of a spatial appearance stream and a temporal motion stream with separately, our proposed two-stream encoder consisted of a spatial appearance stream and a multiplicative residual connections inserted between the two pathways

  • Long-term recurrent convolutional network (LRCN) [6] is an end-to-end framework that classifies the action in video sequences using Long Short-Term Memory (LSTM) with features abstracted by convolutional neural networks (CNNs)

Read more

Summary

Introduction

With the rapid development of the mobile Internet and the continuous updating of video capture devices, the number of video resources is growing explosively. In the encoding phase,to the network that extracts appearance motion features separately, our proposed inoriginal contrasttwo-stream to the original two-stream network that and extracts appearance and motion features two-stream encoder consisted of a spatial appearance stream and a temporal motion stream with separately, our proposed two-stream encoder consisted of a spatial appearance stream and a multiplicative residual connections inserted between the two pathways. We accurately modeled the interactions between spatial and temporal features using proposed a two-stream encoder with cross-stream residual connections which were benefit for a two-stream encoder with cross-stream residual connections which were benefit for backpropagation of gradients; backpropagation of gradients; We effectively captured the global spatiotemporal dependencies by incorporating the local. The rest of the paper is organized as follows: In Section 2, related works of ours are briefly reviewed

Introduction datasets and
Video Action Recognition
Attention Mechanism
Residual Learning
Proposed Approach
Residual Network
Res-C3D Network
BiLSTM Network
Temporal Attention Mechanism
Connections
Experiments
Datasets and Implement Details
Analysis of Res-C3D Network
Analysis of Cross-Stream Connections
Analysis of Fusion Strategies
Analysis of Attention-Based BiLSTM
Comparison with State-of-the-Art Models
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call