TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition

Xiao Wu,Qingge Ji

doi:10.3390/a13070169

Abstract

Modeling spatiotemporal representations is one of the most essential yet challenging issues in video action recognition. Existing methods lack the capacity to accurately model either the correlations between spatial and temporal features or the global temporal dependencies. Inspired by the two-stream network for video action recognition, we propose an encoder–decoder framework named Two-Stream Bidirectional Long Short-Term Memory (LSTM) Residual Network (TBRNet) which takes advantage of the interaction between spatiotemporal representations and global temporal dependencies. In the encoding phase, the two-stream architecture, based on the proposed Residual Convolutional 3D (Res-C3D) network, extracts features with residual connections inserted between the two pathways, and then the features are fused to become the short-term spatiotemporal features of the encoder. In the decoding phase, those short-term spatiotemporal features are first fed into a temporal attention-based bidirectional LSTM (BiLSTM) network to obtain long-term bidirectional attention-pooling dependencies. Subsequently, those temporal dependencies are integrated with short-term spatiotemporal features to obtain global spatiotemporal relationships. On two benchmark datasets, UCF101 and HMDB51, we verified the effectiveness of our proposed TBRNet by a series of experiments, and it achieved competitive or even better results compared with existing state-of-the-art approaches.

Highlights

With the rapid development of the mobile Internet and the continuous updating of video capture devices, the number of video resources is growing explosively
In the encoding phase,to the network that extracts appearance motion features separately, our proposed inoriginal contrasttwo-stream to the original two-stream network that and extracts appearance and motion features two-stream encoder consisted of a spatial appearance stream and a temporal motion stream with separately, our proposed two-stream encoder consisted of a spatial appearance stream and a multiplicative residual connections inserted between the two pathways
Long-term recurrent convolutional network (LRCN) [6] is an end-to-end framework that classifies the action in video sequences using Long Short-Term Memory (LSTM) with features abstracted by convolutional neural networks (CNNs)

Summary

Introduction

With the rapid development of the mobile Internet and the continuous updating of video capture devices, the number of video resources is growing explosively. In the encoding phase,to the network that extracts appearance motion features separately, our proposed inoriginal contrasttwo-stream to the original two-stream network that and extracts appearance and motion features two-stream encoder consisted of a spatial appearance stream and a temporal motion stream with separately, our proposed two-stream encoder consisted of a spatial appearance stream and a multiplicative residual connections inserted between the two pathways. We accurately modeled the interactions between spatial and temporal features using proposed a two-stream encoder with cross-stream residual connections which were benefit for a two-stream encoder with cross-stream residual connections which were benefit for backpropagation of gradients; backpropagation of gradients; We effectively captured the global spatiotemporal dependencies by incorporating the local. The rest of the paper is organized as follows: In Section 2, related works of ours are briefly reviewed

Introduction datasets and

Video Action Recognition

Attention Mechanism

Residual Learning

Proposed Approach

Residual Network

Res-C3D Network

BiLSTM Network

Temporal Attention Mechanism

Connections

Experiments

Datasets and Implement Details

Analysis of Res-C3D Network

Analysis of Cross-Stream Connections

Analysis of Fusion Strategies

Analysis of Attention-Based BiLSTM

Comparison with State-of-the-Art Models

Findings

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms	Publication Date: Jul 15, 2020
Citations: 9	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms

Lead the way for us

Similar Papers

Integrating Gaussian mixture model and dilated residual network for action recognition in videos
Ming Fang ... Shuhua Liu
Multimedia Systems | VOL. 26
Ming Fang, et. al.Ming Fang ... Shuhua Liu
20 Aug 2020
Multimedia Systems | VOL. 26

An ensemble method to forecast 24-h ahead solar irradiance using wavelet decomposition and BiLSTM deep learning network.
Pardeep Singla ... Manoj Duhan
Earth science informatics | VOL. 15
Pardeep Singla, et. al.Pardeep Singla ... Manoj Duhan
17 Nov 2021
Earth science informatics | VOL. 15

End-to-End Model Based on Bidirectional LSTM and CTC for Online Handwritten Mongolian Word Recognition
Da Teng ... Daoerji Fan
-
Da Teng, et. al.Da Teng ... Daoerji Fan
14 Oct 2022
14 Oct 2022

Automatic gear shift strategy for manual transmission of mine truck based on Bi-LSTM network
Liyong Wang ... Min Xie
Expert systems with applications | VOL. 209
Liyong Wang, et. al.Liyong Wang ... Min Xie
03 Aug 2022
Expert systems with applications | VOL. 209

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms