Attention-Based Convolutional LSTM for Describing Video

Zhongyu Liu,Tian Chen,Enjie Ding,Wanli Yu,Yafeng Liu

doi:10.1109/access.2020.3010872

Zhongyu Liu, Tian Chen + Show 3 more

Open Access

https://doi.org/10.1109/access.2020.3010872

Copy DOI

Abstract

Video description technique has been widely used in the computer community for many applications. The typical approaches are mainly based on the encode-decode framework: the fixed-length video representation vectors are extracted by the encoder using the upper layer output of pre-trained convolutional neural networks (CNNs); The decoder uses the recurrent neural networks to generate a textual sentence. However, the upper layers of convolutional neural networks contain low-resolution, semantically strong, while the lower layers contain high-resolution, semantically weak features. In the existing method, the multi-scale information of CNNs is hardly considered to be used in the video description. Ignoring this information will lead to the problem that the video description is not detailed and comprehensive. This paper applies the hierarchical convolutional long short-term memory (ConvLSTM) in the encode-decode framework to conduct feature extraction of the upper and lower layers in CNNs. Moreover, multiple network structures are designed to explore the Spatio-temporal feature extraction performance of ConvLSTM, which can approach the optimal accuracy in the three-layer ConvLSTM. In order to efficiently improve the language quality of video description, the attention mechanism focuses on visual feature output by ConvLSTM. The extensive experimental results demonstrate that the proposed method outperforms the existing approaches.

Highlights

Video description technique is the process of automatically interpreting video content to a natural textual language
To the best of our knowledge, our approach is one of the first to integrate ConvLSTM for exploring long term Spatial-temporal for video captioning. It is different from the existing sequence learning video description methods which first extract the spatial features of the video, and use Recurrent neural networks (RNNs) to extract the temporal features of the video
It mainly consists of three parts, namely, feature extraction layer based on pre-trained convolutional neural networks (CNNs), feature encoding layer with hierarchical ConvLSTMs, and attention-based feature decoder layer

Summary

INTRODUCTION

Video description technique is the process of automatically interpreting video content to a natural textual language. This paper’s main contributions are: To the best of our knowledge, our approach is one of the first to integrate ConvLSTM for exploring long term Spatial-temporal for video captioning It is different from the existing sequence learning video description methods which first extract the spatial features of the video, and use RNN to extract the temporal features of the video. The existing sequence learning video description methods only use fixed-length video representation vectors output by upper layers of pre-trained CNN. PROPOSED METHODOLOGY our novel encode-decode framework, named attention based hierarchical ConvLSTM (AttHCLSTM) for video description is introduced. It mainly consists of three parts, namely, feature extraction layer based on pre-trained CNN, feature encoding layer with hierarchical ConvLSTMs, and attention-based feature decoder layer. The internal principles of each layer will be demonstrated in the part

PROBLEM DEFINITION

CNN FOR LEARNING THE VIDEO FEATURE

LSTM FOR DECODING

ATTENTION MECHANISM

EXPERIMENTS

DATASETS

Findings

CONCLUSIONS

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Attention-Based Convolutional LSTM for Describing Video

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Intermediate Progenitor Cohorts Differentially Generate Cortical Layers and Require Tbr2 for Timely Acquisition of Neuronal Subtype Identity.
Anca B Mihalas ... Gina E Elsen
Cell Reports | VOL. 16
Anca B Mihalas, et. al.Anca B Mihalas ... Gina E Elsen
01 Jun 2016
Cell Reports | VOL. 16

Influence of Pinus pinaster age on aluminium fractions in acidic soils
Cristina Eimil-Fraga ... María José Fernández-Sanjurjo
Spanish Journal of Soil Science | VOL. 10
Cristina Eimil-Fraga, et. al.Cristina Eimil-Fraga ... María José Fernández-Sanjurjo
02 Jul 2020
Spanish Journal of Soil Science | VOL. 10

Effects of Static Magnetic Fields and Temperature on 3D Magnetic Storage in Heated Dot Magnetic Recording
...
-
, et. al. ...
30 Mar 2021
30 Mar 2021

Stacked Convolutional Bidirectional LSTM Recurrent Neural Network for Bearing Anomaly Detection in Rotating Machinery Diagnostics
Kwangsuk Lee ... Jae-Kyeong Kim
-
Kwangsuk Lee, et. al.Kwangsuk Lee ... Jae-Kyeong Kim
01 Jul 2018
01 Jul 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Attention-Based Convolutional LSTM for Describing Video

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access