Abstract
Video description technique has been widely used in the computer community for many applications. The typical approaches are mainly based on the encode-decode framework: the fixed-length video representation vectors are extracted by the encoder using the upper layer output of pre-trained convolutional neural networks (CNNs); The decoder uses the recurrent neural networks to generate a textual sentence. However, the upper layers of convolutional neural networks contain low-resolution, semantically strong, while the lower layers contain high-resolution, semantically weak features. In the existing method, the multi-scale information of CNNs is hardly considered to be used in the video description. Ignoring this information will lead to the problem that the video description is not detailed and comprehensive. This paper applies the hierarchical convolutional long short-term memory (ConvLSTM) in the encode-decode framework to conduct feature extraction of the upper and lower layers in CNNs. Moreover, multiple network structures are designed to explore the Spatio-temporal feature extraction performance of ConvLSTM, which can approach the optimal accuracy in the three-layer ConvLSTM. In order to efficiently improve the language quality of video description, the attention mechanism focuses on visual feature output by ConvLSTM. The extensive experimental results demonstrate that the proposed method outperforms the existing approaches.
Highlights
Video description technique is the process of automatically interpreting video content to a natural textual language
To the best of our knowledge, our approach is one of the first to integrate ConvLSTM for exploring long term Spatial-temporal for video captioning. It is different from the existing sequence learning video description methods which first extract the spatial features of the video, and use Recurrent neural networks (RNNs) to extract the temporal features of the video
It mainly consists of three parts, namely, feature extraction layer based on pre-trained convolutional neural networks (CNNs), feature encoding layer with hierarchical ConvLSTMs, and attention-based feature decoder layer
Summary
Video description technique is the process of automatically interpreting video content to a natural textual language. This paper’s main contributions are: To the best of our knowledge, our approach is one of the first to integrate ConvLSTM for exploring long term Spatial-temporal for video captioning It is different from the existing sequence learning video description methods which first extract the spatial features of the video, and use RNN to extract the temporal features of the video. The existing sequence learning video description methods only use fixed-length video representation vectors output by upper layers of pre-trained CNN. PROPOSED METHODOLOGY our novel encode-decode framework, named attention based hierarchical ConvLSTM (AttHCLSTM) for video description is introduced. It mainly consists of three parts, namely, feature extraction layer based on pre-trained CNN, feature encoding layer with hierarchical ConvLSTMs, and attention-based feature decoder layer. The internal principles of each layer will be demonstrated in the part
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.