Abstract

The massive addition of data to the internet in text, images, and videos made computer vision-based tasks challenging in the big data domain. Recent exploration of video data and progress in visual information captioning has been an arduous task in computer vision. Visual captioning is attributable to integrating visual information with natural language descriptions. This paper proposes an encoder-decoder framework with a 2D-Convolutional Neural Network (CNN) model and layered Long Short Term Memory (LSTM) as the encoder and an LSTM model integrated with an attention mechanism working as the decoder with a hybrid loss function. Visual feature vectors extracted from the video frames using a 2D-CNN model capture spatial features. Specifically, the visual feature vectors are fed into the layered LSTM to capture the temporal information. The attention mechanism enables the decoder to perceive and focus on relevant objects and correlate the visual context and language content for producing semantically correct captions. The visual features and GloVe word embeddings are input into the decoder to generate natural semantic descriptions for the videos. The performance of the proposed framework is evaluated on the video captioning benchmark dataset Microsoft Video Description (MSVD) using various well-known evaluation metrics. The experimental findings indicate that the suggested framework outperforms state-of-the-art techniques. Compared to the state-of-the-art research methods, the proposed model significantly increased all measures, B@1, B@2, B@3, B@4, METEOR, and CIDEr, with the score of 78.4, 64.8, 54.2, and 43.7, 32.3, and 70.7, respectively. The progression in all scores indicates a more excellent grasp of the context of the inputs, which results in more accurate caption prediction.

Highlights

  • Due to the rapid improvement in low-cost camera technology, there is explosive growth in the number of images/videos

  • Bin et al [4] used two layered Long Short Term Memory (LSTM) for the visual encoding and natural-language generation. Their stacked global temporal structure in video clips is achieved by encoding video sequences with a forward and backward directional LSTM network and attributing attention to the original neural network structures

  • The decoder part is defined as a combination of attention and a single LSTM layer

Read more

Summary

Introduction

Due to the rapid improvement in low-cost camera technology, there is explosive growth in the number of images/videos. The work in [20] suggests a method that learns to extract specific information in the image and guides a decoder model to generate text descriptions These approaches utilized only available image content for captioning, which does not include any temporal information. The success of several image captioning approaches allured researchers to focus on temporal information available in the sequence of frames of a video and generating a suitable description or caption for the visual content. Bin et al [4] used two layered LSTM for the visual encoding and natural-language generation Their stacked global temporal structure in video clips is achieved by encoding video sequences with a forward and backward directional LSTM network and attributing attention to the original neural network structures. The proposed framework bettered some of the state-of-art methods

Proposed methodology
C FV2 e s s
Experimental results and analysis
Layer Stacked LSTM
Method
Limitations
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.