Semantic context driven language descriptions of videos using deep neural network

Dinesh Naik,C D Jaidhar

doi:10.1186/s40537-022-00569-4

Dinesh Naik, C D Jaidhar

Open Access

https://doi.org/10.1186/s40537-022-00569-4

Copy DOI

Journal: Journal of Big Data	Publication Date: Feb 10, 2022
Citations: 3	License type: open-access

Affiliation: National Institute of Technology Karnataka

Abstract

The massive addition of data to the internet in text, images, and videos made computer vision-based tasks challenging in the big data domain. Recent exploration of video data and progress in visual information captioning has been an arduous task in computer vision. Visual captioning is attributable to integrating visual information with natural language descriptions. This paper proposes an encoder-decoder framework with a 2D-Convolutional Neural Network (CNN) model and layered Long Short Term Memory (LSTM) as the encoder and an LSTM model integrated with an attention mechanism working as the decoder with a hybrid loss function. Visual feature vectors extracted from the video frames using a 2D-CNN model capture spatial features. Specifically, the visual feature vectors are fed into the layered LSTM to capture the temporal information. The attention mechanism enables the decoder to perceive and focus on relevant objects and correlate the visual context and language content for producing semantically correct captions. The visual features and GloVe word embeddings are input into the decoder to generate natural semantic descriptions for the videos. The performance of the proposed framework is evaluated on the video captioning benchmark dataset Microsoft Video Description (MSVD) using various well-known evaluation metrics. The experimental findings indicate that the suggested framework outperforms state-of-the-art techniques. Compared to the state-of-the-art research methods, the proposed model significantly increased all measures, B@1, B@2, B@3, B@4, METEOR, and CIDEr, with the score of 78.4, 64.8, 54.2, and 43.7, 32.3, and 70.7, respectively. The progression in all scores indicates a more excellent grasp of the context of the inputs, which results in more accurate caption prediction.

Highlights

Due to the rapid improvement in low-cost camera technology, there is explosive growth in the number of images/videos
Bin et al [4] used two layered Long Short Term Memory (LSTM) for the visual encoding and natural-language generation. Their stacked global temporal structure in video clips is achieved by encoding video sequences with a forward and backward directional LSTM network and attributing attention to the original neural network structures
The decoder part is defined as a combination of attention and a single LSTM layer

Summary

Introduction

Due to the rapid improvement in low-cost camera technology, there is explosive growth in the number of images/videos. The work in [20] suggests a method that learns to extract specific information in the image and guides a decoder model to generate text descriptions These approaches utilized only available image content for captioning, which does not include any temporal information. The success of several image captioning approaches allured researchers to focus on temporal information available in the sequence of frames of a video and generating a suitable description or caption for the visual content. Bin et al [4] used two layered LSTM for the visual encoding and natural-language generation Their stacked global temporal structure in video clips is achieved by encoding video sequences with a forward and backward directional LSTM network and attributing attention to the original neural network structures. The proposed framework bettered some of the state-of-art methods

Proposed methodology

C FV2 e s s

Experimental results and analysis

Layer Stacked LSTM

Method

Limitations

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Semantic context driven language descriptions of videos using deep neural network

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

ConvLSNet: A lightweight architecture based on ConvLSTM model for the classification of pulmonary conditions using multichannel lung sound recordings
Faezeh Majzoobi ... Sobhan Goudarzi
Artificial Intelligence In Medicine | VOL. 154
Faezeh Majzoobi, et. al.Faezeh Majzoobi ... Sobhan Goudarzi
22 Jun 2024
Artificial Intelligence In Medicine | VOL. 154

Video captioning using boosted and parallel Long Short-Term Memory networks
Masoomeh Nabati ... Alireza Behrad
Computer Vision and Image Understanding | VOL. 190
Masoomeh Nabati, et. al.Masoomeh Nabati ... Alireza Behrad
11 Oct 2019
Computer Vision and Image Understanding | VOL. 190

Short-Term Load Forecasting Based on Deep Neural Networks Using LSTM Layer
Bo-Sung Kwon ... Rae-Jun Park
Journal of Electrical Engineering & Technology | VOL. 15
Bo-Sung Kwon, et. al.Bo-Sung Kwon ... Rae-Jun Park
11 May 2020
Journal of Electrical Engineering & Technology | VOL. 15

Performance of Three Slim Variants of The Long Short-Term Memory (LSTM) Layer
Daniel Kent ... Fathi Salem
-
Daniel Kent, et. al.Daniel Kent ... Fathi Salem
01 Aug 2019
01 Aug 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Semantic context driven language descriptions of videos using deep neural network

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data