Microsoft Video Description Research Articles

The massive addition of data to the internet in text, images, and videos made computer vision-based tasks challenging in the big data domain. Recent exploration of video data and progress in visual information captioning has been an arduous task in computer vision. Visual captioning is attributable to integrating visual information with natural language descriptions. This paper proposes an encoder-decoder framework with a 2D-Convolutional Neural Network (CNN) model and layered Long Short Term Memory (LSTM) as the encoder and an LSTM model integrated with an attention mechanism working as the decoder with a hybrid loss function. Visual feature vectors extracted from the video frames using a 2D-CNN model capture spatial features. Specifically, the visual feature vectors are fed into the layered LSTM to capture the temporal information. The attention mechanism enables the decoder to perceive and focus on relevant objects and correlate the visual context and language content for producing semantically correct captions. The visual features and GloVe word embeddings are input into the decoder to generate natural semantic descriptions for the videos. The performance of the proposed framework is evaluated on the video captioning benchmark dataset Microsoft Video Description (MSVD) using various well-known evaluation metrics. The experimental findings indicate that the suggested framework outperforms state-of-the-art techniques. Compared to the state-of-the-art research methods, the proposed model significantly increased all measures, B@1, B@2, B@3, B@4, METEOR, and CIDEr, with the score of 78.4, 64.8, 54.2, and 43.7, 32.3, and 70.7, respectively. The progression in all scores indicates a more excellent grasp of the context of the inputs, which results in more accurate caption prediction.

The attention mechanism and sequence-to-sequence framework have shown promising advancements in the temporal task of video captioning. However, imposing the attention mechanism on non-visual words, such as “of” and “the”, may mislead the decoder and decrease the overall performance of video captioning. Furthermore, the traditional sequence to sequence framework optimizes the model by using word-level cross entropy loss, which results in an exposure bias problem. This problem occurs because, at test time, the model uses the previously generated words to predict the next word, while it maximizes the likelihood of the next ground-truth word with consideration of the true previous one during training. To address these issues, we propose the reinforced adaptive attention model (RAAM), which integrates an adaptive attention mechanism with long short-term memory to flexibly utilize visual signals and language information as needed. Accordingly, the model is trained with both word-level loss and sentence-level loss to take advantage of these two losses and alleviate the exposure bias problem by directly optimizing the sentence-level metric using a reinforcement learning algorithm. Besides, a novel training method is proposed for mixed loss optimization. Experiments on the Microsoft Video Description benchmark corpus (MSVD) and the challenging MPII-MD Movie Description dataset demonstrate that the proposed RAAM method, which uses only a single feature, achieves competitive or even superior results compared to existing state-of-the-art models for video captioning.

Microsoft Video Description Research Articles

Articles published on Microsoft Video Description

Exploring the Spatio‐Temporal Aware Graph for video captioning

Semantic context driven language descriptions of videos using deep neural network

Video captioning via a symmetric bidirectional decoder

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms

Exploiting the local temporal information for video captioning

Exploring the effects of non-local blocks on video captioning networks

Video Captioning With Adaptive Attention and Mixed Loss Optimization

Exploring the effects of non-local blocks on video captioning networks

Predicting Visual Features From Text for Image and Video Caption Retrieval

From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning.

Describing Video With Attention-Based Bidirectional LSTM.

Integrating Both Visual and Audio Cues for Enhanced Video Caption

Multimodal Feature Learning for Video Captioning

Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Microsoft Video Description Research Articles

Articles published on Microsoft Video Description

Exploring the Spatio‐Temporal Aware Graph for video captioning

Semantic context driven language descriptions of videos using deep neural network

Video captioning via a symmetric bidirectional decoder

Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms

Exploiting the local temporal information for video captioning

Exploring the effects of non-local blocks on video captioning networks

Video Captioning With Adaptive Attention and Mixed Loss Optimization

Exploring the effects of non-local blocks on video captioning networks

Predicting Visual Features From Text for Image and Video Caption Retrieval

From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning.

Describing Video With Attention-Based Bidirectional LSTM.

Integrating Both Visual and Audio Cues for Enhanced Video Caption

Multimodal Feature Learning for Video Captioning

Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language