Caption Generation Research Articles

Topic modelling (TM) has shown significant progress in boosting the effectiveness of image captioning in the last few years. Although important improvements have been shown in previous topic-guided image captioning models, some challenges remain unsolved, such as the independence of the topic predictors and the sentence generators, resulting in ineffective exploitation of semantic information. Also, all the predicted topics or the top-one topic are used throughout the whole captioning task without considering the current time step's linguistic context, which deviates the captioning network to focus on inaccurate image objects. To tackle these challenges, we propose a novel image captioning method consisting of four modules: enhanced topic predictor (ETP), retrieval-based topics re-weighting module (RTR), subsequent topic predictor (STP), and caption generation module. The prediction and generation modules are trained in an end-to-end manner to promote the efficient use of topics by predicting suitable topics at each time step. ETP predicts the topics using the image features, and is enhanced with topic embedding (TE). The RTR is only applied in the testing stage for re-weighting the topics predicted by ETP. In each time step, the STP automatically predicts concise topics subsets to alleviate the diversity of the image topics. Compared with the existing topic-based models, our model can automatically generate more accurate and diverse captions, boosting the explainability of how the topics influence the generated word in each time step. Extensive experiments on the MS-COCO and Flickr30K benchmark datasets show that our method enhances the overall image captioning's performance and the topic prediction task, and outperforms many recent image captioning approaches in terms of the evaluation metrics.

Training supervised video captioning model requires coupled video-caption pairs. However, for many targeted languages, sufficient paired data are not available. To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language. To solve the task, a natural choice is to employ a two-step pipeline system: first utilizing video-to-pivot captioning model to generate captions in pivot language and then utilizing pivot-to-target translation model to translate the pivot captions to the target language. However, in such a pipeline system, 1) visual information cannot reach the translation model, generating visual irrelevant target captions; 2) the errors in the generated pivot captions will be propagated to the translation model, resulting in disfluent target captions. To address these problems, we propose the Unpaired Video Captioning with Visual Injection system (UVC-VI). UVC-VI first introduces the Visual Injection Module (VIM), which aligns source visual and target language domains to inject the source visual information into the target language domain. Meanwhile, VIM directly connects the encoder of the video-to-pivot model and the decoder of the pivot-to-target model, allowing end-to-end inference by completely skipping the generation of pivot captions. To enhance the cross-modality injection of the VIM, UVC-VI further introduces a pluggable video encoder, i.e., Multimodal Collaborative Encoder (MCE). The experiments show that UVC-VI outperforms pipeline systems and exceeds several supervised systems. Furthermore, equipping existing supervised systems with our MCE can achieve 4% and 7% relative margins on the CIDEr scores to current state-of-the-art models on the benchmark MSVD and MSR-VTT datasets, respectively.

Caption Generation Research Articles

Related Topics

Articles published on Caption Generation

Switching Text-Based Image Encoders for Captioning Images With Text

A Deep Learning Model to Generate Image Captions

Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

Dense Video Captioning With Early Linguistic Information Fusion

Error-Driven Learning: Development of Vocabulary Learning Support System that Utilizes Suggestibility of Error by Image Generation

Image Captioning With Novel Topics Guidance and Retrieval-Based Topics Re-Weighting

Enhanced Image Captioning Using Features Concatenation and Efficient Pre-Trained Word Embedding

Image caption extraction to aid visual learning

TSFNet: Triple-Steam Image Captioning

An Efficient Deep Learning based Hybrid Model for Image Caption Generation

High-level and Low-level Feature Set for Image Caption Generation with Optimized Convolutional Neural Network

Context Aware Image Sentiment Classification using Deep Learning Techniques

VLSP 2021 - VieCap4H Challenge: Automatic Image Caption Generation for Healthcare Domain in Vietnamese

VieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning.

Image to Caption Generator

Separate Syntax and Semantics: Part-of-Speech-Guided Transformer for Image Captioning

Generating news image captions with semantic discourse extraction and contrastive style-coherent learning

Guide and interact: scene-graph based generation and control of video captions

Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Caption Generation Research Articles

Related Topics

Articles published on Caption Generation

Switching Text-Based Image Encoders for Captioning Images With Text

A Deep Learning Model to Generate Image Captions

Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

Dense Video Captioning With Early Linguistic Information Fusion

Error-Driven Learning: Development of Vocabulary Learning Support System that Utilizes Suggestibility of Error by Image Generation

Image Captioning With Novel Topics Guidance and Retrieval-Based Topics Re-Weighting

Enhanced Image Captioning Using Features Concatenation and Efficient Pre-Trained Word Embedding

Image caption extraction to aid visual learning

TSFNet: Triple-Steam Image Captioning

An Efficient Deep Learning based Hybrid Model for Image Caption Generation

High-level and Low-level Feature Set for Image Caption Generation with Optimized Convolutional Neural Network

Context Aware Image Sentiment Classification using Deep Learning Techniques

VLSP 2021 - VieCap4H Challenge: Automatic Image Caption Generation for Healthcare Domain in Vietnamese

VieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning.

Image to Caption Generator

Separate Syntax and Semantics: Part-of-Speech-Guided Transformer for Image Captioning

Generating news image captions with semantic discourse extraction and contrastive style-coherent learning

Guide and interact: scene-graph based generation and control of video captions

Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks.