Visual Captioning Research Articles

Natural Language Generation (NLG) accepts input data in the form of images, videos, or text and generates corresponding natural language text as output. Existing NLG methods mainly adopt a supervised approach and rely heavily on coupled data-to-text pairs. However, for many targeted scenarios and for non-English languages, sufficient quantities of labeled data are often not available. As a result, it is necessary to collect and label data-text pairs for training, which is both costly and time-consuming. To relax the dependency on labeled data of downstream tasks, we propose an intuitive and effective zero-shot learning framework, ZeroNLG, which can deal with multiple NLG tasks, including image-to-text (image captioning), video-to-text (video captioning), and text-to-text (neural machine translation), across English, Chinese, German, and French within a unified framework. ZeroNLG does not require any labeled downstream pairs for training. During training, ZeroNLG (i) projects different domains (across modalities and languages) to corresponding coordinates in a shared common latent space; (ii) bridges different domains by aligning their corresponding coordinates in this space; and (iii) builds an unsupervised multilingual auto-encoder to learn to generate text by reconstructing the input text given its coordinate in shared latent space. Consequently, during inference, based on the data-to-text pipeline, ZeroNLG can generate target sentences across different languages given the coordinate of input data in the common space. Within this unified framework, given visual (imaging or video) data as input, ZeroNLG can perform zero-shot visual captioning; given textual sentences as input, ZeroNLG can perform zero-shot machine translation. We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and "believable" outputs and significantly outperforms existing zero-shot methods.

Read full abstract

For automated visual captioning, existing neural encoder-decoder methods commonly use a simple sequence-to-sequence or an attention-based mechanism. The attention-based models pay attention to specific visual areas or objects; using a single heat map that indicates which portion of the image is most important rather than treating the objects (within the image) equally. These models are usually a mixture of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) architectures. CNN’s generally extract global visual signals that only provide global information of main objects, attributes, and their relationship, but fail to provide local (regional) information within objects, such as lines, corners, curve and edges. On one hand, missing some of the information and details of local visual signals may lead to misprediction, misidentification of objects or completely missing the main object(s). On the other hand, additional visual signals information produces meaningless and irrelevant description, which may be coming from objects in foreground or background. To address these concerns, we created a new joint signals attention image captioning model for global and local signals that is adaptive by nature. Primarily, proposed model extracts global visual signals at image-level and local visual signals at object-level. The joint signal attention model (JSAM) plays a dual role in visual signal extraction and non-visual signal prediction. Initially, JSAM selects meaningful global and regional visual signals to discard irrelevant visual signals and integrates selected visual signals smartly. Subsequently, in a language model, smart JSAM decides at each time-step (level) on how to attend visual or non-visual signals to generate accurate, descriptive, and elegant sentences. Lastly, we examine the efficiency and superiority of the projected model over recent similar image captioning models by conducting essential experimentations on the MS-COCO dataset.

Read full abstract

Visual Captioning Research Articles

Articles published on Visual Captioning

ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation.

Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning

A comprehensive survey on deep-learning-based visual captioning

A Survey of Vision and Language Related Multi-Modal Task

Affective word embedding in affective explanation generation for fine art paintings

Sentiment Analysis of Image with Text Caption using Deep Learning Techniques.

Semantic context driven language descriptions of videos using deep neural network

Imagine, Reason and Write: Visual Storytelling with Graph Knowledge and Relational Reasoning

Consensus Graph Representation Learning for Better Grounded Image Captioning

Hierarchical Memory Decoder for Visual Narrating

Image captions: global-local and joint signals attention model (GL-JSAM)

Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling

SibNet: Sibling Convolutional Encoder for Video Captioning.

Survey of deep learning and architectures for visual captioning—transitioning between media and natural languages

Stateful human-centered visual captioning system to aid video surveillance

Cross-Modality Retrieval by Joint Correlation Learning

Show, Reward, and Tell

Hierarchical LSTMs with Adaptive Attention for Visual Captioning.

Deep Learning for Image-to-Text Generation: A Technical Overview

Superimposing visual cues and captions over slides to teach computer operations.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Visual Captioning Research Articles

Articles published on Visual Captioning

ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation.

Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning

A comprehensive survey on deep-learning-based visual captioning

A Survey of Vision and Language Related Multi-Modal Task

Affective word embedding in affective explanation generation for fine art paintings

Sentiment Analysis of Image with Text Caption using Deep Learning Techniques.

Semantic context driven language descriptions of videos using deep neural network

Imagine, Reason and Write: Visual Storytelling with Graph Knowledge and Relational Reasoning

Consensus Graph Representation Learning for Better Grounded Image Captioning

Hierarchical Memory Decoder for Visual Narrating

Image captions: global-local and joint signals attention model (GL-JSAM)

Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling

SibNet: Sibling Convolutional Encoder for Video Captioning.

Survey of deep learning and architectures for visual captioning—transitioning between media and natural languages

Stateful human-centered visual captioning system to aid video surveillance

Cross-Modality Retrieval by Joint Correlation Learning

Show, Reward, and Tell

Hierarchical LSTMs with Adaptive Attention for Visual Captioning.

Deep Learning for Image-to-Text Generation: A Technical Overview

Superimposing visual cues and captions over slides to teach computer operations.