Textual Attention Research Articles

SummaryNews image captioning aims to generate captions or descriptions for news images automatically, serving as draft captions for creating news image captions manually. News image captions are different from generic captions as news image captions contain more detailed information such as entity names and events. Therefore, both images on news and the accompanying text are the source of generating caption of news image. Pointer‐generator network is a neural method defined for text summarization. This article proposes the Multimodal pointer‐generation network by incorporating visual information into the original network for news image captioning. The multimodal attention mechanism is proposed by splitting attention into visual attention paid to the image and textual attention paid to the text. The multimodal pointer mechanism is proposed by using both textual attention and visual attention to compute pointer distributions, where visual attention is first transformed into textual attention via the word‐image relationships. The multimodal coverage mechanism is defined to reduce repetitions of attentions or repetitions of pointer distributions. Experiments on the DailyMail test dataset and the out‐of‐domain BBC test dataset show that the proposed model outperforms the original pointer‐generator network, the generic image captioning method, the extractive news image captioning method, and the LDA‐based method according BLEU, METEOR, and ROUGL‐L evaluations. Experiments also show that the proposed multimodal coverage mechanisms can improve the model, and that transforming visual attention to pointer distributions can improve the model.

Fine-grained visual categorization is to recognize hundreds of subcategories belonging to the same basic-level category, which is a highly challenging task due to the quite subtle and local visual distinctions among similar subcategories. Most existing methods generally learn part detectors to discover discriminative regions for better categorization performance. However, not all parts are beneficial and indispensable for visual categorization, and the setting of part detector number heavily relies on prior knowledge as well as experimental validation. As is known to all, when we describe the object of an image via textual descriptions, we mainly focus on the pivotal characteristics, and rarely pay attention to common characteristics as well as the background areas. This is an involuntary transfer from human visual attention to textual attention, which leads to the fact that textual attention tells us how many and which parts are discriminative and significant to categorization. So textual attention could help us to discover visual attention in image. Inspired by this, we propose a fine-grained visual-textual representation learning (VTRL) approach, and its main contributions are: (1) Fine-grained visual-textual pattern mining devotes to discovering discriminative visual-textual pairwise information for boosting categorization performance through jointly modeling vision and text with generative adversarial networks (GANs), which automatically and adaptively discovers discriminative parts. (2) Visual-textual representation learning jointly combines visual and textual information, which preserves the intra-modality and inter-modality information to generate complementary fine-grained representation, as well as further improves categorization performance.

Textual Attention Research Articles

Articles published on Textual Attention

Diffusion model-based text-guided enhancement network for medical image segmentation

Image caption generation using a dual attention mechanism

Adaptive Text Denoising Network for Image Caption Editing

Hardest and semi-hard negative pairs mining for text-based person search with visual–textual attention

Global-Guided Asymmetric Attention Network for Image-Text Matching

RNIC-A retrospect network for image captioning

A Multimodal Framework for Video Caption Generation

Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image Captioning

MSCAN: Multimodal Self-and-Collaborative Attention Network for image aesthetic prediction tasks

Adversarial Attribute-Text Embedding for Person Search With Natural Language Query

A news image captioning approach based on multimodal pointer‐generator network

Against British Influences: Home Rule and the Autonomy of Irish Popular Culture in Ireland’s Juvenile Periodicals

Textual-Visual Reference-Aware Attention Network for Visual Dialog

Image caption generation with dual attention mechanism

Biography as Geography

VistaNet: Visual Aspect Attention Network for Multimodal Sentiment Analysis

Cognitive Effect of Tracing Gesture in the Learning from Mathematics Worked Examples

Spatiotemporal-Textual Co-Attention Network for Video Question Answering

Fine-Grained Visual-Textual Representation Learning

Dualattn-GAN: Text to Image Synthesis With Dual Attentional Generative Adversarial Network

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Textual Attention Research Articles

Articles published on Textual Attention

Diffusion model-based text-guided enhancement network for medical image segmentation

Image caption generation using a dual attention mechanism

Adaptive Text Denoising Network for Image Caption Editing

Hardest and semi-hard negative pairs mining for text-based person search with visual–textual attention

Global-Guided Asymmetric Attention Network for Image-Text Matching

RNIC-A retrospect network for image captioning

A Multimodal Framework for Video Caption Generation

Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image Captioning

MSCAN: Multimodal Self-and-Collaborative Attention Network for image aesthetic prediction tasks

Adversarial Attribute-Text Embedding for Person Search With Natural Language Query

A news image captioning approach based on multimodal pointer‐generator network

Against British Influences: Home Rule and the Autonomy of Irish Popular Culture in Ireland’s Juvenile Periodicals

Textual-Visual Reference-Aware Attention Network for Visual Dialog

Image caption generation with dual attention mechanism

Biography as Geography

VistaNet: Visual Aspect Attention Network for Multimodal Sentiment Analysis

Cognitive Effect of Tracing Gesture in the Learning from Mathematics Worked Examples

Spatiotemporal-Textual Co-Attention Network for Video Question Answering

Fine-Grained Visual-Textual Representation Learning

Dualattn-GAN: Text to Image Synthesis With Dual Attentional Generative Adversarial Network