Image Captioning Research Articles

This research investigates the challenges posed by the predominant focus on English language text-to-image generation (TTI) because of the lack of annotated image caption data in other languages. The resulting inequitable access to TTI technology in non-English-speaking regions motivates the research of multilingual TTI (mTTI) and the potential of neural machine translation (NMT) to facilitate its development. The study presents two main contributions. Firstly, a systematic empirical study employing a multilingual multi-modal encoder evaluates standard cross-lingual NLP methods applied to mTTI, including TRANSLATE TRAIN, TRANSLATE TEST, and ZERO-SHOT TRANSFER. Secondly, a novel parameter-efficient approach called Ensemble Adapter (ENSAD) is introduced, leveraging multilingual text knowledge within the mTTI framework to avoid the language gap and enhance mTTI performance. Additionally, the research addresses challenges associated with transformer-based TTI models, such as slow generation and complexity for high-resolution images. It proposes hierarchical transformers and local parallel autoregressive generation techniques to overcome these limitations. A 6B-parameter transformer pretrained with a cross-modal general language model (CogLM) and fine-tuned for fast super-resolution results in a new text-to-image system, denoted as It, which demonstrates competitive performance compared to the state-of-the-art DALL-E-2. Furthermore, It supports interactive text-guided editing on images, offering a versatile and efficient solution for text-to-image generation.. Keywords: Text-to-image generation, Multilingual TTI (mTTI), Neural machine translation (NMT), Cross-lingual NLP, Ensemble Adapter (ENSAD), Hierarchical transformers, Super- resolution, Transformer-based models, Cross-modal general language model (CogLM).

Detecting misogynous memes poses a significant challenge due to the presence of multiple modalities (image + text). The inherent complexity arises from the lack of direct correspondence between the textual and visual elements, where an image and overlaid text often convey disparate meanings. Additionally, memes conveying messages of hatred or taunting, particularly targeted towards women, present additional comprehension difficulties. This article introduces the MISTRA framework, which leverages variational autoencoders for dimensionality reduction of the large-sized image features before fusing multimodal features. The framework also harnesses the capabilities of large language models through transfer learning to develop fusion embeddings by extracting and concatenating features from different modalities (image, text, and image-generated caption text) for the misogynous classification task. The components of the framework include state-of-the-art models such as the Vision Transformer model (ViT), textual model (DistilBERT), CLIP (Contrastive Language–Image Pre-training), and BLIP (Bootstrapping Language–Image Pre-training for Unified Vision-Language Understanding and Generation) models. Our experiments are conducted on the SemEval-2022 Task 5 MAMI dataset. To establish a baseline model, we perform separate experiments using the Naive Bayes machine learning classifier on meme texts and ViT on meme images. We evaluate the performance on six different bootstrap samples and report evaluation metrics such as precision, recall, and Macro-F1 score for each bootstrap sample. Additionally, we compute the confidence interval on our evaluation scores and conduct paired t-tests to understand whether our best-performing model has significant differences from the other experiments or not. The experimental results demonstrate that the dimensionality reduction approach on multimodal features with a multilayer perceptron classifier achieved the highest performance with a Macro–F1 score of 71.5 percent, outperforming the baseline approaches in individual modalities.

Image Captioning Research Articles

Related Topics

Articles published on Image Captioning

A multimodal machine learning approach to generate news articles from geo-tagged images

Image Caption Prediction Using Deep Learning

Image Caption Generator

Research on image caption generation method based on multi-modal pre-training model and text mixup optimization

Attribute guided fusion network for obtaining fine-grained image captions

Image Caption Bot for Assistive Vision

Clustering swap prediction for image-text pre-training

TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning

Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning

A survey of neurosymbolic visual reasoning with scene graphs and common sense knowledge

Bridging Languages through Images: A Multilingual Text-to-Image Synthesis Approach

Automatic Image Captioning using Deep Learning

SAMT-generator: A second-attention for image captioning based on multi-stage transformer network

A Novel Pretrained General-purpose Vision Language Model for the Vietnamese Language

Paradoxes of Global History

MISTRA: Misogyny Detection through Text–Image Fusion and Representation Analysis

The Role of Attention Mechanism in Generating Image Captions: An Innovative Approach with Neural Network-Based Seq2seq Model

A Comparative Study of Feature Extraction Models for Image Caption Generation

Recent Advances in Synthesis and Interaction of Speech, Text, and Vision

Image Caption Generator using Deep Learning

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Image Captioning Research Articles

Related Topics

Articles published on Image Captioning

A multimodal machine learning approach to generate news articles from geo-tagged images

Image Caption Prediction Using Deep Learning

Image Caption Generator

Research on image caption generation method based on multi-modal pre-training model and text mixup optimization

Attribute guided fusion network for obtaining fine-grained image captions

Image Caption Bot for Assistive Vision

Clustering swap prediction for image-text pre-training

TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning

Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning

A survey of neurosymbolic visual reasoning with scene graphs and common sense knowledge

Bridging Languages through Images: A Multilingual Text-to-Image Synthesis Approach

Automatic Image Captioning using Deep Learning

SAMT-generator: A second-attention for image captioning based on multi-stage transformer network

A Novel Pretrained General-purpose Vision Language Model for the Vietnamese Language

Paradoxes of Global History

MISTRA: Misogyny Detection through Text–Image Fusion and Representation Analysis

The Role of Attention Mechanism in Generating Image Captions: An Innovative Approach with Neural Network-Based Seq2seq Model

A Comparative Study of Feature Extraction Models for Image Caption Generation

Recent Advances in Synthesis and Interaction of Speech, Text, and Vision

Image Caption Generator using Deep Learning