Image Captioning Research Articles

Image captioning is an interdisciplinary area that uses techniques from computer vision and natural language processing to provide a textual description of a picture. The Image captioning task is the process of understanding the scene present in the image by identifying objects and associated actions present to create a meaningful human-like caption which can be used for wide range of applications, including image retrieval, video indexing, assistive technology for the visually impaired, content-based image search, biomedicine, and autonomous cars. Formerly, Machine Learning was utilized for this purpose which will be extensive use of hand-crafted features such as Scale-Invariant Feature Transform (SIFT), Local Binary Patterns (LBP), the Histogram of Oriented Gradients (HOG), and combinations of these features. Extracting handmade characteristics from huge datasets is not straightforward or viable. Many deep learning-based techniques were later proposed. Deep Learning retrieval and template-based approaches were presented; however, both had drawbacks such as losing crucial objects. Recent breakthroughs in deep learning and natural language processing have resulted in considerable increases in image captioning system performance which involves adopting attention mechanisms, transformer-based architectures, multi modal connections, Object-Detection based encoder-decoder and many others. In this survey will explore some of the most recent techniques for image captioning, the datasets and evaluation measures that have been employed in deep learning-based automatic image captioning. The ultimate intention of this study is to act as a guide for researchers by emphasizing future directions for research work. Index Terms: image captioning, computer vision, deep learning, Textual description, natural language processing.

Read full abstract

Detecting misogynous memes poses a significant challenge due to the presence of multiple modalities (image + text). The inherent complexity arises from the lack of direct correspondence between the textual and visual elements, where an image and overlaid text often convey disparate meanings. Additionally, memes conveying messages of hatred or taunting, particularly targeted towards women, present additional comprehension difficulties. This article introduces the MISTRA framework, which leverages variational autoencoders for dimensionality reduction of the large-sized image features before fusing multimodal features. The framework also harnesses the capabilities of large language models through transfer learning to develop fusion embeddings by extracting and concatenating features from different modalities (image, text, and image-generated caption text) for the misogynous classification task. The components of the framework include state-of-the-art models such as the Vision Transformer model (ViT), textual model (DistilBERT), CLIP (Contrastive Language–Image Pre-training), and BLIP (Bootstrapping Language–Image Pre-training for Unified Vision-Language Understanding and Generation) models. Our experiments are conducted on the SemEval-2022 Task 5 MAMI dataset. To establish a baseline model, we perform separate experiments using the Naive Bayes machine learning classifier on meme texts and ViT on meme images. We evaluate the performance on six different bootstrap samples and report evaluation metrics such as precision, recall, and Macro-F1 score for each bootstrap sample. Additionally, we compute the confidence interval on our evaluation scores and conduct paired t-tests to understand whether our best-performing model has significant differences from the other experiments or not. The experimental results demonstrate that the dimensionality reduction approach on multimodal features with a multilayer perceptron classifier achieved the highest performance with a Macro–F1 score of 71.5 percent, outperforming the baseline approaches in individual modalities.

Read full abstract

Image Captioning Research Articles

Related Topics

Articles published on Image Captioning

Understanding remote sensing imagery like reading a text document: What can remote sensing image captioning offer?

An Analysis on Recent Approaches for Image Captioning

Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning.

A multimodal machine learning approach to generate news articles from geo-tagged images

Image Caption Prediction Using Deep Learning

Image Caption Generator

Research on image caption generation method based on multi-modal pre-training model and text mixup optimization

Attribute guided fusion network for obtaining fine-grained image captions

Image Caption Bot for Assistive Vision

Clustering swap prediction for image-text pre-training

TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning

Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning

A survey of neurosymbolic visual reasoning with scene graphs and common sense knowledge

Bridging Languages through Images: A Multilingual Text-to-Image Synthesis Approach

Automatic Image Captioning using Deep Learning

SAMT-generator: A second-attention for image captioning based on multi-stage transformer network

A Novel Pretrained General-purpose Vision Language Model for the Vietnamese Language

Paradoxes of Global History

MISTRA: Misogyny Detection through Text–Image Fusion and Representation Analysis

The Role of Attention Mechanism in Generating Image Captions: An Innovative Approach with Neural Network-Based Seq2seq Model

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Image Captioning Research Articles

Related Topics

Articles published on Image Captioning

Understanding remote sensing imagery like reading a text document: What can remote sensing image captioning offer?

An Analysis on Recent Approaches for Image Captioning

Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning.

A multimodal machine learning approach to generate news articles from geo-tagged images

Image Caption Prediction Using Deep Learning

Image Caption Generator

Research on image caption generation method based on multi-modal pre-training model and text mixup optimization

Attribute guided fusion network for obtaining fine-grained image captions

Image Caption Bot for Assistive Vision

Clustering swap prediction for image-text pre-training

TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning

Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning

A survey of neurosymbolic visual reasoning with scene graphs and common sense knowledge

Bridging Languages through Images: A Multilingual Text-to-Image Synthesis Approach

Automatic Image Captioning using Deep Learning

SAMT-generator: A second-attention for image captioning based on multi-stage transformer network

A Novel Pretrained General-purpose Vision Language Model for the Vietnamese Language

Paradoxes of Global History

MISTRA: Misogyny Detection through Text–Image Fusion and Representation Analysis

The Role of Attention Mechanism in Generating Image Captions: An Innovative Approach with Neural Network-Based Seq2seq Model