Multimodal Transformer With Multi-View Visual Representation for Image Captioning

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN) based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches, the MT model simultaneously captures intra- and inter-modal interactions in a unified attention block. Due to the in-depth modular composition of such attention blocks, the MT model can perform complex multimodal reasoning and output accurate captions. Moreover, to further improve the image captioning performance, multi-view visual features are seamlessly introduced into the MT model. We quantitatively and qualitatively evaluate our approach using the benchmark MSCOCO image captioning dataset and conduct extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method significantly outperforms the previous state-of-the-art methods. With an ensemble of seven models, our solution ranks the 1st place on the real-time leaderboard of the MSCOCO image captioning challenge at the time of the writing of this paper.

Similar Papers
  • Research Article
  • 10.56726/irjmets46115
MULTIMODAL TRANSFORMER-BASED IMAGE CAPTION GENERATION
  • Nov 17, 2023
  • International Research Journal of Modernization in Engineering Technology and Science
  • Pratiksha Magadum + 4 more

Recent research has made notable progress in im-proving vision-language multi-modal tasks like image captioning and visual question answering.Typically, these efforts employ an encoder-decoder framework with a CNN-based image encoder for visual feature extraction and an RNN-based caption decoder with attention mechanisms.These models primarily address inter-modal interactions while neglecting intra-modal ones.Inspired by the success of the Transformer model in translation, a Multi-modal Transformer (MT) is introduced for image captioning.It captures both intra-and inter-modal interactions within a unified attention block for intricate multi-modal reasoning.Multi-view visual features further enhance performance.Evaluation on the MSCOCO dataset demonstrates significant improvements, securing top rankings.

  • Research Article
  • Cite Count Icon 21
  • 10.1145/3573891
Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi
  • Mar 24, 2023
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Santosh Kumar Mishra + 3 more

In sequence-to-sequence modeling tasks, such as image captioning, machine translation, and visual question answering, encoder-decoder architectures are state of the art. An encoder, convolutional neural network (CNN) encodes input images into fixed dimensional vector representation in the image captioning task, whereas a decoder, a recurrent neural network, performs language modeling and generates the target descriptions. Recent CNNs use the same operation over every pixel; however, all the image pixels are not equally important. To address this, the proposed method uses a dynamic convolution-based encoder for image encoding or feature extraction, Long-Short-Term-Memory as a decoder for language modeling, and X-Linear attention to make the system robust. Encoders, attentions, and decoders are important aspects of the image captioning task; therefore, we experiment with various encoders, decoders, and attention mechanisms. Most of the works for image captioning have been carried out for the English language in the existing literature. We propose a novel approach for caption generation from images in Hindi. Hindi, widely spoken in South Asia and India, is the fourth most-spoken language globally; it is India’s official language. The proposed method utilizes dynamic convolution operation on the encoder side to obtain a better image encoding quality. The Hindi image captioning dataset is manually created by translating the popular MSCOCO dataset from English to Hindi. In terms of BLEU scores, the performance of the proposed method is compared with other baselines, and the results obtained show that the proposed method outperforms different baselines. Manual human assessment in terms of adequacy and fluency of the captions generated further determines the efficacy of the proposed method in generating good-quality captions.

  • Research Article
  • Cite Count Icon 65
  • 10.1109/tcyb.2020.2997034
Chinese Image Caption Generation via Visual Attention and Topic Modeling.
  • Jun 22, 2020
  • IEEE Transactions on Cybernetics
  • Maofu Liu + 4 more

Automatic image captioning is to conduct the cross-modal conversion from image visual content to natural language text. Involving computer vision (CV) and natural language processing (NLP), it has become one of the most sophisticated research issues in the artificial-intelligence area. Based on the deep neural network, the neural image caption (NIC) model has achieved remarkable performance in image captioning, yet there still remain some essential challenges, such as the deviation between descriptive sentences generated by the model and the intrinsic content expressed by the image, the low accuracy of the image scene description, and the monotony of generated sentences. In addition, most of the current datasets and methods for image captioning are in English. However, considering the distinction between Chinese and English in syntax and semantics, it is necessary to develop specialized Chinese image caption generation methods to accommodate the difference. To solve the aforementioned problems, we design the NICVATP2L model via visual attention and topic modeling, in which the visual attention mechanism reduces the deviation and the topic model improves the accuracy and diversity of generated sentences. Specifically, in the encoding phase, convolutional neural network (CNN) and topic model are used to extract visual and topic features of the input images, respectively. In the decoding phase, an attention mechanism is applied to processing image visual features for obtaining image visual region features. Finally, the topic features and the visual region features are combined to guide the two-layer long short-term memory (LSTM) network for generating Chinese image captions. To justify our model, we have conducted experiments over the Chinese AIC-ICC image dataset. The experimental results show that our model can automatically generate more informative and descriptive captions in Chinese in a more natural way, and it outperforms the existing image captioning NIC model.

  • Research Article
  • Cite Count Icon 1
  • 10.21275/sr22809213717
Image Caption Generator Using Convolutional Neural Network Algorithm
  • Aug 5, 2022
  • International Journal of Science and Research (IJSR)
  • Shaik Parvez

It is a very difficult challenge to automatically describe an image using a sentence from any natural language, such as English. It necessitates knowledge of both natural language processing and picture processing. The fusion of computer vision and natural language processing has received a lot of interest recently thanks to the advent of deep learning. This field is exemplified by image captioning, which teaches a computer to understand an image's visual information using one or more phrases. In addition to the ability to recognize the item and the scene, high-level image semantics also needs the ability to analyze the state, the properties, and the relationship between these things. Despite the fact that image captioning is a challenging and intricate endeavor, numerous academics have made substantial advancements. In artificial intelligence (AI), computer vision and natural language processing are used to automatically create an image's contents (Natural Language Processing). The regenerative neuronal model is developed. It is dependent on machine translation and computer vision. Using this technique, natural phrases are produced that finally explain the image. Convolutional neural networks (CNN)and recurrent neural networks (RNN) are also components of this architecture. RNN is utilized for phrase creation, while CNN is used to extract features from images. The model has been taught to produce captions that, when given an input image, almost exactly describe the image. On various datasets, the model's precision and the fluency or command of the language it learns from visual descriptions are examined. These tests demonstrate that the model frequently provides precise descriptions for an input image.

  • Conference Article
  • Cite Count Icon 3
  • 10.23919/eusipco55093.2022.9909888
Automated Image Captioning with Multi-layer Gated Recurrent Unit
  • Aug 29, 2022
  • Ozge Taylan Moral + 3 more

Describing the semantic content of an image via natural language, known as image captioning, has recently attracted substantial interest in computer vision and language processing communities. Current image captioning approaches are mainly based on an encoder-decoder framework in which visual information is extracted by an image encoder and captions are generated by a text decoder, using convolution neural net-works (CNN) and recurrent neural networks (RNN), respectively. Although this framework is promising for image captioning, it has limitations in utilizing the encoded visual information for generating grammatically and semantically correct captions in the RNN decoder. More specifically, the RNN decoder is ineffective in using the contextual information from the encoded data due to its limited ability in capturing long-term complex dependencies. Inspired by the advantage of gated recurrent unit (GRU), in this paper, we propose an extension of conventional RNN by introducing a multi-layer GRU that modulates the most relevant information inside the unit to enhance the semantic coherence of captions. Experimental results on the MSCOCO dataset show the superiority of our proposed approach over the state-of-the-art approaches in several performance metrics.

  • Research Article
  • 10.53106/222344892023101302003
Image Captioning Based on Fine-grained Relationships with Multiscale Regions of Interest
  • Oct 1, 2023
  • 理工研究國際期刊
  • 林亮宇 林亮宇 + 1 more

<p>隨著機器學習的蓬勃發展,圖片字幕生成(Image Captioning)的技術愈來愈進步。近期的Image Captioning引入區域提取網路(Region proposal Networks,RPN)與注意力機制(Attention Mechanism)。Image Captioning 透過 RPN 提取圖片中特定的物件區域,可以降低雜訊被當作視覺特徵的機率;注意力機制讓模型更專注在物件到文字的轉換。但是目前研究成果還存在著缺陷,RPN 與注意力機制皆專注於單一物件區域。它們缺少物件與物件之間更細膩的視覺特徵。上述的缺陷導致字幕生成器生成不明確的關係描述。為了提高Image Captioning 生成關係描述的細膩度,本研究提出透過不同物件之間多尺度感興趣區域之關係特徵的Image Captioning模型。本研究架構有 RPN、全卷積神經網路(Fully Convolutional Neural Networks,FCNN)以及長短期記憶(Long Short-term Memory,LSTM)單元。相較於現有的研究成果,在視覺特徵上,除了物件區域外,我們將進一步提取不同物件之間的多尺度 ROIs。由於某些多尺度 ROIs 是屬於雜訊,因此利用並交比(Intersection-over-Union)進行篩選。每一個ROI都先經由FCNN萃取出視覺特徵,再通過融合機制與排序網路獲得已排序的融合特徵,最後利用 LSTM 學習此特徵到完整句子的轉換。在訓練過程中額外透過階層式屬性的輔助監督,使字幕生成器能夠針對如何生成細膩的屬性進行學習。本研究提出的架構能夠在動態的圖片上,使用更精確的動詞描述物件動作。並且在基於 n-gram 的方法上,獲得更高的分數。</p> <p> </p><p>With the rapid development of machine learning, the technique of Image Captioning is be coming more and more advanced. Recent researches of Image Captioning introduce Region Proposal Networks (RPN) and Attention Mechanism. Through RPN, we can extract features of specific object region in the image and reduce the probability of noises being treated as visual features. Attention mechanism makes the models to focus more on the mapping of object and caption. However, the current research results have deficiencies. Both RPN and Attention Mechanism only focus on the single object region instead of fine-grained visual features. Aforementioned deficiencies cause mistakes that caption generator generates uncertain rela tionships. In this paper, to improve exquisiteness of relationship descriptions for Image Cap tioning, we propose the Image Captioning model which generates sentence with multi-scale regions of interest (ROIs) between two different objects. Our proposed architecture includes Region Proposal Networks, Fully Convolutional Neural Networks and Long Short-term Memory cells. Compared to the existing research results, we extract not only object regions but multi-scale ROIs between two different objects on visual features. Some of Multi-scale ROIs are noises that can be screened by utilizing Intersection-over-Union (IoU). Each ROI utilizes FCNN to extract the visual features, followed by obtaining sorted fusion features with fusion mechanism and sorting network, and lastly learning transformation between this features to a whole sentence by LSTM. Caption generator can focus on learning how to generate fine grained attributes with hierarchical attribute supervisions on the training stage. The architecture proposed in this study can use more precise verbs to describe object actions on dynamic pic tures. Furthermore, our architecture outperforms on metrics based n-gram.</p> <p> </p>

  • Research Article
  • Cite Count Icon 97
  • 10.1016/j.ipm.2019.102178
Image caption generation with dual attention mechanism
  • Dec 12, 2019
  • Information Processing & Management
  • Maofu Liu + 4 more

Image caption generation with dual attention mechanism

  • Research Article
  • 10.55041/ijsrem27770
Synthesis of Vision and Language: Multifaceted Image Captioning Application
  • Dec 23, 2023
  • INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Arpit Gupta + 2 more

The rapid advancement in image captioning has been a pivotal area of research, aiming to mimic human-like understanding of visual content. This paper presents an innovative approach that integrates attention mechanisms and object features into an image captioning model. Leveraging the Flickr8k dataset, this research explores the fusion of these components to enhance image comprehension and caption generation. Furthermore, the study showcases the implementation of this model in a user-friendly application using FASTAPI and ReactJS, offering text-to-speech translation in multiple languages. The findings underscore the efficacy of this approach in advancing image captioning technology. This tutorial outlines the construction of an image caption generator, employing Convolutional Neural Network (CNN) for image feature extraction and Long Short-Term Memory Network (LSTM) for Natural Language Processing (NLP). Keywords—Convolutional Neural Networks, Long Short Term Memory, Attention Mechanism, Transformer Architecture, Vision Transformers, Transfer Learning, Multimodal fusion, Deep Learning Models, Pre-Trained Models, Image Processing Techniques

  • Research Article
  • Cite Count Icon 43
  • 10.1016/j.compeleceng.2021.107114
Image captioning in Hindi language using transformer networks
  • Apr 17, 2021
  • Computers & Electrical Engineering
  • Santosh Kumar Mishra + 4 more

Image captioning in Hindi language using transformer networks

  • Research Article
  • 10.66108/mna.v4i3.102
Image caption generation using transfer learning using LSTM and DenseNet
  • Dec 21, 2025
  • Machines and Algorithms
  • Abdul Jabbar

Image captioning consists of the description of images by identifying the main objects of an image, the features of the objects, and their associations. The effective system should also produce syntactically and semantically correct sentences. Deep learning methods can be effective in addressing the complications involved in this task. The article presents an advanced deep learning architecture of image captioning that enable the implication of three advanced technologies i.e., machine vision, machine translation and transfer learning. The state-of-the-art CNN architecture have been utilized to perform this task i.e., DenseNet201 model. DenseNet201 is a convolutional neural network (CNN) which converts the image data into a feature vector. After this CNN, a recurrent neural network (RNN) is exploited to encode the images using this vector. The coded text is then passed through another RNN, which is known as Long Short-Term Memory (LSTM) networks where the feature vector is decoded to produce a sequence of words which finally form the image descriptions. The Flickr8k dataset is used to test the effectiveness of the proposed model, and the performance of the model is measured with the help of the BLEU metric, which then gives a quantitative evaluation of the potential of the model.

  • Research Article
  • Cite Count Icon 34
  • 10.1109/lgrs.2021.3135711
Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer
  • Jan 1, 2022
  • IEEE Geoscience and Remote Sensing Letters
  • Shuo Zhuang + 5 more

Remote sensing image captioning (RSIC) has great significance in image understanding, which describes the image content in natural language. Existing methods are mainly based on deep learning and rely on the encoder–decoder model to generate sentences. In the decoding process, recurrent neural network (RNN) and long short-term memory (LSTM) are normally applied to sequentially generate image captions. In this letter, the transformer encoder–decoder is combined with grid features to improve the RSIC performance. First, the pretrained convolutional neural network (CNN) is used to extract grid-based visual features, which are encoded as vectorial representations. Then, the transformer outputs semantic descriptions to bridge visual features and natural language. Besides, the self-critical sequence training (SCST) strategy is applied to further optimize the image captioning model and improve the quality of generated sentences. Extensive experiments are organized on three public datasets of RSCID, UCM-Captions, and Sydney-Captions. Experimental results demonstrate the effectiveness of SCST strategy and the proposed method achieves superior performance compared with the state-of-the-art image captioning approaches on the RSCID dataset.

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 26
  • 10.18653/v1/d19-5205
English to Hindi Multi-modal Neural Machine Translation and Hindi Image Captioning
  • Jan 1, 2019
  • Sahinur Rahman Laskar + 3 more

With the widespread use of Machine Trans-lation (MT) techniques, attempt to minimizecommunication gap among people from di-verse linguistic backgrounds. We have par-ticipated in Workshop on Asian Transla-tion 2019 (WAT2019) multi-modal translationtask. There are three types of submissiontrack namely, multi-modal translation, Hindi-only image captioning and text-only transla-tion for English to Hindi translation. The mainchallenge is to provide a precise MT output.The multi-modal concept incorporates textualand visual features in the translation task. Inthis work, multi-modal translation track re-lies on pre-trained convolutional neural net-works (CNN) with Visual Geometry Grouphaving 19 layered (VGG19) to extract imagefeatures and attention-based Neural MachineTranslation (NMT) system for translation.The merge-model of recurrent neural network(RNN) and CNN is used for the Hindi-onlyimage captioning. The text-only translationtrack is based on the transformer model of theNMT system. The official results evaluated atWAT2019 translation task, which shows thatour multi-modal NMT system achieved Bilin-gual Evaluation Understudy (BLEU) score20.37, Rank-based Intuitive Bilingual Eval-uation Score (RIBES) 0.642838, Adequacy-Fluency Metrics (AMFM) score 0.668260 forchallenge test data and BLEU score 40.55,RIBES 0.760080, AMFM score 0.770860 forevaluation test data in English to Hindi multi-modal translation respectively.

  • Report Series
  • Cite Count Icon 8
  • 10.29007/hxhn
Multimodal Neural Machine Translation Using CNN and Transformer Encoder
  • Apr 2, 2019
  • EasyChair preprint
  • Hiroki Takushima + 3 more

Multimodal machine translation uses images related to source language sentences as inputs to improve translation quality. Previous multimodal Neural Machine Translation (NMT) models, which incorporate visual features of each image region into an encoder for source language sentences or an attention mechanism between an encoder and a decoder, cannot catch the relation between visual features from each image region. This paper proposes a new multimodal NMT model, which encodes an input image using a Convolutional Neural Network (CNN) and a Transformer encoder. In particular, the proposed image encoder first extracts visual features from each image region using a CNN, and then encodes an input image on the basis of the extracted visual features using a Transformer encoder, where the relation between visual features from each image region are captured by a self-attention mechanism of the Transformer encoder. The experiments on the English-German translation task using the Multi30k data set show that the proposed model achieves 0.96 BLEU points improvement against a baseline Transformer NMT model without image inputs and 0.47 BLEU points improvement against a baseline multimodal Transformer NMT model without a Transformer encoder for images.

  • Research Article
  • 10.55041/ijsrem26789
Automatic Intelligence Caption Generator
  • Nov 1, 2023
  • INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Trushna Kapadnis + 4 more

An Image Caption Generator is a sophisticated AI system that combines computer vision and natural language processing to automatically create descriptive textual captions for images. This technology utilizes deep learning, particularly Convolutional Neural Networks (CNNs), to analyze and extract meaningful visual features from the input image. These features capture details about the objects, scenes, and elements within the image. Subsequently, a natural language processing model, often built on Recurrent Neural Networks (RNNs) or Transformers, processes these visual features and generates coherent, contextually relevant captions. Post-processing steps may be applied to enhance the quality of the generated text. The primary aim of Image Caption Generators is to facilitate image understanding, improve accessibility, and enhance content search ability by providing human-readable descriptions for visual content. This technology is instrumental in various fields, including content tagging, accessibility tools for the visually impaired, and enhancing user experiences in multimedia content management systems, ultimately bridging the gap between visual and textual information for a more comprehensive and human-like interpretation of image. Key Words:Image Recognition, Internet, Image-To-Caption, Contextual Understanding,Image Captioning.

  • Book Chapter
  • Cite Count Icon 5
  • 10.4018/979-8-3693-5643-2.ch009
A Deep Learning-Based Efficient Image Captioning Approach for Hindi Language
  • Apr 5, 2024
  • Vishal Jayaswal + 2 more

The image caption is a statement that simply conveys the contents of an image. The technique of picture captioning requires both digital image processing and natural language processing. Previously, the majority of research was completed in English language for image captioning. But research work for the Hindi language is much less. Hindi is the national language of India, and the fourth most widely spoken language in the world. The vast majority of Indians speak Hindi. This was the main cause behind the choice to develop a Hindi-language picture captioning algorithm. In this chapter, an effective deep learning-based photo captioning model based on encoder-decoder for the Hindi language is proposed. The encoding process utilizes a convolution neural network (CNN), while the decoding process employs a recurrent neural network (RNN) with an attention mechanism. For the implementation, the Hindi version of the Flickr 8k dataset is used and to evaluate the performance of image captioning, BLEU score is used.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant