Visual Image Captioning through Transformer

Muneeb Nabi,Shouaib Ahmad,Prachi Goel,Kanishk Varshney,Rohit Pachauri,Apurva Jain

doi:10.22214/ijraset.2023.57766

Abstract

Abstract: The convergence of computer vision and natural language processing in Artificial Intelligence has sparked significant interest in recent years, largely propelled by the advancements in deep learning. One notable application born from this synergy is the automatic description of images in English. Image captioning involves the computer's ability to interpret visual information from an image and translate it into one or more descriptive phrases. Generating meaningful descriptions requires understanding the state, properties, and relationships between the depicted objects, demanding a grasp of high-level picture semantics. Automatically captioning images is a complex task that intertwines image analysis with text generation. Central to this process is the concept of attention, determining what to describe and in what sequence. While transformer architectures have shown success in text analysis and translation, adapting them for image captioning presents unique challenges due to structural differences between semantic units in images (usually identified regions from object detection models) and sentences (composed of individual words). Little effort has been devoted to tailoring transformer architectures to suit images' structural characteristics. In this study, we introduce the Image Transformer, a novel architecture comprising a modified encoding transformer and an implicit decoding transformer. Our approach involves expanding the inner architecture of the original transformer layer to better accommodate the structural nuances of images. By utilizing only region features as inputs, our model achieves state-of-the-art performance on the MSCOCO Dataset. This research employing CNN-Transformer architectural models for image captioning aims to detect objects within images and convey information through textual messages. The envisioned application of this method extends to aiding individuals with visual impairments, using text-to-speech messages to facilitate their access to information and nurture their cognitive abilities. This paper meticulously explores fundamental concepts in image captioning and its standardized procedures, introducing a generative CNN-Transformer model as a significant advancement in this field.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Visual Image Captioning through Transformer

Abstract

Talk to us

Similar Papers

More From: International Journal for Research in Applied Science and Engineering Technology

Lead the way for us

Similar Papers

Synthesis of Vision and Language: Multifaceted Image Captioning Application
Arpit Gupta ... Himanshu Goyal
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 07
Arpit Gupta, et. al.Arpit Gupta ... Himanshu Goyal
23 Dec 2023
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 07

Visuals to Text: A Comprehensive Review on Automatic Image Captioning
Yue Ming ... Nannan Hu
IEEE/CAA Journal of Automatica Sinica | VOL. 9
Yue Ming, et. al.Yue Ming ... Nannan Hu
01 Aug 2022
IEEE/CAA Journal of Automatica Sinica | VOL. 9

Deep Learning in Natural Language Generation from Images
Xiaodong He ... Li Deng
-
Xiaodong He, et. al.Xiaodong He ... Li Deng
01 Jan 2018
01 Jan 2018

Improving Image Captioning with Better Use of Caption
Zhan Shi ... Xu Zhou
-
Zhan Shi, et. al.Zhan Shi ... Xu Zhou
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Visual Image Captioning through Transformer

Abstract

Talk to us

Similar Papers

More From: International Journal for Research in Applied Science and Engineering Technology