An Attentive Fourier-Augmented Image-Captioning Transformer

Raymond Ian Osolo,Jun Long,Zhan Yang

doi:10.3390/app11188354

Abstract

Many vision–language models that output natural language, such as image-captioning models, usually use image features merely for grounding the captions and most of the good performance of the model can be attributed to the language model, which does all the heavy lifting, a phenomenon that has persisted even with the emergence of transformer-based architectures as the preferred base architecture of recent state-of-the-art vision–language models. In this paper, we make the images matter more by using fast Fourier transforms to further breakdown the input features and extract more of their intrinsic salient information, resulting in more detailed yet concise captions. This is achieved by performing a 1D Fourier transformation on the image features first in the hidden dimension and then in the sequence dimension. These extracted features alongside the region proposal image features result in a richer image representation that can then be queried to produce the associated captions, which showcase a deeper understanding of image–object–location relationships than similar models. Extensive experiments performed on the MSCOCO dataset demonstrate a CIDER-D, BLEU-1, and BLEU-4 score of 130, 80.5, and 39, respectively, on the MSCOCO benchmark dataset.

Highlights

It is quite a trivial task for humans to describe an image, it was for many years, until recently, a very challenging task for a machine to perform at a level close to that of humans
The major breakthrough came with the introduction of deep learning-based image-captioning systems [1] that were inspired by the success of deep learning-based neural machine translation systems [2], which heralded probably the most significant age in the field of visual–language research
A convolutional neural network (CNN) encoder pre-trained on a large image classification dataset such as ImageNet [6], is used to extract the image features into a fixed-length vector that is fed to the decoder LSTM which learns to generate captions one word at a time by conditioning them over both the image features and the previously generated words

Summary

Introduction

It is quite a trivial task for humans to describe an image, it was for many years, until recently, a very challenging task for a machine to perform at a level close to that of humans. The challenge of accomplishing the task of automatic image-captioning stems from the fact that it is a vision-to-language task that requires a machine to both understand the semantic content in an image and have language understanding as well as generation capabilities. This is further complicated by the fact that it is an ambiguous task with many accurate descriptions possible for each image.

Methods

Results

Discussion

Conclusion