Cross-modal text and visual generation: A systematic review. Part 1: Image to text

Maciej Żelaszczyk,Jacek Mańdziuk

doi:10.1016/j.inffus.2023.01.008

Abstract

We review the existing literature on generating text from visual data under the cross-modal generation umbrella, which affords us to compare and contrast various approaches taking visual data as input and producing text outputs, while not limiting the analysis to narrowly-defined areas, such as image captioning. We provide a breakdown of image-to-text generation methods into generative and non-generative image captioning and visual dialogue, with further distinctions provided for relevant areas. We provide template methods and discuss the existing research in light of such template methods, highlighting both the salient commonalities between different approach and significant departures from the templates. Where it is of interest, we also provide comparisons between templates for distinct areas. To achieve a comprehensive review, we focus on research papers published at 8 leading machine learning conferences in the years 2016–2021 as well as a number of papers, which do not conform to our search criteria, but nonetheless come from leading venues. This is the first review we know of to provide a systematic description of the current state of image-to-text generation and tie distinct research areas together looking at them through the lens of cross-modal generation.

Full Text