Vision to Language: Methods, Metrics and Datasets

Naeha Sharif,Wei Liu,Uzair Nadeem,Mohammed Bennamoun,Syed Afaq Ali Shah

doi:10.1007/978-3-030-49724-8_2

Abstract

Alan Turing’s pioneering vision of machines in the 1950s, that are capable of thinking like humans is still what Artificial Intelligence (AI) and Deep Learning research aspires to manifest, 70 years on. With replicating or modeling human intelligence as the ultimate goal, AI’s Holy Grail is to create systems that can perceive and reason about the world like humans and perform tasks such as visual interpretation/processing, speech recognition, decision-making and language understanding. In this quest, two of the dominant subfields of AI, Computer Vision and Natural Language Processing, attempt to create systems that can fully understand the visual world and achieve human-like language processing, respectively. To be able to interpret and describe visual content in natural language is one of the most distinguished capabilities of a human. While humans find it rather easy to accomplish, it is very hard for a machine to mimic this complex process. The past decade has seen significant research effort on the computational tasks that involve translation between and fusion of the two modalities of human communication, namely Vision and Language. Moreover, the unprecedented success of deep learning has further propelled research on tasks that link images to sentences, instead of just tags (as done in Object Recognition and Classification). This chapter discusses the fundamentals of generating natural language description of images as well as the prominent and the state-of-the-art methods, their limitations, various challenges in image captioning and future directions to push this technology further for practical real-world applications. It also serves as a reference to a comprehensive list of data resources for training deep captioning models and metrics that are currently in use for model evaluation.

Full Text