Abstract

Alan Turing’s pioneering vision of machines in the 1950s, that are capable of thinking like humans is still what Artificial Intelligence (AI) and Deep Learning research aspires to manifest, 70 years on. With replicating or modeling human intelligence as the ultimate goal, AI’s Holy Grail is to create systems that can perceive and reason about the world like humans and perform tasks such as visual interpretation/processing, speech recognition, decision-making and language understanding. In this quest, two of the dominant subfields of AI, Computer Vision and Natural Language Processing, attempt to create systems that can fully understand the visual world and achieve human-like language processing, respectively. To be able to interpret and describe visual content in natural language is one of the most distinguished capabilities of a human. While humans find it rather easy to accomplish, it is very hard for a machine to mimic this complex process. The past decade has seen significant research effort on the computational tasks that involve translation between and fusion of the two modalities of human communication, namely Vision and Language. Moreover, the unprecedented success of deep learning has further propelled research on tasks that link images to sentences, instead of just tags (as done in Object Recognition and Classification). This chapter discusses the fundamentals of generating natural language description of images as well as the prominent and the state-of-the-art methods, their limitations, various challenges in image captioning and future directions to push this technology further for practical real-world applications. It also serves as a reference to a comprehensive list of data resources for training deep captioning models and metrics that are currently in use for model evaluation.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.