Abstract
Natural language processing (NLP) and computer vision (CV) are two of the most essential areas in the artificial intelligence field (AI). CV is a topic of study that looks into the methods for teaching computers to perceive and interpret digital stuff like images and movies. NLP (natural language processing) is a discipline of linguistics that allows computers to process, interpret, and even synthesise human language. With the advent and development of deep learning over the last decade, there has been a continual flow of innovation and breakthroughs that convincingly push the boundaries and improve the state-of-the-art in both vision and language modelling. An noteworthy point is that research in the two areas is beginning to interact, and many previous experiences have demonstrated that this is a good thing. In general, visual and language interactions have taken place in two dimensions: vision to language and language to vision. In the form of tags [Reference Yao, Mei, Ngo and Li1], answers [Reference Anderson2], captions [Reference Yao, Pan, Li, Qiu and Mei3–Reference Yao, Pan, Li and Mei5], and comments [Reference Li, Yao, Mei, Chao and Rui6], the former primarily recognises or describes the visual content with a set of individual words or a natural sentence. In visual content, a tag, for example, usually represents a specific object, action, or event. An answer is a statement made in response to a question concerning the facts portrayed in a photograph or video. A caption is a natural-language utterance (typically a sentence) that goes beyond tags or answers. The purpose of this programme is to convert text into an image or a video. For example, given a textual description of "this small bird has a short beak and dark stripe down the top, the wings are a mix of brown, white, and black," text-to-image synthesis' purpose is to create a bird image that fits all of the requirements. This study examines recent state-of-the-art AI advancements that improve both vision to language, particularly image/video captioning, and language to vision. Real-world deployments in both domains are also shown as good instances of how AI improves user engagement and transforms consumer experiences in industrial applications. The following is a breakdown of the remaining sections. Section II outlines the evolution of vision to language by laying out a quick road map of significant image/video captioning technologies, distilling a typical encoder–decoder structure, and summarising the results of a common benchmark. Further, the practical applications of vision to language are discussed. Section III describes the technological advances in language to vision in terms of various situations and generation methodologies, and then concludes with a summary Progress in language to image, language to video, and AI-powered applications is summarised. Finally, Section IV brings the paper to a close
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have