Abstract

In recent years, the confluence of computer vision and natural language processing, propelled by advancements in deep learning, has garnered significant interest. Among its notable applications, image captioning stands out, enabling computers to comprehend visual content through one or more sentences. This process entails not only identifying objects and scenes but also analyzing their attributes, states, and interrelations, culminating in the generation of meaningful descriptions encapsulating high-level image semantics. While inherently complex, image captioning has seen remarkable progress thanks to the efforts of numerous researchers. This paper offers a comprehensive review of three prominent image captioning methodologies leveraging deep neural networks: CNN-RNN, CNN-CNN, and Reinforcement-based frameworks. Each approach is accompanied by a detailed analysis of representative works, elucidating their respective contributions. Furthermore, evaluation metrics pertinent to these methods are discussed, followed by a synthesis of their advantages and primary challenges. Through this thorough examination, insights into the evolving landscape of image captioning are aimed to be provided, highlighting avenues for further exploration and innovation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call