Abstract

Recent advances in deep learning have enabled machines to see, hear, and even speak. In some cases, with the help of deep learning, machines have also outperformed humans in these complex tasks. Such improvements have reignited interest in many fields. Image captioning, which is considered an intersection between computer vision and natural language processing, has recently received significant attention. Deep learning-based image captioning models represent a great improvement on traditional methods. However, most of the work done in image captioning is based on supervised deep learning methods. Recently, unsupervised image captioning has started to gather momentum. This paper presents the first survey that focuses on unsupervised and semi-supervised image captioning techniques and methods. Additionally, the survey shows how such methods can be used with different data availability and data pairing settings, where some methods can be used with paired data, while others can be used with unpaired data. Furthermore, special cases of unpaired data, such as cross-domain and cross-lingual image captioning, are also discussed. Finally, the survey presents a discussion on challenges and future research directions of image captioning.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call