Video Captioning using Deep Learning: An Overview of Methods, Datasets and Metrics

M Amaresh,S Chitrakala

doi:10.1109/iccsp.2019.8698097

Abstract

In recent years, automatically generating natural language descriptions for videos has created a lot of focus in computer vision and natural language processing research. Video understanding has several applications such as video retrieval and indexing etc., but the video captioning is the quite challenging topic because of the complex and diverse nature of video content. However, the understanding between video content and natural language sentence remains an open problem to create several methodology to better understand the video and generate the sentence automatically. Deep learning methodologies have increased great focus towards video processing because of their better performance and the high-speed computing capability. This survey discusses various methods using the end-to-end framework of encoder-decoder network based on deep learning approaches to generate the natural language description for video sequences. This paper also addresses the different dataset used for video captioning and image captioning and also various evaluation parameters used for measuring the performance of different video captioning models.

Full Text