Abstract

In recent years, automatically generating natural language descriptions for videos has created a lot of focus in computer vision and natural language processing research. Video understanding has several applications such as video retrieval and indexing etc., but the video captioning is the quite challenging topic because of the complex and diverse nature of video content. However, the understanding between video content and natural language sentence remains an open problem to create several methodology to better understand the video and generate the sentence automatically. Deep learning methodologies have increased great focus towards video processing because of their better performance and the high-speed computing capability. This survey discusses various methods using the end-to-end framework of encoder-decoder network based on deep learning approaches to generate the natural language description for video sequences. This paper also addresses the different dataset used for video captioning and image captioning and also various evaluation parameters used for measuring the performance of different video captioning models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.