A Survey on Video Description and Summarization Using Deep Learning-Based Methods

Pranati Rakshit,Anuj Kumar,Amlan Chakraborty

doi:10.1007/978-981-19-5090-2_21

Abstract

Video description is basically a method of automatic creation of natural language sentences which can illustrate the insides of a given video. Several applications areas are there where we need summarization of description of video such as in video subtitling, human–robot interaction, helping the visually impaired person etc. In this present paper, we review different state-of-the-art techniques for video description using deep learning techniques. This paper summarizes the deep learning based state-of-the-art methods for video description. The paper also identifies the pros and cons of each evaluation metrics employed in the task. Most of these methods generally uses a 3D CNN model to convert videos into multi-dimensional arrays, then uses a word embedding techniques like GLOVE, etc., for featurization of text description and then finally trains an RNN or LSTM model or a variant of the two to perform final classification. The paper also describes the benchmark datasets of each of these methods along with evaluation metrics and state-of-the-art performance is reported on the same. Wherever applicable, we also list down the advantages and drawbacks of each of these methods as stated earlier. This survey paper can act as a preparatory read for anyone entering the field.

Full Text