Abstract

Video description is basically a method of automatic creation of natural language sentences which can illustrate the insides of a given video. Several applications areas are there where we need summarization of description of video such as in video subtitling, human–robot interaction, helping the visually impaired person etc. In this present paper, we review different state-of-the-art techniques for video description using deep learning techniques. This paper summarizes the deep learning based state-of-the-art methods for video description. The paper also identifies the pros and cons of each evaluation metrics employed in the task. Most of these methods generally uses a 3D CNN model to convert videos into multi-dimensional arrays, then uses a word embedding techniques like GLOVE, etc., for featurization of text description and then finally trains an RNN or LSTM model or a variant of the two to perform final classification. The paper also describes the benchmark datasets of each of these methods along with evaluation metrics and state-of-the-art performance is reported on the same. Wherever applicable, we also list down the advantages and drawbacks of each of these methods as stated earlier. This survey paper can act as a preparatory read for anyone entering the field.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call