Abstract

Wildlife videos often have elaborate dynamics, and techniques for generating video captions for wildlife clips involve both natural language processing and computer vision. Current techniques for video captioning have shown encouraging results. However, these techniques derive captions based on video frames only, ignoring audio information. In this paper we propose to create video captions with the help of both audio and visual information, in natural language. We utilize deep neural networks with convolutional and recurrent neural networks both involved. Experimental results on a corpus of wildlife clips show that fusion of audio knowledge greatly improves the efficiency of video description. These superior results are achieved using convolutional neural networks (CNN) and recurrent neural networks (RNN).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call