Video Captioning Based on Multi-layer Gated Recurrent Unit for Smartphones

Bengü Feti̇ler,Volkan Kiliç,Özkan Çayli,Aytuğ Onan,Özge Taylan Moral

doi:10.31590/ejosat.1039242

Abstract

Video captioning is the visual understanding process to generate grammatically and semantically meaningful descriptions that are of interest in the fields of computer vision (CV) and natural language processing (NLP). Recent advances in the computing power of the mobile platform have led to many video captioning applications that use CV and NLP techniques. These video captioning applications mainly depend on the encoder-decoder approach running with the internet connection, which employs convolutional neural networks (CNNs) on the encoder and recurrent neural networks (RNNs) on the decoder. However, this approach is not powerful enough to get accurate captioning results, and fast response due to online data transfer. In this paper, therefore, the encoder-decoder approach has been extended with a sequence-to-sequence model under a multi-layer gated recurrent unit (GRU) to generate a semantically more coherent caption. Visual information from image features of each video frame is extracted with ResNet-101 CNN in the encoder to feed the multi-layer GRU based decoder for caption generation. The proposed approach has been compared with the state-of-the-art approaches using experiments on the MSVD dataset under eight performance metrics. In addition, the proposed approach is embedded into our custom-designed Android application, called WeCap, capable of faster caption generation without an internet connection.

Full Text