Video captioning using boosted and parallel Long Short-Term Memory networks

Masoomeh Nabati,Alireza Behrad

doi:10.1016/j.cviu.2019.102840

Abstract

Video captioning and its integration with deep learning is one of the most challenging issues in the field of machine vision and artificial intelligence. In this paper, a new boosted and parallel architecture is proposed for video captioning using Long Short-Term Memory (LSTM) networks. The proposed architecture comprises two LSTM layers and a word selection module. The first LSTM layer has the responsibility of encoding frame features extracted by a pre-trained deep Convolutional Neural Network (CNN). In the second LSTM layer, a novel architecture is used for video captioning by leveraging several decoding LSTMs in a parallel and boosting architecture. This layer, which is called Boosted and Parallel LSTM (BP-LSTM) layer, is constructed by iteratively training LSTM networks using a special kind of AdaBoost algorithm during the training phase. During the testing phase, the outputs of BP-LSTMs are concurrently combined using the maximum probability criterion and word selection module. We tested the proposed algorithm with two well-known video captioning datasets and compared the results with state-of-the-art algorithms. The results show that the proposed architecture considerably improves the accuracy of the generated sentence.

Full Text