Att-BiL-SL: Attention-Based Bi-LSTM and Sequential LSTM for Describing Video in the Textual Formation

Shakil Ahmed,Siam Bin Shawkat,A F M Saifuddin Saif,Farzad Rahman,Md Mostofa Nurannabi Shakil,Md Mostofa Jaman,Md Imtiaz Hanif,Borshan Sarker Sonok,Jahid Hasan,Md Mazid Ul Haque,Hasan Muhommod Sabbir

doi:10.3390/app12010317

Abstract

With the advancement of the technological field, day by day, people from around the world are having easier access to internet abled devices, and as a result, video data is growing rapidly. The increase of portable devices such as various action cameras, mobile cameras, motion cameras, etc., can also be considered for the faster growth of video data. Data from these multiple sources need more maintenance to process for various usages according to the needs. By considering these enormous amounts of video data, it cannot be navigated fully by the end-users. Throughout recent times, many research works have been done to generate descriptions from the images or visual scene recordings to address the mentioned issue. This description generation, also known as video captioning, is more complex than single image captioning. Various advanced neural networks have been used in various studies to perform video captioning. In this paper, we propose an attention-based Bi-LSTM and sequential LSTM (Att-BiL-SL) encoder-decoder model for describing the video in textual format. The model consists of two-layer attention-based bi-LSTM and one-layer sequential LSTM for video captioning. The model also extracts the universal and native temporal features from the video frames for smooth sentence generation from optical frames. This paper includes the word embedding with a soft attention mechanism and a beam search optimization algorithm to generate qualitative results. It is found that the architecture proposed in this paper performs better than various existing state of the art models.

Highlights

Throughout recent years, people are having easier access to a massive proportion of visual data in various online platforms, which usually contain the following—sound, visual scene, and sometimes textual data
We propose an attention-based bi-LSTM [53] as an encoder in our framework
We found the best results from the Xception + Inflated 3D attention-based Bi-LSTM and Sequential LSTM Convolutional Neural Network (CNN) model in both datasets

Summary

Introduction

Throughout recent years, people are having easier access to a massive proportion of visual data in various online platforms, which usually contain the following—sound, visual scene, and sometimes textual data. A report showed that web traffic recordings might increment to 82% in 2021, which announced 73% of total traffic in 2016 (e.g., Netflix and YouTube) [1]. These integrated multiple-sourced pieces of information require more analytic processing powers and require a huge amount of storage space. A succinct, video synopsis featuring the essential pieces of the video will help in the ordering and quicker recovery of the required information. It will likewise be valuable in the route, notice, and information investigation throughout a large video

Methods

Results

Conclusion