Natural Language Description of Videos for Smart Surveillance

Aniqa Dilawari,Yunyoung Nam,Muhammad Usman Ghani Khan,Zahoor-Ur Rehman,Atta-Ur Rahman,Yasser D Al-Otaibi

doi:10.3390/app11093730

Abstract

After the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage. Though quite handy, these cameras produce videos in a massive size and volume. The major challenge faced by security agencies is the effort of analyzing the surveillance video data collected and generated daily. Problems related to these videos are twofold: (1) understanding the contents of video streams, and (2) conversion of the video contents to condensed formats, such as textual interpretations and summaries, to save storage space. In this paper, we have proposed a video description framework on a surveillance dataset. This framework is based on the multitask learning of high-level features (HLFs) using a convolutional neural network (CNN) and natural language generation (NLG) through bidirectional recurrent networks. For each specific task, a parallel pipeline is derived from the base visual geometry group (VGG)-16 model. Tasks include scene recognition, action recognition, object recognition and human face specific feature recognition. Experimental results on the TRECViD, UET Video Surveillance (UETVS) and AGRIINTRUSION datasets depict that the model outperforms state-of-the-art methods by a METEOR (Metric for Evaluation of Translation with Explicit ORdering) score of 33.9%, 34.3%, and 31.2%, respectively. Our results show that our framework has distinct advantages over traditional rule-based models for the recognition and generation of natural language descriptions.

Highlights

There is an exponential increase in digital multimedia, resulting in the generation of enormous amounts of video data
This section details the results of the TRECViD, UET Video Surveillance (UETVS) and AGRIINTRUSION datasets
We compare the results of this multitask learning-based framework with two baseline models—an long short-term memory (LSTM) semantic compositional network (SCN-LSTM) and multimofal stochastic recurrent neural network (RNN) (MS-RNN)—and video descriptions using deep neural networks

Summary

Introduction

There is an exponential increase in digital multimedia, resulting in the generation of enormous amounts of video data. The growing rate of multimedia content uploaded on the Internet involves automatic interpretation and description of the videos for the retrieval of important information This can be useful in surveillance, security, human–computer interaction, robotic intelligence and even helps visually impaired people. Among these applications, an automatic description of videos in a natural language is gaining interest, where we give a video to the deep learning framework that converts it into one or multiple sentences. More complex rules were applied in [5] that contained a relatively large vocabulary to generate sentences These approaches require monotonous work when the data is huge. The results of these approaches were lacking with large datasets like Microsoft Common Objects in Context (MS COCO) [8]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Apr 21, 2021
Citations: 12	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Natural Language Description of Videos for Smart Surveillance

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Utilizing Multi-layer Perceptron for Esophageal Cancer Classification Through Machine Learning Methods
Sandeep Kumar ... Indrajeet Gupta
The Open Public Health Journal | VOL. 17
Sandeep Kumar, et. al.Sandeep Kumar ... Indrajeet Gupta
07 Oct 2024
The Open Public Health Journal | VOL. 17

Facial Expressions Recognition for Human-Robot Interaction Using Deep Convolutional Neural Networks with Rectified Adam Optimizer.
Daniel Octavian Melinte ... Luige Vladareanu
Sensors | VOL. 20
Daniel Octavian Melinte, et. al.Daniel Octavian Melinte ... Luige Vladareanu
23 Apr 2020
Sensors | VOL. 20

ReVal: A Simple and Effective Machine Translation Evaluation Metric Based on Recurrent Neural Networks
Rohit Gupta ... Constantin Orasan
-
Rohit Gupta, et. al.Rohit Gupta ... Constantin Orasan
01 Jan 2015
01 Jan 2015

Statistical Analysis of Machine Translation Evaluation Systems for English- Hindi Language Pair
Pooja Malik ... Anurag S. Baghel
Recent Advances in Computer Science and Communications | VOL. 13
Pooja Malik, et. al.Pooja Malik ... Anurag S. Baghel
05 Nov 2020
Recent Advances in Computer Science and Communications | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Natural Language Description of Videos for Smart Surveillance

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences