Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping

Jiu Lou,Decheng Zuo,Zhan Zhang,Hongwei Liu

doi:10.3390/electronics10212654

Abstract

In the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory-visual information fusion based on autoencoder mapping. First, a feature extraction model based on the CNN-LSTM framework is established, and multimedia segments are used as whole input to solve the problem of time axis misalignment of visual and auditory information. Then, a shared semantic subspace is constructed based on an autoencoder mapping model and is optimized by semantic correspondence, which solves the problem of audiovisual semantic deviation and realizes the fusion of visual and auditory information on segment level features. Finally, the whole network is used to identify violence. The experimental results show that the method can make good use of the complementarity between modes. Compared with single-mode information, the multimodal method can achieve better results.

Highlights

IntroductionThe wide application of high-definition multimedia data acquisition equipment has guaranteed public social security and greatly protected the safety of people and property
This paper proposes a recognition model of violence, which uses a convolutional neural network (CNN)-long- and short-term memory network (LSTM) architecture for fragment levels feature extraction and uses the autoencoder [27] model to represent the shared semantic subspace mapping for audiovisual information fusion
This paper proposes an auditory-visual information fusion model based on an autoencoder for violent behavior recognition

Summary

Introduction

The wide application of high-definition multimedia data acquisition equipment has guaranteed public social security and greatly protected the safety of people and property. The semantic expression bias of visual and auditory information, such as normal behavior, is shown in a video accompanied by an explosion, or, alternatively, violent behavior is shown but without any abnormal background sound Both of these are problems that need to be solved in the process of multi-modal feature fusion. This paper proposes a recognition model of violence, which uses a CNN-LSTM architecture for fragment levels feature extraction and uses the autoencoder [27] model to represent the shared semantic subspace mapping for audiovisual information fusion. Through this approach, we seek to circumvent problems related to audiovisual information time axis misalignment.

Auditory and Visual Feature Extraction Method

Auditory Feature Extraction Based on CNN-LSTM

Visual Feature Based on CNN-ConvLSTM

The Deep Network for Auditory Visual Information Fusion

Shared Semantic Subspace Based on Autoencoder

Shared

Model Optimization Based on Semantic Correspondence

Network Structure

Violence

Algorithm Realization

Dataset

Validation of Feature Combination Method

Method

Test Results

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Electronics	Publication Date: Oct 29, 2021
Citations: 7	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Similar Papers

Gymnasts utilize visual and auditory information for behavioural synchronization in trampolining.
Thomas Heinen ... Pia Vinken
Biology of sport | VOL. 31
Thomas Heinen, et. al.Thomas Heinen ... Pia Vinken
01 Jul 2014
Biology of sport | VOL. 31

Noisy speech enhancement by fusion of auditory and visual information: a study of vowel transitions
Laurent Girin ... Gang Feng
-
Laurent Girin, et. al.Laurent Girin ... Gang Feng
22 Sep 1997
22 Sep 1997

Research on Upper Limb Action Intention Recognition Method Based on Fusion of Posture Information and Visual Information
Jian-Wei Cui ... Bing-Yan Yan
Electronics | VOL. 11
Jian-Wei Cui, et. al.Jian-Wei Cui ... Bing-Yan Yan
27 Sep 2022
Electronics | VOL. 11

The Use of Auditory and Visual Information in Phonetic Perception
Kerry P Green
-
Kerry P GreenKerry P Green
01 Jan 1996
01 Jan 1996

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics