Abstract

In the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory-visual information fusion based on autoencoder mapping. First, a feature extraction model based on the CNN-LSTM framework is established, and multimedia segments are used as whole input to solve the problem of time axis misalignment of visual and auditory information. Then, a shared semantic subspace is constructed based on an autoencoder mapping model and is optimized by semantic correspondence, which solves the problem of audiovisual semantic deviation and realizes the fusion of visual and auditory information on segment level features. Finally, the whole network is used to identify violence. The experimental results show that the method can make good use of the complementarity between modes. Compared with single-mode information, the multimodal method can achieve better results.

Highlights

  • IntroductionThe wide application of high-definition multimedia data acquisition equipment has guaranteed public social security and greatly protected the safety of people and property

  • This paper proposes a recognition model of violence, which uses a convolutional neural network (CNN)-long- and short-term memory network (LSTM) architecture for fragment levels feature extraction and uses the autoencoder [27] model to represent the shared semantic subspace mapping for audiovisual information fusion

  • This paper proposes an auditory-visual information fusion model based on an autoencoder for violent behavior recognition

Read more

Summary

Introduction

The wide application of high-definition multimedia data acquisition equipment has guaranteed public social security and greatly protected the safety of people and property. The semantic expression bias of visual and auditory information, such as normal behavior, is shown in a video accompanied by an explosion, or, alternatively, violent behavior is shown but without any abnormal background sound Both of these are problems that need to be solved in the process of multi-modal feature fusion. This paper proposes a recognition model of violence, which uses a CNN-LSTM architecture for fragment levels feature extraction and uses the autoencoder [27] model to represent the shared semantic subspace mapping for audiovisual information fusion. Through this approach, we seek to circumvent problems related to audiovisual information time axis misalignment.

Auditory and Visual Feature Extraction Method
Auditory Feature Extraction Based on CNN-LSTM
Visual Feature Based on CNN-ConvLSTM
The Deep Network for Auditory Visual Information Fusion
Shared Semantic Subspace Based on Autoencoder
Shared
Model Optimization Based on Semantic Correspondence
Network Structure
Violence
Algorithm Realization
Dataset
Validation of Feature Combination Method
Method
Test Results
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call