Sound Can Help Us See More Clearly.

Yongsheng Li,Zhengping Jin,Tengfei Tu,Qiaoyan Wen,Jishuai Li,Hua Zhang

doi:10.3390/s22020599

Abstract

In the field of video action classification, existing network frameworks often only use video frames as input. When the object involved in the action does not appear in a prominent position in the video frame, the network cannot accurately classify it. We introduce a new neural network structure that uses sound to assist in processing such tasks. The original sound wave is converted into sound texture as the input of the network. Furthermore, in order to use the rich modal information (images and sound) in the video, we designed and used a two-stream frame. In this work, we assume that sound data can be used to solve motion recognition tasks. To demonstrate this, we designed a neural network based on sound texture to perform video action classification tasks. Then, we fuse this network with a deep neural network that uses continuous video frames to construct a two-stream network, which is called A-IN. Finally, in the kinetics dataset, we use our proposed A-IN to compare with the image-only network. The experimental results show that the recognition accuracy of the two-stream neural network model with uesed sound data features is increased by 7.6% compared with the network using video frames. This proves that the rational use of the rich information in the video can improve the classification effect.

Highlights

The sheer volume of video data nowadays demands robust video classification techniques that can effectively recognize human actions and complex events for applications such as video search, summarization, or intelligent surveillance
We propose a neural network structure for solving video action recognition, which uses the sound texture in the video as input
In order to make full use of the multi-modal information provided by the video, inspired by the two-stream network, we propose a two-stream network structure that uses flames and sound, called A-IN

Summary

Introduction

The sheer volume of video data nowadays demands robust video classification techniques that can effectively recognize human actions and complex events for applications such as video search, summarization, or intelligent surveillance. At the same time, when the proportion of the objects interacting in the action is too small, and there is no prominent position displayed, it is difficult distinguish the action category effectively using only the image information in the video. The sound in the video originates from the interaction between objects. Specific audio can be the main discriminator for certain actions (such as “washing”) and objects in the action. Due to these correlations, we believe that the sound information that occurs in synchronization with the visual signal in the video can provide rich training features, which can be used to train the video action classification model

Objectives

Methods

Results

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Sensors	Publication Date: Jan 13, 2022
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Sound Can Help Us See More Clearly.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Sensors

Lead the way for us

Similar Papers

Deep stable neural networks: Large-width asymptotics and convergence rates
Stefano Favaro ... Stefano Peluchetti
Bernoulli | VOL. 29
Stefano Favaro, et. al.Stefano Favaro ... Stefano Peluchetti
01 Aug 2023
Bernoulli | VOL. 29

Efficient Neural Networks on the Edge with FPGAs by Optimizing an Adaptive Activation Function.
Yiyue Jiang ... Andrius Vaicaitis
Sensors | VOL. 24
Yiyue Jiang, et. al.Yiyue Jiang ... Andrius Vaicaitis
13 Mar 2024
Sensors | VOL. 24

Deep distributed convolutional neural networks: Universality
Ding-Xuan Zhou
Analysis and Applications | VOL. 16
Ding-Xuan ZhouDing-Xuan Zhou
01 Nov 2018
Analysis and Applications | VOL. 16

Computer vision approaches based on deep learning and neural networks: Deep neural networks for video analysis of human pose estimation
Eralda Nishani ... Betim Cico
-
Eralda Nishani, et. al.Eralda Nishani ... Betim Cico
01 Jun 2017
01 Jun 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sound Can Help Us See More Clearly.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Sensors