Abstract
In the field of video action classification, existing network frameworks often only use video frames as input. When the object involved in the action does not appear in a prominent position in the video frame, the network cannot accurately classify it. We introduce a new neural network structure that uses sound to assist in processing such tasks. The original sound wave is converted into sound texture as the input of the network. Furthermore, in order to use the rich modal information (images and sound) in the video, we designed and used a two-stream frame. In this work, we assume that sound data can be used to solve motion recognition tasks. To demonstrate this, we designed a neural network based on sound texture to perform video action classification tasks. Then, we fuse this network with a deep neural network that uses continuous video frames to construct a two-stream network, which is called A-IN. Finally, in the kinetics dataset, we use our proposed A-IN to compare with the image-only network. The experimental results show that the recognition accuracy of the two-stream neural network model with uesed sound data features is increased by 7.6% compared with the network using video frames. This proves that the rational use of the rich information in the video can improve the classification effect.
Highlights
The sheer volume of video data nowadays demands robust video classification techniques that can effectively recognize human actions and complex events for applications such as video search, summarization, or intelligent surveillance
We propose a neural network structure for solving video action recognition, which uses the sound texture in the video as input
In order to make full use of the multi-modal information provided by the video, inspired by the two-stream network, we propose a two-stream network structure that uses flames and sound, called A-IN
Summary
The sheer volume of video data nowadays demands robust video classification techniques that can effectively recognize human actions and complex events for applications such as video search, summarization, or intelligent surveillance. At the same time, when the proportion of the objects interacting in the action is too small, and there is no prominent position displayed, it is difficult distinguish the action category effectively using only the image information in the video. The sound in the video originates from the interaction between objects. Specific audio can be the main discriminator for certain actions (such as “washing”) and objects in the action. Due to these correlations, we believe that the sound information that occurs in synchronization with the visual signal in the video can provide rich training features, which can be used to train the video action classification model
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.