Abstract

Traditional action recognition algorithms often only pay attention to video RGB features or optical flow features. These methods do not make good use of the audio information in the video. Based on RGB and optical flow characteristics, this paper introduces the processing of audio information, and classifies videos based on element-level fine-grained multi-modal fusion. Through experimental comparison, the accuracy of the multi-modal fusion algorithm proposed in this paper is improved by 7.38% on the HMDB51 dataset and 3.18% on the UCF101 dataset compared to the simple modal splicing. At the same time, it is proved that the introduction of audio modes can effectively improve the performance of the model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call