Abstract

Human action recognition has great research and application value in intelligent video surveillance, human-computer interaction and other communication fields. In order to improve the accuracy of human action recognition for video understanding, the extraction of human motion features and attentional fusion methods are studied. This paper has two main contributions. Firstly, based on the essence of optical flow validity, a novel dynamic feature expression method called Human-Object Contour(HOC) is presented, which combines object understanding and contextual information. Secondly, referring to the principle of Stacking in ensemble learning, we propose Attentional Multi-modal Fusion Network(AMFN). According to the characteristics of the video, attention is paid to selecting different modalities rather than simple averaging with fixed weight. The experiment shows that HOC is effectively complementary to the static appearance feature, and the accuracy of action recognition with our fusion network improves effectively. Our approach obtains the state-of-the-art performance on the datasets of HMDB51 (72.2%) and UCF101 (96.0%).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call