Abstract

Recognizing the actions performed in a video is challenging for an intelligent system since there are wide variations and enormous information in the video. Attention mechanism pays attention to key target areas, ignores irrelevant information and extracts more discriminant features. In recent years, attention mechanism has been introduced into video recognition. Although a rich literature has been spawned, most of the research on attention aims to aggregate local features by attention. Instead of feature aggregation, we propose to aggregate decisions based on local spatio-temporal attention regions for action recognition, which is inspired by ensemble learning. The proposed decision fusion module is easy to interpret and architecture-independent. In this article, the regions around the body joints are regarded as the key regions. We use the corresponding regions of the body joints in the 3-D feature maps as the basic local features for local classification. Finally, all the local classification results are combined to make a global decision. Furthermore, when training the network, we can selectively add supervision to the local and global decisions. We experimentally show that the proposed mechanism can improve the recognition performance on multiple datasets which demonstrates its effectiveness.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call