Abstract

In this paper, we propose VT-BPAN, a novel approach that combines the capabilities of Vision Transformer (VT), bilinear pooling, and attention network fusion for effective human action recognition (HAR). The proposed methodology significantly enhances the accuracy of activity recognition through the following advancements: (1) The introduction of an effective two-stream feature pooling and fusion mechanism that combines RGB frames and skeleton data to augment the spatial–temporal feature representation. (2) The development of a spatial lightweight vision transformer that mitigates computational costs. The evaluation of this framework encompasses three widely employed video action datasets, demonstrating that the proposed approach achieves performance on par with state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call