Abstract

Action recognition has found extensive applications in fields such as video classification and security monitoring. However, existing action recognition methods, such as those based on 3D convolutional neural networks, often struggle to capture comprehensive global information. Meanwhile, transformer-based approaches face challenges associated with excessively high computational complexity. We introduce a Multi-Scale Video Longformer network (MSVL), built upon the 3D Longformer architecture featuring a “local attention + global features” attention mechanism, enabling us to reduce computational complexity while preserving global modeling capabilities. Specifically, MSVL gradually reduces the video feature resolution and increases the feature dimensions across four stages. In the lower layers of the network (stage 1, stage 2), we leverage local window attention to alleviate local redundancy and computational demands. Concurrently, global tokens are employed to retain global features. In the higher layers of the network (stage 3, stage 4), this local window attention evolves into a dense computation mechanism, enhancing overall performance. Finally, extensive experiments are conducted on UCF101 (97.6%), HMDB51 (72.9%), and the assembly action dataset (100.0%), demonstrating the effectiveness and efficiency of the MSVL.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.