Abstract

Fine-grained motion information is crucial to improve the performance of video action recognition. Great strides have been made, with direct subtraction at adjacent original frames or frame features to learn motion representations. However, existing methods lack the ability to focus on the motion details about the action-related regions and to model the video events at multi-scale. To solve the above problems, this paper proposes a Fine-grained Motion Enhancement Network (FMENet), which comprises an Inter-frame Difference Motion Enhancement (IDME) module and a Multi-fiber Multi-level SpatioTemporal (MMST) module. In specific, the IDME module combines inter-frame local difference attention with a foreground encoder to differentiate foreground motion-related and static-related features, in conjunction with inter-frame feature subtraction operation to learn fine-grained motion features. Complementary to the motion features provided by IDME module, the MMST module introduces multi-level receptive fields in both spatial and temporal dimensions to further encode the semantic and appearance information of events. Finally, the IDME and MMST modules are embedded in a standard 2D framework to construct the FMENet with introducing limited computational cost. Extensive experiments on the Mini-Kinetics-200, Something–Something V1, HMDB-51 and UCF-101 datasets have been conducted to demonstrate the superiority of our method over other methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call