Abstract
Most current few-shot action recognition approaches follow the metric learning paradigm, measuring the distance of any sub-sequences (frames, any frame combinations or clips) between different actions for classification. However, this disordered distance metric between action sub-sequences ignores the long-term temporal relations of actions, which may result in significant metric deviations. What's more, the distance metric suffers from the distinctive temporal distribution of different actions, including intra-class temporal offsets and inter-class local similarity. In this paper, a novel few-shot action recognition framework, Frame-to-frame Temporal Alignment Network (FTAN), is proposed to address the above challenges. Specifically, an attention-based temporal alignment (ATA) module is devised to calculate the distance between corresponding frames of different actions along the temporal dimension to achieve frame-to-frame temporal alignment. Meanwhile, the Temporal Context module (TCM) is proposed to increase inter-class diversity by enriching the frame-level feature representation, and the Frames Cyclic Shift Module (FCSM) performs frame-level temporal cyclic shift to reduce intra-class inconsistency. In addition, we present temporal and global contrastive objectives to assist in learning discriminative and class-agnostic visual features. Experimental results show that the proposed architecture achieves state-of-the-art on HMDB51, UCF101, Something-Something V2 and Kinetics-100 datasets.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have