Abstract
The collection and annotation of large-scale video data pose significant challenges, prompting the exploration of few-shot models to recognize unseen actions with limited training samples. However, most existing models rely on complex temporal alignment strategies, neglecting the necessity of constructing accurate class prototypes in action recognition. To address this limitation, we propose an unsupervised prototype self-calibration based on hybrid attention contrastive learning (UPSHC), aiming to refine prototypes through various strategies. Our network comprises three innovative components: a hybrid attention network (HAN), a dual adaptive contrastive learning mechanism (DACL), and an unsupervised prototype self-calibration mechanism (UPSC). The HAN integrates self-attention and cross-attention mechanisms to highlight important spatiotemporal regions, reducing intra-class variations and enhancing inter-class variations. The DACL mechanism constructs prototype-centered and query-centered contrastive losses to deeply explore the similarities between prototypes and samples, improving the model’s sensitivity to subtle inter-class differences. Additionally, DACL employs adaptive margins to dynamically adjust the distances between positive and negative samples, mitigating the unreliability of fixed margins. The UPSC module utilizes unlabeled query samples in an unsupervised manner to optimize target prototypes without requiring additional training. Experimental results on three benchmark datasets demonstrate that UPSHC achieves state-of-the-art performance, highlighting the superiority of our prototype enhancement approach in few-shot action recognition.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have