Abstract
In recent years, the field of few-shot action recognition (FSAR) has garnered significant attention. Although many methods primarily rely on mono-modal data, there is a growing trend towards utilizing multi-modal data. However, existing FSAR methods often employ simplistic fusion techniques, such as concatenation or comparison, which may not fully leverage the potential of multi-modal information. We aim to explore multi-modal information from different views and subsequently perform multi-view fusion. Based on the textual and visual modality information extracted by the CLIP backbone, we propose an MDMF method that comprehensively utilizes these modalities at two levels. At the first level, we extract visual information from two views: Local Temporal Context and Global Temporal Context. Within each view, we fuse visual features with textual information through concatenation and Cross-Transformer operations. We then employ metric comparison to derive probability distributions for classification under the meta-learning paradigm for each view. At the second level, we fuse the probability distributions from both views to make the final decision. Concurrently, during training, for each query, we calculate the posterior distributions of the textual and visual modalities within each view using text information distance and visual information distance. Based on these distributions, we group query samples with higher view reliability. Subsequently, we enhance the representation of the less reliable view of specific samples through mutual distillation. By delving deep into multi-modal data through a multi-view approach, our few-shot action recognition model demonstrates the potential for achieving higher accuracy and enhanced robustness. Our code is available at the URL: https://github.com/cofly2014/CLIP-MDMF.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.