Multi-view distillation based on multi-modal fusion for few-shot action recognition (CLIP-MDMF)

Fei Guo,Yikang Wang,Han Qi,Wenping Jin,Li Zhu,Jing Sun

doi:10.1016/j.knosys.2024.112539

Abstract

In recent years, the field of few-shot action recognition (FSAR) has garnered significant attention. Although many methods primarily rely on mono-modal data, there is a growing trend towards utilizing multi-modal data. However, existing FSAR methods often employ simplistic fusion techniques, such as concatenation or comparison, which may not fully leverage the potential of multi-modal information. We aim to explore multi-modal information from different views and subsequently perform multi-view fusion. Based on the textual and visual modality information extracted by the CLIP backbone, we propose an MDMF method that comprehensively utilizes these modalities at two levels. At the first level, we extract visual information from two views: Local Temporal Context and Global Temporal Context. Within each view, we fuse visual features with textual information through concatenation and Cross-Transformer operations. We then employ metric comparison to derive probability distributions for classification under the meta-learning paradigm for each view. At the second level, we fuse the probability distributions from both views to make the final decision. Concurrently, during training, for each query, we calculate the posterior distributions of the textual and visual modalities within each view using text information distance and visual information distance. Based on these distributions, we group query samples with higher view reliability. Subsequently, we enhance the representation of the less reliable view of specific samples through mutual distillation. By delving deep into multi-modal data through a multi-view approach, our few-shot action recognition model demonstrates the potential for achieving higher accuracy and enhanced robustness. Our code is available at the URL: https://github.com/cofly2014/CLIP-MDMF.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-view distillation based on multi-modal fusion for few-shot action recognition (CLIP-MDMF)

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems

Lead the way for us

Similar Papers

Influences of narcissism and parental mediation on adolescents' textual and visual personal information disclosure in Facebook
Cong Liu ... May O Lwin
Computers in Human Behavior | VOL. 58
Cong Liu, et. al.Cong Liu ... May O Lwin
02 Jan 2015
Computers in Human Behavior | VOL. 58

Cross-media web video topic detection based on heterogeneous interactive tensor learning
Chengde Zhang ... Xia Xiao
Knowledge-Based Systems | VOL. 283
Chengde Zhang, et. al.Chengde Zhang ... Xia Xiao
17 Nov 2023
Knowledge-Based Systems | VOL. 283

Speechreading and Aging
Nancy Tye-Murray ... Brent Spehar
The ASHA Leader | VOL. 10
Nancy Tye-Murray, et. al.Nancy Tye-Murray ... Brent Spehar
01 Jul 2005
The ASHA Leader | VOL. 10

Visual Information Matters for ASR Error Correction
Vanya Bannihatti Kumar ... Shanbo Cheng
-
Vanya Bannihatti Kumar, et. al.Vanya Bannihatti Kumar ... Shanbo Cheng
04 Jun 2023
04 Jun 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-view distillation based on multi-modal fusion for few-shot action recognition (CLIP-MDMF)

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems