Textual Modalities Research Articles

In recent years, the field of few-shot action recognition (FSAR) has garnered significant attention. Although many methods primarily rely on mono-modal data, there is a growing trend towards utilizing multi-modal data. However, existing FSAR methods often employ simplistic fusion techniques, such as concatenation or comparison, which may not fully leverage the potential of multi-modal information. We aim to explore multi-modal information from different views and subsequently perform multi-view fusion. Based on the textual and visual modality information extracted by the CLIP backbone, we propose an MDMF method that comprehensively utilizes these modalities at two levels. At the first level, we extract visual information from two views: Local Temporal Context and Global Temporal Context. Within each view, we fuse visual features with textual information through concatenation and Cross-Transformer operations. We then employ metric comparison to derive probability distributions for classification under the meta-learning paradigm for each view. At the second level, we fuse the probability distributions from both views to make the final decision. Concurrently, during training, for each query, we calculate the posterior distributions of the textual and visual modalities within each view using text information distance and visual information distance. Based on these distributions, we group query samples with higher view reliability. Subsequently, we enhance the representation of the less reliable view of specific samples through mutual distillation. By delving deep into multi-modal data through a multi-view approach, our few-shot action recognition model demonstrates the potential for achieving higher accuracy and enhanced robustness. Our code is available at the URL: https://github.com/cofly2014/CLIP-MDMF.

Read full abstract

Joint Multimodal Entity and Relation Extraction (JMERE), which needs to combine complex image information to extract entity-relation quintuples from text sequences, posts higher requirements of the model’s multimodal feature fusion and selection capabilities. With the advancement of large pre-trained language models, existing studies focus on improving the feature alignments between textual and visual modalities. However, there remains a noticeable gap in capturing the temporal information present in textual sequences. In addition, these methods exhibit a certain deficiency in distinguishing irrelevant images when integrating image and text features, making them susceptible to interference from image information unrelated to the text. To address these challenges, we propose a temporally enhanced and similarity-gated attention network (TESGA) for joint multimodal entity relation extraction. Specifically, we first incorporate an LSTM-based Text Temporal Enhancement module to enhance the model’s ability to capture temporal information from the text. Next, we introduce a Text-Image Similarity-Gated Attention mechanism, which controls the degree of incorporating image information based on the consistency between image and text features. Subsequently, We design the entity and relation prediction module using a form-filling approach based on entity and relation types, and conduct prediction of entity-relation quintuples. Notably, apart from the JMERE task, our approach can also be applied to other tasks involving text-visual enhancement, such as Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE). To demonstrate the effectiveness of our approach, our model is extensively experimented on three benchmark datasets and achieves state-of-the-art performance. Our code will be available upon paper acceptance.11https://github.com/vacuum-cup/TESGA.

Read full abstract

Textual Modalities Research Articles

Related Topics

Articles published on Textual Modalities

Semantic-Driven Crossmodal Fusion for Multimodal Sentiment Analysis

Policy Sandboxing: Empathy As An Enabler Towards Inclusive Policy-Making

사전학습 모델 기반 발화 동영상 멀티 모달 감정 인식

Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments

TL-CStrans Net: a vision robot for table tennis player action recognition driven via CS-Transformer.

Reconstructing representations using diffusion models for multimodal sentiment analysis through reading comprehension

Multi-Modal Graph Aggregation Transformer for image captioning

Social Event Classification Based on Multimodal Masked Transformer Network

Mutually-Guided Hierarchical Multi-Modal Feature Learning for Referring Image Segmentation

Social presence and collaborative creativity in leaner media

On the effects of obfuscating speaker attributes in privacy-aware depression detection

A novel Deep High-level Concept-mining Jointing Hashing Model for unsupervised cross-modal retrieval

Multi-view distillation based on multi-modal fusion for few-shot action recognition (CLIP-MDMF)

MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition

Prompt-Based Modality Bridging for Unified Text-to-Face Generation and Manipulation

Multimodal PEAR Chain-of-Thought Reasoning for Multimodal Sentiment Analysis

A social context-aware graph-based multimodal attentive learning framework for disaster content classification during emergencies

Joint multimodal entity-relation extraction based on temporal enhancement and similarity-gated attention

AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition.

HRT in DMC? the orthographic representation of high rising terminals in WhatsApp

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Textual Modalities Research Articles

Related Topics

Articles published on Textual Modalities

Semantic-Driven Crossmodal Fusion for Multimodal Sentiment Analysis

Policy Sandboxing: Empathy As An Enabler Towards Inclusive Policy-Making

사전학습 모델 기반 발화 동영상 멀티 모달 감정 인식

Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments

TL-CStrans Net: a vision robot for table tennis player action recognition driven via CS-Transformer.

Reconstructing representations using diffusion models for multimodal sentiment analysis through reading comprehension

Multi-Modal Graph Aggregation Transformer for image captioning

Social Event Classification Based on Multimodal Masked Transformer Network

Mutually-Guided Hierarchical Multi-Modal Feature Learning for Referring Image Segmentation

Social presence and collaborative creativity in leaner media

On the effects of obfuscating speaker attributes in privacy-aware depression detection

A novel Deep High-level Concept-mining Jointing Hashing Model for unsupervised cross-modal retrieval

Multi-view distillation based on multi-modal fusion for few-shot action recognition (CLIP-MDMF)

MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition

Prompt-Based Modality Bridging for Unified Text-to-Face Generation and Manipulation

Multimodal PEAR Chain-of-Thought Reasoning for Multimodal Sentiment Analysis

A social context-aware graph-based multimodal attentive learning framework for disaster content classification during emergencies

Joint multimodal entity-relation extraction based on temporal enhancement and similarity-gated attention

AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition.

HRT in DMC? the orthographic representation of high rising terminals in WhatsApp