Cross-modal Alignment Research Articles

Human 3D skeleton-based action recognition has received increasing interest in recent years. Inspired by the excellent ability of the multi-modal model, some pioneer attempts to employ diverse modalities, i.e., skeleton and language, to construct the skeleton-language model and have shown compelling results. Yet, these attempts model the data representation as deterministic point estimation, ignoring a key issue that descriptions of similar motions are uncertain and ambiguous, which brings about restricted comprehension of complex concept hierarchies and impoverished cross-modal alignment reliability. To tackle this challenge, this paper proposes a new Uncertain Skeleton-Language Learning Framework (USLLF) to capture the semantic ambiguity among diverse modalities in a probabilistic manner for the first time. USLLF consists of both inter- and intra-modal uncertainties. Specifically, first, we integrate the language (text) generated by ChatGPT with the generic skeleton-based network and develop a deterministic multi-modal baseline, which can be easily achieved via any off-the-shelf skeleton and text encoders. Then, based on this baseline, we explicitly model the intra-modal (skeleton/language) uncertainties as the Gaussian distributions using the new uncertainty networks capable of learning the distributional embeddings of modalities. Following this, these embeddings are aligned and formulated as inter-modal (skeleton-language) uncertainty using both the contrastive and negative log-likelihood objectives to alleviate the cross-modal alignment error. Experimental results on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets show that our approach outperforms the proposed baseline and achieves comparable performance with a high inference efficiency compared to the state-of-the-art methods. Besides, we also deliver insightful analyses on how learned uncertainty reduces the impact of uncertain and ambiguous data on model performance.

Read full abstract

Nowadays, with the rapid development of consumer Unmanned Aerial Vehicles (UAVs), utilizing UAV platforms for visual surveillance has become very attractive, and a key part of this is remote vision-based pedestrian attribute recognition. Pedestrian Attribute Recognition (PAR) is dedicated to predicting multiple attribute labels of a single pedestrian image extracted from surveillance videos and aerial imagery, which presents significant challenges in the computer vision community due to factors such as poor imaging quality and substantial pose variations. Despite recent studies demonstrating impressive advancements in utilizing complicated architectures and exploring relations, most of them may fail to fully and systematically consider the inter-region, inter-attribute, and region-attribute mapping relations simultaneously and be stuck in the dilemma of information redundancy, leading to the degradation of recognition accuracy. To address the issues, we construct a novel Mask-Relation-Guided Transformer (MRG-T) framework that consists of three relation modeling modules to fully exploit spatial and semantic relations in the model learning process. Specifically, we first propose a Masked Region Relation Module (MRRM) to focus on precise spatial attention regions to extract more robust features with masked random patch training. To explore the semantic association of attributes, we further present a Masked Attribute Relation Module (MARM) to extract intrinsic and semantic inter-attribute relations with an attribute label masking strategy. Based on the cross-attention mechanism, we finally design a Region and Attribute Mapping Module (RAMM) to learn the cross-modal alignment between spatial regions and semantic attributes. We conduct comprehensive experiments on three public benchmarks such as PETA, PA-100K, and RAPv1, and conduct inference on a large-scale airborne person dataset named PRAI-1581. The extensive experimental results demonstrate the superior performance of our method compared to state-of-the-art approaches and validate the effectiveness of mask-relation-guided modeling in the remote vision-based PAR task.

Read full abstract

Cross-modal Alignment Research Articles

Related Topics

Articles published on Cross-modal Alignment

Text-and-Image Learning Transformer for Cross-modal Person Re-identification

Protein complex structure modeling by cross-modal alignment between cryo-EM maps and protein sequences

Cascaded Cross-modal Alignment for Visible-Infrared Person Re-Identification

Language-guided temporal primitive modeling for skeleton-based action recognition

Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding.

Modeling the skeleton-language uncertainty for 3D action recognition

SARD: Fake news detection based on CLIP contrastive learning and multimodal semantic alignment

Improving Image-Text Matching With Bidirectional Consistency of Cross-Modal Alignment

Text-to-Image Vehicle Re-Identification: Multi-Scale Multi-View Cross-Modal Alignment Network and a Unified Benchmark

SMART: Syntax-Calibrated Multi-Aspect Relation Transformer for Change Captioning.

Cross-modal semantic aligning and neighbor-aware completing for robust text–image person retrieval

Cross-modal learning using privileged information for long-tailed image classification

Unified multimodal fusion transformer for few shot object detection for remote sensing images

Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval

Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding

CiteNet: Cross-modal incongruity perception network for multimodal sentiment prediction

Text-based person search via cross-modal alignment learning

CAMANet: Class Activation Map Guided Attention Network for Radiology Report Generation.

MRG-T: Mask-Relation-Guided Transformer for Remote Vision-Based Pedestrian Attribute Recognition in Aerial Imagery

Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Cross-modal Alignment Research Articles

Related Topics

Articles published on Cross-modal Alignment

Text-and-Image Learning Transformer for Cross-modal Person Re-identification

Protein complex structure modeling by cross-modal alignment between cryo-EM maps and protein sequences

Cascaded Cross-modal Alignment for Visible-Infrared Person Re-Identification

Language-guided temporal primitive modeling for skeleton-based action recognition

Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding.

Modeling the skeleton-language uncertainty for 3D action recognition

SARD: Fake news detection based on CLIP contrastive learning and multimodal semantic alignment

Improving Image-Text Matching With Bidirectional Consistency of Cross-Modal Alignment

Text-to-Image Vehicle Re-Identification: Multi-Scale Multi-View Cross-Modal Alignment Network and a Unified Benchmark

SMART: Syntax-Calibrated Multi-Aspect Relation Transformer for Change Captioning.

Cross-modal semantic aligning and neighbor-aware completing for robust text–image person retrieval

Cross-modal learning using privileged information for long-tailed image classification

Unified multimodal fusion transformer for few shot object detection for remote sensing images

Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval

Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding

CiteNet: Cross-modal incongruity perception network for multimodal sentiment prediction

Text-based person search via cross-modal alignment learning

CAMANet: Class Activation Map Guided Attention Network for Radiology Report Generation.

MRG-T: Mask-Relation-Guided Transformer for Remote Vision-Based Pedestrian Attribute Recognition in Aerial Imagery

Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning