Cross-modal Representation Learning Research Articles

Integrating multispectral data has been demonstrated to be an effective solution for illumination-invariant pedestrian detection, in particular, RGB and thermal images can provide complementary information to handle light variations. However, most of the current multispectral detectors fuse the multimodal features by simple concatenation, without discovering their latent relationships. In this paper, we propose a cross-modal feature learning (CFL) module, based on a split-and-aggregation strategy, to explicitly explore both the shared and modality-specific representations between paired RGB and thermal images. We insert the proposed CFL module into multiple layers of a two-branch-based pedestrian detection network, to learn the cross-modal representations in diverse semantic levels. By introducing a segmentation-based auxiliary task, the multimodal network is trained end-to-end by jointly optimizing a multi-task loss. On the other hand, to alleviate the reliance of existing multispectral pedestrian detectors on thermal images, we propose a knowledge distillation framework to train a student detector, which only receives RGB images as input and distills the cross-modal representations guided by a well-trained multimodal teacher detector. In order to facilitate the cross-modal knowledge distillation, we design different distillation loss functions for the feature, detection and segmentation levels. Experimental results on the public KAIST multispectral pedestrian benchmark validate that the proposed cross-modal representation learning and distillation method achieves robust performance.

Retrieving unlabeled videos by textual queries, known as Ad-hoc Video Search (AVS), is a core theme in multimedia data management and retrieval. The success of AVS counts on cross-modal representation learning that encodes both query sentences and videos into common spaces for semantic similarity computation. Inspired by the initial success of previously few works in combining multiple sentence encoders, this paper takes a step forward by developing a new and general method for effectively exploiting diverse sentence encoders. The novelty of the proposed method, which we term <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Sentence Encoder Assembly</i> (SEA), is two-fold. First, different from prior art that uses only a single common space, SEA supports text-video matching in multiple encoder-specific common spaces. Such a property prevents the matching from being dominated by a specific encoder that produces an encoding vector much longer than other encoders. Second, in order to explore complementarities among the individual common spaces, we propose multi-space multi-loss learning. As extensive experiments on four benchmarks (MSR-VTT, TRECVID AVS 2016-2019, TGIF and MSVD) show, SEA surpasses the state-of-the-art. In addition, SEA is extremely ease to implement. All this makes SEA an appealing solution for AVS and promising for continuously advancing the task by harvesting new sentence encoders.

Cross-modal Representation Learning Research Articles

Articles published on Cross-modal Representation Learning

Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning

Cross-Modal Representation Learning for Lightweight and Accurate Facial Action Unit Detection

Deep Cross-Modal Representation Learning and Distillation for Illumination-Invariant Pedestrian Detection

SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries

Adversarial Learning-Based Semantic Correlation Representation for Cross-Modal Retrieval

Semi-supervised cross-modal representation learning with GAN-based Asymmetric Transfer Network

Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Cross-modal Representation Learning Research Articles

Articles published on Cross-modal Representation Learning

Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning

Cross-Modal Representation Learning for Lightweight and Accurate Facial Action Unit Detection

Deep Cross-Modal Representation Learning and Distillation for Illumination-Invariant Pedestrian Detection

SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries

Adversarial Learning-Based Semantic Correlation Representation for Cross-Modal Retrieval

Semi-supervised cross-modal representation learning with GAN-based Asymmetric Transfer Network

Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences