Abstract

Multimodal representations play an important role in multimodal learning tasks, including cross-modal retrieval and intra-modal clustering. However, existing multimodal representation learning approaches focus on building one common space by aligning different modalities and ignore the complementary information across the modalities, such as the intra-modal local structures. In other words, they only focus on the object-level alignment and ignore structure-level alignment. To tackle the problem, we propose a novel symmetric multimodal representation learning framework by transferring local structures across different modalities, namely MTLS. A customized soft metric learning strategy and an iterative parameter learning process are designed to symmetrically transfer local structures and enhance the cluster structures in intra-modal representations. The bidirectional retrieval loss based on multi-layer neural networks is utilized to align two modalities. MTLS is instantiated with image and text data and shows its superior performance on image-text retrieval and image clustering. MTLS outperforms the state-of-the-art multimodal learning methods by up to 32% in terms of R@1 on text-image retrieval and 16.4% in terms of AMI onclustering.

Highlights

  • Multimodal data, such as image-text and speech-video, commonly exists in the real-world and is critical for applications, such as image captioning [1,2], visual question answering [3,4], and audio-visual speech recognition [5]

  • Because there is no intra-modal representation learned in the baseline models, i.e., Mean Vector (MV), canonical correlation analysis (CCA) H, and CCA H and the attention based models, i.e., Stacked cross attention networks (SCAN) and Bidirectional Focal Attention Network (BFAN), need multiple regions of each image, we only demonstrate the clustering results of Original image embeddings (i.e., Resnet152) and the image representations learned by Visual Semantic Embedding (VSE), VSE++, Order, Two-branch neural networks (TBNN), Multimodal Tensor Fusion Network (MTFN) and our MTLS

  • We propose a novel multimodal representation learning framework, MTLS, which symmetrically transfers local structure across modalities by a customized soft metric learning strategy and an iterative parameter learning process

Read more

Summary

Introduction

Multimodal data, such as image-text and speech-video, commonly exists in the real-world and is critical for applications, such as image captioning [1,2], visual question answering [3,4], and audio-visual speech recognition [5]. Most existing multimodal representation learning approaches aim to project the multimodal data into a common space by aligning different modalities with similarity constraints These methods only focus on the object-level alignment, which means they try to align two corresponding objects in different modalities. The structure-level alignment can enhance the local structure in one modality through learning from the other modality, which is beneficial for the classification and clustering Neural networks, such as autoencoders, are common tools to learn joint multimodal representations that fuse unimodal representations and are trained to perform a particular task [5,10]. MTLS is instantiated with image-text data, and the learned multimodal representations are evaluated by cross-modal retrieval tasks and image clustering. The superior image clustering performance and the visualization results demonstrate that the local structures are successfully transferred across modalities and complement the original image representations

Related Work
Multimodal Representations with Local Structure Transferring
Representation Encoding
Local Structure Transferring
Modality Aligning
Learning Algorithm
Experiments
Implementation Details
Experimental Setup
Cross-Modal Retrieval
Image Clustering
Visualization
Findings
Conclusions and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call