Cross-modal representation learning and generation

Huafeng Liu,Liqiang Nie,Liang Li,Bingkun Bao,Jingjing Chen,Jiaying Liu,Zechao Li

doi:10.11834/jig.230035

Abstract

多媒体数据持续呈现爆发式增长并显现出异源异构的特性，因此跨模态学习领域研究逐渐引起学术和工业界的关注。跨模态表征与生成是跨模态学习的两大核心基础问题。跨模态表征旨在利用多种模态之间的互补性剔除模态之间的冗余，从而获得更为有效的特征表示；跨模态生成则是基于模态之间的语义一致性，实现不同模态数据形式上的相互转换，有助于提高不同模态间的迁移能力。本文系统地分析了国际与国内近年来跨模态表征与生成领域的重要研究进展，包括传统跨模态表征学习、多模态大模型表示学习、图像到文本的跨模态转换和跨模态图像生成。其中，传统跨模态表征学习探讨了跨模态统一表征和跨模态协同表征，多模态大模型表示学习探讨了基于Transformer的模型研究，图像到文本的跨模态转换探讨了图像视频的语义描述、视频字幕语义分析和视觉问答等领域的发展，跨模态图像生成从不同模态信息的跨模态联合表示方法、图像的跨模态生成技术和基于预训练的特定域图像生成阐述了跨模态生成方面的进展。本文详细综述了上述各个子领域研究的挑战性，对比了国内外研究方面的进展情况，梳理了发展脉络和学术研究的前沿动态。最后，根据上述分析展望了跨模态表征与生成的发展趋势和突破口。;Nowadays, with the booming of multimedia data, the character of multi-source and multi-modality of data has become a challenging problem in multimedia research. Its representation and generation can be as two key factors in cross-modal learning research. Cross-modal representation studies feature learning and information integration methods using multi-modal data. To get more effective feature representation, multimodality-between mutual benefits are required to be strengthened. Cross-modal generation is focused on the knowledge transfer mechanism across modalities. The modals-between semantic consistency can be used to realize data-interchangeable profiles of different modals. It is beneficial to improve modalities-between migrating ability. The literature review in cross-modal representation and generation are critically analyzed on the aspect of 1) traditional cross-modal representation learning, 2) big model for cross-modal representation learning, 3) image-to-text cross-modal conversion, joint representation, and 4) cross-modal image generation. Traditional cross-modal representation has two categories:joint representation and coordinated representation. Joint representation can yield multiple single-modal information to the joint representation space when each of single-modal information is processed through the coordinated representations, and cross-modal representations can be learnt mutually in terms of similarity constraints. Deep neural networks(DNNs) based self-supervised learning ability are activated to deal with largescale unlabeled data, especially for the Transformer-based methods. To enrich the supervised learning paradigm, the pretrained large models can yield large-scale unlabeled data to learn training, and a downstream tasks-derived small amount of labeled data is used for model fine-tuning. The pre-trained model has better versatility and transfering ability compared to the trained model for specific tasks, and the fine-tuned model can be used to optimize downstream tasks as well. The developmentof cross-modal synthesis(a.k.a. image caption or video caption) methods have been summarized, including end-toend, semantic-based, and stylize-based methods. In addition, current situation of cross-modal conversion between image and text has beenanalyzed, including image caption, video caption, and visual question answering. The cross-modal generation methods are summarized as well in relevance to the joint representation of cross-modal information, image generation, text-image cross-modal generation, and cross-modal generation based on pre-trained models. In recent years, generative adversarial networks(GANs) and denoising diffusion probabilistic models(DDPMs) have been faciliating in crossmodal generation tasks. Thanks to the strong adaptability and generation ability of DDPM models, cross-modal generation research can be developed and the constraints of vulnerable textures are optimized to a certain extent. The growth of GAN-based and DDPM-based methods are summarized and analyzed further.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Cross-modal representation learning and generation

Abstract

Talk to us

Similar Papers

More From: Journal of Image and Graphics

Lead the way for us

Similar Papers

Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou ... Lei Zhang
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34
Luowei Zhou, et. al.Luowei Zhou ... Lei Zhang
03 Apr 2020
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34

Cross-Modal Feature Representation Learning and Label Graph Mining in a Residual Multi-Attentional CNN-LSTM Network for Multi-Label Aerial Scene Classification
Peng Li ... Dezheng Zhang
Remote Sensing | VOL. 14
Peng Li, et. al.Peng Li ... Dezheng Zhang
18 May 2022
Remote Sensing | VOL. 14

VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models
Ziyi Yin ... Jinghui Chen
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Ziyi Yin, et. al.Ziyi Yin ... Jinghui Chen
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Cross-Modal Generation and Pair Correlation Alignment Hashing
Weihua Ou ... Jianping Gou
IEEE Transactions on Intelligent Transportation Systems | VOL. 24
Weihua Ou, et. al.Weihua Ou ... Jianping Gou
01 Mar 2023
IEEE Transactions on Intelligent Transportation Systems | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cross-modal representation learning and generation

Abstract

Talk to us

Similar Papers

More From: Journal of Image and Graphics