Cross-modal Data Research Articles

Cross-modal remote sensing image-text retrieval (CMRSITR) is a challenging topic in the remote sensing (RS) community. It has gained growing attention because it can be flexibly used in many practical applications. In the current deep era, with the help of deep convolutional neural networks (DCNNs), many successful CMRSITR methods have been proposed. Most of them first learn valuable features from RS images and texts respectively. Then, the obtained visual and textual features are mapped into a common space for the final retrieval. The above operations are feasible, however, two difficulties are still to be solved. One is that the semantics within the visual and textual features are misaligned due to the independent learning manner. The other one is that the deep links between RS images and texts cannot be fully explored by simple common space mapping. To overcome the above challenges, we propose a new model named interacting-enhancing feature transformer (IEFT) for CMRSITR, which regards the RS images and texts as a whole. First, a simple feature embedding module (FEM) is developed to map images and texts into the visual and textual feature spaces. Second, an information interacting-enhancing module (IIEM) is designed to simultaneously model the inner relationships between RS images and texts and enhance the visual features. IIEM consists of three feature interacting-enhancing (FIE) blocks, each of which contains an inter-modality relationship interacting (IMRI) sub-block and a visual feature enhancing (VFE) sub-block. The duty of IMRI is to exploit the hidden relations between cross-modal data, while the responsibility of VFE is to improve the visual features. By combining them, semantic bias can be mitigated, and the complex contents of RS images can be studied. Finally, the retrieval module (RM) is constructed to generate the matching scores for deciding the search results. Extensive experiments are conducted on four public RS data sets. The positive results demonstrate that our IEFT can achieve superior retrieval performance compared with many existing methods. Our source codes are available at https://github.com/TangXu-Group/Cross-modal-remote-sensing-image-and-text-retrieval-models/tree/main/IEFT.

Data augmentation has become one of the keys to alleviating the over-fitting of models on training data and improving the generalization capabilities on testing data. Most existing data augmentation methods only focus on one modality, which is incapable when facing multiple data modalities. Some prior works try to interpolate with random coefficients in the latent space to generate new samples, which can generically work for any data modality. However, these works ignore the extra information conveyed by multimodality data. In fact, the extra information in one modality can provide semantic directions to generate more meaningful samples in another modality. This paper proposes Cross-modal Data Augmentation (CMDA), a simple yet effective data augmentation method to alleviate the over-fitting issue and improve the generalization performance. We evaluate CMDA on unsupervised and supervised tasks of different modalities, on which CMDA consistently and significantly outperforms baselines. For instance, CMDA improves the unsupervised anomaly detection baseline in vision modality from the AUROC <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$76.46\%, 73.07\%$</tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$64.36\%$</tex-math></inline-formula> to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$83.25\%, 76.22\%$</tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$70.57\%$</tex-math></inline-formula> on three different datasets, respectively. Besides, extensive experiments demonstrate that CMDA is applicable to various neural network architectures. Furthermore, prior methods that interpolate in the latent space need to work with downstream tasks to construct the latent space. In contrast, CMDA can work with or without downstream tasks, which makes the applicability of CMDA more extensive. Our source code is publicly available for non-commercial or research use at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/Anfeather/CMDA</uri> .

Cross-modal Data Research Articles

Related Topics

Articles published on Cross-modal Data

Bidirectional Attentional Interaction Networks for RGB-D salient object detection

Cross on Cross Attention: Deep Fusion Transformer for Image Captioning

Transformer-based cross-modal multi-contrast network for ophthalmic diseases diagnosis

Development and validation of an fMRI-informed EEG model of reward-related ventral striatum activation

Cross-modal Person Re-identification Based on Hybrid Learning Networks

Semantic-embedding Guided Graph Network for cross-modal retrieval

Contrastive learning-based general Deepfake detection with multi-scale RGB frequency clues

Semantic Completion and Filtration for Image–Text Retrieval

Cardiac LGE MRI Segmentation With Cross-Modality Image Augmentation and Improved U-Net.

BCINet: Bilateral cross-modal interaction network for indoor scene understanding in RGB-D images

Unsupervised Cross-Media Graph Convolutional Network for 2D Image-Based 3D Model Retrieval

Interacting-Enhancing Feature Transformer for Cross-Modal Remote-Sensing Image and Text Retrieval

Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations

Cross-Modal Data Augmentation for Tasks of Different Modalities

RGB-D Crowd Counting With Cross-Modal Cycle-Attention Fusion and Fine-Coarse Supervision

Infrared time-sensitive target detection technology based on cross-modal data augmentation

Deep Cross-Modal Hashing Based on Semantic Consistent Ranking

Deep continual hashing with gradient-aware memory for cross-modal retrieval

A dual tri-path CNN system for brain tumor segmentation

Hybrid-attention based Feature-reconstructive Adversarial Hashing Networks for Cross-modal Retrieval

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Cross-modal Data Research Articles

Related Topics

Articles published on Cross-modal Data

Bidirectional Attentional Interaction Networks for RGB-D salient object detection

Cross on Cross Attention: Deep Fusion Transformer for Image Captioning

Transformer-based cross-modal multi-contrast network for ophthalmic diseases diagnosis

Development and validation of an fMRI-informed EEG model of reward-related ventral striatum activation

Cross-modal Person Re-identification Based on Hybrid Learning Networks

Semantic-embedding Guided Graph Network for cross-modal retrieval

Contrastive learning-based general Deepfake detection with multi-scale RGB frequency clues

Semantic Completion and Filtration for Image–Text Retrieval

Cardiac LGE MRI Segmentation With Cross-Modality Image Augmentation and Improved U-Net.

BCINet: Bilateral cross-modal interaction network for indoor scene understanding in RGB-D images

Unsupervised Cross-Media Graph Convolutional Network for 2D Image-Based 3D Model Retrieval

Interacting-Enhancing Feature Transformer for Cross-Modal Remote-Sensing Image and Text Retrieval

Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations

Cross-Modal Data Augmentation for Tasks of Different Modalities

RGB-D Crowd Counting With Cross-Modal Cycle-Attention Fusion and Fine-Coarse Supervision

Infrared time-sensitive target detection technology based on cross-modal data augmentation

Deep Cross-Modal Hashing Based on Semantic Consistent Ranking

Deep continual hashing with gradient-aware memory for cross-modal retrieval

A dual tri-path CNN system for brain tumor segmentation

Hybrid-attention based Feature-reconstructive Adversarial Hashing Networks for Cross-modal Retrieval