Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval
Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In common practice, video representation is constructed by feeding clips into 3D convolutional neural networks for a coarse-grained global visual feature extraction. In addition, several studies have attempted to align the local objects of video with the text. However, these representations share a drawback of neglecting rich fine-grained relation features capturing spatial-temporal object interactions that benefits mapping textual entities in the real-world retrieval system. To tackle this problem, we propose an adversarial multi-grained embedding network (AME-Net), a novel cross-modal retrieval framework that adopts both fine-grained local relation and coarse-grained global features in bridging text-video modalities. Additionally, with the newly proposed visual representation, we also integrate an adversarial learning strategy into AME-Net, to further narrow the domain gap between text and video representations. In summary, we contribute AME-Net with an adversarial learning strategy for learning a better joint embedding space, and experimental results on MSR-VTT and YouCook2 datasets demonstrate that our proposed framework consistently outperforms the state-of-the-art method.
- Research Article
11
- 10.1145/3627103
- Dec 9, 2023
- ACM Transactions on Multimedia Computing, Communications, and Applications
The task of text-video retrieval aims to understand the correspondence between language and vision and has gained increasing attention in recent years. Recent works have demonstrated the superiority of local spatio-temporal relation learning with graph-based models. However, most existing graph-based models are handcrafted and depend heavily on expert knowledge and empirical feedback, which may be unable to mine the high-level fine-grained visual relations effectively. These limitations result in their inability to distinguish videos with the same visual components but different relations. To solve this problem, we propose a novel cross-modal retrieval framework, Bi-Branch Complementary Network (BiC-Net), which modifies Transformer architecture to effectively bridge text-video modalities in a complementary manner via combining local spatio-temporal relation and global temporal information. Specifically, local video representations are encoded using multiple Transformer blocks and additional residual blocks to learn fine-grained spatio-temporal relations and long-term temporal dependency, calling the module a Fine-grained Spatio-temporal Transformer (FST). Global video representations are encoded using a multi-layer Transformer block to learn global temporal features. Finally, we align the spatio-temporal relation and global temporal features with the text feature on two embedding spaces for cross-modal text-video retrieval. Extensive experiments are conducted on MSR-VTT, MSVD, and YouCook2 datasets. The results demonstrate the effectiveness of our proposed model. Our code is public at https://github.com/lionel-hing/BiC-Net .
- Conference Article
8
- 10.1109/ijcnn52387.2021.9533421
- Jul 18, 2021
Recognition of insect pests in the wild plays a key role in crop protection. Large-scale pest recognition in natural scenes is extremely challenging due to significant intra-class variation and small inter-class variation within sub-categories. Existing works typically use state-of-the-art convolutional neural networks (CNNs) to extract global features directly for pest classification, while neglecting the effectiveness of fine-grained features for identifying visually similar pest categories under a specific super-category. In this paper, we propose a saliency guided discriminative learning network (SGDL-Net) to tackle these problems. The proposed SGDL-Net simultaneously mines global features and fine-grained features in a multi-task learning manner. We design two branches with shared parameters for pest datasets with a hierarchical structure: the raw branch and the fine-grained branch. The raw branch is utilized to extract coarse-grained features, i.e., global features, and the fine-grained branch mines fine-grained features through a fine-grained feature mining module (FFMM) as a way to constrain feature learning in the raw branch. In particular, we leverage a salient object location module (SOLM) to locate the salient object in the image and feed it to the fine-grained branch. Finally, through the co-training of the two branches, SGDL-Net is able to learn coarse-grained and fine-grained combined discriminative features via a single CNN. Experimental results show that SGDL-Net achieves state-of-the-art performance on the benchmark dataset IP102 used for insect pest recognition. Meanwhile, ablative studies demonstrate the promise of its application on other hierarchically structured datasets (e.g., CIFAR-100).
- Research Article
12
- 10.1016/j.neucom.2022.01.094
- Jan 29, 2022
- Neurocomputing
FeatInter: Exploring fine-grained object features for video-text retrieval
- Research Article
32
- 10.1016/j.inffus.2024.102454
- May 7, 2024
- Information Fusion
Similar modality completion-based multimodal sentiment analysis under uncertain missing modalities
- Research Article
1
- 10.12182/20240360208
- Mar 20, 2024
- Sichuan da xue xue bao. Yi xue ban = Journal of Sichuan University. Medical science edition
The fully automatic segmentation of glioma and its subregions is fundamental for computer-aided clinical diagnosis of tumors. In the segmentation process of brain magnetic resonance imaging (MRI), convolutional neural networks with small convolutional kernels can only capture local features and are ineffective at integrating global features, which narrows the receptive field and leads to insufficient segmentation accuracy. This study aims to use dilated convolution to address the problem of inadequate global feature extraction in 3D-UNet. 1) Algorithm construction: A 3D-UNet model with three pathways for more global contextual feature extraction, or 3DGE-UNet, was proposed in the paper. By using publicly available datasets from the Brain Tumor Segmentation Challenge (BraTS) of 2019 (335 patient cases), a global contextual feature extraction (GE) module was designed. This module was integrated at the first, second, and third skip connections of the 3D UNet network. The module was utilized to fully extract global features at different scales from the images. The global features thus extracted were then overlaid with the upsampled feature maps to expand the model's receptive field and achieve deep fusion of features at different scales, thereby facilitating end-to-end automatic segmentation of brain tumors. 2) Algorithm validation: The image data were sourced from the BraTs 2019 dataset, which included the preoperative MRI images of 335 patients across four modalities (T1, T1ce, T2, and FLAIR) and a tumor image with annotations made by physicians. The dataset was divided into the training, the validation, and the testing sets at an 8∶1∶1 ratio. Physician-labelled tumor images were used as the gold standard. Then, the algorithm's segmentation performance on the whole tumor (WT), tumor core (TC), and enhancing tumor (ET) was evaluated in the test set using the Dice coefficient (for overall effectiveness evaluation), sensitivity (detection rate of lesion areas), and 95% Hausdorff distance (segmentation accuracy of tumor boundaries). The performance was tested using both the 3D-UNet model without the GE module and the 3DGE-UNet model with the GE module to internally validate the effectiveness of the GE module setup. Additionally, the performance indicators were evaluated using the 3DGE-UNet model, ResUNet, UNet++, nnUNet, and UNETR, and the convergence of these five algorithm models was compared to externally validate the effectiveness of the 3DGE-UNet model. 1) In internal validation, the enhanced 3DGE-UNet model achieved Dice mean values of 91.47%, 87.14%, and 83.35% for segmenting the WT, TC, and ET regions in the test set, respectively, producing the optimal values for comprehensive evaluation. These scores were superior to the corresponding scores of the traditional 3D-UNet model, which were 89.79%, 85.13%, and 80.90%, indicating a significant improvement in segmentation accuracy across all three regions (P<0.05). Compared with the 3D-UNet model, the 3DGE-UNet model demonstrated higher sensitivity for ET (86.46% vs. 80.77%) (P<0.05) , demonstrating better performance in the detection of all the lesion areas. When dealing with lesion areas, the 3DGE-UNet model tended to correctly identify and capture the positive areas in a more comprehensive way, thereby effectively reducing the likelihood of missed diagnoses. The 3DGE-UNet model also exhibited exceptional performance in segmenting the edges of WT, producing a mean 95% Hausdorff distance superior to that of the 3D-UNet model (8.17 mm vs. 13.61 mm, P<0.05). However, its performance for TC (8.73 mm vs. 7.47 mm) and ET (6.21 mm vs. 5.45 mm) was similar to that of the 3D-UNet model. 2) In the external validation, the other four algorithms outperformed the 3DGE-UNet model only in the mean Dice for TC (87.25%), the mean sensitivity for WT (94.59%), the mean sensitivity for TC (86.98%), and the mean 95% Hausdorff distance for ET (5.37 mm). Nonetheless, these differences were not statistically significant (P>0.05). The 3DGE-UNet model demonstrated rapid convergence during the training phase, outpacing the other external models. The 3DGE-UNet model can effectively extract and fuse feature information on different scales, improving the accuracy of brain tumor segmentation.
- Research Article
32
- 10.1016/j.eswa.2020.113465
- Apr 22, 2020
- Expert Systems with Applications
A deceptive review detection framework: Combination of coarse and fine-grained features
- Research Article
24
- 10.1016/j.patcog.2023.109636
- Apr 24, 2023
- Pattern Recognition
BDNet: A BERT-based dual-path network for text-to-image cross-modal person re-identification
- Research Article
- 10.3390/bioengineering11100958
- Sep 25, 2024
- Bioengineering (Basel, Switzerland)
Magnetic resonance imaging (MRI) diagnosis, enhanced by deep learning methods, plays a crucial role in medical image processing, facilitating precise clinical diagnosis and optimal treatment planning. Current methodologies predominantly focus on feature extraction from the image domain, which often results in the loss of global features during down-sampling processes. However, the unique global representational capacity of MRI K-space is often overlooked. In this paper, we present a novel MRI K-space-based global feature extraction and dual-path attention fusion network. Our proposed method extracts global features from MRI K-space data and fuses them with local features from the image domain using a dual-path attention mechanism, thereby achieving accurate MRI segmentation for diagnosis. Specifically, our method consists of four main components: an image-domain feature extraction module, a K-space domain feature extraction module, a dual-path attention feature fusion module, and a decoder. We conducted ablation studies and comprehensive comparisons on the Brain Tumor Segmentation (BraTS) MRI dataset to validate the effectiveness of each module. The results demonstrate that our method exhibits superior performance in segmentation diagnostics, outperforming state-of-the-art methods with improvements of up to 63.82% in the HD95 distance evaluation metric. Furthermore, we performed generalization testing and complexity analysis on the Automated Cardiac Diagnosis Challenge (ACDC) MRI cardiac segmentation dataset. The findings indicate robust performance across different datasets, highlighting strong generalizability and favorable algorithmic complexity. Collectively, these results suggest that our proposed method holds significant potential for practical clinical applications.
- Research Article
3
- 10.3390/rs17030503
- Jan 31, 2025
- Remote Sensing
Remote sensing cross-modal text-image retrieval constitutes a pivotal component of multi-modal retrieval in remote sensing, central to which is the process of learning integrated visual and textual representations. Prior research predominantly emphasized the overarching characteristics of remote sensing images, or employed attention mechanisms for meticulous alignment. However, these investigations, to some degree, overlooked the intricacies inherent in the textual descriptions accompanying remote sensing images. In this paper, we introduce a novel cross-modal retrieval model, specifically tailored for remote sensing image-text, leveraging attention correction and filtering mechanisms. The proposed model is architected around four primary components: an image feature extraction module, a text feature extraction module, an attention correction module, and an attention filtering module. Within the image feature extraction module, the Visual Graph Neural Network (VIG) serves as the principal encoder, augmented by a multi-tiered node feature fusion mechanism. This ensures a comprehensive understanding of remote sensing images. For text feature extraction, both the Bidirectional Gated Recurrent Unit (BGRU) and the Graph Attention Network (GAT) are employed as encoders, furnishing the model with an enriched understanding of the associated text. The attention correction segment minimizes potential misalignments in image-text pairings, specifically by modulating attention weightings in cases where there’s a unique correlation between visual area attributes and textual descriptors. Concurrently, the attention filtering segment diminishes the influence of extraneous visual sectors and terms in the image-text matching process, thereby enhancing the precision of cross-modal retrieval. Extensive experimentation carried out on both the RSICD and RSITMD datasets, yielded commendable results, attesting to the superior efficacy of the proposed methodology in the domain of remote sensing cross-modal text-image retrieval.
- Research Article
17
- 10.1016/j.neucom.2024.127828
- May 10, 2024
- Neurocomputing
MAFormer: A transformer network with multi-scale attention fusion for visual recognition
- Conference Article
6
- 10.1145/3459637.3482158
- Oct 26, 2021
Cross-modal retrieval is a classic task in the multimedia community, which aims to search for semantically similar results from different modalities. The core of cross-modal retrieval is to learn the most correlated features in a common feature space for the multi-modal data so that the similarity can be directly measured. In this paper, we propose a novel model using optimal transport for bridging the heterogeneity gap in cross-modal retrieval tasks. Specifically, we calculate the optimal transport plans between feature distributions of different modalities and then minimize the transport cost by optimizing the feature embedding functions. In this way, the feature distributions of multi-modal data can be well aligned in the common feature space. In addition, our model combines the complementary losses in different levels: 1) semantic level, 2) distributional level, and 3) pairwise level for improving cross-modal retrieval performance. In extensive experiments, our method outperforms many other cross-modal retrieval methods, which proves the efficacy of using optimal transport in cross-modal retrieval tasks.
- Conference Article
5
- 10.1109/icpr.2016.7899903
- Dec 1, 2016
We consider the problem of joint modeling of videos and their corresponding textual descriptions (e.g. sentences or phrases). Our approach consists of three components: the video representation, the textual representation, and a joint model that links videos and text. Our video representation uses the state-of-the-art deep 3D ConvNet to capture the semantic information in the video. Our textual representation uses the recent advancement in learning word and sentence vectors from large text corpus. The joint model is learned to score the correct (video, text) pairs higher than the incorrect ones. We demonstrate our approach in several applications: 1) retrieving sentences given a video; 2) retrieving videos given a sentence; 3) zero-shot action recognition in videos.
- Research Article
8
- 10.1360/ssi-2019-0292
- Jun 1, 2020
- SCIENTIA SINICA Informationis
In recent years, increasing amounts of video resources have created a series of demands for fine retrieval of video moments, such as highlight moments in sports events and the re-creation of specific video content. In this context, research on cross-modal video segment retrieval, which attempts to output a video moment that matches the input query text, is gradually emerging. Existing solutions primarily focus on global or local feature representation for query text and video moments. However, such solutions ignore matching semantic relations contained in query text and video moments. For example, given the query text “a person is playing basketball, existing retrieval systems may incorrectly return a video moment of “a person holding a basketball without the considering the semantic relationship of “a person playing basketball. Therefore, this paper proposes a cross-modal relationship alignment framework, which we refer to as CrossGraphAlign, for cross-modal video moment retrieval. The proposed framework constructs a textual relationship graph and a visual relationship graph to model the query semantics in text and video segment relations, and then evaluates the similarity between text relations and visual relations through cross-modally aligned graph convolutional networks to help construct a more accurate video moment retrieval system. Experimental results on the publicly available cross-modal video retrieval datasets TACoS and ActivityNet Captions demonstrate that the proposed method can effectively utilize the semantic relationships to improve the recall rate in cross-modal video moment retrieval.
- Conference Article
34
- 10.1145/3343031.3351067
- Oct 15, 2019
As a natural extension of image-based cross-modal recipe retrieval, retrieving a specific video given a recipe as the query is seldom explored. There are various temporal and spatial elements hidden in cooking videos. In addition, current image-based cross-modal recipe retrieval approaches mostly emphasize the understanding of textual and visual content independently. Such methods overlook the interaction between textual and visual content. In this work, we innovatively propose a new problem of video-based cross-modal recipe retrieval and thoroughly investigate this issue under the attention paradigm. In particular, we firstly exploit a parallel-attention network to independently learn the representations of videos and recipes. Next, a co-attention network is proposed to explicitly emphasize the cross-modal interactive features between videos and recipes. Meanwhile, a cross-modal fusion sub-network is proposed to learn both the independent and collaborative dynamics, which can enhance the associated representation of videos and recipes. Last but not the least, the embedding vectors of videos and recipes stemming from joint network are optimized with a pairwise ranking loss. Extensive experiments on a self-collected dataset have verified the effectiveness and rationality of our proposed solution.
- Research Article
1
- 10.1080/10106049.2024.2375572
- Jan 1, 2024
- Geocarto International
Transformer models boost building extraction accuracy by capturing global features from images. However, convolutional networks’ potential in local feature extraction remains underutilized in CNN + Transformer models, limiting performance. To harness convolutional networks for local feature extraction, we propose a feature attention large kernel (ALK) module and a dual encoder network for high-resolution image-building extraction. The model integrates an attention-based large kernel encoder, a ResNet50-Transformer encoder, a Channel Transformer (Ctrans) module and a decoder. Efficiently capturing local and global building features from both convolutional and positional perspectives, the dual encoder enhances performance. Moreover, replacing skip connections with the CTrans module mitigates semantic inconsistency during feature fusion, ensuring better multidimensional feature integration. Experimental results demonstrate superior extraction of local and global features compared to other models, showcasing the potential of enhancing local feature extraction in advancing CNN + Transformer models.