Image-text Retrieval Research Articles

Tomato leaf disease control in the field of smart agriculture urgently requires attention and reinforcement. This paper proposes a method called LAFANet for image-text retrieval, which integrates image and text information for joint analysis of multimodal data, helping agricultural practitioners to provide more comprehensive and in-depth diagnostic evidence to ensure the quality and yield of tomatoes. First, we focus on six common tomato leaf disease images and text descriptions, creating a Tomato Leaf Disease Image-Text Retrieval Dataset (TLDITRD), introducing image-text retrieval into the field of tomato leaf disease retrieval. Then, utilizing ViT and BERT models, we extract detailed image features and sequences of textual features, incorporating contextual information from image-text pairs. To address errors in image-text retrieval caused by complex backgrounds, we propose Learnable Fusion Attention (LFA) to amplify the fusion of textual and image features, thereby extracting substantial semantic insights from both modalities. To delve further into the semantic connections across various modalities, we propose a False Negative Elimination-Adversarial Negative Selection (FNE-ANS) approach. This method aims to identify adversarial negative instances that specifically target false negatives within the triplet function, thereby imposing constraints on the model. To bolster the model's capacity for generalization and precision, we propose Adversarial Regularization (AR). This approach involves incorporating adversarial perturbations during model training, thereby fortifying its resilience and adaptability to slight variations in input data. Experimental results show that, compared with existing ultramodern models, LAFANet outperformed existing models on TLDITRD dataset, with top1, top5, and top10 reaching 83.3% and 90.0%, and top1, top5, and top10 reaching 80.3%, 93.7%, and 96.3%. LAFANet offers fresh technical backing and algorithmic insights for the retrieval of tomato leaf disease through image-text correlation.

Fine-grained image-text retrieval has been a hot research topic to bridge the vision and languages, and its main challenge is how to learn the semantic correspondence across different modalities. The existing methods mainly focus on learning the global semantic correspondence or intramodal relation correspondence in separate data representations, but which rarely consider the intermodal relation that interactively provide complementary hints for fine-grained semantic correlation learning. To address this issue, we propose a relation-aggregated cross-graph (RACG) model to explicitly learn the fine-grained semantic correspondence by aggregating both intramodal and intermodal relations, which can be well utilized to guide the feature correspondence learning process. More specifically, we first build semantic-embedded graph to explore both fine-grained objects and their relations of different media types, which aim not only to characterize the object appearance in each modality, but also to capture the intrinsic relation information to differentiate intramodal discrepancies. Then, a cross-graph relation encoder is newly designed to explore the intermodal relation across different modalities, which can mutually boost the cross-modal correlations to learn more precise intermodal dependencies. Besides, the feature reconstruction module and multihead similarity alignment are efficiently leveraged to optimize the node-level semantic correspondence, whereby the relation-aggregated cross-modal embeddings between image and text are discriminatively obtained to benefit various image-text retrieval tasks with high retrieval performance. Extensive experiments evaluated on benchmark datasets quantitatively and qualitatively verify the advantages of the proposed framework for fine-grained image-text retrieval and show its competitive performance with the state of the arts.

Image-text Retrieval Research Articles

Related Topics

Articles published on Image-text Retrieval

Hypergraph clustering based multi-label cross-modal retrieval

AENet: Image Retrieval of Kazakh Handwritten Documents Based on Attention Mechanism and Feature Aggregation

Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval

Multi-view and region reasoning semantic enhancement for image-text retrieval

Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval

Clustering swap prediction for image-text pre-training

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

A Novel Pretrained General-purpose Vision Language Model for the Vietnamese Language

High-Accuracy Tomato Leaf Disease Image-Text Retrieval Method Utilizing LAFANet.

Unsupervised deep hashing with multiple similarity preservation for cross-modal image-text retrieval

Heterogeneous Graph Fusion Network for cross-modal image-text retrieval

Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

Soft Contrastive Cross-Modal Retrieval

Multiview adaptive attention pooling for image–text retrieval

Perceive, Reason, and Align: Context-guided cross-modal correlation learning for image–text retrieval

Retraction Note: Simulation of cross-modal image-text retrieval algorithm under convolutional neural network structure and hash method

Relation-Aggregated Cross-Graph Correlation Learning for Fine-Grained Image-Text Retrieval.

Integrating listwise ranking into pairwise-based image-text retrieval

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Image-text Retrieval Research Articles

Related Topics

Articles published on Image-text Retrieval

Hypergraph clustering based multi-label cross-modal retrieval

AENet: Image Retrieval of Kazakh Handwritten Documents Based on Attention Mechanism and Feature Aggregation

Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval

Multi-view and region reasoning semantic enhancement for image-text retrieval

Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval

Clustering swap prediction for image-text pre-training

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

A Novel Pretrained General-purpose Vision Language Model for the Vietnamese Language

High-Accuracy Tomato Leaf Disease Image-Text Retrieval Method Utilizing LAFANet.

Unsupervised deep hashing with multiple similarity preservation for cross-modal image-text retrieval

Heterogeneous Graph Fusion Network for cross-modal image-text retrieval

Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

Soft Contrastive Cross-Modal Retrieval

Multiview adaptive attention pooling for image–text retrieval

Perceive, Reason, and Align: Context-guided cross-modal correlation learning for image–text retrieval

Retraction Note: Simulation of cross-modal image-text retrieval algorithm under convolutional neural network structure and hash method

Relation-Aggregated Cross-Graph Correlation Learning for Fine-Grained Image-Text Retrieval.

Integrating listwise ranking into pairwise-based image-text retrieval