Flickr30K Datasets Research Articles

Information Recommendation (IR) systems are conventionally designed to operate within a single modality at a time, such as Text2Text or Image2Image. However, the concept of cross-modality aims to facilitate a versatile recommendation experience across different modalities, such as Text2Image. In recent years, significant strides have been made in developing neural recommender models that are multimodal and capable of generalizing across a broad spectrum of domains in a zero-shot manner, thanks to the robust representation capabilities of neural networks. These architectures enable the generation of embeddings for assets (i.e., content uploaded on a platform by users), presenting a concise representation of their semantics and allowing for comparisons through similarity ranking. In this paper, we present ZCCR, a Zero-shot Content-based Crossmodal Recommendation System that leverages knowledge from large-scale pretrained Vision-Language Models (VLMs) such as CLIP and ALBEF to redefine the recommendation task as a zero-shot retrieval task, eliminating the need for labeled data or prior knowledge about the recommended content. Furthermore, ZCCR performs crossmodal similarity search on an optimized index, such as FAISS, to improve the speed of recommendation. The goal is to recommend to the user assets that are similar to those they have previously uploaded, commonly referred to as their user profile. Within the user profile, we identify “areas of interest”—groups of assets associated with specific user interests, such as cooking, sports, or cars. To identify these areas of interest and construct the search query for the retrieval operation, ZCCR employs an innovative use of Agglomerative Clustering. This technique groups user past assets by similarity without requiring prior knowledge of the number of clusters. Once the areas of interest, or clusters, are identified, the centroid is utilized as the search seed to find similar assets. Experimental results demonstrate the efficiency of the selected components in terms of search time and retrieval performance on modified MSCOCO and FLICKR30k datasets tailored for the recommendation task. Furthermore, ZCCR outperforms both a baseline tagging system (BT) and a more advanced tag system which utilizes a Large Language Model (LLM) to extract embeddings from tags. The results show that, even compared with the latter, embeddings directly extracted from raw assets yield superior outcomes compared to relying on intermediate tags generated by other tools. The code implementation to reproduce all experiments and results shown in this paper is provided at the following link: ZCCR-experiments.

As a fundamental and challenging task in bridging language and vision domains, Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality, and its key challenge is to measure the semantic similarity across different modalities. Although significant progress has been achieved, existing approaches typically suffer from two major limitations: (1) It hurts the accuracy of the representation by directly exploiting the bottom-up attention based region-level features where each region is equally treated. (2) It limits the scale of negative sample pairs by employing the mini-batch based end-to-end training mechanism. To address these limitations, we propose a Unified Semantic Enhancement Momentum Contrastive Learning (USER) method for ITR. Specifically, we delicately design two simple but effective Global representation based Semantic Enhancement (GSE) modules. One learns the global representation via the self-attention algorithm, noted as Self-Guided Enhancement (SGE) module. The other module benefits from the pre-trained CLIP module, which provides a novel scheme to exploit and transfer the knowledge from an off-the-shelf model, noted as CLIP-Guided Enhancement (CGE) module. Moreover, we incorporate the training mechanism of MoCo into ITR, in which two dynamic queues are employed to enrich and enlarge the scale of negative sample pairs. Meanwhile, a Unified Training Objective (UTO) is developed to learn from mini-batch based and dynamic queue based samples. Extensive experiments on the benchmark MSCOCO and Flickr30K datasets demonstrate the superiority of both retrieval accuracy and inference efficiency. For instance, compared with the existing best method NAAF, the metric R@1 of our USER on the MSCOCO 5K Testing set is improved by 5% and 2.4% on caption retrieval and image retrieval without any external knowledge or pre-trained model while enjoying over 60 times faster inference speed. Our source code will be released at https://github.com/zhangy0822/USER.

Flickr30K Datasets Research Articles

Articles published on Flickr30K Datasets

VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval

Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture

A method for image–text matching based on semantic filtering and adaptive adjustment

Zero-Shot Content-Based Crossmodal Recommendation System

A transformer-based Urdu image caption generation

Multi-level Symmetric Semantic Alignment Network for image–text matching

RagBERT: Relationship-aligned and grammar-wise BERT model for image captioning

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

CAST: Cross-Modal Retrieval and Visual Conditioning for image captioning

The Role of Attention Mechanism in Generating Image Captions: An Innovative Approach with Neural Network-Based Seq2seq Model

IMAGE CAPTION GENERATOR USING DEEP LEARNING

Automatic Image Caption Generation with Deep Learning

Towards bridging the semantic gap between image and text: An empirical approach

Cascade & allocate: A cross-structure adversarial attack against models fusing vision and language

Heterogeneous Graph Fusion Network for cross-modal image-text retrieval

TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip

Adaptive Semantic-Enhanced Transformer for Image Captioning.

Ar-CM-ViMETA: Arabic Image Captioning based on Concept Model and Vision-based Multi-Encoder Transformer Architecture

USER: Unified Semantic Enhancement With Momentum Contrast for Image-Text Retrieval.

Synthesis of Vision and Language: Multifaceted Image Captioning Application

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Flickr30K Datasets Research Articles

Articles published on Flickr30K Datasets

VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval

Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture

A method for image–text matching based on semantic filtering and adaptive adjustment

Zero-Shot Content-Based Crossmodal Recommendation System

A transformer-based Urdu image caption generation

Multi-level Symmetric Semantic Alignment Network for image–text matching

RagBERT: Relationship-aligned and grammar-wise BERT model for image captioning

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

CAST: Cross-Modal Retrieval and Visual Conditioning for image captioning

The Role of Attention Mechanism in Generating Image Captions: An Innovative Approach with Neural Network-Based Seq2seq Model

IMAGE CAPTION GENERATOR USING DEEP LEARNING

Automatic Image Caption Generation with Deep Learning

Towards bridging the semantic gap between image and text: An empirical approach

Cascade & allocate: A cross-structure adversarial attack against models fusing vision and language

Heterogeneous Graph Fusion Network for cross-modal image-text retrieval

TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip

Adaptive Semantic-Enhanced Transformer for Image Captioning.

Ar-CM-ViMETA: Arabic Image Captioning based on Concept Model and Vision-based Multi-Encoder Transformer Architecture

USER: Unified Semantic Enhancement With Momentum Contrast for Image-Text Retrieval.

Synthesis of Vision and Language: Multifaceted Image Captioning Application