Multimodal Retrieval Research Articles

Multimodal retrieval has received widespread consideration since it can commendably provide massive related data support for the development of computational social systems (CSSs). However, the existing works still face the following challenges: 1) rely on the tedious manual marking process when extended to CSS, which not only introduces subjective errors but also consumes abundant time and labor costs; 2) only using strongly aligned data for training, lacks concern for the adjacency information, which makes the poor robustness and semantic heterogeneity gap difficult to be effectively fit; and 3) mapping features into real-valued forms, which leads to the characteristics of high storage and low retrieval efficiency. To address these issues in turn, we have designed a multimodal retrieval framework based on web-knowledge-driven, called <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unsupervised and robust graph convolutional hashing</i> (URGCH). The specific implementations are as follows: first, a “ <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">secondary semantic self-fusion</i> ” approach is proposed, which mainly extracts semantic-rich features through pretrained neural networks, constructs the joint semantic matrix through semantic fusion, and eliminates the process of manual marking; second, a “ <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">adaptive computing</i> ” approach is designed to construct enhanced semantic graph features through the knowledge-infused of neighborhoods and uses graph convolutional networks for knowledge fusion coding, which enables URGCH to sufficiently fit the semantic modality gap while obtaining satisfactory robustness features; Third, combined with hash learning, the multimodality data are mapped into the form of binary code, which reduces storage requirements and improves retrieval efficiency. Eventually, we perform plentiful experiments on the web dataset. The results evidence that URGCH exceeds other baselines about <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1\%$</tex-math> </inline-formula> – <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$3.7\%$</tex-math> </inline-formula> in mean average precisions (MAPs), displays superior performance in all the aspects, and can meaningfully provide multimodal data retrieval services to CSS.

Read full abstract

Deep cross-modal hashing has promoted the field of multi-modal retrieval due to its excellent efficiency and storage, but its vulnerability to backdoor attacks is rarely studied. Notably, current deep cross-modal hashing methods inevitably require large-scale training data, resulting in poisoned samples with imperceptible triggers that can easily be camouflaged into the training data to bury backdoors in the victim model. Nevertheless, existing backdoor attacks focus on the uni-modal vision domain, while the multi-modal gap and hash quantization weaken their attack performance. In addressing the aforementioned challenges, we undertake an invisible black-box backdoor attack against deep cross-modal hashing retrieval in this article. To the best of our knowledge, this is the first attempt in this research field. Specifically, we develop a flexible trigger generator to generate the attacker’s specified triggers, which learns the sample semantics of the non-poisoned modality to bridge the cross-modal attack gap. Then, we devise an input-aware injection network, which embeds the generated triggers into benign samples in the form of sample-specific stealth and realizes cross-modal semantic interaction between triggers and poisoned samples. Owing to the knowledge-agnostic of victim models, we enable any cross-modal hashing knockoff to facilitate the black-box backdoor attack and alleviate the attack weakening of hash quantization. Moreover, we propose a confusing perturbation and mask strategy to induce the high-performance victim models to focus on imperceptible triggers in poisoned samples. Extensive experiments on benchmark datasets demonstrate that our method has a state-of-the-art attack performance against deep cross-modal hashing retrieval. Besides, we investigate the influences of transferable attacks, few-shot poisoning, multi-modal poisoning, perceptibility, and potential defenses on backdoor attacks. Our codes and datasets are available at https://github.com/tswang0116/IB3A.

Read full abstract

Multimodal Retrieval Research Articles

Related Topics

Articles published on Multimodal Retrieval

Identifying Implicit Social Biases in Vision-Language Models

Application of Multimedia Information Retrieval Technology in Japanese Text Content Information Query Platform

Flexible Dual Multi-Modal Hashing for Incomplete Multi-Modal Retrieval

An Interactive Multi-Modal Query Answering System with Retrieval-Augmented Large Language Models

Fast unsupervised multi-modal hashing based on piecewise learning

A Web Knowledge-Driven Multimodal Retrieval Method in Computational Social Systems: Unsupervised and Robust Graph Convolutional Hashing

Invisible Black-Box Backdoor Attack against Deep Cross-Modal Hashing Retrieval

Multimodal bird information retrieval system

Visual Language – Let the Product Say What You Want

Learning to disentangle and fuse for fine-grained multi-modality ship image retrieval

Asymmetric Supervised Fusion-Oriented Hashing for Cross-Modal Retrieval.

Multiple Pseudo-Siamese Network with Supervised Contrast Learning for Medical Multi-modal Retrieval

LCEMH: Label Correlation Enhanced Multi-modal Hashing for efficient multi-modal retrieval

Constrained Bipartite Graph Learning for Imbalanced Multi-Modal Retrieval

Hashing Fake: Producing Adversarial Perturbation for Online Privacy Protection Against Automatic Retrieval Models

Annotate and retrieve in vivo images using hybrid self-organizing map

A Multi-Modal Retrieval Model for Mathematical Expressions Based on ConvNeXt and Hesitant Fuzzy Set

Towards Human–Machine Recognition Alignment: An Adversarilly Robust Multimodal Retrieval Hashing Framework

Feature Fusion Based on Transformer for Cross-modal Retrieval

One for more: Structured Multi-Modal Hashing for multiple multimedia retrieval tasks

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multimodal Retrieval Research Articles

Related Topics

Articles published on Multimodal Retrieval

Identifying Implicit Social Biases in Vision-Language Models

Application of Multimedia Information Retrieval Technology in Japanese Text Content Information Query Platform

Flexible Dual Multi-Modal Hashing for Incomplete Multi-Modal Retrieval

An Interactive Multi-Modal Query Answering System with Retrieval-Augmented Large Language Models

Fast unsupervised multi-modal hashing based on piecewise learning

A Web Knowledge-Driven Multimodal Retrieval Method in Computational Social Systems: Unsupervised and Robust Graph Convolutional Hashing

Invisible Black-Box Backdoor Attack against Deep Cross-Modal Hashing Retrieval

Multimodal bird information retrieval system

Visual Language – Let the Product Say What You Want

Learning to disentangle and fuse for fine-grained multi-modality ship image retrieval

Asymmetric Supervised Fusion-Oriented Hashing for Cross-Modal Retrieval.

Multiple Pseudo-Siamese Network with Supervised Contrast Learning for Medical Multi-modal Retrieval

LCEMH: Label Correlation Enhanced Multi-modal Hashing for efficient multi-modal retrieval

Constrained Bipartite Graph Learning for Imbalanced Multi-Modal Retrieval

Hashing Fake: Producing Adversarial Perturbation for Online Privacy Protection Against Automatic Retrieval Models

Annotate and retrieve in vivo images using hybrid self-organizing map

A Multi-Modal Retrieval Model for Mathematical Expressions Based on ConvNeXt and Hesitant Fuzzy Set

Towards Human–Machine Recognition Alignment: An Adversarilly Robust Multimodal Retrieval Hashing Framework

Feature Fusion Based on Transformer for Cross-modal Retrieval

One for more: Structured Multi-Modal Hashing for multiple multimedia retrieval tasks