Abstract

Cross-modal similarity query has become a highlighted research topic for managing multimodal datasets such as images and texts. Existing researches generally focus on query accuracy by designing complex deep neural network models and hardly consider query efficiency and interpretability simultaneously, which are vital properties of cross-modal semantic query processing system on large-scale datasets. In this work, we investigate multi-grained common semantic embedding representations of images and texts and integrate interpretable query index into the deep neural network by developing a novel Multi-grained Cross-modal Query with Interpretability (MCQI) framework. The main contributions are as follows: (1) By integrating coarse-grained and fine-grained semantic learning models, a multi-grained cross-modal query processing architecture is proposed to ensure the adaptability and generality of query processing. (2) In order to capture the latent semantic relation between images and texts, the framework combines LSTM and attention mode, which enhances query accuracy for the cross-modal query and constructs the foundation for interpretable query processing. (3) Index structure and corresponding nearest neighbor query algorithm are proposed to boost the efficiency of interpretable queries. (4) A distributed query algorithm is proposed to improve the scalability of our framework. Comparing with state-of-the-art methods on widely used cross-modal datasets, the experimental results show the effectiveness of our MCQI approach.

Highlights

  • With rapid development of computer science and technology, multimedia data including images and texts have been emerging on the Internet, which have become the main form of humans knowing the world

  • Cross-modal similarity query has been an essential technique with wide applications, such as search engine and multimedia data

  • C σ similarity function the kth matched pair of images and texts dimension of local common embedding space dimension of global common embedding space the set of patch relation tuples between images and texts the ith data instance in the dataset the ith common fine-grained semantic feature the ith common coarse-grained semantic feature weight factor to balance fine-grained and coarse-grained features number of computing nodes probability of weight factor can be omitted second stage is the index construction stage, in which M-tree index and inverted index are integrated to process efficient and interpretable queries. We introduce it in the aspects of embedding representations of multimodal data and interpretable query processing

Read more

Summary

Introduction

With rapid development of computer science and technology, multimedia data including images and texts have been emerging on the Internet, which have become the main form of humans knowing the world. Numerous parameters of deep neural networks make query process and results difficult to be explained That is, those models have weak interpretability, which is an important property for general and reliable cross-modal query system. Our core insight is that we can leverage deep neural network model to capture multi-grained cross-modal common semantics and build an efficient hybrid index with interpretability and scalability. In order to ensure the adaptability and generality of our framework, during training common feature vectors for different types we first capture coarse-grained and fine-grained semantic information by designing different networks and combine them. In order to capture the latent semantic relation between images and texts, the framework combines LSTM and attention mode, which enhances query accuracy for the cross-modal query and constructs the foundation for interpretable query processing.

Cross‐modal Retrieval
Latent Semantic Alignment
Cross‐modal Hashing
Distributed Similarity Query
Proposed Model
Fine‐grained Embedding Learning with Local Semantics
Embedding Representations of Multimodal Data
Coarse‐grained Embedding Learning with Global Semantics
Multi‐grained Objective Function
Optimization
Interpretable Query Processing
Index Construction
Interpretable kNN Query
Distributed Algorithm
Selection of Pivot Points
Query‐Sensitive Load Balancing
Computation of pn
Distributed kNN Query Algorithm
Experiment Setup
Verification of Observation 1
Performance of Query Accuracy
Performance of Query Time
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call