&lt;i&gt;c&lt;/i&gt;-SNE: Deep Cross-modal Retrieval based on Subjective Information using Stochastic Neighbor Embedding

Yusuke Yamada,Jin Nakazawa,Tadashi Okoshi

doi:10.2197/ipsjjip.31.246

Abstract

Cross-modal information retrieval based on subjective information aims to enable flexible media retrieval services, such as allowing users to specify, for example, an image to search audio clips. The resulting audio clips should have an impression similar to the specified image. Existing methods focus on building cross-media cross-modal relationships using objective information (such as the standard caption). However, such a relation can be built only between the pieces of media that are originally related, which limits the flexibility of cross-modal media retrieval. This research leverages subjective information in the media clips for similarity calculation to achieve more flexibility. We propose a novel cross-modal stochastic neighbor embedding technique called c-SNE. c-SNE can extract features of subjective information from pieces of media and map them in the common embedding space. It is a learning technique to bridge the heterogeneous gap between the modal distributions using label-weighted SNE. It allows users to find the media that share the same subjective information with a query medium. Our experimental results on the benchmark datasets demonstrate that the proposed method effectively performs in cross-modal distribution alignment and retrieval. Furthermore, our user study with ten users with 600 data points confirmed that c-SNE outperforms three related methods in the actual usage situation from the users' perspective.

Full Text