The explosive growth of multimedia data on the Internet has magnified the challenge of information retrieval. Multimedia data usually emerges in different modalities, such as image, text, video, and audio. Unsupervised cross-modal hashing techniques that support searching among multi-modal data have gained importance in large-scale retrieval tasks because of the advantage of low storage cost and high efficiency. Current methods learn the hash functions by transforming high-dimensional data into discrete hash codes. However, the original manifold structure and semantic correlation are not preserved well in compact hash codes. We propose a novel unsupervised cross-modal hashing method to cope with this problem from two perspectives. On the one hand, the semantic correlation in textual space and the locally geometric structure in the visual space are reconstructed by unified hashing features seamlessly and simultaneously. On the other hand, the \(\ell _{2,1}\)-norm penalties are imposed on the projection matrices separately to learn the relevant and discriminative hash codes. The experimental results indicate that our proposed method achieves an improvement of 1%, 6%, 9%, and 2% over the best comparison method on the four publicly available datasets (WiKi, PASCAL-VOC, UCI Handwritten Digit, and NUS-WIDE), respectively. In conclusion, the proposed framework which combines hash functions learning and multimodal graph embedding is effective in learning hash codes and achieves superior retrieval performance compared to state-of-the-art methods.
Read full abstract