Abstract

With the rapid growth of multimodal data, the cross-modal search has widely attracted research interests. Due to its efficiency on storage and computing, hashing-based methods are broadly used for large scale cross-modal retrieval. Most existing hashing methods are designed based on binary supervision, which transforms complex relationships of multi-label data into simple similar or dissimilar. However, few methods have explored the rich semantic information implicit in multi-label data to improve the accuracy of searching results. In this paper, the multi-level semantic supervision generating approach is proposed by exploring the label relevance. And a deep hashing framework is designed for multi-label image-text cross retrieval tasks. It can simultaneously capture the binary similarity and the complex multi-level semantic structure of data in different forms. Moreover, the effects of three different convolutional neural networks, CNN-F, VGG-16, and ResNet-50, on the retrieval results are compared. The experimental results on an open source cross-modal dataset show that our approach outperforms several state-of-the-art hashing methods, and the retrieval result on the CNN-F network is better than VGG-16 and ResNet-50.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call