Abstract

In this paper, we propose a clustering-based semi-supervised cross-modal retrieval method to relieve the problem of insufficient annotation in cross-modal datasets. First, we reconstruct cross-modal data as scene graph structure to filter meaningless information. Second, we extract embedding representation features of images and texts to put them into a common space. Finally, we propose a clustering-based classification method with modality-independent constraint to discriminate samples. According to our experimental results, significant improvement on performance shows the accuracy of our method in terms of three widely used cross-modal datasets compared with the state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call