Abstract

Learning the common representations for various modalities of data is the key component in cross-modal retrieval. Most existing deep approaches learn multiple networks to independently project each sample into a common representation. However, each representation is only extracted from the corresponding data, which totally ignores the relationships between other data. Thus it is challenging to learn efficient common representations when lacking sufficient supervised multi-modal data for training, e.g., few-shot cross-modal retrieval. How to efficiently exploit the information contained in other examples is underexplored. In this work, we present the Self-Others Net, a few-shot cross-modal retrieval model that fully exploits information contained both in its own and other samples. First, we propose a self-network to fully exploit the correlations that lurk in the data itself. It integrates the features at different layers and extracts the multi-level information in the self-network. Second, an others-network is further proposed to model the relationships among all samples, which learns the Mahalanobis tensor and mixes the prototypes of all data to capture the non-linear dependencies for common representation learning. Extensive experiments are conducted on three benchmark datasets, which demonstrate clear improvements of the proposed method over the state-of-the-arts.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.