Know Yourself and Know Others: Efficient Common Representation Learning for Few-shot Cross-modal Retrieval

Shaoying Wang,Zhenyu Shi,Hanjiang Lai

doi:10.1145/3460426.3463632

Abstract

Learning the common representations for various modalities of data is the key component in cross-modal retrieval. Most existing deep approaches learn multiple networks to independently project each sample into a common representation. However, each representation is only extracted from the corresponding data, which totally ignores the relationships between other data. Thus it is challenging to learn efficient common representations when lacking sufficient supervised multi-modal data for training, e.g., few-shot cross-modal retrieval. How to efficiently exploit the information contained in other examples is underexplored. In this work, we present the Self-Others Net, a few-shot cross-modal retrieval model that fully exploits information contained both in its own and other samples. First, we propose a self-network to fully exploit the correlations that lurk in the data itself. It integrates the features at different layers and extracts the multi-level information in the self-network. Second, an others-network is further proposed to model the relationships among all samples, which learns the Mahalanobis tensor and mixes the prototypes of all data to capture the non-linear dependencies for common representation learning. Extensive experiments are conducted on three benchmark datasets, which demonstrate clear improvements of the proposed method over the state-of-the-arts.

Full Text