Abstract

Cross-modal recognition and matching with privileged information are important challenging problems in the field of computer vision. The cross-modal scenario deals with matching across different modalities and needs to take care of the large variations present across and within each modality. The privileged information scenario deals with the situation that all the information available during training may not be available during the testing stage, and hence, algorithms need to leverage the extra information from the training stage itself. We show that for multi-modal data, either one of the above situations may arise if one modality is absent during testing. Here, we propose a novel framework, which can handle both these scenarios seamlessly with applications to matching multi-modal data. The proposed approach jointly uses data from the two modalities to build a canonical representation, which encompasses information from both the modalities. We explore four different types of canonical representations for different types of data. The algorithm computes dictionaries and canonical representation for data from both the modalities, such that the transformed sparse coefficients of both the modalities are equal to that of the canonical representation. The sparse coefficients are finally matched using Mahalanobis metric. Extensive experiments on different data sets, involving RGBD, text-image, and audio-image data, show the effectiveness of the proposed framework.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call