Multimodal metric learning aims to transform heterogeneous data into a common subspace where cross-modal similarity computing can be directly performed and has received much attention in recent years. Typically, the existing methods are designed for nonhierarchical labeled data. Such methods fail to exploit the intercategory correlations in the label hierarchy and, therefore, cannot achieve optimal performance on hierarchical labeled data. To address this problem, we propose a novel metric learning method for hierarchical labeled multimodal data, named deep hierarchical multimodal metric learning (DHMML). It learns the multilayer representations for each modality by establishing a layer-specific network corresponding to each layer in the label hierarchy. In particular, a multilayer classification mechanism is introduced to enable the layerwise representations to not only preserve the semantic similarities within each layer, but also retain the intercategory correlations across different layers. In addition, an adversarial learning mechanism is proposed to bridge the cross-modality gap by producing indistinguishable features for different modalities. Through integration of the multilayer classification and adversarial learning mechanisms, DHMML can obtain hierarchical discriminative modality-invariant representations for multimodal data. Experiments on two benchmark datasets are used to demonstrate the superiority of the proposed DHMML method over several state-of-the-art methods.
Read full abstract