Abstract
Bridging visual and textual representations plays a central role in delving into multimedia data understanding. The main challenge arises from that images and texts exist in heterogeneous spaces, leading to the difficulty to preserve the semantic consistency between both modalities. To narrow the modality gap, most recent methods resort to extra object detectors or parsers to obtain the hierarchical representations. In this work, we address this problem by introducing our Multi-Task Hierarchical Convolutional Neural Network (MT-HCN). It is characterized by mining the hierarchical semantic information without the aid of any extra supervisions. Firstly, from the perspective of representing architecture, we leverage the intrinsic hierarchical structure of Convolutional Neural Networks (CNNs) to decompose the representations of both modalities into two semantically complementary levels, i.e., exterior representations and concept representations. The former focuses on discovering the fine-grained low-level associations between both modalities, meanwhile the latter underlines capturing more high-level abstract semantics. Specifically, we present a Self-Supervised Clustering (SSC) loss to preserve more fine-grained semantic clues in exterior representations. It is constituted on the basis of viewing multiple image/text pairs with similar exterior as a category. In addition, a novel harmonious bidirectional triplet ranking (HBTR) loss is proposed, which mitigate the adverse effects brought about by the biased and noisy negative samples. Besides hardest negatives, it also imposes the constraints on the distance between the positive pairs and the centroid of negative pairs. Extensive experimental results on two popular cross-modal retrieval benchmarks demonstrate our proposed MT-HCN can achieve the competitive results compared with the state-of-the-art methods.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have