Learning Hierarchical Visual-Semantic Representation with Phrase Alignment

Baoming Yan,Jiang Yang,Binqiang Zhao,Liyu Chen,Xiaobo Li,Leihao Pei,Lin Wang,Qingheng Zhang,Enyun Yu

doi:10.1145/3460426.3463610

Abstract

Effective visual-semantic representation is critical to the image-text matching task. Various methods are proposed to develop image representation with more semantic concepts and a lot of progress has been achieved. However, the internal hierarchical structure in both image and text, which could effectively enhance the semantic representation, is rarely explored in the image-text matching task. In this work, we propose a Hierarchical Visual-Semantic Network (HVSN) with fine-grained semantic alignment to exploit the hierarchical structure. Specifically, we first model the spatial or semantic relationship between objects and aggregate them into visual semantic concepts by the Local Relational Attention (LRA) module. Then we employ Gated Recurrent Unit (GRU) to learn relationships between visual semantic concepts and generate the global image representation. For the text part, we develop phrase features from related words, then generate text representation by learning relationships between these phrases. Besides, the model is trained with joint optimization of image-text retrieval and phrase alignment task to capture the fine-grained interplay between vision and language. Our approach achieves state-of-the-art performance on Flickr30K and MS-COCO datasets. On Flickr30K, our approach outperforms the current state-of-the-art method by 3.9% relatively in text retrieval with image query and 1.3% relatively for image retrieval with text query (based on Recall@1). On MS-COCO, our HVSN improves image retrieval by 2.3% relatively and text retrieval by 1.2% relatively. Both quantitative and visual ablation studies are provided to verify the effectiveness of the proposed modules.

Full Text