Image–text retrieval is an important kind of cross-modal retrieval method and has recently attracted much attention. Existing image–text retrieval methods often ignore the relative importance of each fragment (region in an image or word in a sentence) on the global semantic of image or text when aggregating features of image or text fragments, resulting in the ineffectiveness of the learned image and text representations. To address this problem, we propose an image–text retrieval method named Global-aware Fragment Representation Aggregation Network (GFRAN). Specifically, it first designs a fine-grained multimodal information interaction module based on the self-attention mechanism to model both the intra-modality and inter-modality relationships between image regions and words. Then, with the guidance of the global image or text feature, it aggregates image or text fragment features conditioned on their attention weights over global feature, to highlight fragments that contribute more to the overall semantics of images and texts. Extensive experiments on two benchmark datasets Flickr30K and MS-COCO demonstrate the superiority of the proposed GFRAN model over several state-of-the-art baselines.
Read full abstract