Abstract

The rapid developments in sensor technology and mobile devices bring a flourish of social images, and large-scale social images have attracted increasing attention to researchers. Existing approaches generally rely on recognizing object instances individually with geo-tags, visual patterns, etc. However, the social image represents a web of interconnected relations; these relations between entities carry semantic meaning and help a viewer differentiate between instances of a substance. This article forms the perspective of the spatial relationship to exploring the joint learning of social images. Precisely, the model consists of three parts: (a) a module for deep semantic understanding of images based on residual network (ResNet); (b) a deep semantic analysis module of text beyond traditional word bag methods; (c) a joint reasoning module from which the text weights obtained using image features on self-attention and a novel tree-based clustering algorithm. The experimental results demonstrate the effectiveness of using Flickr30k and Microsoft COCO datasets. Meanwhile, our method considers spatial relations while matching.

Highlights

  • With the rise of cheap sensors, mobile terminals, and social networks, research on social images is making good progress, including image retrieval, object classification, and scene understanding

  • We aim at developing a method to learn the spatial relations across separate visual objects and texts for social image understanding. erefore, this paper proposes a cross-modal framework, which builds a joint model of texts and images to extract features and combine the advantages of self-attention mechanism and deep learning models, generating interactive effects

  • We focus on two image-text tasks: spatial relation modeling and image-text matching. e former refers both to image-to-image and image-to-text, and definitions of the two scenarios are straightforward: given an input image, the goal is to find the relationships between entities with semantic meaning. e second task refers to find the best matching sentences to the input images

Read more

Summary

Introduction

With the rise of cheap sensors, mobile terminals, and social networks, research on social images is making good progress, including image retrieval, object classification, and scene understanding. Wang et al [5] present an algorithm to learn the relations between scenes, objects, and texts with the help of image-level labels. Such a training process requires a large number of paired images and text data. The spatial relationships from the textual descriptions are very scarce in reality Motivated by these observations, we aim at developing a method to learn the spatial relations across separate visual objects and texts for social image understanding. E proposed methods usually require additional annotations of relations, while they demand only image-level annotations

Cross-Modal Reasoning Framework
Similarity Network
Cross-Modal Matching
Experiments and Results
Comparison with the State-of-the-Art

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.