Abstract

As a new task of image understanding, the dense caption model needs to locate and describe a salient region in the image simultaneously. It inevitably divides the dense caption model into two parts, one part for detecting the regions of interest and the other part for generating regional language caption. Previous methods are relatively simple to deal with these two parts, using the feature map on the last convolution layer of RPN network to predict object coordinates, and using LSTM for regional language modeling. However, the structure of RPN is insufficient to deal with a large number of objects in the complex dataset, and LSTM also fails to effectively utilize the global information of images in regional language training, which brings opportunity to improve the performance of dense caption. In this paper, we propose a novel Cross-scale Fusion with Global Attribute model (CSGA) that enables the two parts of the dense caption model to perform normal end-to-end training without mutual interference. Furthermore, our model uses a one-stage object detector with feature map fusion operation across multiple detection scales to improve the quality of object detection part, and combines image features with the global high-level attribute to improve regional language training. We design a variety of model architectures and conducted sufficient experiments. Empirical results on Visual Genome dataset show that our model achieves competitive results with mAP 8.33.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.