Abstract
Recently, image attributes containing high-level semantic information have been widely used in computer vision tasks, including visual recognition and image captioning. Existing attribute extraction methods map visual concepts to the probabilities of frequently-used words by directly using Convolutional Neural Networks (CNNs). Typically, two main problems exist in those methods. First, words of different parts of speech (POSs) are handled in the same way, but non-nominal words can hardly be mapped to visual regions through CNNs only. Second, synonymous nominal words are treated as independent and different words, in which similarities are ignored. In this paper, a novel Refined Universal Detection (RUDet) method is proposed to solve these two problems. Specifically, a Refinement (RF) module is designed to extract refined attributes of non-nominal words based on the attributes of nominal words and visual features. In addition, a Word Tree (WT) module is constructed to integrate synonymous nouns, which ensures that similar words hold similar and more accurate probabilities. Moreover, a Feature Enhancement (FE) module is adopted to enhance the ability to mine different visual concepts in different scales. Experiments conducted on the large-scale Microsoft (MS) COCO dataset illustrate the effectiveness of our proposed method.
Highlights
Attribute extraction is an important process in various computer vision tasks
Quantitative Results: Referring to Reference [3], the metric Average Precision (AP) for multi-label classification problems is used in the evaluation
The precision and recall values are calculated by the number of true positive instances and false positive instances corresponding to different probability thresholds
Summary
Attribute extraction is an important process in various computer vision tasks. Attributes with high probabilities of words “man”, “food”, “eating”, and “delicious” indicate that there is probably a man who is eating delicious food in that image. It shows that the attributes containing high-level semantic. The application of attributes is of paramount importance in image captioning, which is a process of generating natural sentence descriptions for a given image based on the objects, together with their actions and relationships in the image. Recent work shows that attributes containing high-level semantic information can significantly improve the performance of caption generation [3,4].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.