Abstract
Visual relationship detection (VRD), a challenging task in the image understanding, suffers from vague connection between relationship patterns and visual appearance. This issue is caused by the high diversity of relationship-independent visual appearance, where inexplicit and redundant cues may not contribute to the relationship detection, even confuse the detector. Previous relationship detection models have shown remarkable progress in leveraging external textual information or scene-level interaction to complement relationship detection cues. In this work, we propose Contextual Coefficients Excitation Feature (CCEF), a focal visual representation, which is adaptively recalibrated from original visual feature responses by explicitly modeling the interdependencies between features and their contextual coefficients. Specifically, contextual coefficients are obtained by calculation of both the spatial coefficients and generated-label ones. In addition, a conditional Wasserstein Generative Adversarial Network (WGAN) regularized with a relationship classification loss is designed to alleviate inadequate training of generated-label coefficients due to long tail distribution of relationship. Experimental results demonstrate the effective improvements of our method on relationship detection. In particular, our method improves the recall from 8.5% to 23.2% of predicting unseen relationship from zero-shot set.
Highlights
With rapid development of deep learning and image recognition [1,2,3,4,5], visual relationship detection [6], a higher-level visual understanding task, has been a popular research topic
We propose a novel Contextual Coefficients Excitation Feature (CCEF), which is a focal representation based on a new relationship space
CCEF (Ours − A + S + G): Use visual representation described as Section 3.1
Summary
With rapid development of deep learning and image recognition [1,2,3,4,5], visual relationship detection [6], a higher-level visual understanding task, has been a popular research topic. Visual relationship detection aims to recognize various visually observable predicates between subject and object, where subject and object are a pair of objects in the image. Visual relationship detection is a challenging task that most existing relationship detection methods [8,9] treat each type of relationship predicates as a class, leading to the high diversity of visual appearance which varies greatly with different relationship instances. This visual diversity undermines the correlation between relationship predicates and visual appearance and confuses the detector.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have