Abstract

Referring expression comprehension (REC) requires locating the region referred by the expression, where one of the key challenges is to distinguish the correct object from other of the same category using the described relationships. Existing two-stage methods explicitly establish the visual relationships among objects based on spatial information, including the location and scale. This paper investigates the role of relational features. We find that the predicted result becomes incorrect when the region scale changes. The trained model statistically tends to predict larger regions as the results and performs worse for objects of smaller scales. To alleviate this problem, we propose a Scale-Insensitive Network (SINet) to improve the robustness to scale information during the visual relational feature modeling process. Specifically, a category-wise random pooling module is designed to efficiently change object scales, and SINet simultaneously takes the original and resized regions as inputs. We introduce a consistency loss to train the model to remain correct under different scales. Our method can be integrated to existing two-stage methods for alleviating the dependence on scale information and promoting their utilization of key visual features. Extensive experimental results on 3 commonly used datasets, including RefCOCO, RefCOCO+ and RefCOCOg, have demonstrated the superiority of SINet to the state-of-the-art two-stage methods in terms of REC accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call