Abstract

Neural Referring Expression Generation (REG) models have shown promising results in generating expressions which uniquely describe visual objects. However, current REG models still lack the ability to produce diverse and unambiguous referring expressions (REs). To address the lack of diversity, we propose generating a set of diverse REs, rather than one-shot REs. To reduce the ambiguity of referring expressions, we directly optimise non-differentiable test metrics using reinforcement learning (RL), and we show that our approaches achieve better results under multiple different settings. Specifically, we initially present a novel RL approach to REG training, which instead of drawing one sample per input, it averages over multiple samples to normalize the reward during RL training. Secondly, we present an innovative REG model that utilizes an object attention mechanism that explicitly incorporates information about the target object and is optimised using our proposed RL approach. Thirdly, we propose a novel transformer model optimised with RL that exploits different levels of visual information. Our human evaluation demonstrates the effectiveness of this model, where we improve the state-of-the-art results in RefCOCO testA and testB in terms of task success from 76.95% to 81.66% and from 78.10% to 83.33% respectively. While in RefCOCO+ testA we show improvements from 58.85% to 83.33%. Finally, we present a thorough comparison of diverse decoding strategies (sampling and maximisation-based) and how they control the trade-off between the quality and diversity.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call