Abstract

We introduce the first goal-driven training for visual question answering and dialog agents. Specifically, we pose a cooperative ‘image guessing’ game between two agents – Q-BOT and A-BOT– who communicate in natural language dialog so that Q-BOT can select an unseen image from a lineup of images. We use deep reinforcement learning (RL) to learn the policies of these agents end-to-end – from pixels to multi-agent multi-round dialog to game reward.,,We demonstrate two experimental results.,,First, as a ‘sanity check’ demonstration of pure RL (from scratch), we show results on a synthetic world, where the agents communicate in ungrounded vocabularies, i.e., symbols with no pre-specified meanings (X, Y, Z). We find that two bots invent their own communication protocol and start using certain symbols to ask/answer about certain visual attributes (shape/color/style). Thus, we demonstrate the emergence of grounded language and communication among ‘visual’ dialog agents with no human supervision.,,Second, we conduct large-scale real-image experiments on the VisDial dataset [5], where we pretrain on dialog data with supervised learning (SL) and show that the RL finetuned agents significantly outperform supervised pretraining. Interestingly, the RL Q-BOT learns to ask questions that A-BOT is good at, ultimately resulting in more informative dialog and a better team.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call