Abstract

Recent years have seen significant advancements in artificial intelligence (AI). Still, we are far from intelligent agents that can visually perceive their surroundings, reason, and interact with humans in natural language, therebybeing an integral part of our lives. As a step towards such a grand goal, this thesis proposes Visual Dialog that requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in the image, infer context from history, and answer the questionaccurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence while being grounded in vision enough to allow objective evaluation of responses. We collect VisDial, a large dataset of human dialogs on real images to benchmark progress. In order to tackle several challenges in Visual Dialog, we build neural network-based models that reason about the inputs explicitly by performing visual coreference resolution called CorefNMN. To demonstrate the effectiveness of such models, we test them on the VisDial dataset and a large diagnostic dataset, CLEVR-Dialog, which we synthetically generate with fully annotateddialog states. By breaking down the performance of these models according to history dependency, coreference distance, etc., we show that our models quantitativelyoutperform other approaches and are qualitatively more interpretable, grounded, and consistent—all of which are desirable for an AI system. We then apply Visual Dialog to visually-grounded, goal-driven dialog without the need for additional supervision. Goal-driven dialog agents specialize in a downstream task (or goal) making them deployable and interesting to study. We also train these agents from scratch to solely maximize their performance (goal rewards) and study the emergent language. Our findings show that while most agent-invented language is effective (i.e., achieve near-perfect task rewards), they are decidedly not interpretable or compositional. All our datasets and code are publicly available to encourage future research.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call