Abstract

Reasoning, a trait of cognitive intelligence, is regarded as a crucial ability that distinguishes humans from other species. However, neural networks now pose a challenge to this human ability. Text-to-image synthesis is a class of vision and linguistics, wherein the goal is to learn multimodal representations between the image and text features. Hence, it requires a high-level reasoning ability that understands the relationships between objects in the given text and generates high-quality images based on the understanding. Text-to-image translation can be termed as the visual thinking of neural networks. In this study, our model infers the complicated relationships between objects in the given text and generates the final image by leveraging the previous history. We define diverse novel adversarial loss functions and finally demonstrate the best one that elevates the reasoning ability of the text-to-image synthesis. Remarkably, most of our models possess their own reasoning ability. Quantitative and qualitative comparisons with several methods demonstrate the superiority of our approach.

Highlights

  • They are typically medium to large birds, usually grey or white, often with black markings on the head or wings

  • We identify each of them as Visual Thinking Least Squares (VTLS), Visual Thinking Hinge (VTH), Visual Thinking Relativistic(VTR), Visual Thinking Relativistic Average (VTRA), Visual Thinking Relativistic Average Least Squares (VTRALS), Visual Thinking Relativistic Average Hinge (VTRAH)

  • An accurate evaluation of our task is to firstly identify whether all the objects mentioned by the Teller are present in the synthesized image and, secondly, whether they are present at the exact location

Read more

Summary

Introduction

They are typically medium to large birds, usually grey or white, often with black markings on the head or wings. They typically have harsh wailing or squawking calls, longish bills, and webbed feet. Even after reading the sentences, it is not easy to immediately imagine what they explain. What the sentences describe is a seagull, i.e., a bird that anyone who sees the photo recognizes intuitively. Why is it difficult to immediately perceive something described in words? The type of vast information in the world is gradually shifting from text to image. Static textual information is changing into vivid visual information, and text is being used as an auxiliary material to

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call