Visual Thinking of Neural Networks: Interactive Text to Image Synthesis

Hyunhee Lee,Gyeongmin Kim,Heuiseok Lim,Yuna Hur

doi:10.1109/access.2021.3074973

Hyunhee Lee, Gyeongmin Kim + Show 2 more

Open Access

PDF Available

https://doi.org/10.1109/access.2021.3074973

Copy DOI

Export

Save

Cite

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 7	License type: CC BY 4.0

Affiliation: Korea University

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Reasoning, a trait of cognitive intelligence, is regarded as a crucial ability that distinguishes humans from other species. However, neural networks now pose a challenge to this human ability. Text-to-image synthesis is a class of vision and linguistics, wherein the goal is to learn multimodal representations between the image and text features. Hence, it requires a high-level reasoning ability that understands the relationships between objects in the given text and generates high-quality images based on the understanding. Text-to-image translation can be termed as the visual thinking of neural networks. In this study, our model infers the complicated relationships between objects in the given text and generates the final image by leveraging the previous history. We define diverse novel adversarial loss functions and finally demonstrate the best one that elevates the reasoning ability of the text-to-image synthesis. Remarkably, most of our models possess their own reasoning ability. Quantitative and qualitative comparisons with several methods demonstrate the superiority of our approach.

Highlights

They are typically medium to large birds, usually grey or white, often with black markings on the head or wings
We identify each of them as Visual Thinking Least Squares (VTLS), Visual Thinking Hinge (VTH), Visual Thinking Relativistic(VTR), Visual Thinking Relativistic Average (VTRA), Visual Thinking Relativistic Average Least Squares (VTRALS), Visual Thinking Relativistic Average Hinge (VTRAH)
An accurate evaluation of our task is to firstly identify whether all the objects mentioned by the Teller are present in the synthesized image and, secondly, whether they are present at the exact location

Summary

Introduction

They are typically medium to large birds, usually grey or white, often with black markings on the head or wings. They typically have harsh wailing or squawking calls, longish bills, and webbed feet. Even after reading the sentences, it is not easy to immediately imagine what they explain. What the sentences describe is a seagull, i.e., a bird that anyone who sees the photo recognizes intuitively. Why is it difficult to immediately perceive something described in words? The type of vast information in the world is gradually shifting from text to image. Static textual information is changing into vivid visual information, and text is being used as an auxiliary material to

Objectives

Methods

Results

Conclusion