Abstract

Scene retrieval from input descriptions has been one of the most important applications with the increasing number of videos on the Web. However, this is still a challenging task since semantic gaps between features of texts and videos exist. In this paper, we try to solve this problem by utilizing a text-to-image Generative Adversarial Network (GAN), which has become one of the most attractive research topics in recent years. The text-to-image GAN is a deep learning model that can generate images from their corresponding descriptions. We propose a new retrieval framework, “Query is GAN”, based on the text-to-image GAN that drastically improves scene retrieval performance by simple procedures. Our novel idea makes use of images generated by the text-to-image GAN as queries for the scene retrieval task. In addition, unlike many studies on text-to-image GANs that mainly focused on the generation of high-quality images, we reveal that the generated images have reasonable visual features suitable for the queries even though they are not visually pleasant. We show the effectiveness of the proposed framework through experimental evaluation in which scene retrieval is performed from real video datasets.

Highlights

  • With the increasing number of videos on the Web, methods of retrieval that provide users with scenes1 corresponding to their descriptions have become important topics of study [1]–[5]

  • We propose a new scene retrieval framework, Query is GAN, based on a text-to-image Generative Adversarial Network (GAN) [28]–[35]

  • In this study, we reveal that such visually non-pleasant images generated by the text-to-image GAN have reasonable visual features suitable for the queries, evidence of which was obtained in the experiments

Read more

Summary

Introduction

With the increasing number of videos on the Web, methods of retrieval that provide users with scenes corresponding to their descriptions have become important topics of study [1]–[5]. The scene retrieval task has been studied by many researchers, and there have been many reports that proposes text-based retrieval methods [6] and content-based retrieval methods [7]. With the rapid growth of deep learning technologies, studies on scene retrieval have moved to the stage. Realization of scene retrieval is difficult because several important challenges must be tackled simultaneously. Videos and their corresponding descriptions are denoted as modalities that have different semantic spaces. It is necessary to match these two different modalities to

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call