Improving Scene Text Retrieval via Stylized Middle Modality

Shipeng Zhu,Jun Fang,Hui Xue,Pengfei Fang

doi:10.1145/3696209

Shipeng Zhu, Jun Fang + Show 2 more

https://doi.org/10.1145/3696209

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Scene text retrieval addresses the challenge of localizing and searching for all text instances within scene images based on a query text. This cross-modal task has significant applications in various domains, such as intelligent transportation systems and social media analysis. In practice, ensuring consistency of the same content between two modalities is crucial in improving retrieval accuracy. This article addresses the issue by introducing a stylized middle modality, which fuses the graphical query text with the style of the extracted text proposal. To this end, we propose a stylized middle modality learning (SM 2 L) framework. The proposed stylized middle modality enables the network to jointly enforce constraints on visual feature coherence and text semantic feature consistency in the optimization phase, thereby minimizing the modality gap in the retrieval space. This brings in two major advantages: (1) SM 2 L will pave the way to seamlessly benefit the scene text retrieval and (2) the proposed learning paradigm enables the machine to avoid adding redundant computing resources in the inference phase. Substantial experiments demonstrate that the proposed method outperforms the state-of-the-art retrieval performance considerably.

Full Text