Abstract
Scene text retrieval addresses the challenge of localizing and searching for all text instances within scene images based on a query text. This cross-modal task has significant applications in various domains, such as intelligent transportation systems and social media analysis. In practice, ensuring consistency of the same content between two modalities is crucial in improving retrieval accuracy. This paper addresses the issue by introducing a stylized middle modality, which fuses the graphical query text with the style of the extracted text proposal. To this end, we propose a stylized middle modality learning (SM \({}^{2}\) L) framework. The proposed stylized middle modality enables the network to jointly enforce constraints on visual feature coherence and text semantic feature consistency in the optimization phase, thereby minimizing the modality gap in the retrieval space. This brings in two major advantages: 1) SM \({}^{2}\) L will pave the way to seamlessly benefit the scene text retrieval and 2) the proposed learning paradigm enables the machine to avoid adding redundant computing resources in the inference phase. Substantial experiments demonstrate that the proposed method outperforms the state-of-the-art retrieval performance considerably.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: ACM Transactions on Multimedia Computing, Communications, and Applications
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.