Abstract

Scene Text Recognition (STR), the critical step in OCR systems, has attracted much attention in computer vision. Recent research on modeling textual semantics with Language Model (LM) has witnessed remarkable progress. However, LM only optimizes the joint probability of the estimated characters generated from the Vision Model (VM) in a single language modality, ignoring the visual-semantic relations in different modalities. Thus, LM-based methods can hardly generalize well to some challenging conditions, in which the text has weak or multiple semantics, arbitrary shape, and so on. To migrate the above issue, in this paper, we propose Multimodal Visual-Semantic Representations Learning for Text Recognition Network (MVSTRN) to reason and combine the multimodal visual-semantic information for accurate Scene Text Recognition. Specifically, our MVSTRN builds a bridge between vision and language through its unified architecture and has the ability to reason visual semantics by guiding the network to reconstruct the original image from the latent text representation, breaking the structural gap between vision and language. Finally, the tailored multimodal Fusion (MMF) module is motivated to combine the multimodal visual and textual semantics from VM and LM to make the final predictions. Extensive experiments demonstrate our MVSTRN achieves state-of-the-art performance on several benchmarks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.