Abstract
Abstract Recent advancements in scene text recognition have predominantly focused on leveraging textual semantics. However, an over-reliance on linguistic priors can impede a model’s ability to handle irregular text scenes, including non-standard word usage, occlusions, severe distortions, or stretching. The key challenges lie in effectively localizing occlusions, perceiving multi-scale text, and inferring text based on scene context. To address these challenges and enhance visual capabilities, we introduce the Graph Reasoning Model (GRM). The GRM employs a novel feature fusion method to align spatial context information across different scales, beginning with a feature aggregation stage that extracts rich spatial contextual information from various feature maps. Visual reasoning representations are then obtained through graph convolution. We integrate the GRM module with a language model to form a two-stream architecture called GRNet. This architecture combines pure visual predictions with joint visual-linguistic predictions to produce the final recognition results. Additionally, we propose a dynamic iteration refinement for the language model to prevent over-correction of prediction results, ensuring a balanced contribution from both visual and linguistic cues. Extensive experiments demonstrate that GRNet achieves state-of-the-art average recognition accuracy across six mainstream benchmarks. These results highlight the efficacy of our multi-modal approach in scene text recognition, particularly in challenging scenarios where visual reasoning plays a crucial role.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.