Abstract

Scene text in natural images carries a wealth of valuable semantic information, while due to the largely varied appearance of the text, accurately recognizing scene text is a challenging task. In this work, we propose an arbitrary-shaped scene text recognition method based on learning and fusing multiple representations of text in the scale space with attention mechanisms. Specifically, as distinctive visual features of text often appear at different scales, given an input text image, we generate a family of multi-scale representations that capture complementary appearance characteristics of the text through multiple encoder branches with progressively increasing scale parameters. We further introduce edge map features as a supplementary high-frequency representation with useful text cues. We then refine the multi-scale representations with in-scale and cross-scale attention mechanisms and adaptively aggregate them into an enhanced representation of the text, which effectively improves the text recognition accuracy. The proposed text recognition method achieves competitive results on several scene text benchmarks, demonstrating its effectiveness in recognizing text of various shapes.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.