Abstract

Scene text recognition has attracted lots of research interest in computer vision for decades due to its various application. However, it is still a challenging task because of texts appearance variations in term of perspective distortion, text line curvature, text styles as well as font size. Almost all existing state of the art methods adopt the attention-based encoder-decoder framework which uses RNN as main structure. Inspired by the outstanding performance of transformer, which also adoptsencoder-decoder framework but discards the RNN unit, in the field of natural language processing, we develop the recognition network based on transformer (RNBT). And we also modify the loss function to improve the problem that the encoder-decode framework gets bad recognition performance on images that has longer text length than images in training set. The whole network can be trained end-to-end by using only images and image-level annotations. Extensive experiments on various public datasets, including CUTE80, SVT-Perspective, IIIT5K, SVT and ICDAR datasets, show that the proposed method achieves excellent performance on both regular and irregular datasets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.