Abstract

This research focuses on Scene Text Recognition (STR), a crucial component in various applications of artificial intelligence such as image retrieval, office automation, and intelligent traffic systems. Recent studies have shown that semantic-aware approaches significantly improve the performance of STR tasks, with context-aware STR methods becoming mainstream. Among these, the fusion of visual and language models has shown remarkable effectiveness. We propose a novel method (PABINet) that incorporates three key components: a Visual-Language Decoder, a Language Model, and a Fusion Model. First, during training, the Visual-Language Decoder masks the original labels in the Transformer decoder using permutation masks, with each mask being unique. This enhances word memorization and learning through contextual semantic information, resulting in robust semantic knowledge. During the inference stage, the Visual-Language Decoder employs autonomous Autoregressive model (AR) inference to generate results. Subsequently, the Language Model scrutinizes and corrects the output of the Visual-Language Encoder using a cloze mask approach, achieving context-aware, autonomous, bidirectional inference. Finally, the Fusion Model concatenates and refines the outputs of both models through iterative layers.Experimental results demonstrate that our PABINet performs exceptionally well when handling various quality images. When trained with synthetic data, PABINet achieves a new STR benchmark (average accuracy of 92.41%), and when trained with real data, it establishes new state-of-the-art results (average accuracy of 96.28%).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call