Abstract
This paper proposed an efficient and novel end-to-end text detection and recognition framework called DiZNet. DiZNet is built upon a core representation using text detail maps and employs the classical lightweight ResNet18 as the backbone for the text detection and recognition algorithm model. The redesigned Text Attention Head (TAH) takes multiple shallow backbone features as input, effectively extracting pixel-level information of text in images and global text positional features. The extracted text features are integrated into the stackable Feature Pyramid Enhancement Fusion Module (FPEFM). Supervised with text detail map labels, which include boundary information and texture of important text, the model predicts text detail maps and fuses them into the text detection and recognition heads. Through end-to-end testing on publicly available natural scene text benchmark datasets, our approach demonstrates robust generalization capabilities and real-time detection speeds. Leveraging the advantages of text detail map representation, DiZNet achieves a good balance between precision and efficiency on challenging datasets. For example, DiZNet achieves 91.2% Precision and 85.9% F-measure at a speed of 38.4 FPS on Total-Text and 83.8% F-measure at a speed of 30.0 FPS on ICDAR2015, it attains 83.8% F-measure at 30.0 FPS. The code is publicly available at: https://github.com/DiZ-gogogo/DiZNet
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of Visual Communication and Image Representation
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.