Scene Text Detection Based on Multi-Headed Self-Attention Using Shifted Windows

Baohua Huang,Xiaoru Feng

doi:10.3390/app13063928

Baohua Huang, Xiaoru Feng

Open Access

PDF Available

https://doi.org/10.3390/app13063928

Copy DOI

Export

Save

Cite

Journal: Applied Sciences	Publication Date: Mar 20, 2023
Citations: 1	License type: CC BY 4.0

Affiliation: Guangxi University

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Scene text detection has become a popular topic in computer vision research. Most of the current research is based on deep learning, using Convolutional Neural Networks (CNNs) to extract the visual features of images. However, due to the limitations of convolution kernel size, CNNs can only extract local features of images with small perceptual fields, and they cannot obtain more global features. In this paper, to improve the accuracy of scene text detection, a feature enhancement module is added to the text detection model. This module acquires global features of an image by computing the multi-headed self-attention of the feature map. The improved model extracts local features using CNNs, while extracting global features through the feature enhancement module. The features extracted by both of these are then fused to ensure that visual features at different levels of the image are extracted. A shifted window is used in the calculation of the self-attention, which reduces the computational complexity from the second power of the input image width-height product to the first power. Experiments are conducted on the multi-oriented text dataset ICDAR2015 and the multi-language text dataset MSRA-TD500. Compared with the pre-improvement method DBNet, the F1-score improves by 0.5% and 3.5% on ICDAR2015 and MSRA-TD500, respectively, indicating the effectiveness of the model improvement.

Full Text