TransText: Improving scene text detection via transformer

Jiajun Zhu,Guodong Wang

doi:10.1016/j.dsp.2022.103698

Abstract

Arbitrary shape scene text detection is still a challenging task in the computer vision community. Compared with anchor-based methods, segmentation-based methods have advantages in some respects, which detect text instances by generating accurate text components. In this paper, we propose a novel text components extraction network for arbitrary shape scene text detection. It can detect different text components through two parallel branches. In the first branch, the feature redistribution module (FRM) is proposed to extract text-related features and filter non-text information. Then, these text-related features are aggregated to generate text boundary maps. These text-related features are also fed to the second branch to progressively generate the text kernel maps. The second branch consists of a modified Transformer decoder with a multi-level progressive supervision strategy. It aims to capture spatial details and establish the dependencies between different text regions. In this way, our method can generate accurate text components without additional geometric calculations. Thanks to the powerful Transformer decoder and efficient differentiable binarization module, our method not only achieves advanced detection accuracy but also has a competitive inference speed. Specifically, with the ResNet-18 backbone, our method can run at 43.5 FPS and achieve an F-measure of 83.3% on the Total-Text dataset, 1.43 times faster than the latest state-of-the-art method. With the ResNet-50 backbone, our method achieves an F-measure of 85.2% and outperforms the latest state-of-the-art method by 0.2% on the Total-Text dataset. Code is available at: https://github.com/As-David/TransText.

Full Text