Abstract

Arbitrary shape text spotting remains a challenging computer vision task. In this paper, we propose an end-to-end trainable unified framework for arbitrary shape text spotting to overcome the limitations inherent in the existing methods. Specifically, we propose to perceive and understand text based on different levels of semantics, i:e:, holistic-, pixel- and sequencelevel semantics, and then unify the recognized semantics for robust text spotting. To implement the framework, we customize the detection and mask branches of Mask R-CNN to explore both holistic- and pixel-level semantics for text recognition. According to the recognition results, the text spotting task can then be formulated in the two-dimensional feature space. Then, by feeding the two-dimensional feature maps into an additional text recognition branch, our framework further delivers onedimensional sequence-level semantics for text recognition based on an attention-based sequence-to-sequence network. Finally, the results from all the three levels of semantics are merged as the final result. Therefore, our framework is capable of simultaneously recognizing texts from both the one- and twodimensional perspectives, achieving highly comprehensive text recognition. In addition, because some existing datasets lack character-level annotations, the extensive descriptions of texts from our framework further allow us to use only word-level annotations as weak supervision for training a robust text spotting model. Experiments on ICDAR 2013, ICDAR 2015, and Total-Text show that our framework achieves state-of-the-art performance for both detection and recognition.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call