Abstract

In this article, we focus on the scene text recognition problem, which is one of the challenging sub-files of computer vision because of the random existence of scene text. Recently, scene text recognition has achieved state-of-art performance because of the improvement of deep learning. At present, encoder-decoder architecture was widely used for scene recognition tasks, which consist of feature extractor, sequence module. Specifically, at the decoder part, connectionist temporal classification(CTC), attention mechanism, and transformer(self-attention) are three main approaches used in recent research. CTC decoder is flexible and can handle sequences with large changes in length for its align sequences features with labels in a frame-wise manner. Attention decoder can learn better and deeper feature expression and get the better position information of each character. Attention decoder can get more robust and accurate performance for both regular and irregular scene text. Moreover, a novel decoder mechanism is introduced in our study. The proposed architecture has several advantages: the model can be trained using the end-to-end manner under the condition of multi decoders, and can deal with the sequences of arbitrary length and the images of arbitrary shape. Extensive experiments on standard benchmarks demonstrate that our model's performance is improved for regular and irregular text recognition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call