Abstract

Text Image Recognition (TIR) has recently achieved many re-markable successes with methods using convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Among them, the encoder-decoder (ED) attention with RNN s outperforms the others. This attention-based model, however, learns sequentially along text-line directions as the characteristic of RNNs. As a result, parallelable learning of the model is reduced, and learning errors are accumulated. It is still struggling for recognizing long text-lines. In this paper, we propose a TIR model of multi-head self-attention (MSA)that does not rely on RNNs. During the training stage, the model learns parallelly the corresponding positions of characters in output sequences and visual features extracted by CNNs. Besides, we employ domain adaptation (DA) at character-level features for the proposed model so that it is robust to the various character styles in testing data. We perform experiments with English, Japanese, and Chinese text image datasets. The experimental results show that our proposed method increases the recognition rates by 10.01~15.76% on each dataset in comparison with the earliest RNN based method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call