Abstract

Models based on extremely deep convolutional networks and attention mechanisms now have powerful feature extraction abilities, improving text recognition performance in natural scenes. However, the very deep network model built with stacked layers brings massive parameters, which limits the application of the text recognition algorithm on storage-constrained devices. In this paper, we propose a lightweight and effective backbone called the Recursive Residual Transformer Network (RRTrN) for scene text recognition. Specifically, by leveraging recursive learning and a combination of convolutional layers and a transformer unit, RRTrN achieves powerful feature extraction while significantly reducing the number of parameters. This reduction in parameters promotes the deployment of our text recognition algorithm on storage-constrained devices, making it more accessible for practical applications. Furthermore, a recursive distillation strategy is presented to balance the recursive learning inference time and performance, enhancing the practicality and efficiency of RRTrN. Extensive experiments on mainstream benchmarks and popular models verify the generalization of RRTrN and achieve state-of-the-art recognition performance on five datasets. Notably, the classical STR model based on RRTrN can achieve a 3 percentage point increase in recognition accuracy or reduce the number of parameters by 80%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call