Abstract

In the last years, methods for detecting text in real scenes have made significant progress with an increase in neural networks. However, due to the limitation of the receptive field of the central nervous system and the simple representation of text by using rectangular bounding boxes, the previous methods may be insufficient for working with more challenging instances of text. To solve this problem, this paper proposes a scene text detection network based on cross-scale feature fusion (CSFF-Net). The framework is based on the lightweight backbone network Resnet, and the feature learning is enhanced by embedding the depth weighted convolution module (DWCM) while retaining the original feature information extracted by CNN. At the same time, the 3D-Attention module is also introduced to merge the context information of adjacent areas, so as to refine the features in each spatial size. In addition, because the Feature Pyramid Network (FPN) cannot completely solve the interdependence problem by simple element-wise addition to process cross-layer information flow, this paper introduces a Cross-Level Feature Fusion Module (CLFFM) based on FPN, which is called Cross-Level Feature Pyramid Network (Cross-Level FPN). The proposed CLFFM can better handle cross-layer information flow and output detailed feature information, thus improving the accuracy of text region detection. Compared to the original network framework, the framework provides a more advanced performance in detecting text images of complex scenes, and extensive experiments on three challenging datasets validate the realizability of our approach.

Highlights

  • Text has become one of the essential means of conveying information in the contemporary world, and there is a wide variety of textual information in the social scenes we live in

  • Existing convolutional neural networks (CNNs)-based text detection algorithms [1,2] can be divided into approximately two categories: regression-based and segmentation methods

  • To collect the feature information for the surface layer and the deep layer comprehensively at the same time, we propose the Cross-Level Feature Pyramid Network for modelling the extracted feature information on two adjacent feature layers to further enhance feature extraction

Read more

Summary

Introduction

Text has become one of the essential means of conveying information in the contemporary world, and there is a wide variety of textual information in the social scenes we live in. Detecting the text in the natural environment is the process of locating text regions in an image through a detection network and representing them with polygonal bounding boxes. Existing CNN-based text detection algorithms [1,2] can be divided into approximately two categories: regression-based and segmentation methods. For regression-based scene text detection algorithms [3,4,5,6,7,8,9,10,11,12], text objects are usually represented in the form of a rectangular or square field with a certain orientation. The detection speed is fast and can avoid the generation of errors that accumulate over multiple stages, most existing relapsing-based ways are no longer able to handle the text detection problem accurately and efficiently due to the limited form of the text representation (axis-aligned rectangles, rotated rectangles or quadrilaterals), and in particular do not perform very well when used to detect curved text on datasets such as Total-Text [13], which is very unfavorable to the subsequent text recognition in the whole optical character recognition engine

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call