Abstract

Accurate and efficient text detection in natural scenes is a fundamental yet challenging task in computer vision, especially when dealing with arbitrarily-oriented texts. Most contemporary text detection methods are designed to identify horizontal or approximately horizontal text, which cannot satisfy practical detection requirements for various real-world images such as image streams or videos. To address this lacuna, we propose a novel method called Rotational You Only Look Once (R-YOLO), a robust real-time convolutional neural network (CNN) model to detect arbitrarily-oriented texts in natural image scenes. First, a rotated anchor box with angle information is used as the text bounding box over various orientations. Second, features of various scales are extracted from the input image to determine the probability, confidence, and inclined bounding boxes of the text. Finally, Rotational Distance Intersection over Union Non-Maximum Suppression is used to eliminate redundancy and acquire detection results with the highest accuracy. Experiments on benchmark comparison are conducted upon four popular datasets, i.e., ICDAR2015, ICDAR2013, MSRA-TD500, and ICDAR2017-MLT. The results indicate that the proposed R-YOLO method significantly outperforms state-of-the-art methods in terms of detection efficiency while maintaining high accuracy; for example, the proposed R-YOLO method achieves an F-measure of 82.3% at 62.5 fps with 720 p resolution on the ICDAR2015 dataset.

Highlights

  • text is defined as P (Text) in natural scenes, including road traffic signs, billboards, and shopping mall signs, etc. play a crucial role in our daily lives, providing essential information on society and our environment

  • Compared with SegLink [18], He et al [25], EAST [19], He et al [40], DSRN [41], TextBoxes++ [24], and RRD [20], which are one-step methods, our F-measure is higher by 7.3%, 5.3%, 1.6%, 1.3%, 0.9%, 0.6%, and 0.1%, respectively

  • The recall rate is improved from 71.5% to 82.9%, and the F-measure is improved from 80.1% to 86.4%, while the speed reduces by 0.2 fps only

Read more

Summary

Introduction

Texts in natural scenes, including road traffic signs, billboards, and shopping mall signs, etc. play a crucial role in our daily lives, providing essential information on society and our environment. Compared with standard text on documents or the internet, texts in natural scenes are discrepant, having varied sizes, font type, color, language, and orientation. They often have varying illumination intensities, complex backgrounds, and multiple photographing angles, causing challenges in text detection and recognition. Certain methods have attempted to address the arbitrarily-oriented text detection problem [14,15,16,17,18,19,20,21,22,23,24,25]. These methods follow a two-stage strategy based on deep CNN

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call