A Large Chinese Text Dataset in the Wild

Tai-Ling Yuan,Zhe Zhu,Cheng-Jun Li,Shi-Min Hu,Kun Xu,Tai-Jiang Mu

doi:10.1007/s11390-019-1923-y

Abstract

In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some more complicated character sets such as Chinese text. Lack of training data has always been a problem, especially for deep learning methods which require massive training data. In this paper, we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters from 3 850 unique ones annotated by experts in over 30 000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc. For each character, the annotation includes its underlying character, bounding box, and six attributes. The attributes indicate the character’s background complexity, appearance, style, etc. Besides the dataset, we give baseline results using state-of-the-art methods for three tasks: character recognition (top-1 accuracy of 80.5%), character detection (AP of 70.9%), and text line detection (AED of 22.1). The dataset, source code, and trained models are publicly available.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Large Chinese Text Dataset in the Wild

Abstract

Talk to us

Similar Papers

More From: Journal of Computer Science and Technology

Lead the way for us

Journal: Journal of Computer Science and Technology	Publication Date: May 1, 2019
Citations: 73

Similar Papers

Detecting of Vertically-Oriented Texts in Images Containing Natural Scenes
Yi Ling Ong ... Almon Chai
-
Yi Ling Ong, et. al.Yi Ling Ong ... Almon Chai
07 Dec 2020
07 Dec 2020

A Hybrid Deep Neural Network for Urdu Text Recognition in Natural Images
Asghar Ali ... Mark Pickering
-
Asghar Ali, et. al.Asghar Ali ... Mark Pickering
01 Jul 2019
01 Jul 2019

Text Proposals Based on Windowed Maximally Stable Extremal Region for Scene Text Detection
Feng Su ... Lan Wang
-
Feng Su, et. al.Feng Su ... Lan Wang
01 Nov 2017
01 Nov 2017

A novel text structure feature extractor for Chinese scene text detection and recognition
Xiaohang Ren ... Xiaokang Yang
-
Xiaohang Ren, et. al.Xiaohang Ren ... Xiaokang Yang
01 Dec 2016
01 Dec 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Large Chinese Text Dataset in the Wild

Abstract

Talk to us

Similar Papers

More From: Journal of Computer Science and Technology