A large-scale dataset for Chinese historical document recognition and analysis

Yongxin Shi,Dezhi Peng,Yuyi Zhang,Jiahuan Cao,Lianwen Jin

doi:10.1038/s41597-025-04495-x

Yongxin Shi, Dezhi Peng + Show 3 more

Open Access

https://doi.org/10.1038/s41597-025-04495-x

Copy DOI

Export

Save

Cite

Journal: Scientific Data	Publication Date: Jan 29, 2025
License type: cc-by-nc-nd

Abstract
Full-Text
Similar Papers

Abstract

Listen

The development of Chinese civilization has produced a vast collection of historical documents. Recognizing and analyzing these documents hold significant value for the research of ancient culture. Recently, researchers have tried to utilize deep-learning techniques to automate recognition and analysis. However, existing Chinese historical document datasets, which are heavily relied upon by deep-learning models, suffer from limited data scale, insufficient character category, and lack of book-level annotation. To fill this gap, we introduce HisDoc1B, a large-scale dataset for Chinese historical document recognition and analysis. The HisDoc1B comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in scale. Additionally, it is the only dataset with book-level annotations and punctuation annotations. Furthermore, extensive experiments demonstrate the high quality and practical utility of the proposed HisDoc1B. We believe that HisDoc1B could provide valuable resources to boost the advancement of research in this domain.

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

A large-scale dataset for Chinese historical document recognition and analysis

Abstract

Published Version

Talk to us

Similar Papers

More From: Scientific Data

Lead the way for us

Similar Papers

Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry
Lukas Reiter ... Ruedi Aebersold
Molecular & Cellular Proteomics | VOL. 8
Lukas Reiter, et. al.Lukas Reiter ... Ruedi Aebersold
01 Nov 2009
Molecular & Cellular Proteomics | VOL. 8

Parallel Multivariate Spatio-Temporal Clustering of Large Ecological Datasets on Hybrid Supercomputers
Sarat Sreepathi ... Vamsi Sripathi
-
Sarat Sreepathi, et. al.Sarat Sreepathi ... Vamsi Sripathi
01 Sep 2017
01 Sep 2017

Macroscopes for Making Sense of Science
Katy Börner ... Elizabeth Record
-
Katy Börner, et. al.Katy Börner ... Elizabeth Record
09 Jul 2017
09 Jul 2017

Style, Computers, and Early Modern Drama: Beyond Authorship. Hugh Craig and Brett Greatley-Hirsch. Cambridge: Cambridge University Press, 2017. Pp. vii+283.
Jonathan P Lamb
Modern Philology | VOL. 116
Jonathan P LambJonathan P Lamb
01 Feb 2019
Modern Philology | VOL. 116

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

A large-scale dataset for Chinese historical document recognition and analysis

Abstract

Published Version

Talk to us

Similar Papers

More From: Scientific Data