Abstract

Oracle bone script (OBS) is the earliest known ancient Chinese writing system and the ancestor of modern Chinese. As the Chinese writing system is the oldest continuously-used system in the world, the study of OBS plays an important role in both linguistic and historical research. In order to utilize advanced machine learning methods to automatically process OBS, we construct an information system for OBS (IsOBS) to symbolize, serialize, and store OBS data at the character-level, based on efficient databases and retrieval modules. Moreover, we also apply few-shot learning methods to build an effective OBS character recognition module, which can recognize a large number of OBS characters (especially those characters with a handful of examples) and make the system easy to use. The demo system of IsOBS can be found from \url{http://isobs.thunlp.org/}. In the future, we will add more OBS data to the system, and hopefully our IsOBS can support further efforts in automatically processing OBS and advance the scientific progress in this field.

Highlights

  • Oracle bone script (OBS) refers to characters carved on animal bones or turtle plastrons

  • (2) As shown in Figure 2, on one OBS document carved on one animal bone or turtle plastron, the number of characters ranges from fewer than ten to more than one hundred

  • ∗ indicates equal contribution † Corresponding author as OBS is used for divination in ancient China, these documents cover a variety of topics, including war, ceremonial sacrifice, agriculture, as well as births, illnesses, and deaths of royal members (Flad et al, 2008)

Read more

Summary

Introduction

Oracle bone script (OBS) refers to characters carved on animal bones or turtle plastrons. OBS documents constitute the earliest Chinese textual corpora, and to analyze and understand OBS is of great significance to historical research Considering that it is often sophisticated and time-consuming to manually process ancient languages, some efforts have been devoted to utilizing machine learning techniques in this field. The above three challenges make it difficult to use existing machine learning methods for understanding OBS, and the third one is the most crucial To this end, we construct an information system for OBS (IsOBS) to symbolize and serialize OBS data at the character-level, so that we can utilize machine learning methods to process OBS in the future: (1) We construct an OBS character database, where each character is matched to corresponding modern Chinese character (if it has been deciphered) and incorporates a variety of its glyphs. We believe IsOBS can serve as a footstone to support further research (especially character recognition and language understranding) on automatically processing OBS in the future

Application Scenarios
Character Database for OBS
Document Database for OBS
Character Recognition and Information Retrieval Modules
System Framework and Details
OBS Databases
Character Recognition Module
Experiment and Evaluation
Dataset
Evaluation Metric
Neural Network Hyper-Parameters
Overall Results

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.