Abstract

The identification of names and dates in larger corpora of historical texts is important for both traditional and digitally mediated research; it is part of reading as well as of exploring digital corpora. This paper is an introduction to a number of issues concerning named entity recognition (NER) for classical Chinese. In particular it introduces the “Digital Archive of Buddhist Temple Gazetteers” ( http://buddhistinformatics.ddbc.edu.tw/fosizhi/ ), as a benchmark corpus for NER on classical Chinese and illustrates how marked-up corpora can provide answers to question that could not otherwise be addressed. The “Digital Archive of Buddhist Temple Gazetteers” is an open source and access archive of local histories of Chinese Buddhist sites. Names and dates were encoded with XML/TEI and associated with authority databases. The archive, which contains classical texts in a variety of genres, can serve as testing data for experiments in NER and POS tagging. The data is made available as part of the article. We also show that for classical Chinese even a custom-made person name dictionary, created during the markup of the corpus, cannot in turn be used to parse the same corpus successfully without further intervention.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.