Abstract

Information extraction from historical text is challenging because of the lack of data to train natural language processing tools. This chapter evaluates the utility of in-domain training data for data-driven profiling of characters, verbs, and toponyms and reports a case study on a corpus of Chinese Buddhist text. As is typical for such a corpus, the Chinese Buddhist Canon has few annotated linguistic resources other than lexica of names, places, and domain-specific terms. We apply a lexicon-based approach for named entity recognition and then report an analysis of the “who,” “what,” and “where” of the Canon: who the characters were, what they did, and where they were. Experimental results also show that even a small amount of word segmentation, part-of-speech, and dependency annotation can improve accuracy in named entity recognition and in extraction of character-verb associations.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call