Abstract

There is a small but growing literature on large-scale statistical modeling of Chinese language texts. Ouyang analyzed a corpus of over 40,000 ancient documents downloaded from multiple sources. This was used to plot the temporal distributions of word frequencies and geographic distributions of authors. Huang and Yu modeled the SongCi poetry corpus, first converting it to tonally marked pinyin to conserve poetically important pronunciation information. Nichols and colleagues reported initial modeling of the Chinese Text Project corpus1 in a conference paper. (Further below, we describe differences between this corpus and the Handian.) With additional collaborators, this group has now conducted two studies that are currently unpublished but under review. In the first, they apply topic models to address scholarly questions about the relationships among important texts of Ancient Chinese philosophy. In the second, they use topic modeling to investigate the concepts of mind and body in ancient Chinese philosophy. Although we share similar scholarly objectives with these researchers, our approach in this paper is unique in that for the first time anywhere we bring the benefits of computational modeling of ancient Chinese texts to a robust public platform that is mirrored on both sides of the Pacific. Besides being just a useful portal to the texts, our approach foregrounds the interpretive issues surrounding topic models, and makes more sophisticated exploration and analysis of interpretive questions possible for experts and novices alike.

Highlights

  • There is a small but growing literature on large-scale statistical modeling of Chinese language texts

  • Ouyang analyzed a corpus of over 40,000 ancient documents downloaded from multiple sources

  • We share similar scholarly objectives with these researchers, our approach in this paper is unique in that for the first time anywhere we bring the benefits of computational modeling of ancient Chinese texts to a robust public platform that is mirrored on both sides of the Pacific

Read more

Summary

Introduction

There is a small but growing literature on large-scale statistical modeling of Chinese language texts. Ouyang analyzed a corpus of over 40,000 ancient documents downloaded from multiple sources This was used to plot the temporal distributions of word frequencies and geographic distributions of authors.[1] Huang and Yu modeled the SongCi poetry corpus, first converting it to tonally marked pinyin to conserve poetically important pronunciation information.[2] Nichols and colleagues reported initial modeling of the Chinese Text Project corpus[3] in a conference paper. They apply topic models to address scholarly questions about the relationships among important texts of Ancient Chinese philosophy In the second, they use topic modeling to investigate the concepts of mind and body in ancient Chinese philosophy.[4] we share similar scholarly objectives with these researchers, our approach in this paper is unique in that for the first time anywhere we bring the benefits of computational modeling of ancient Chinese texts to a robust public platform that is mirrored on both sides of the Pacific. Rather than try to demarcate “philosophy” from the rest, we decided to pursue our computational inquiry with as broad a corpus as we could locate

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call