Categorical ambiguity and information content

Chu-Ren Huang,Ru-Yng Chang

doi:10.3115/1118824.1118829

Abstract

Assignment of grammatical categories is the fundamental step in natural language processing. And ambiguity resolution is one of the most challenging NLP tasks that is currently still beyond the power of machines. When two questions are combined together, the problem of resolution of categorical ambiguity is what a computational linguistic system can do reasonably good, but yet still unable to mimic the excellence of human beings. This task is even more challenging in Chinese language processing because of the poverty of morphological information to mark categories and the lack of convention to mark word boundaries. In this paper, we try to investigate the nature of categorical ambiguity in Chinese based on Sinica Corpus. The study differs crucially from previous studies in that it directly measure information content as the degree of ambiguity. This method not only offers an alternative interpretation of ambiguity, it also allows a different measure of success of categorical disambiguation. Instead of precision or recall, we can also measure by how much the information load has been reduced. This approach also allows us to identify which are the most ambiguous words in terms of information content. The somewhat surprising result actually reinforces the Saussurian view that underlying the systemic linguistic structure, assignment of linguistic content for each linguistic symbol is arbitrary.

Full Text