Related Factors of Document Classification Performance in a Highly Inflectional Language

Kyongho Min

doi:10.1007/978-3-540-45080-1_87

Abstract

This paper describes relationships between the document classification performance and its relevant factors for a highly inflectional language that forms monolithic compound noun terms. The factors are the number of class feature sets, the size of training or testing document, ratio of overlapping class features among 8 classes, and ratio of non-overlapping class feature sets. The system is composed of three phases: a Korean morphological analyser called HAM [11], an application of compound noun phrase analysis and extraction of terms whose syntactic categories are noun, name, verb, and adjective, and an effective document classification algorithm based on preferred class score heuristics. The best algorithm in this paper, Weighted PCSICF based on inverse class frequency, shows an inverse proportional relationship between its performance and the number of class feature sets and the number of ratio of non-overlapping class feature sets.

Full Text