Abstract

AbstractThis paper compares effectiveness of document classification algorithms for a highly inflectional/derivational language that forms monolithic compound noun terms, like Korean. The system is composed of three phases: (1) a Korean morphological analyser called HAM [10], (2) compound noun phrase analysis and extraction of terms whose syntactic categories are noun, proper noun, verb, and adjective, and (3) various document classification algorithms based on preferred class score heuristics. We focus on the comparison of document classification methods including a simple voting method, and preferred class score heuristics employing two factors, namely ICF (inverse class frequency) and IDF (inverse document frequency) with/without term frequency weighting. In addition, this paper compares algorithms that use different class feature sets filtered by four syntactic categories. Compared to the results of algorithms that are not using syntactic information for class feature sets, the algorithms using syntactic information for class feature sets shows performance differences in this paper by -3.3% – 4.7%. Of the 20 algorithms that were tested, the algorithms, PCSIDF FV (i.e. Filtering Verb Terms) and Weighted PCSIDF FV, show the best performance (74.2% of F-measurement ratio). In the case of the Weighted PCSICF algorithm, the use of syntactic information for selection of class feature sets decreased the performance on document classification by 1.3 – 3.3%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call