Abstract

International Journal of Computer Processing of LanguagesVol. 18, No. 03, pp. 197-210 (2005) No AccessImproving Domain Dictionary-Based Text Categorization Using Self-Partition ModelWENLIANG CHEN, JINGBO ZHU, MUHUA ZHU, LI ZHANG, and TIANSHUN YAOWENLIANG CHENNatural Language Processing Lab, Northeastern University, Shenyang, 110004, P. R. China Search for more papers by this author , JINGBO ZHUNatural Language Processing Lab, Northeastern University, Shenyang, 110004, P. R. China Search for more papers by this author , MUHUA ZHUNatural Language Processing Lab, Northeastern University, Shenyang, 110004, P. R. China Search for more papers by this author , LI ZHANGNatural Language Processing Lab, Northeastern University, Shenyang, 110004, P. R. China Search for more papers by this author , and TIANSHUN YAONatural Language Processing Lab, Northeastern University, Shenyang, 110004, P. R. China Search for more papers by this author https://doi.org/10.1142/S0219427905001304Cited by:1 PreviousNext AboutSectionsPDF/EPUB ToolsAdd to favoritesDownload CitationsTrack CitationsRecommend to Library ShareShare onFacebookTwitterLinked InRedditEmail AbstractIn this paper, we present a novel model for improving the performance of Domain Dictionary-based text categorization. The proposed model is named as Self-Partition Model (SPM). SPM can group the candidate words into the predefined clusters, which are generated according to the structure of Domain Dictionary. Using these learned clusters as features, we proposed a novel text representation. The experimental results show that the proposed text representation-based text categorization system performs better than the Domain Dictionary-based text categorization system. It also performs better than the system based on Bag-of-Words when the number of features is small and the training corpus size is small.This research was supported in part by the National Natural Science Foundation of China and Microsoft Asia Research (No. 60203019), the National Natural Science Foundation of China (No. 60473140) and the Key Project of the Chinese Ministry of Education (No. 104065).Keywords:Text CategorizationText RepresentationDomain KnowledgeWord Clustering References L. D. Baker and A. K. McCallum, Distributional clustering of words for text classification, in Proc. 21st Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998, pp. 96–103 . Google Scholar R. Bekkerman et al. , On feature distributional clustering for text categorization , Proc. SIGIR-01, 24th ACM Int. Conf. on Research and Development in Information Retrieval , eds. W. B. Croft et al. ( ACM Press , New York, USA ) . Google Scholar W. Chen, X. Chang, H. Wang, J. Zhu and T. Yao, Automatic word clustering for text categorization using global information, in First Asia Information Retrieval Symposium (AIRS 2004), 2004, pp. 1–6 . Google Scholar C. L. C. E. Board, China Library Categorization, 4th edn. (Beijing Library Press, Beijing, 1999). Google Scholar L. Lee, Similarity-Based Approaches to Natural Language Processing, Ph.D. thesis, Harvard University, Cambridge, MA, 1997 . Google Scholar S. Lee and M. Shishibori, Passage segmentation based on topic matter, Computer Processing of Oriental Languages 15(3), pp. 305–340 . Google Scholar A. McCallum and K. Nigam, A comparison of event models for naïve Bayes text classification, in AAAI–98 Workshop on Learning for Text Categorization, 1998 . Google Scholar F. Pereira, N. Tishby and L. Lee, Distributional clustering of English words, in 30th Annual Meeting of the ACL, 1993, pp. 183–190 . Google ScholarF. Sebastiani, ACM Computing Surveys 34, 1 (2002). Crossref, Google Scholar Scott, Sam and Stan Matwin, Text classification using WordNet hypernyms, in Proc. COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems, Montreal, 1998 . Google Scholar Y. Yang and J. P. Pedersen, A comparative study on feature selection in text categorization in Proc. 14th Int. Conf. on Machine Learning (ICML'97), Jr. Doughals H. Fisher (ed.), Nashville, TN, July 8–12, 1997 . Google ScholarY. Yang and X. Liu, Proc. SIGIR-99, 22nd ACM Int. Conf. on Research and Development in Information Retrieval, eds. M. A. Hearst, F. Gey and R. Tong (ACM Press, New York, USA, 1999) pp. 42–49. Crossref, Google Scholar T. S. Yao et al. , Natural Language Processing — A research of making computers understand human languages ( Tsinghua University Press , 2002 ) . Google ScholarJ. Zhu and T. Yao, Journal of Chinese Information Processing 16(3), (2002). Google Scholar FiguresReferencesRelatedDetailsCited By 1Divergence-based feature selection for naïve Bayes text classificationHuizhen Wang, Jingbo Zhu and Keh-Yih Su1 Oct 2008 Recommended Vol. 18, No. 03 Metrics History KeywordsText CategorizationText RepresentationDomain KnowledgeWord ClusteringPDF download

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call