Machine learning for Asian language text classification

Fuchun Peng,Xiangji Huang

doi:10.1108/00220410710743306

Abstract

PurposeThe purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task.Design/methodology/approachNaïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation‐based approach was compared with the non‐segmentation‐based approach.FindingsThere were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy.Practical implicationsApply the findings to real web text classification is ongoing work.Originality/valueThe paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Machine learning for Asian language text classification

Abstract

Talk to us

Similar Papers

More From: Journal of Documentation

Lead the way for us

Journal: Journal of Documentation	Publication Date: May 1, 2007
Citations: 45

Similar Papers

A Hybrid Algorithm for Text Classification Based on CNN-BLSTM with Attention
Lei Fu ... Yi Liu
-
Lei Fu, et. al.Lei Fu ... Yi Liu
01 Nov 2018
01 Nov 2018

Improving the effectiveness of language modeling approaches to information retrieval
Yuanhua Lv
ACM SIGIR Forum | VOL. 46
Yuanhua LvYuanhua Lv
21 Dec 2012
ACM SIGIR Forum | VOL. 46

A Chinese Character-Level and Word-Level Complementary Text Classification Method
Wentong Chen ... Yuexin Wu
-
Wentong Chen, et. al.Wentong Chen ... Yuexin Wu
01 Dec 2020
01 Dec 2020

Domain-Aligned Data Augmentation for Low-Resource and Imbalanced Text Classification
Nikolaos Stylianou ... Despoina Chatzakou
-
Nikolaos Stylianou, et. al.Nikolaos Stylianou ... Despoina Chatzakou
01 Jan 2023
01 Jan 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Machine learning for Asian language text classification

Abstract

Talk to us

Similar Papers

More From: Journal of Documentation