Chinese Text Classification Research Articles

In this paper, text representation and feature selection strategies for Chinese text classification based on n-grams are discussed. Two steps feature selection strategy is proposed which combines the preprocess within classes with the feature selection among classes. Four different feature selection methods and three text representation weights are compared by exhaustive experiments. Both C-SVC classifier and Naive bayes classifier are adopted to assess the results. All experiments are performed on Chinese corpus TanCorpV1.0 which includes more than 14,000 texts divided in 12 classes. Our experiments concern: (1) the performance comparison among different feature selection strategies: absolute text frequency, relative text frequency, absolute n-gram frequency and relative n-gram frequency; (2) the comparison of the sparseness and feature correlation in the “text by feature” matrices produced by four feature selection methods; (3) the performance comparison among three term weights: 0/1 logical value, n-gr...

Read full abstract

PurposeThe purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task.Design/methodology/approachNaïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation‐based approach was compared with the non‐segmentation‐based approach.FindingsThere were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy.Practical implicationsApply the findings to real web text classification is ongoing work.Originality/valueThe paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.

Read full abstract

Chinese Text Classification Research Articles

Related Topics

Articles published on Chinese Text Classification

A Fast Algorithm for Chinese Text Categorization Based on Key Tree

A logistic regression-based smoothing method for Chinese text categorization

Chinese Text Classification with a KNN Classifier Using an Adjusted Feature Weighting Method

Application of TSVM Incremental Learning in Web Text Categorization

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Chinese short text classification based on hyponymy relation

The Chinese Text Categorization System with Category Priorities

N-grams based feature selection and text representation for Chinese Text Classification

Method for Chinese short text classification based on feature extension

Non-Independent Term Selection for Chinese Text Categorization

N-grams based feature selection and text representation for Chinese Text Classification

Technology for Chinese text categorization based on reverse matching algorithm

The Chinese text categorization system with association rule and category priority

Machine learning for Asian language text classification

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Chinese Text Classification Research Articles

Related Topics

Articles published on Chinese Text Classification

A Fast Algorithm for Chinese Text Categorization Based on Key Tree

A logistic regression-based smoothing method for Chinese text categorization

Chinese Text Classification with a KNN Classifier Using an Adjusted Feature Weighting Method

Application of TSVM Incremental Learning in Web Text Categorization

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Chinese short text classification based on hyponymy relation

The Chinese Text Categorization System with Category Priorities

N-grams based feature selection and text representation for Chinese Text Classification

Method for Chinese short text classification based on feature extension

Non-Independent Term Selection for Chinese Text Categorization

N-grams based feature selection and text representation for Chinese Text Classification

Technology for Chinese text categorization based on reverse matching algorithm

The Chinese text categorization system with association rule and category priority

Machine learning for Asian language text classification