Abstract

In this paper, text representation and feature selection strategies for Chinese text classification based on n-grams are discussed. Two steps feature selection strategy is proposed which combines the preprocess within classes with the feature selection among classes. Four different feature selection methods and three text representation weights are compared by exhaustive experiments. Both C-SVC classifier and Naive bayes classifier are adopted to assess the results. All experiments are performed on Chinese corpus TanCorpV1.0 which includes more than 14,000 texts divided in 12 classes. Our experiments concern: (1) the performance comparison among different feature selection strategies: absolute text frequency, relative text frequency, absolute n-gram frequency and relative n-gram frequency; (2) the comparison of the sparseness and feature correlation in the “text by feature” matrices produced by four feature selection methods; (3) the performance comparison among three term weights: 0/1 logical value, n-gr...

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.