Abstract

As an important step in natural language processing (NLP), text classification system has been widely used in many fields, like spam filtering, news classification, and web page detection. Vector space model (VSM) is generally used to extract feature vectors for representing texts which is very important for text classification. In this paper, a feature selection algorithm based on synonym merging named SM-CHI is proposed. Besides, the improved CHI formula and synonym merging are used to select feature words so that the accuracy of classification can be improved and the feature dimension can be reduced. In addition, for feature words selected by SM-CHI, this paper presented three weight calculation algorithms to explore the best feature weight update method. Finally, we designed three comparative experiments and proved the classification accuracy is the highest when choosing the improved CHI formula 2, set the threshold α to 0.8 and use the largest weight among the synonyms to update the feature weight, respectively.

Highlights

  • With the development of the Internet, the amount of Chinese text information shows an exponential growth trend

  • This paper mainly studies the influence of feature selection and synonym merging on the accuracy of classification in automatic text classification

  • We presented a new feature selection algorithm named SM-CHI based on an improved CHI [4] formula and synonym merging to achieve efficient feature selection and dimension reduction

Read more

Summary

Introduction

With the development of the Internet, the amount of Chinese text information shows an exponential growth trend. In the step, an improved TF-IDF method is used to calculate the feature weights for each word to generate the feature vector of each text. 2.1 Classification model Nowadays, most of the text classification methods are based on VSM where the texts are represented in the form of (feature vector, label). The work in [19] proposed a text feature selection method based on “TongYiCi Cilin” to reduce data’s feature dimensions while ensuring data integrity and classification accuracy. The model proposed in this paper is a text classification model based on synonym merging, named SM-CHI. The difference with [19] is that we merge synonyms after feature selection based on CHI and we propose three improved weighting method for the merged feature words

Text classification model based on semantic similarity
Synonym merging algorithm based on “Tong YiCi
Feature selection method based on the synonym merging
14: Selects the first 200 words as the feature
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call