A feature selection method based on synonym merging in text classification system

Haipeng Yao,Chong Liu,Luyao Wang,Peiying Zhang

doi:10.1186/s13638-017-0950-z

Abstract

As an important step in natural language processing (NLP), text classification system has been widely used in many fields, like spam filtering, news classification, and web page detection. Vector space model (VSM) is generally used to extract feature vectors for representing texts which is very important for text classification. In this paper, a feature selection algorithm based on synonym merging named SM-CHI is proposed. Besides, the improved CHI formula and synonym merging are used to select feature words so that the accuracy of classification can be improved and the feature dimension can be reduced. In addition, for feature words selected by SM-CHI, this paper presented three weight calculation algorithms to explore the best feature weight update method. Finally, we designed three comparative experiments and proved the classification accuracy is the highest when choosing the improved CHI formula 2, set the threshold α to 0.8 and use the largest weight among the synonyms to update the feature weight, respectively.

Highlights

With the development of the Internet, the amount of Chinese text information shows an exponential growth trend
This paper mainly studies the influence of feature selection and synonym merging on the accuracy of classification in automatic text classification
We presented a new feature selection algorithm named SM-CHI based on an improved CHI [4] formula and synonym merging to achieve efficient feature selection and dimension reduction

Summary

Introduction

With the development of the Internet, the amount of Chinese text information shows an exponential growth trend. In the step, an improved TF-IDF method is used to calculate the feature weights for each word to generate the feature vector of each text. 2.1 Classification model Nowadays, most of the text classification methods are based on VSM where the texts are represented in the form of (feature vector, label). The work in [19] proposed a text feature selection method based on “TongYiCi Cilin” to reduce data’s feature dimensions while ensuring data integrity and classification accuracy. The model proposed in this paper is a text classification model based on synonym merging, named SM-CHI. The difference with [19] is that we merge synonyms after feature selection based on CHI and we propose three improved weighting method for the merged feature words

Text classification model based on semantic similarity

Synonym merging algorithm based on “Tong YiCi

Feature selection method based on the synonym merging

14: Selects the first 200 words as the feature

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP journal on wireless communications and networking	Publication Date: Oct 5, 2017
Citations: 9	License type: open-access

R Discovery Prime

R Discovery Prime

A feature selection method based on synonym merging in text classification system

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP journal on wireless communications and networking

Lead the way for us

Similar Papers

A Chi-Square Statistics Based Feature Selection Method in Text Classification
Yujia Zhai ... Wei Song
-
Yujia Zhai, et. al.Yujia Zhai ... Wei Song
01 Nov 2018
01 Nov 2018

Improving Domain Dictionary-Based Text Categorization Using Self-Partition Model
Wenliang Chen ...
International Journal of Computer Processing of Languages | VOL. 18
Wenliang Chen, et. al.Wenliang Chen ...
01 Sep 2005
International Journal of Computer Processing of Languages | VOL. 18

Study on Web Text Feature Selection Based on Rough Set
Xianghua Lu ... Weijing Wang
-
Xianghua Lu, et. al.Xianghua Lu ... Weijing Wang
01 Jan 2012
01 Jan 2012

Research On Text Classification Based On Deep Neural Network
Deageon Kim
International Journal of Communication Networks and Information Security (IJCNIS) | VOL. 14
Deageon KimDeageon Kim
31 Dec 2022
International Journal of Communication Networks and Information Security (IJCNIS) | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A feature selection method based on synonym merging in text classification system

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP journal on wireless communications and networking