A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification

Saeedeh Davoudi,Sayeh Mirzaei

doi:10.1109/csicc52343.2021.9420602

Abstract

Natural Language Processing (NLP) is one of the promising fields of artificial intelligence. In recent decades, high volume of text data has been generated through the Internet. This kind of data is a valuable source of information which can be used in various fields such as information retrieval, search engines, recommender systems, etc. One practical task of text mining is document classification. In this paper, we mainly focus on Persian document classification. We introduce a new feature extraction approach derived from the combination of K-means clustering and Word2Vec to acquire semantically relevant and discriminant word representations. We call our proposed approach CC-Word2Vec (Categorical Clustering-Word2Vec) since we retrain the Word2Vec model using the word clusters of each category obtained by K-Means algorithm. We use 200 documents of 5 most frequent categories of Hamshahri news dataset to evaluate our method. We pass the extracted word vectors to Multi-Layer Perceptron (MLP) and Gradient Boosting (GB) classifiers to compare the performance of the proposed approach with Term Frequency Inverse Document Frequency (TF-IDF) and Word2Vec methods. Our new approach resulted in an improvement in the obtained accuracy of Gradient Boosting and Multi-Layer Perceptron models in comparison with TF-IDF and Word2Vec techniques.

Full Text