Abstract
In recent years, data received from social media has increased exponentially. They have become valuable sources of information for many analysts and businesses to expand their business. Automatic document classification is an essential step in extracting knowledge from these sources of information. In automatic text classification, words are assessed as a set of features. Selecting useful features from each text reduces the size of the feature vector and improves classification performance. Many algorithms have been applied for the automatic classification of text. Although all the methods proposed for other languages are applicable and comparable, studies on classification and feature selection in the Persian text have not been sufficiently carried out. The present research is conducted in Persian, and the introduction of a Persian dataset is a part of its innovation. In the present article, an innovative approach is presented to improve the performance of Persian text classification. The authors extracted 85,000 Persian messages from the Idekav-system, which is a Telegram search engine. The new idea presented in this paper to process and classify this textual data is on the basis of the feature vector expansion by adding some selective features using the most extensively used feature selection methods based on Local and Global filters. The new feature vector is then filtered by applying the secondary feature selection. The secondary feature selection phase selects more appropriate features among those added from the first step to enhance the effect of applying wrapper methods on classification performance. In the third step, the combined filter-based methods and the combination of the results of different learning algorithms have been used to achieve higher accuracy. At the end of the three selection stages, a method was proposed that increased accuracy up to 0.945 and reduced training time and calculations in the Persian dataset.
Highlights
Nowadays, the rapid progress and easy access to Internet technologies, multimedia, and social networks have drastically changed and affected human life
The relevant documents with the highest and lowest scales are extracted; since numerical rankings cannot be applied to all phrases and sentences that are a part of the review, filter methods are used based on the characteristics of the studied language
The authors carry out the preprocessing steps, such as deleting the stop words of the Persian language, stemming, etc., on the phrases, and use the matrix of the obtained features to calculate the score of that phrase
Summary
The rapid progress and easy access to Internet technologies, multimedia, and social networks have drastically changed and affected human life. Social networks have a considerable impact on the potential value of businesses [3] They are widespread and highly regarded among users. 60% of Iranians use Telegram [9], and it has become a popular and extensively used social network in various fields such as the development of certain Internet businesses and contains valuable information. The request-type messages that are exchanged among Telegram users are among these data with hidden knowledge. In Telegram, a message can be sent containing a request for help to buy a house or a product, etc. If this request is identified and sent to the owners of related jobs, it will promote business development. The authors are dealing with Telegram text data, and it is necessary to process and classify the text to Khalifeh Zadeh & Zare Chahooki, An Effective Method of Feature Selection in Persian Text for Improving the
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have