Abstract
One of the several benefits of text classification is to automatically assign document in predefined category is one of the primary steps toward knowledge extraction from the raw textual data. In such tasks, words are dealt with as a set of features. Due to high dimensionality and sparseness of feature vector results from traditional feature selection methods, most of the proposed text classification methods for this purpose lack performance and accuracy. Many algorithms have been implemented to the problem of Automatic Text Categorization that’s why, we tried to use new methods like Information Extraction, Natural Language Processing, and Machine Learning. This paper proposes an innovative approach to improve the classification performance of the Persian text. Naive Bayes classifiers which are widely used for text classification in machine learning are based on the conditional probability. we have compared the Gaussian, Multinomial and Bernoulli methods of naive Bayes algorithms with SVM algorithm. for statistical text representation, TF and TF-IDF and character-level 3 (3-Gram) [6,9] were used. Finally, experimental results on 10 newsgroups.
Highlights
With the advent of information technology, organizations and companies are increasingly turned to the Internet to transfer their information
Since this paper deals with three objectives of comparing the weighing methods, the effect of reducing the feature vector and selecting the best machine-learning algorithm for classifying Persian texts, 4 tests are done on the texts
The algorithms are compared without eliminating the prepositions that the above algorithm outperforms the rest of the algorithms by 82.4% accuracy but it is less than the first method and the duration of the training is increased due to the large feature vector
Summary
With the advent of information technology, organizations and companies are increasingly turned to the Internet to transfer their information. Given that about 80% of the information is in the form of text, companies need data retrieving and mining tools to keep up with their rivals and compete through their achieved information at the right time and low cost. Text mining is an important part of data mining that organizes a set of a large text documents to capture their hidden knowledge. This science includes the classification of texts, extraction of relationships, entities, and events that are widely used in data retrieval to organize documents. First is the supervised approach, which is commonly used where a pre-defined category is labelled and assigned to a document based on its contents.
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have