Abstract
Text classification requires previously extraction of features describing the text documents in the collection. Usually these features are based on the occurrence frequency of words, N-grams of words in documents, i.e. the vector space model for document representation is built. Feature selection allows to reduce redundancy in high-dimensional representation of text data, which can significantly improve text classification performance. In the present paper, research on feature selection methods is performed in terms of the accuracy and F-measure of text classification with different number of selected attributes (N-grams of words) for different classifiers and different datasets. The obtained results can be used to apply further pre-processing steps, which include modifying the vector space model in order to achieve its improvement in terms of the text classification.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have