Abstract

Automatic text categorization (ATC) is a technique of the text document classification. Based on the textual content of documents, predefined classes are assigned. Large numbers of features are extracted from text documents, and documents are represented as feature vectors. However, feature vector contains many redundant features which cost high processing overhead, and sometimes, the performance of the classification is reduced. Therefore, feature selection schemes are used to select a most relevant feature from the feature vector of a text document for reducing the processing cost and improve the performance of the classification system. In this paper, mutual information-based weighted feature selection algorithms are used for automatic text categorization on the Ohsumed test collection dataset which is a subset of the MEDLINE database available in KEEL text classification dataset. The implementation of four learners SVM, kNN, DT, and NB along with nine feature selection algorithms such as BetaGamma, CMIM, MRMR, MIFS, JMI, DISR, ICAP, Condred, and CIFE is used for experimentation from FEAST toolbox. The extensive experiments are carried out for the performance evaluation using accuracy. On comparing nine feature selection algorithm on text document data set. The results suggested that weighted feature selection is enhancing the classification performance of text documentation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call