Comparison of Feature Selection for Imbalance Text Datasets

Andreas Chandra

doi:10.1109/icimtech.2019.8843773

Andreas Chandra

https://doi.org/10.1109/icimtech.2019.8843773

Copy DOI

Export

Save

Cite

Publication Date: Aug 1, 2019

Citations: 6

Abstract
Full-Text
Similar Papers

Abstract

Listen

The numbers of documents are increasing rapidly in a web format. Therefore, automatic document classification is needed to help human to classify the documents. Text classification is one of the common tasks in text mining problems. In order to build a model that is able to classify a document, the words are the main source as a feature to create a model. Because there are so many words in a corpus, we need to be selective which features that are significant to the labels. Feature selection has been introduced to improve the classification task. Moreover, it could be used to reduce the high dimensionality feature space. Feature selection becomes one of the most familiar solutions to high dimensionality problem of document classification. In text classification, selection of good features plays an important role. Feature selection is an approach that can be used to increase model classification accuracy and computational efficiency. This paper presents an empirical study of the most widely used feature selection methods. Term Frequency (TF), Mutual Information (MI), and Chi-square (X2) with 2 distinct classifiers Support Vector Machine (SVM) and Multinomial Naïve Bayes (MNB). The experimentations are tested out on commonly used benchmark datasets such as 20-Newsgroups, Reuters and our dataset. Because there are some parameters on how many features that we should take given that documents, we use the best 10 percent until 20 percent of features to test it out. The obtained results show that the six experiments that has been conducted Chi-squared is out as the best performance for text classification.

Full Text