A novel filter feature selection method using rough set for short text data

Rasim Cekik,Alper Kursat Uysal

doi:10.1016/j.eswa.2020.113691

Abstract

High dimensionality problem is an important concern for short text classification due to its effect on computational cost and accuracy of classifiers. Also, short text data, besides being high dimensional, has an incomplete, inconsistent and sparse structure. Selection of important features that provide a better representation is a solution for high dimensionality problem. In this study, we developed a novel filter feature selection method, Proportional Rough Feature Selector (PRFS), which uses the rough set for a regional distinction according to the value set of term to identify documents that exactly belong to a class or that is possibly belong to a class. Documents possible to belong to a class are penalized by multiplying with a coefficient named α. Additionally, the effect of sparsity in the term vector space is calculated with the help of rough set. The PRFS is compared with state-of-the-art filter feature selection methods such as Gini index, information gain, distinguishing feature selector, recently proposed max–min ratio, and normalized difference measure methods. The comparison is carried out using various feature sizes on four different short text datasets with a Macro-F1 success measure. Experimental results demonstrated that the PRFS offers either better or competitive performance with respect to other feature selection methods in terms of Macro-F1. This study may be a pioneering study in this research field as it proposes a novel feature selection method for short text classification using a rough set theory.

Full Text