A novel redistribution-based feature selection for text classification

Murat Okkalioglu

doi:10.1016/j.eswa.2023.123119

Abstract

Text classification is the process of automatically categorizing text documents into predefined labels, gaining increasing importance with the growing volume of data. In the vector space model, terms within documents must be quantified to be used with classifiers. The abundance of unique terms associated with documents can lead to unwieldy vectors. To address this issue, feature selection, reducing the number of terms by removing irrelevant terms, is a common solution. In this paper, a novel feature selection method called Amount of ReDistribution to Establish Neutrality, ARDEN, is proposed for the task of text classification. ARDEN, is designed with statistical distance perspective by measuring the distance of a term to its neutral counterpart, which represents the least distinguishing term having uniform document frequencies across all classes. ARDEN introduces a method to measure the related distance, providing insights into the degree of deviation of the term from neutrality, thus capturing its distinguishing power. ARDEN is experimentally tested against state-of-the-art feature selection methods. The results suggest that it is a competitive feature selection method, exhibiting superior performance in terms of summary statistics, μ and δ′. Furthermore, ARDEN is designed for multi-class text classification problems, which does not require a globalization function. Additionally, the proposed method is compared against Wasserstein statistical distance metric, and clearly outperforms it. Finally, we also propose δ′ evaluation criterion in order to facilitate the interpretation of experimental outcomes across multiple feature sizes.

Full Text