Re-ranking and TOPSIS-based ensemble feature selection with multi-stage aggregation for text categorization

Guanghua Fu,Bencheng Li,Yongsheng Yang,Chaofeng Li

doi:10.1016/j.patrec.2023.02.027

Abstract

Aiming at reducing data dimensionality, feature selection (FS) could improve the accuracy and reduce computational cost of machine learning model, especially those with high-dimensional text datasets. To improve the robustness, ensemble feature selection (EFS) has been developed with considerable attention recently where different aggregation methods are applied. This paper proposed a four-stage EFS method called re-ranking and TOPSIS-based ensemble feature selection (RTEFS). In the first stage of RTEFS, features are extracted from the text corpus. The second one is to construct a union subset yielded by six filter-based FS methods out of preprocessing feature vectors. Then a re-ranking stage is applied to evaluate those features from such subset. The TOPSIS method is used to aggregate the ranking lists ranked by two FS groups in the ensemble feature ranking stage. In the final stage, the two fused rankings are ensembled via a multi-objective genetic algorithm NSGA-III. To demonstrate the superiority of the proposed method, experiments are performed using 20-Newsgroups and Reuters-21,578 datasets with Support Vector Machine and K-Nearest Neighbors classifiers. Results show that RTEFS produces higher accuracy and F-measure scores over the base counterparts.

Full Text