A simple and efficient filter feature selection method via document-term matrix unitization

Qing Li,Shuai Zhao,Tengjiao He,Jinming Wen

doi:10.1016/j.patrec.2024.02.025

Abstract

Text processing tasks commonly grapple with the challenge of high dimensionality. One of the most effective solutions to this challenge is to preprocess text data through feature selection methods. Feature selection can select the most advantageous features for subsequent operations (e.g., classification) from the native feature space of the text. This process effectively trims the feature space’s dimensionality, enhancing subsequent operations’ efficiency and accuracy. This paper proposes a straightforward and efficient filter feature selection method based on document-term matrix unitization (DTMU) for text processing. Diverging from previous filter feature selection methods that concentrate on scoring criteria definition, our method achieves more optimal feature selection by unitizing each column of the document-term matrix. This approach mitigates feature-to-feature influence and reinforces the role of the weighting proportion within the features. Subsequently, our scoring criterion subtracts the sum of weights for negative samples from positive samples and takes the absolute value. We conduct numerical experiments to compare DTMU with four advanced filter feature selection methods: max–min ratio metric, proportional rough feature selector, least loss, and relative discrimination criterion, along with two classical filter feature selection methods: Chi-square and information gain. The experiments are performed on four ten-thousand-dimensional feature space datasets: book, dvd, music, movie and two thousand-dimensional feature space datasets: imdb, amazon_cells, sourced from Amazon product reviews and movie reviews. Experimental findings demonstrate that DTMU selects more advantageous features for subsequent operations and achieves a higher dimensionality reduction rate than those of the other six methods used for comparison. Moreover, DTMU exhibits robust generalization capabilities across various classifiers and dimensional datasets. Notably, the average CPU time for a single run of DTMU is measured at 1.455 s.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A simple and efficient filter feature selection method via document-term matrix unitization

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition Letters

Lead the way for us

Journal: Pattern Recognition Letters	Publication Date: Mar 19, 2024
Citations: 1

Similar Papers

Two Parallelized Filter Methods for Feature Selection Based on Spark
Reine Marie Ndéla Marone ... Demba Kande
-
Reine Marie Ndéla Marone, et. al.Reine Marie Ndéla Marone ... Demba Kande
14 Dec 2018
14 Dec 2018

An efficient feature selection method based on improved elephant herding optimization to classify high‐dimensional biomedical data
Harpreet Singh ... Manpreet Kaur
Expert Systems | VOL. 39
Harpreet Singh, et. al.Harpreet Singh ... Manpreet Kaur
16 May 2022
Expert Systems | VOL. 39

A novel filter feature selection method using rough set for short text data
Rasim Cekik ... Alper Kursat Uysal
Expert Systems with Applications | VOL. 160
Rasim Cekik, et. al.Rasim Cekik ... Alper Kursat Uysal
06 Jul 2020
Expert Systems with Applications | VOL. 160

SVM-FuzCoC: A novel SVM-based feature selection method using a fuzzy complementary criterion
S.P Moustakidis ... J.B Theocharis
Pattern Recognition | VOL. 43
S.P Moustakidis, et. al.S.P Moustakidis ... J.B Theocharis
10 May 2010
Pattern Recognition | VOL. 43

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A simple and efficient filter feature selection method via document-term matrix unitization

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition Letters